├── .gitignore ├── README.md ├── RESOURCES.md ├── data ├── biden_inauguration_millercenter.txt └── trump_inauguration_millercenter.txt ├── figs ├── aliasing.png ├── conditional_statements.png ├── control_flow.png ├── decomposition_abstraction.png ├── functions.png ├── good_programmer.jpg ├── iteration.png ├── markup_lang.png ├── person_greta.png ├── procedural_object-oriented.png ├── program_lang.png ├── python.png ├── python_workshop.png └── sets.png ├── python_example.ipynb ├── python_intro.ipynb └── speech_analysis.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Introduction to Programming with Python 2 | 3 | ## Workshop Information 4 | 5 | * Live session will take place on June 16, 2021. 6 | 7 | 8 | ## Instructors 9 | 10 | * [Milena Tsvetkova](m.tsvetkova@lse.ac.uk), Assistant Professor of Computational Social Science, Department of Methodology, LSE 11 | * Yuanmo He (GTA), PhD student in Social Research Methodology, Department of Methodology, LSE 12 | 13 | ## Description 14 | 15 | This workshop introduces students to the fundamentals of computer programming in Python. The workshop is intended for students who lack a formal background in the field. Topics include data types, control structures, functions, and an introduction to the principles of object-oriented programming. We will learn to design and write simple computer programs, using a practical example from computational social science. 16 | 17 | 18 | ## Prerequisites 19 | 20 | This is an introductory workshop and no prior experience with programming or Python is required. 21 | 22 | 23 | ## Software 24 | 25 | The workshop will use Jupyter Notebooks to edit and write code. Students have two options: 26 | 27 | 1. Use [Google Colab](https://colab.research.google.com/notebooks/intro.ipynb#recent=true) to work in the cloud via your web browser. 28 | 1. You will need a Google account to use Colab so if you don't already have one, create it. Other than that, you do not need to install any specialized software. 29 | 2. Open the [Colab file](https://drive.google.com/file/d/1ou6DKxFaAVKBrn9AEgWEjdpNe76bltP9/view?usp=sharing) in your browser by first clicking on the link and then on `Open with Google Colaboratory` on top of the page. 30 | 3. Click on `Copy to Drive` to create your own copy, which you can edit and save. 31 | 32 | 2. Alternatively, you can pre-install Python and Jupyter Notebooks locally on your personal computer and run them from there. Unfortunately, we will be unable to provide support for this option, so choose it only if you already have some experience with the software. 33 | 1. Install [Anaconda](https://www.anaconda.com/products/individual), which comes with both, as well as with the most common data science packages. 34 | 2. Clone/download this repository on your computer. 35 | 3. Run the Jupyter server and open the file `python_intro.ipynb` from the cloned repository. 36 | 37 | ## Materials 38 | 39 | All materials for the workshop are available at [https://github.com/social-research/python-workshop](https://github.com/social-research/python-workshop). 40 | 41 | Additional optional resources include: 42 | * [RESOUCES.md file](https://github.com/social-research/python-workshop/blob/main/RESOUCES.md) 43 | * [Python Wikibook](https://en.wikibooks.org/wiki/Python_Programming) 44 | * Matthes, Eric. [*Python Crash Course Cheat Sheet*](https://ehmatthes.github.io/pcc/cheatsheets/README.html). 45 | * [Intermediate and advanced Python documentation](http://docs.python.org/3/) 46 | 47 | 48 | ## Detailed outline 49 | 50 | 1. Introduction to programming languages 51 | * Introduction to Jupyter Notebooks/Google Colab 52 | * Markup vs. programming languages 53 | 2. Primitives in Python 54 | * Scalar data types, operators, expressions, and value assignment to variables 55 | * Non-scalar data types: `str`, `list`, `tuple`, `set`, and `dict` 56 | * Sequence operations and methods, aliasing vs. cloning 57 | 3. Control flow in Python 58 | * Branching with `if`, `elif`, and `else` 59 | * Iteration with `while` and `for` 60 | * `range()` and list comprehensions 61 | 4. Functions in Python 62 | * Function arguments and variable scope 63 | * Modules 64 | 5. Data abstraction with objects and classes 65 | -------------------------------------------------------------------------------- /RESOURCES.md: -------------------------------------------------------------------------------- 1 | # Resources for learning and practicing Python 2 | 3 | 4 | ## Exercises 5 | 6 | * [How to Think Like a Computer Scientist: Interactive Edition](http://interactivepython.org/runestone/static/thinkcspy/index.html) 7 | * [Python Exercises, Practice, Solution](https://www.w3resource.com/python-exercises/) 8 | * [CodingBat](https://codingbat.com/python) 9 | * [HackerRank](https://www.hackerrank.com/dashboard) Has lots of Python exercises for all levels, and is often used by recruiters! 10 | 11 | ## Online courses 12 | 13 | * [Introduction to Computer Science and Programming in Python](https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-0001-introduction-to-computer-science-and-programming-in-python-fall-2016/index.htm) 14 | 15 | * [Composing Programs](http://composingprograms.com/) - The online textbook to UC Berkeley's [Intro to Programming](https://inst.eecs.berkeley.edu/~cs61a/sp18/) course 16 | 17 | * [Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook/tree/master/notebooks) - A whole textbook available for free as a collection of Jupyter Notebooks 18 | 19 | ## Algorithms 20 | 21 | * [IDEA](https://idea-instructions.com/) – An ongoing series of nonverbal algorithm assembly instructions 22 | * [VISUALGO](https://visualgo.net/en) – Visualising data structures and algorithms through animation 23 | 24 | ## Advanced optimization 25 | 26 | * [Performance tips](https://wiki.python.org/moin/PythonSpeed/PerformanceTips) – Performance tips from the Python wiki 27 | 28 | ## Packages 29 | 30 | ### NumPy 31 | 32 | * [NumPy quickstart tutorial](https://docs.scipy.org/doc/numpy/user/quickstart.html) 33 | * [NumPy tutorial on DataCamp](https://www.datacamp.com/community/tutorials/python-numpy-tutorial) 34 | * [NumPy API](https://docs.scipy.org/doc/numpy/reference/) 35 | 36 | ### Pandas 37 | 38 | * [First Pandas tutorial](https://pandas.pydata.org/pandas-docs/stable/10min.html) 39 | * [More advanced Pandas tutorial](https://pandas.pydata.org/pandas-docs/stable/cookbook.html) 40 | * [Pandas cheatsheet](http://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) 41 | * [Pandas API](https://pandas.pydata.org/pandas-docs/stable/api.html) 42 | 43 | ### Statsmodels 44 | 45 | * [Statsmodels examples](https://www.statsmodels.org/dev/examples/index.html) 46 | * [Statsmodels documentation](http://www.statsmodels.org/stable/index.html) 47 | 48 | ### Networkx 49 | 50 | * [NetworkX tutorial](https://networkx.github.io/documentation/stable/tutorial.html) 51 | * [NetworkX examples](https://networkx.github.io/documentation/stable/auto_examples/index.html) 52 | * [NetworkX reference](https://networkx.github.io/documentation/stable/reference/index.html) 53 | 54 | ### Scikit 55 | 56 | * [Scikit-learn tutorials](http://scikit-learn.org/stable/tutorial/index.html) 57 | * [Scikit-learn examples](http://scikit-learn.org/stable/auto_examples/index.html) 58 | * [Scikit-learn API](http://scikit-learn.org/stable/modules/classes.html) 59 | 60 | ### Matplotlib 61 | 62 | * [Scipy lecture notes](http://www.scipy-lectures.org/intro/matplotlib/matplotlib.html) 63 | * [Matplotlib gallery](http://matplotlib.org/gallery.html) 64 | * [Pyplot API](http://matplotlib.org/api/pyplot_summary.html) 65 | * [Axes API](https://matplotlib.org/api/axes_api.html) 66 | -------------------------------------------------------------------------------- /data/biden_inauguration_millercenter.txt: -------------------------------------------------------------------------------- 1 | Chief Justice Roberts, Vice President Harris, Speaker Pelosi, Leader Schumer, Leader McConnell, Vice President Pence, distinguished guests, and my fellow Americans. 2 | 3 | This is America’s day. 4 | 5 | This is democracy’s day. 6 | 7 | A day of history and hope. 8 | 9 | Of renewal and resolve. 10 | 11 | Through a crucible for the ages America has been tested anew and America has risen to the challenge. 12 | 13 | Today, we celebrate the triumph not of a candidate, but of a cause, the cause of democracy. 14 | 15 | The will of the people has been heard and the will of the people has been heeded. 16 | 17 | We have learned again that democracy is precious. 18 | 19 | Democracy is fragile. 20 | 21 | And at this hour, my friends, democracy has prevailed. 22 | 23 | So now, on this hallowed ground where just days ago violence sought to shake this Capitol’s very foundation, we come together as one nation, under God, indivisible, to carry out the peaceful transfer of power as we have for more than two centuries. 24 | 25 | We look ahead in our uniquely American way – restless, bold, optimistic – and set our sights on the nation we know we can be and we must be. 26 | 27 | I thank my predecessors of both parties for their presence here. 28 | 29 | I thank them from the bottom of my heart. 30 | 31 | You know the resilience of our Constitution and the strength of our nation. 32 | 33 | As does President Carter, who I spoke to last night but who cannot be with us today, but whom we salute for his lifetime of service. 34 | 35 | I have just taken the sacred oath each of these patriots took — an oath first sworn by George Washington. 36 | 37 | But the American story depends not on any one of us, not on some of us, but on all of us. 38 | 39 | On “We the People” who seek a more perfect Union. 40 | 41 | This is a great nation and we are a good people. 42 | 43 | Over the centuries through storm and strife, in peace and in war, we have come so far. But we still have far to go. 44 | 45 | We will press forward with speed and urgency, for we have much to do in this winter of peril and possibility. 46 | 47 | Much to repair. 48 | 49 | Much to restore. 50 | 51 | Much to heal. 52 | 53 | Much to build. 54 | 55 | And much to gain. 56 | 57 | Few periods in our nation’s history have been more challenging or difficult than the one we’re in now. 58 | 59 | A once-in-a-century virus silently stalks the country. 60 | 61 | It’s taken as many lives in one year as America lost in all of World War II. 62 | 63 | Millions of jobs have been lost. 64 | 65 | Hundreds of thousands of businesses closed. 66 | 67 | A cry for racial justice some 400 years in the making moves us. The dream of justice for all will be deferred no longer. 68 | 69 | A cry for survival comes from the planet itself. A cry that can’t be any more desperate or any more clear. 70 | 71 | And now, a rise in political extremism, white supremacy, domestic terrorism that we must confront and we will defeat. 72 | 73 | To overcome these challenges – to restore the soul and to secure the future of America – requires more than words. 74 | 75 | It requires that most elusive of things in a democracy: 76 | 77 | Unity. 78 | 79 | Unity. 80 | 81 | In another January in Washington, on New Year’s Day 1863, Abraham Lincoln signed the Emancipation Proclamation. 82 | 83 | When he put pen to paper, the President said, “If my name ever goes down into history it will be for this act and my whole soul is in it.” 84 | 85 | My whole soul is in it. 86 | 87 | Today, on this January day, my whole soul is in this: 88 | 89 | Bringing America together. 90 | 91 | Uniting our people. 92 | 93 | And uniting our nation. 94 | 95 | I ask every American to join me in this cause. 96 | 97 | Uniting to fight the common foes we face: 98 | 99 | Anger, resentment, hatred. 100 | 101 | Extremism, lawlessness, violence. 102 | 103 | Disease, joblessness, hopelessness. 104 | 105 | With unity we can do great things. Important things. 106 | 107 | We can right wrongs. 108 | 109 | We can put people to work in good jobs. 110 | 111 | We can teach our children in safe schools. 112 | 113 | We can overcome this deadly virus. 114 | 115 | We can reward work, rebuild the middle class, and make health care 116 | secure for all. 117 | 118 | We can deliver racial justice. 119 | 120 | We can make America, once again, the leading force for good in the world. 121 | 122 | I know speaking of unity can sound to some like a foolish fantasy. 123 | 124 | I know the forces that divide us are deep and they are real. 125 | 126 | But I also know they are not new. 127 | 128 | Our history has been a constant struggle between the American ideal that we are all created equal and the harsh, ugly reality that racism, nativism, fear, and demonization have long torn us apart. 129 | 130 | The battle is perennial. 131 | 132 | Victory is never assured. 133 | 134 | Through the Civil War, the Great Depression, World War, 9/11, through struggle, sacrifice, and setbacks, our “better angels” have always prevailed. 135 | 136 | In each of these moments, enough of us came together to carry all of us forward. 137 | 138 | And, we can do so now. 139 | 140 | History, faith, and reason show the way, the way of unity. 141 | 142 | We can see each other not as adversaries but as neighbors. 143 | 144 | We can treat each other with dignity and respect. 145 | 146 | We can join forces, stop the shouting, and lower the temperature. 147 | 148 | For without unity, there is no peace, only bitterness and fury. 149 | 150 | No progress, only exhausting outrage. 151 | 152 | No nation, only a state of chaos. 153 | 154 | This is our historic moment of crisis and challenge, and unity is the path forward. 155 | 156 | And, we must meet this moment as the United States of America. 157 | 158 | If we do that, I guarantee you, we will not fail. 159 | 160 | We have never, ever, ever failed in America when we have acted together. 161 | 162 | And so today, at this time and in this place, let us start afresh. 163 | 164 | All of us. 165 | 166 | Let us listen to one another. 167 | 168 | Hear one another. 169 | See one another. 170 | 171 | Show respect to one another. 172 | 173 | Politics need not be a raging fire destroying everything in its path. 174 | 175 | Every disagreement doesn’t have to be a cause for total war. 176 | 177 | And, we must reject a culture in which facts themselves are manipulated and even manufactured. 178 | 179 | My fellow Americans, we have to be different than this. 180 | 181 | America has to be better than this. 182 | 183 | And, I believe America is better than this. 184 | 185 | Just look around. 186 | 187 | Here we stand, in the shadow of a Capitol dome that was completed amid the Civil War, when the Union itself hung in the balance. 188 | 189 | Yet we endured and we prevailed. 190 | 191 | Here we stand looking out to the great Mall where Dr. King spoke of his dream. 192 | 193 | Here we stand, where 108 years ago at another inaugural, thousands of protestors tried to block brave women from marching for the right to vote. 194 | 195 | Today, we mark the swearing-in of the first woman in American history elected to national office – Vice President Kamala Harris. 196 | 197 | Don’t tell me things can’t change. 198 | 199 | Here we stand across the Potomac from Arlington National Cemetery, where heroes who gave the last full measure of devotion rest in eternal peace. 200 | 201 | And here we stand, just days after a riotous mob thought they could use violence to silence the will of the people, to stop the work of our democracy, and to drive us from this sacred ground. 202 | 203 | That did not happen. 204 | 205 | It will never happen. 206 | 207 | Not today. 208 | 209 | Not tomorrow. 210 | 211 | Not ever. 212 | 213 | To all those who supported our campaign I am humbled by the faith you have placed in us. 214 | 215 | To all those who did not support us, let me say this: Hear me out as we move forward. Take a measure of me and my heart. 216 | 217 | And if you still disagree, so be it. 218 | 219 | That’s democracy. That’s America. The right to dissent peaceably, within the guardrails of our Republic, is perhaps our nation’s greatest strength. 220 | 221 | Yet hear me clearly: Disagreement must not lead to disunion. 222 | 223 | And I pledge this to you: I will be a President for all Americans. 224 | 225 | I will fight as hard for those who did not support me as for those who did. 226 | 227 | Many centuries ago, Saint Augustine, a saint of my church, wrote that a people was a multitude defined by the common objects of their love. 228 | 229 | What are the common objects we love that define us as Americans? 230 | 231 | I think I know. 232 | 233 | Opportunity. 234 | 235 | Security. 236 | 237 | Liberty. 238 | 239 | Dignity. 240 | 241 | Respect. 242 | 243 | Honor. 244 | 245 | And, yes, the truth. 246 | 247 | Recent weeks and months have taught us a painful lesson. 248 | 249 | There is truth and there are lies. 250 | 251 | Lies told for power and for profit. 252 | 253 | And each of us has a duty and responsibility, as citizens, as Americans, and especially as leaders – leaders who have pledged to honor our Constitution and protect our nation — to defend the truth and to defeat the lies. 254 | 255 | I understand that many Americans view the future with some fear and trepidation. 256 | 257 | I understand they worry about their jobs, about taking care of their families, about what comes next. 258 | 259 | I get it. 260 | 261 | But the answer is not to turn inward, to retreat into competing factions, distrusting those who don’t look like you do, or worship the way you do, or don’t get their news from the same sources you do. 262 | 263 | We must end this uncivil war that pits red against blue, rural versus urban, conservative versus liberal. 264 | 265 | We can do this if we open our souls instead of hardening our hearts. 266 | 267 | If we show a little tolerance and humility. 268 | 269 | If we’re willing to stand in the other person’s shoes just for a moment. 270 | Because here is the thing about life: There is no accounting for what fate will deal you. 271 | 272 | There are some days when we need a hand. 273 | 274 | There are other days when we’re called on to lend one. 275 | 276 | That is how we must be with one another. 277 | 278 | And, if we are this way, our country will be stronger, more prosperous, more ready for the future. 279 | 280 | My fellow Americans, in the work ahead of us, we will need each other. 281 | 282 | We will need all our strength to persevere through this dark winter. 283 | 284 | We are entering what may well be the toughest and deadliest period of the virus. 285 | 286 | We must set aside the politics and finally face this pandemic as one nation. 287 | 288 | I promise you this: as the Bible says weeping may endure for a night but joy cometh in the morning. 289 | 290 | We will get through this, together 291 | 292 | The world is watching today. 293 | 294 | So here is my message to those beyond our borders: America has been tested and we have come out stronger for it. 295 | 296 | We will repair our alliances and engage with the world once again. 297 | 298 | Not to meet yesterday’s challenges, but today’s and tomorrow’s. 299 | 300 | We will lead not merely by the example of our power but by the power of our example. 301 | 302 | We will be a strong and trusted partner for peace, progress, and security. 303 | 304 | We have been through so much in this nation. 305 | 306 | And, in my first act as President, I would like to ask you to join me in a moment of silent prayer to remember all those we lost this past year to the pandemic. 307 | 308 | To those 400,000 fellow Americans – mothers and fathers, husbands and wives, sons and daughters, friends, neighbors, and co-workers. 309 | 310 | We will honor them by becoming the people and nation we know we can and should be. 311 | 312 | Let us say a silent prayer for those who lost their lives, for those they left behind, and for our country. 313 | 314 | Amen. 315 | 316 | This is a time of testing. 317 | 318 | We face an attack on democracy and on truth. 319 | 320 | A raging virus. 321 | 322 | Growing inequity. 323 | 324 | The sting of systemic racism. 325 | 326 | A climate in crisis. 327 | 328 | America’s role in the world. 329 | 330 | Any one of these would be enough to challenge us in profound ways. 331 | 332 | But the fact is we face them all at once, presenting this nation with the gravest of responsibilities. 333 | 334 | Now we must step up. 335 | 336 | All of us. 337 | 338 | It is a time for boldness, for there is so much to do. 339 | 340 | And, this is certain. 341 | 342 | We will be judged, you and I, for how we resolve the cascading crises of our era. 343 | 344 | Will we rise to the occasion? 345 | 346 | Will we master this rare and difficult hour? 347 | 348 | Will we meet our obligations and pass along a new and better world for our children? 349 | 350 | I believe we must and I believe we will. 351 | 352 | And when we do, we will write the next chapter in the American story. 353 | 354 | It’s a story that might sound something like a song that means a lot to me. 355 | 356 | It’s called “American Anthem” and there is one verse stands out for me: 357 | 358 | “The work and prayers 359 | of centuries have brought us to this day 360 | What shall be our legacy? 361 | What will our children say?… 362 | Let me know in my heart 363 | When my days are through 364 | America 365 | America 366 | I gave my best to you.” 367 | 368 | Let us add our own work and prayers to the unfolding story of our nation. 369 | 370 | If we do this then when our days are through our children and our children’s children will say of us they gave their best. 371 | 372 | They did their duty. 373 | 374 | They healed a broken land. 375 | My fellow Americans, I close today where I began, with a sacred oath. 376 | 377 | Before God and all of you I give you my word. 378 | 379 | I will always level with you. 380 | 381 | I will defend the Constitution. 382 | 383 | I will defend our democracy. 384 | 385 | I will defend America. 386 | 387 | I will give my all in your service thinking not of power, but of possibilities. 388 | 389 | Not of personal interest, but of the public good. 390 | 391 | And together, we shall write an American story of hope, not fear. 392 | 393 | Of unity, not division. 394 | 395 | Of light, not darkness. 396 | 397 | An American story of decency and dignity. 398 | 399 | Of love and of healing. 400 | 401 | Of greatness and of goodness. 402 | 403 | May this be the story that guides us. 404 | 405 | The story that inspires us. 406 | 407 | The story that tells ages yet to come that we answered the call of history. 408 | 409 | We met the moment. 410 | 411 | That democracy and hope, truth and justice, did not die on our watch but thrived. 412 | 413 | That our America secured liberty at home and stood once again as a beacon to the world. 414 | 415 | That is what we owe our forebearers, one another, and generations to follow. 416 | 417 | So, with purpose and resolve we turn to the tasks of our time. 418 | 419 | Sustained by faith. 420 | 421 | Driven by conviction. 422 | 423 | And, devoted to one another and to this country we love with all our hearts. 424 | 425 | May God bless America and may God protect our troops. 426 | 427 | Thank you, America. -------------------------------------------------------------------------------- /data/trump_inauguration_millercenter.txt: -------------------------------------------------------------------------------- 1 | Chief Justice Roberts, President Carter, President Clinton, President Bush, President Obama, fellow Americans, and people of the world: thank you. 2 | 3 | We, the citizens of America, are now joined in a great national effort to rebuild our country and to restore its promise for all of our people. 4 | 5 | Together, we will determine the course of America and the world for years to come. 6 | 7 | We will face challenges. We will confront hardships. But we will get the job done. 8 | 9 | Every four years, we gather on these steps to carry out the orderly and peaceful transfer of power, and we are grateful to President Obama and First Lady Michelle Obama for their gracious aid throughout this transition. They have been magnificent. 10 | 11 | Today’s ceremony, however, has very special meaning. Because today we are not merely transferring power from one Administration to another, or from one party to another – but we are transferring power from Washington, D.C. and giving it back to you, the American People. 12 | 13 | For too long, a small group in our nation’s Capital has reaped the rewards of government while the people have borne the cost. 14 | 15 | Washington flourished – but the people did not share in its wealth. 16 | 17 | Politicians prospered – but the jobs left, and the factories closed. 18 | 19 | The establishment protected itself, but not the citizens of our country. 20 | 21 | Their victories have not been your victories; their triumphs have not been your triumphs; and while they celebrated in our nation’s Capital, there was little to celebrate for struggling families all across our land. 22 | 23 | That all changes – starting right here, and right now, because this moment is your moment: it belongs to you. 24 | 25 | It belongs to everyone gathered here today and everyone watching all across America. 26 | 27 | This is your day. This is your celebration. 28 | 29 | And this, the United States of America, is your country. 30 | 31 | What truly matters is not which party controls our government, but whether our government is controlled by the people. 32 | 33 | January 20th 2017, will be remembered as the day the people became the rulers of this nation again. 34 | 35 | The forgotten men and women of our country will be forgotten no longer. 36 | 37 | Everyone is listening to you now. 38 | 39 | You came by the tens of millions to become part of a historic movement the likes of which the world has never seen before. 40 | 41 | At the center of this movement is a crucial conviction: that a nation exists to serve its citizens. 42 | 43 | Americans want great schools for their children, safe neighborhoods for their families, and good jobs for themselves. 44 | 45 | These are the just and reasonable demands of a righteous public. 46 | 47 | But for too many of our citizens, a different reality exists: Mothers and children trapped in poverty in our inner cities; rusted-out factories scattered like tombstones across the landscape of our nation; an education system, flush with cash, but which leaves our young and beautiful students deprived of knowledge; and the crime and gangs and drugs that have stolen too many lives and robbed our country of so much unrealized potential. 48 | 49 | This American carnage stops right here and stops right now. 50 | 51 | We are one nation – and their pain is our pain. Their dreams are our dreams; and their success will be our success. We share one heart, one home, and one glorious destiny. 52 | 53 | The oath of office I take today is an oath of allegiance to all Americans. 54 | 55 | For many decades, we’ve enriched foreign industry at the expense of American industry; 56 | 57 | Subsidized the armies of other countries while allowing for the very sad depletion of our military; 58 | 59 | We've defended other nation’s borders while refusing to defend our own; 60 | 61 | And spent trillions of dollars overseas while America's infrastructure has fallen into disrepair and decay. 62 | 63 | We’ve made other countries rich while the wealth, strength, and confidence of our country has disappeared over the horizon. 64 | 65 | One by one, the factories shuttered and left our shores, with not even a thought about the millions upon millions of American workers left behind. 66 | 67 | The wealth of our middle class has been ripped from their homes and then redistributed across the entire world. 68 | 69 | But that is the past. And now we are looking only to the future. 70 | 71 | We assembled here today are issuing a new decree to be heard in every city, in every foreign capital, and in every hall of power. 72 | 73 | From this day forward, a new vision will govern our land. 74 | 75 | From this moment on, it’s going to be America First. 76 | 77 | Every decision on trade, on taxes, on immigration, on foreign affairs, will be made to benefit American workers and American families. 78 | 79 | We must protect our borders from the ravages of other countries making our products, stealing our companies, and destroying our jobs. Protection will lead to great prosperity and strength. 80 | 81 | I will fight for you with every breath in my body – and I will never, ever let you down. 82 | 83 | America will start winning again, winning like never before. 84 | 85 | We will bring back our jobs. We will bring back our borders. We will bring back our wealth. And we will bring back our dreams. 86 | 87 | We will build new roads, and highways, and bridges, and airports, and tunnels, and railways all across our wonderful nation. 88 | 89 | We will get our people off of welfare and back to work – rebuilding our country with American hands and American labor. 90 | 91 | We will follow two simple rules: Buy American and Hire American. 92 | 93 | We will seek friendship and goodwill with the nations of the world – but we do so with the understanding that it is the right of all nations to put their own interests first. 94 | 95 | We do not seek to impose our way of life on anyone, but rather to let it shine as an example for everyone to follow. 96 | 97 | We will reinforce old alliances and form new ones – and unite the civilized world against Radical Islamic Terrorism, which we will eradicate completely from the face of the Earth. 98 | 99 | At the bedrock of our politics will be a total allegiance to the United States of America, and through our loyalty to our country, we will rediscover our loyalty to each other. 100 | 101 | When you open your heart to patriotism, there is no room for prejudice. 102 | 103 | The Bible tells us, “how good and pleasant it is when God’s people live together in unity.” 104 | 105 | We must speak our minds openly, debate our disagreements honestly, but always pursue solidarity. 106 | 107 | When America is united, America is totally unstoppable. 108 | 109 | There should be no fear – we are protected, and we will always be protected. 110 | 111 | We will be protected by the great men and women of our military and law enforcement and, most importantly, we are protected by God. 112 | 113 | Finally, we must think big and dream even bigger. 114 | 115 | In America, we understand that a nation is only living as long as it is striving. 116 | 117 | We will no longer accept politicians who are all talk and no action – constantly complaining but never doing anything about it. 118 | 119 | The time for empty talk is over. 120 | 121 | Now arrives the hour of action. 122 | 123 | Do not let anyone tell you it cannot be done. No challenge can match the heart and fight and spirit of America. 124 | 125 | We will not fail. Our country will thrive and prosper again. 126 | 127 | We stand at the birth of a new millennium, ready to unlock the mysteries of space, to free the Earth from the miseries of disease, and to harness the energies, industries and technologies of tomorrow. 128 | 129 | A new national pride will stir our souls, lift our sights, and heal our divisions. 130 | 131 | It is time to remember that old wisdom our soldiers will never forget: that whether we are black or brown or white, we all bleed the same red blood of patriots, we all enjoy the same glorious freedoms, and we all salute the same great American Flag. 132 | 133 | And whether a child is born in the urban sprawl of Detroit or the windswept plains of Nebraska, they look up at the same night sky, they fill their heart with the same dreams, and they are infused with the breath of life by the same almighty Creator. 134 | 135 | So to all Americans, in every city near and far, small and large, from mountain to mountain, and from ocean to ocean, hear these words: 136 | 137 | You will never be ignored again. 138 | 139 | Your voice, your hopes, and your dreams, will define our American destiny. And your courage and goodness and love will forever guide us along the way. 140 | 141 | Together, We Will Make America Strong Again. 142 | 143 | We Will Make America Wealthy Again. 144 | 145 | We Will Make America Proud Again. 146 | 147 | We Will Make America Safe Again. 148 | 149 | And, Yes, Together, We Will Make America Great Again. Thank you, God Bless You, And God Bless America. 150 | -------------------------------------------------------------------------------- /figs/aliasing.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/social-research/python-workshop/79be98ba6bdee5428be3b99e8a36d2fb8dbf1d8d/figs/aliasing.png -------------------------------------------------------------------------------- /figs/conditional_statements.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/social-research/python-workshop/79be98ba6bdee5428be3b99e8a36d2fb8dbf1d8d/figs/conditional_statements.png -------------------------------------------------------------------------------- /figs/control_flow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/social-research/python-workshop/79be98ba6bdee5428be3b99e8a36d2fb8dbf1d8d/figs/control_flow.png -------------------------------------------------------------------------------- /figs/decomposition_abstraction.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/social-research/python-workshop/79be98ba6bdee5428be3b99e8a36d2fb8dbf1d8d/figs/decomposition_abstraction.png -------------------------------------------------------------------------------- /figs/functions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/social-research/python-workshop/79be98ba6bdee5428be3b99e8a36d2fb8dbf1d8d/figs/functions.png -------------------------------------------------------------------------------- /figs/good_programmer.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/social-research/python-workshop/79be98ba6bdee5428be3b99e8a36d2fb8dbf1d8d/figs/good_programmer.jpg -------------------------------------------------------------------------------- /figs/iteration.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/social-research/python-workshop/79be98ba6bdee5428be3b99e8a36d2fb8dbf1d8d/figs/iteration.png -------------------------------------------------------------------------------- /figs/markup_lang.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/social-research/python-workshop/79be98ba6bdee5428be3b99e8a36d2fb8dbf1d8d/figs/markup_lang.png -------------------------------------------------------------------------------- /figs/person_greta.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/social-research/python-workshop/79be98ba6bdee5428be3b99e8a36d2fb8dbf1d8d/figs/person_greta.png -------------------------------------------------------------------------------- /figs/procedural_object-oriented.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/social-research/python-workshop/79be98ba6bdee5428be3b99e8a36d2fb8dbf1d8d/figs/procedural_object-oriented.png -------------------------------------------------------------------------------- /figs/program_lang.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/social-research/python-workshop/79be98ba6bdee5428be3b99e8a36d2fb8dbf1d8d/figs/program_lang.png -------------------------------------------------------------------------------- /figs/python.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/social-research/python-workshop/79be98ba6bdee5428be3b99e8a36d2fb8dbf1d8d/figs/python.png -------------------------------------------------------------------------------- /figs/python_workshop.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/social-research/python-workshop/79be98ba6bdee5428be3b99e8a36d2fb8dbf1d8d/figs/python_workshop.png -------------------------------------------------------------------------------- /figs/sets.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/social-research/python-workshop/79be98ba6bdee5428be3b99e8a36d2fb8dbf1d8d/figs/sets.png -------------------------------------------------------------------------------- /python_example.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "slideshow": { 7 | "slide_type": "slide" 8 | } 9 | }, 10 | "source": [ 11 | "# Example Project: Comparing Trump's and Biden's Inaugural Speeches\n", 12 | "\n", 13 | "We will use a mini-project as an extended practical example to demonstrate the concepts we are learning in the workshop. The project aims to analyze and compare the inaugural speeches of the current and last US presidents.\n", 14 | "\n", 15 | "The speech transcripts were obtained from https://millercenter.org/the-presidency/presidential-speeches and copied in the text files `biden_inauguration_millercenter.txt` and `trump_inauguration_millercenter.txt` in the `data` folder." 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": { 21 | "slideshow": { 22 | "slide_type": "slide" 23 | } 24 | }, 25 | "source": [ 26 | "## Straight-line programming\n", 27 | "\n", 28 | "Even just with basic understanding of data types, operations, and methods, we can already extract useful information from data. Below, we will:\n", 29 | "1. Open the text file with one of the speeches\n", 30 | "2. Clean up the text and extract a list of all the words used in the speech\n", 31 | "3. Estimate the length of the speach and number of unique words used" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 1, 37 | "metadata": { 38 | "scrolled": false, 39 | "slideshow": { 40 | "slide_type": "-" 41 | } 42 | }, 43 | "outputs": [ 44 | { 45 | "data": { 46 | "text/plain": [ 47 | "'Chief Justice Roberts, President Carter, President Clinton, President Bush, President Obama, fellow Americans, and people of the world: thank you.\\n\\nWe, the citizens of America, are now joined in a great national effort to rebuild our country and to restore its promise for all of our people.\\n\\nTogether, we will determine the course of America and the world for years to come.\\n\\nWe will face challenges. We will confront hardships. But we will get the job done.\\n\\nEvery four years, we gather on these st'" 48 | ] 49 | }, 50 | "execution_count": 1, 51 | "metadata": {}, 52 | "output_type": "execute_result" 53 | } 54 | ], 55 | "source": [ 56 | "# Open the file and get the text into a string variable called txt\n", 57 | "with open('data/trump_inauguration_millercenter.txt') as f:\n", 58 | " txt = f.read()\n", 59 | "txt[:500] # Show the first 500 characters of the txt variable" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": 2, 65 | "metadata": { 66 | "scrolled": true, 67 | "slideshow": { 68 | "slide_type": "-" 69 | } 70 | }, 71 | "outputs": [ 72 | { 73 | "name": "stdout", 74 | "output_type": "stream", 75 | "text": [ 76 | "['2017', '20th', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'about', 'about', 'accept', 'across', 'across', 'across', 'across', 'across', 'action', 'action', 'administration', 'affairs', 'again', 'again', 'again', 'again', 'again', 'again', 'again', 'again', 'again', 'against', 'aid', 'airports', 'all', 'all', 'all', 'all', 'all', 'all', 'all', 'all', 'all', 'all', 'all', 'all', 'allegiance', 'allegiance', 'alliances', 'allowing', 'almighty', 'along', 'always', 'always', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'american', 'american', 'american', 'american', 'american', 'american', 'american', 'american', 'american', 'american', 'american', 'american', 'americans', 'americans', 'americans', 'americans', 'an', 'an', 'an', 'and', 'and']\n", 77 | "1436\n", 78 | "536\n" 79 | ] 80 | } 81 | ], 82 | "source": [ 83 | "# Remove paragraphs and format consistently\n", 84 | "txt = txt.strip().replace('\\n', ' ').replace(\"’\", \"'\")\n", 85 | "\n", 86 | "# Get rid of possessives and expand contractions\n", 87 | "txt = txt.replace(\"'s\", '').replace(\"'ve\", ' have').replace(\"'re\", ' are')\n", 88 | "txt = txt.replace(\"can't\", 'can not').replace(\"n't\", ' not')\n", 89 | "\n", 90 | "# Remove punctuation\n", 91 | "txt = txt.replace('—', '').replace('–', '')\n", 92 | "txt = txt.replace('.', '').replace(',', '').replace(':', '').replace(';', '').replace('…', '')\n", 93 | "txt = txt.replace(\"”\", '').replace(\"“\", '')\n", 94 | "\n", 95 | "# Convert to lower-case\n", 96 | "txt = txt.lower()\n", 97 | "\n", 98 | "# Break into words\n", 99 | "wrds = txt.split()\n", 100 | "print(sorted(wrds)[:100])\n", 101 | "\n", 102 | "# Count the number of words in the speech\n", 103 | "print(len(wrds))\n", 104 | "\n", 105 | "# Count the number of unique words\n", 106 | "print(len(set(wrds)))" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": { 112 | "slideshow": { 113 | "slide_type": "slide" 114 | } 115 | }, 116 | "source": [ 117 | "## Control flow with conditionals and loops\n", 118 | "\n", 119 | "Branching and iteration allow us to employ more complex logic in our data processing and analysis: e.g., repeat operations or set conditions to select data. Below, we will:\n", 120 | "1. Count the number of times each unique word is mentioned in the speech\n", 121 | "2. Exclude non-meaningful words such as articles and prepositions\n", 122 | "3. Identify the most commonly used meaningful words to reveal the theme and tone of the speech" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 3, 128 | "metadata": { 129 | "scrolled": false, 130 | "slideshow": { 131 | "slide_type": "-" 132 | } 133 | }, 134 | "outputs": [ 135 | { 136 | "data": { 137 | "text/plain": [ 138 | "[('and', 74),\n", 139 | " ('the', 70),\n", 140 | " ('we', 49),\n", 141 | " ('of', 48),\n", 142 | " ('our', 48),\n", 143 | " ('will', 40),\n", 144 | " ('to', 37),\n", 145 | " ('is', 21),\n", 146 | " ('america', 18),\n", 147 | " ('a', 15)]" 148 | ] 149 | }, 150 | "execution_count": 3, 151 | "metadata": {}, 152 | "output_type": "execute_result" 153 | } 154 | ], 155 | "source": [ 156 | "# Create dictionary with word:count\n", 157 | "word_counts = {}\n", 158 | "\n", 159 | "for i in wrds:\n", 160 | " if i not in word_counts:\n", 161 | " word_counts[i] = 1\n", 162 | " else:\n", 163 | " word_counts[i] += 1\n", 164 | "\n", 165 | "# Print the words with counts in decreasing order of popularity\n", 166 | "# Note this produces a list of tuples\n", 167 | "sorted_word_counts = sorted(word_counts.items(), key=lambda i: i[1], reverse=True)\n", 168 | "\n", 169 | "sorted_word_counts[:10]" 170 | ] 171 | }, 172 | { 173 | "cell_type": "code", 174 | "execution_count": 4, 175 | "metadata": { 176 | "scrolled": false, 177 | "slideshow": { 178 | "slide_type": "-" 179 | } 180 | }, 181 | "outputs": [ 182 | { 183 | "data": { 184 | "text/plain": [ 185 | "[('we', 49),\n", 186 | " ('our', 48),\n", 187 | " ('will', 40),\n", 188 | " ('america', 18),\n", 189 | " ('you', 12),\n", 190 | " ('all', 12),\n", 191 | " ('american', 12),\n", 192 | " ('their', 11),\n", 193 | " ('your', 11),\n", 194 | " ('people', 9)]" 195 | ] 196 | }, 197 | "execution_count": 4, 198 | "metadata": {}, 199 | "output_type": "execute_result" 200 | } 201 | ], 202 | "source": [ 203 | "# We will create a dictionary of all words mentioned more than once without stop words\n", 204 | "# Stop words are common words that are not meaningful in this context\n", 205 | "stop_words = ['a', 'about', 'across', 'after', 'an', 'and', 'any', 'are', 'as', 'at', \n", 206 | " 'be', 'because', 'but', 'by', 'did', 'do', 'does', 'for', 'from',\n", 207 | " 'get', 'has', 'have', 'if', 'in', 'is', 'it', 'its',\n", 208 | " 'many', 'more', 'much', 'no', 'not', 'of', 'on', 'or', 'out',\n", 209 | " 'so', 'some', 'than', 'the', 'this', 'that', 'those', 'through', 'to',\n", 210 | " 'very', 'what', 'where', 'whether', 'which', 'while', 'who', 'with']\n", 211 | "\n", 212 | "common_words = []\n", 213 | "for i in sorted_word_counts:\n", 214 | " if i[0] not in stop_words:\n", 215 | " if i[1] > 1:\n", 216 | " common_words.append(i)\n", 217 | " else:\n", 218 | " break\n", 219 | " \n", 220 | "# Alternatively, we can use a list comprehension for the code block above\n", 221 | "# common_words = [i for i in sorted_word_counts if i[0] not in stop_words and i[1] > 1] \n", 222 | " \n", 223 | "common_words[:10]" 224 | ] 225 | }, 226 | { 227 | "cell_type": "markdown", 228 | "metadata": { 229 | "slideshow": { 230 | "slide_type": "slide" 231 | } 232 | }, 233 | "source": [ 234 | "## Functions\n", 235 | "\n", 236 | "Once we understand conditionals, loops, and functions, we can improve the code above and make it more efficient and modular. This will allow us to apply it to multiple data files, without the need to duplicate large chunks of code. Below, we will:\n", 237 | "1. Create a function to extract words from text and another function to count words in a text\n", 238 | "2. Apply the functions to each president's speech\n", 239 | "3. Compare the length and repetitiveness of the speeches, the most common words and the unique words" 240 | ] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "execution_count": 5, 245 | "metadata": { 246 | "scrolled": false, 247 | "slideshow": { 248 | "slide_type": "-" 249 | } 250 | }, 251 | "outputs": [ 252 | { 253 | "name": "stdout", 254 | "output_type": "stream", 255 | "text": [ 256 | "['chief', 'justice', 'roberts', 'president', 'carter', 'president', 'clinton', 'president', 'bush', 'president', 'obama', 'fellow', 'americans', 'and', 'people', 'of', 'the', 'world', 'thank', 'you', 'we', 'the', 'citizens', 'of', 'america', 'are', 'now', 'joined', 'in', 'a', 'great', 'national', 'effort', 'to', 'rebuild', 'our', 'country', 'and', 'to', 'restore', 'its', 'promise', 'for', 'all', 'of', 'our', 'people', 'together', 'we', 'will', 'determine', 'the', 'course', 'of', 'america', 'and', 'the', 'world', 'for', 'years', 'to', 'come', 'we', 'will', 'face', 'challenges', 'we', 'will', 'confront', 'hardships', 'but', 'we', 'will', 'get', 'the', 'job', 'done', 'every', 'four', 'years', 'we', 'gather', 'on', 'these', 'steps', 'to', 'carry', 'out', 'the', 'orderly', 'and', 'peaceful', 'transfer', 'of', 'power', 'and', 'we', 'are', 'grateful', 'to']\n" 257 | ] 258 | } 259 | ], 260 | "source": [ 261 | "import string # See https://docs.python.org/3/library/string.html\n", 262 | "\n", 263 | "# This will now be a global variable so we will follow the convention and \n", 264 | "# name it in all caps\n", 265 | "STOP_WORDS = ['a', 'about', 'across', 'after', 'an', 'and', 'any', 'are', 'as', 'at', \n", 266 | " 'be', 'because', 'but', 'by', 'did', 'do', 'does', 'for', 'from',\n", 267 | " 'get', 'has', 'have', 'if', 'in', 'is', 'it', 'its',\n", 268 | " 'many', 'more', 'much', 'no', 'not', 'of', 'on', 'or', 'out',\n", 269 | " 'so', 'some', 'than', 'the', 'this', 'that', 'those', 'through', 'to',\n", 270 | " 'very', 'what', 'where', 'whether', 'which', 'while', 'who', 'with']\n", 271 | "\n", 272 | "def get_tokens(fname):\n", 273 | " \"\"\"Read given text file and return a list with all words in lowercase\n", 274 | " in the order they appear in the text. Common contractions are expanded\n", 275 | " and hyphenated words are combined in one word.\n", 276 | " \"\"\"\n", 277 | " with open(fname) as f:\n", 278 | " txt = f.read()\n", 279 | " \n", 280 | " # Remove paragraphs and format consistently\n", 281 | " txt = txt.strip().replace('\\n', ' ').replace(\"’\", \"'\")\n", 282 | " \n", 283 | " # Get rid of possessives and expand contractions\n", 284 | " txt = txt.replace(\"'s\", '').replace(\"'ve\", ' have').replace(\"'re\", ' are')\n", 285 | " txt = txt.replace(\"can't\", 'can not').replace(\"n't\", ' not')\n", 286 | "\n", 287 | " # Remove punctuation and convert to lower-case\n", 288 | " exclude = set(string.punctuation) | {\"”\", \"“\", \"…\", '–'}\n", 289 | " txt = ''.join(ch.lower() for ch in txt if ch not in exclude)\n", 290 | "\n", 291 | " # Break into words\n", 292 | " wrds = txt.split()\n", 293 | " \n", 294 | " return wrds\n", 295 | "\n", 296 | "\n", 297 | "def get_word_counts(tokens):\n", 298 | " \"\"\"Take tokens and return a dictionary where keys are words\n", 299 | " and values are counts of the number of time the word is repeated.\n", 300 | " \"\"\"\n", 301 | " # Create dictionary with word:count\n", 302 | " word_counts = {}\n", 303 | "\n", 304 | " for i in tokens:\n", 305 | " if i not in STOP_WORDS:\n", 306 | " if i not in word_counts:\n", 307 | " word_counts[i] = 1\n", 308 | " else:\n", 309 | " word_counts[i] += 1\n", 310 | "\n", 311 | " # Get the words with counts in decreasing order of popularity\n", 312 | " # Note this produces a list of tuples\n", 313 | " sorted_word_counts = sorted(word_counts.items(), key=lambda i: i[1], reverse=True)\n", 314 | " \n", 315 | " return sorted_word_counts\n", 316 | "\n", 317 | "\n", 318 | "trump_tokens = get_tokens('data/trump_inauguration_millercenter.txt')\n", 319 | "biden_tokens = get_tokens('data/biden_inauguration_millercenter.txt')\n", 320 | "\n", 321 | "print(trump_tokens[:100])\n" 322 | ] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "execution_count": 6, 327 | "metadata": { 328 | "scrolled": false, 329 | "slideshow": { 330 | "slide_type": "-" 331 | } 332 | }, 333 | "outputs": [ 334 | { 335 | "name": "stdout", 336 | "output_type": "stream", 337 | "text": [ 338 | "1436 2382\n", 339 | "536 721\n", 340 | "2.6791044776119404 3.30374479889043\n", 341 | "\n", 342 | "[('we', 49), ('our', 48), ('will', 40), ('america', 18), ('you', 12), ('all', 12), ('american', 12), ('their', 11), ('your', 11), ('people', 9), ('country', 9), ('nation', 9), ('again', 9), ('one', 8), ('every', 7), ('world', 6), ('now', 6), ('great', 6), ('back', 6), ('never', 6)]\n", 343 | "\n", 344 | "[('we', 91), ('our', 43), ('will', 33), ('i', 33), ('us', 27), ('my', 20), ('america', 20), ('can', 18), ('you', 17), ('all', 17), ('one', 15), ('nation', 14), ('democracy', 11), ('me', 11), ('must', 10), ('americans', 9), ('today', 9), ('people', 9), ('american', 9), ('story', 9)]\n" 345 | ] 346 | } 347 | ], 348 | "source": [ 349 | "# Biden's speech is longer\n", 350 | "print(len(trump_tokens), len(biden_tokens))\n", 351 | "print(len(set(trump_tokens)), len(set(biden_tokens)))\n", 352 | "\n", 353 | "# Biden's speech is also more repetitive\n", 354 | "print(len(trump_tokens)/len(set(trump_tokens)), len(biden_tokens)/len(set(biden_tokens)))\n", 355 | "\n", 356 | "print() # Add an empty line to separate results\n", 357 | "\n", 358 | "# The ten most common words for Trump and Biden\n", 359 | "trump_wcounts = get_word_counts(trump_tokens)\n", 360 | "biden_wcounts = get_word_counts(biden_tokens)\n", 361 | "\n", 362 | "# Biden's speech is more self-centered\n", 363 | "print(trump_wcounts[:20])\n", 364 | "\n", 365 | "print() # Add an empty line to separate results\n", 366 | "\n", 367 | "print(biden_wcounts[:20])\n" 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": 7, 373 | "metadata": { 374 | "scrolled": false 375 | }, 376 | "outputs": [ 377 | { 378 | "name": "stdout", 379 | "output_type": "stream", 380 | "text": [ 381 | "[('back', 6), ('protected', 5), ('dreams', 5), ('wealth', 4), ('everyone', 4), ('bring', 4), ('obama', 3), ('too', 3), ('capital', 3), ('government', 3), ('factories', 3), ('foreign', 3), ('countries', 3)]\n", 382 | "\n", 383 | "[('democracy', 11), ('me', 11), ('story', 9), ('know', 8), ('history', 7), ('war', 7), ('days', 6), ('truth', 5), ('may', 5), ('cause', 4), ('centuries', 4), ('peace', 4), ('virus', 4), ('lost', 4), ('soul', 4), ('things', 4), ('once', 4), ('better', 4), ('need', 4), ('say', 4), ('vice', 3), ('hope', 3), ('resolve', 3), ('prevailed', 3), ('ago', 3), ('violence', 3), ('them', 3), ('constitution', 3), ('sacred', 3), ('year', 3), ('cry', 3), ('whole', 3), ('uniting', 3), ('join', 3), ('common', 3), ('faith', 3), ('show', 3), ('dignity', 3), ('respect', 3), ('meet', 3), ('believe', 3), ('yet', 3), ('gave', 3), ('honor', 3), ('lies', 3)]\n" 384 | ] 385 | } 386 | ], 387 | "source": [ 388 | "# Get repeated words and check the difference\n", 389 | "trump_100 = set([i[0] for i in trump_wcounts])\n", 390 | "biden_100 = set([i[0] for i in biden_wcounts])\n", 391 | "\n", 392 | "# Unique words only for Trump\n", 393 | "print([i for i in trump_wcounts if i[0] in (trump_100-biden_100) and i[1] > 2])\n", 394 | "\n", 395 | "print()\n", 396 | "\n", 397 | "# Unique words only for Biden\n", 398 | "print([i for i in biden_wcounts if i[0] in (biden_100- trump_100) and i[1] > 2])\n" 399 | ] 400 | }, 401 | { 402 | "cell_type": "markdown", 403 | "metadata": { 404 | "slideshow": { 405 | "slide_type": "slide" 406 | } 407 | }, 408 | "source": [ 409 | "## Classes\n", 410 | "\n", 411 | "What we did above is known as procedural programming – we keep functions and data separate and pass the data to the functions. Alternatively, we can employ the approach of object-oriented programming – we can bundle up the data and functions into classes. In this case, the functions become methods and they belong only to this particular data type. We cannot call them independently, on other types of data, for example." 412 | ] 413 | }, 414 | { 415 | "cell_type": "code", 416 | "execution_count": 8, 417 | "metadata": {}, 418 | "outputs": [ 419 | { 420 | "name": "stdout", 421 | "output_type": "stream", 422 | "text": [ 423 | "Chief Justice Roberts, President Carter, President Clinton, President Bush, President Obama, fellow Americans, and people of the world: thank you.\n", 424 | "\n", 425 | "We, the citizens of America, are now joined in a gre...\n", 426 | "1436\n", 427 | "\n", 428 | "Chief Justice Roberts, Vice President Harris, Speaker Pelosi, Leader Schumer, Leader McConnell, Vice President Pence, distinguished guests, and my fellow Americans.\n", 429 | "\n", 430 | "This is America’s day.\n", 431 | "\n", 432 | "This is de...\n", 433 | "2382\n" 434 | ] 435 | } 436 | ], 437 | "source": [ 438 | "class Speech(object):\n", 439 | " \n", 440 | " def __init__(self, fname):\n", 441 | " \"\"\"Creates a speech using the text in file fname.\"\"\"\n", 442 | " \n", 443 | " with open(fname) as f:\n", 444 | " self.txt = f.read()\n", 445 | " self.tokens = None\n", 446 | " self.word_counts = None\n", 447 | " \n", 448 | " # Populate the empty attributes above by processing the text\n", 449 | " self.process_tokens() \n", 450 | " self.process_word_counts()\n", 451 | " \n", 452 | " \n", 453 | " # The following two methods are called when you initialize a new object\n", 454 | " \n", 455 | " def process_tokens(self):\n", 456 | " \"\"\"Extracts the tokens in the text and assigns them to \n", 457 | " the attribute 'tokens'. 'tokens' is a list of strings.\n", 458 | " \"\"\"\n", 459 | " \n", 460 | " # Remove paragraphs and format consistently\n", 461 | " txt = self.txt.strip().replace('\\n', ' ').replace(\"’\", \"'\")\n", 462 | " \n", 463 | " # Get rid of possessives and expand contractions\n", 464 | " txt = txt.replace(\"'s\", '').replace(\"'ve\", ' have').replace(\"'re\", ' are')\n", 465 | " txt = txt.replace(\"can't\", 'can not').replace(\"n't\", ' not')\n", 466 | " \n", 467 | " # Remove punctuation and convert to lower-case\n", 468 | " exclude = set(string.punctuation) | {\"”\", \"“\", \"…\", '–'}\n", 469 | " txt = ''.join(ch.lower() for ch in txt if ch not in exclude)\n", 470 | "\n", 471 | " # Break into words\n", 472 | " wrds = txt.split()\n", 473 | "\n", 474 | " self.tokens = wrds\n", 475 | " \n", 476 | " \n", 477 | " def process_word_counts(self):\n", 478 | " \"\"\"Counts the number of times each word, excluding stop words,\n", 479 | " appears in the speech and assigns the counts to the attribute 'word_counts'.\n", 480 | " 'word_counts' is a list of tuples in the form (token, count).\n", 481 | " \"\"\"\n", 482 | " # Create dictionary with word:count\n", 483 | " word_counts = {}\n", 484 | "\n", 485 | " for i in self.tokens:\n", 486 | " if i not in STOP_WORDS:\n", 487 | " if i not in word_counts:\n", 488 | " word_counts[i] = 1\n", 489 | " else:\n", 490 | " word_counts[i] += 1\n", 491 | "\n", 492 | " # Get the words with counts in decreasing order of popularity\n", 493 | " # Note this produces a list of tuples\n", 494 | " sorted_word_counts = sorted(word_counts.items(), key=lambda i: i[1], reverse=True)\n", 495 | " self.word_counts = sorted_word_counts\n", 496 | " \n", 497 | " \n", 498 | " # Use get and set methods to provide interface for interacting with the objects\n", 499 | " \n", 500 | " def get_text():\n", 501 | " return self.text\n", 502 | " \n", 503 | " def get_tokens(self):\n", 504 | " \"\"\"Get the tokens in the speech as a list of strings.\"\"\"\n", 505 | " # Avoid returning mutable objects as they could be modified in undesirable ways\n", 506 | " return self.tokens[:]\n", 507 | " \n", 508 | " def get_word_counts(self):\n", 509 | " \"\"\"Get each unique word in the speech and the number of times it appears in the speech.\n", 510 | " Return a list of tuples in the form (token, count).\n", 511 | " \"\"\"\n", 512 | " # Avoid returning mutable objects as they could be modified in undesirable ways\n", 513 | " return self.word_counts[:]\n", 514 | " \n", 515 | " # You can make your code even more interactive by providing extra methods for\n", 516 | " # common and useful operations\n", 517 | " \n", 518 | " def get_speech_length(self):\n", 519 | " \"\"\"Get the number of tokens in the speech.\"\"\"\n", 520 | " return len(self.tokens)\n", 521 | " \n", 522 | " def get_number_unique_tokens(self):\n", 523 | " \"\"\"Gets the number of unique words used in the speech,\n", 524 | " including stop words.\n", 525 | " \"\"\"\n", 526 | " return len(set(self.tokens))\n", 527 | " \n", 528 | " def __str__(self):\n", 529 | " \"\"\"Returns the first 200 characters of the speech.\"\"\"\n", 530 | " return self.txt[:200] + '...'\n", 531 | "\n", 532 | " \n", 533 | "# Create an object of class Speech for Trump's inaugural speech\n", 534 | "trump = Speech('data/trump_inauguration_millercenter.txt')\n", 535 | "print(trump)\n", 536 | "# Process the speech text and get the length of the speech\n", 537 | "print(trump.get_speech_length())\n", 538 | "\n", 539 | "print()\n", 540 | "\n", 541 | "# Create anothe Speech object for Biden's inaugural speech\n", 542 | "biden = Speech('data/biden_inauguration_millercenter.txt')\n", 543 | "print(biden)\n", 544 | "print(biden.get_speech_length())" 545 | ] 546 | } 547 | ], 548 | "metadata": { 549 | "celltoolbar": "Slideshow", 550 | "kernelspec": { 551 | "display_name": "Python 3", 552 | "language": "python", 553 | "name": "python3" 554 | }, 555 | "language_info": { 556 | "codemirror_mode": { 557 | "name": "ipython", 558 | "version": 3 559 | }, 560 | "file_extension": ".py", 561 | "mimetype": "text/x-python", 562 | "name": "python", 563 | "nbconvert_exporter": "python", 564 | "pygments_lexer": "ipython3", 565 | "version": "3.7.4" 566 | } 567 | }, 568 | "nbformat": 4, 569 | "nbformat_minor": 2 570 | } 571 | -------------------------------------------------------------------------------- /python_intro.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "slideshow": { 7 | "slide_type": "slide" 8 | } 9 | }, 10 | "source": [ 11 | "![Introduction to Programming with Python](figs/python_workshop.png \"Introduction to Programming with Python\")" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": { 17 | "slideshow": { 18 | "slide_type": "slide" 19 | } 20 | }, 21 | "source": [ 22 | "# Overview\n", 23 | "\n", 24 | "\n", 25 | "* Instructor: **Milena Tsvetkova**\n", 26 | "* Teaching Assistant: **Yuanmo He**\n", 27 | "\n", 28 | "**10:00 – 12:00 CET**\n", 29 | "\n", 30 | "* Straight-line programming\n", 31 | " * Data types\n", 32 | " * Operations and methods\n", 33 | "* Control flow\n", 34 | " * Conditional statements\n", 35 | "\n", 36 | "\n", 37 | "**12:00 – 12:15 CET** Break\n", 38 | "\n", 39 | "**12:15 – 14:15 CET**\n", 40 | "\n", 41 | "* Control flow\n", 42 | " * Iteration\n", 43 | " * Functions\n", 44 | "* Classes\n", 45 | "* Concluding remarks\n" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": { 51 | "slideshow": { 52 | "slide_type": "slide" 53 | } 54 | }, 55 | "source": [ 56 | "## Why Do Social Scientists Need Computer Programming?\n", 57 | "\n", 58 | "* Collect data\n", 59 | " * Crawling websites and using APIs\n", 60 | " * Online surveys and experiments\n", 61 | " * Computational models and simulations\n", 62 | "* Manage, analyze, and visualize data\n", 63 | " * Large data\n", 64 | " * Non-rectangular data (e.g. networks, text)" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": { 70 | "slideshow": { 71 | "slide_type": "fragment" 72 | } 73 | }, 74 | "source": [ 75 | "* Be autonomous and work independently\n", 76 | "* Learn from and collaborate with engineers and scientists" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": { 82 | "slideshow": { 83 | "slide_type": "fragment" 84 | } 85 | }, 86 | "source": [ 87 | "* Generate and share reproducible workflows" 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": { 93 | "slideshow": { 94 | "slide_type": "slide" 95 | } 96 | }, 97 | "source": [ 98 | "## Markup vs. Programming Languages\n", 99 | "\n", 100 | "\n", 101 | "| | Markup Languages | Programming Languages \n", 102 | "| :----------- |:---------------- | :----------------------\n", 103 | "| |![Markup languages](figs/markup_lang.png \"Markup languages\") | ![Programming languages](figs/program_lang.png \"Programming languages\")\n", 104 | "| **Examples** | TeX, HTML, XML, **Markdown** | C, Java, JavaScript, R, **Python** \n", 105 | "| **Use** | Structure and present data | Transform and generate data \n", 106 | "| **Execution**| Program (e.g. a browser) | Computer hardware \n", 107 | "| **Structure**| Inline tags | Primitive constructs, syntax, static semantics, semantics \n", 108 | "\n", 109 | "(Image sources: Wikimedia)" 110 | ] 111 | }, 112 | { 113 | "cell_type": "markdown", 114 | "metadata": { 115 | "slideshow": { 116 | "slide_type": "notes" 117 | } 118 | }, 119 | "source": [ 120 | "A programming language is a formal language used to specify a set of instructions for a computer to execute. It has:\n", 121 | "\n", 122 | "* Primitive constructs – literals (chracters, numbers) and operators\n", 123 | "* Syntax – rules for putting primitives together\n", 124 | "* Static semantics – rules for forming meaningful commands\n", 125 | "* Semantics – the meaning of commands" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "metadata": { 131 | "slideshow": { 132 | "slide_type": "slide" 133 | } 134 | }, 135 | "source": [ 136 | "## Why Python?\n", 137 | "\n", 138 | "![Python](figs/python.png \"Python\")\n", 139 | "\n", 140 | "* Open-source – free and well-documented\n", 141 | "* Simple and concise syntax\n", 142 | "* Many useful libraries\n", 143 | "* Cross-platform\n", 144 | "* [Widely used in industry and science](https://youtu.be/cKzP61Gjf00)" 145 | ] 146 | }, 147 | { 148 | "cell_type": "markdown", 149 | "metadata": { 150 | "slideshow": { 151 | "slide_type": "slide" 152 | } 153 | }, 154 | "source": [ 155 | "## Programming with Python on Google Colab\n", 156 | "\n", 157 | "Jupyter Notebooks is an open document format based on JSON with live code, equations, visualizations, and explanatory text. \n", 158 | "\n", 159 | "Google Colab allows you to run a Jupyter notebook in the cloud via your browser, no installation required.\n", 160 | "\n", 161 | "**Go to https://github.com/social-research/python-workshop and open the [Google Colab link](https://drive.google.com/file/d/1ou6DKxFaAVKBrn9AEgWEjdpNe76bltP9/view?usp=sharing) under Software.**\n", 162 | "\n", 163 | "(Alternatively, if you have Jupyter pre-installed, you can clone the repository locally and run the Jupyter server to open the file `python_intro.ipynb`.)" 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": { 169 | "slideshow": { 170 | "slide_type": "slide" 171 | } 172 | }, 173 | "source": [ 174 | "## Using Colab Notebooks\n", 175 | "\n", 176 | "* Text cells\n", 177 | " * Double-click to inspect and edit Markdown\n", 178 | " * See cheatsheet: https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet\n", 179 | "* Code cells\n", 180 | " * Press the Play button (or `CTRL/CMD + ENTER`) to run\n", 181 | " * If you run code above, you can use the results below\n", 182 | "* Use `+ Code` and `+ Text` buttons to add new cells\n", 183 | "* If you get in trouble: `Runtime` → `Interrupt execution`" 184 | ] 185 | }, 186 | { 187 | "cell_type": "markdown", 188 | "metadata": { 189 | "slideshow": { 190 | "slide_type": "slide" 191 | } 192 | }, 193 | "source": [ 194 | "# Objects, Data Types, and Expressions\n", 195 | "\n", 196 | "* Computer programs manipulate data in the form of objects\n", 197 | "* Objects have types\n", 198 | " * Scalar — indivisible\n", 199 | " * Non-scalar — with internal structure, can be ordered/unordered and mutable/immutable\n", 200 | "* We can do things with objects\n", 201 | " * Use variables to associate them with names\n", 202 | " * Combine objects and operators to evaluate expressions\n", 203 | " * Call methods on objects\n", 204 | " * Pass objects to functions" 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "metadata": { 210 | "slideshow": { 211 | "slide_type": "slide" 212 | } 213 | }, 214 | "source": [ 215 | "## Data Types in Python\n", 216 | "\n", 217 | "\n", 218 | "| Type | Scalar | Mutability | Order \n", 219 | "| :------: |:----------:|:----------:| :---------:\n", 220 | "| `int` | scalar | immutable | \n", 221 | "| `float` | scalar | immutable | \n", 222 | "| `bool` | scalar | immutable | \n", 223 | "| `None` | scalar | immutable | \n", 224 | "| `str` | non-scalar | immutable | ordered\n", 225 | "| `tuple` | non-scalar | immutable | ordered\n", 226 | "| `list` | non-scalar | mutable | ordered\n", 227 | "| `set` | non-scalar | mutable | unordered\n", 228 | "| `dict` | non-scalar | mutable | unordered" 229 | ] 230 | }, 231 | { 232 | "cell_type": "markdown", 233 | "metadata": { 234 | "slideshow": { 235 | "slide_type": "slide" 236 | } 237 | }, 238 | "source": [ 239 | "## Scalar Data Types\n", 240 | "\n", 241 | "* Integer\n", 242 | "* Float\n", 243 | "* Boolean\n", 244 | "* NoneType" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": 9, 250 | "metadata": { 251 | "slideshow": { 252 | "slide_type": "-" 253 | } 254 | }, 255 | "outputs": [ 256 | { 257 | "data": { 258 | "text/plain": [ 259 | "True" 260 | ] 261 | }, 262 | "execution_count": 9, 263 | "metadata": {}, 264 | "output_type": "execute_result" 265 | } 266 | ], 267 | "source": [ 268 | "int_var = 2 # int <-- text after the hashtag # is a comment and will not be executed as code\n", 269 | "float_var = 0.125 # float\n", 270 | "true_var = True # bool \n", 271 | "none_var = None # NoneType\n", 272 | "\n", 273 | "true_var # Returns the value of the variable" 274 | ] 275 | }, 276 | { 277 | "cell_type": "code", 278 | "execution_count": 11, 279 | "metadata": { 280 | "scrolled": true, 281 | "slideshow": { 282 | "slide_type": "-" 283 | } 284 | }, 285 | "outputs": [ 286 | { 287 | "name": "stdout", 288 | "output_type": "stream", 289 | "text": [ 290 | "0.125\n" 291 | ] 292 | } 293 | ], 294 | "source": [ 295 | "print(float_var) # Returns a string representation of the value of the variable" 296 | ] 297 | }, 298 | { 299 | "cell_type": "markdown", 300 | "metadata": { 301 | "slideshow": { 302 | "slide_type": "slide" 303 | } 304 | }, 305 | "source": [ 306 | "## Non-Scalar Data Types\n", 307 | "\n", 308 | "* String – sequence of characters (immutable, ordered)\n", 309 | "* List – sequence of values (mutable, ordered)\n", 310 | "* Tuple – sequence of values (immutable, ordered)\n", 311 | "* Set – collection of unique values (mutable, unordered)\n", 312 | "* Dictionary – a set of key/value pairs (mutable, unordered)" 313 | ] 314 | }, 315 | { 316 | "cell_type": "code", 317 | "execution_count": 7, 318 | "metadata": { 319 | "slideshow": { 320 | "slide_type": "-" 321 | } 322 | }, 323 | "outputs": [ 324 | { 325 | "name": "stdout", 326 | "output_type": "stream", 327 | "text": [ 328 | "[1, 2, 2, 'a', 'a'] {'b', 1, 2, 'a'}\n" 329 | ] 330 | } 331 | ], 332 | "source": [ 333 | "str_var = 'This is a string.' # str\n", 334 | "list_var = [1, 2, 2, 'a', 'a'] # list\n", 335 | "tuple_var = (1, 2, 'a', 'b') # tuple\n", 336 | "set_var = {1, 2, 2, 'a', 'b'} # set \n", 337 | "dict_var = {1: 'a', 2: 'b', 3: ['c', 'd']} # dict\n", 338 | "print(list_var, set_var)" 339 | ] 340 | }, 341 | { 342 | "cell_type": "markdown", 343 | "metadata": { 344 | "slideshow": { 345 | "slide_type": "slide" 346 | } 347 | }, 348 | "source": [ 349 | "## Using Operators with Objects\n", 350 | "\n", 351 | "* Arithmetic: `+`, `-`, `*`, `/`, `**` exponent, `%` modulus, `//` floor division\n", 352 | "* Boolean: `and`, `or`, `not`\n", 353 | "* Comparison: `==`, `!=` does not equal, `>`, `<=`\n", 354 | "* Assignment: `=` , `+=`, `-=`\n", 355 | "* Membership: `in`" 356 | ] 357 | }, 358 | { 359 | "cell_type": "code", 360 | "execution_count": 7, 361 | "metadata": { 362 | "slideshow": { 363 | "slide_type": "-" 364 | } 365 | }, 366 | "outputs": [ 367 | { 368 | "name": "stdout", 369 | "output_type": "stream", 370 | "text": [ 371 | "4\n", 372 | "abc\n", 373 | "6\n", 374 | "aaah!\n" 375 | ] 376 | } 377 | ], 378 | "source": [ 379 | "# Note that the arithmetic operators + and * have different meanings \n", 380 | "# depending on the types of objects with which they are used\n", 381 | "print(2 + 2)\n", 382 | "print('a' + 'bc')\n", 383 | "print(3*2)\n", 384 | "print(3*'a' + 'h!')" 385 | ] 386 | }, 387 | { 388 | "cell_type": "code", 389 | "execution_count": 8, 390 | "metadata": { 391 | "slideshow": { 392 | "slide_type": "-" 393 | } 394 | }, 395 | "outputs": [ 396 | { 397 | "name": "stdout", 398 | "output_type": "stream", 399 | "text": [ 400 | "False\n", 401 | "True\n", 402 | "True\n" 403 | ] 404 | } 405 | ], 406 | "source": [ 407 | "# Boolean operators return bool\n", 408 | "print(True and False)\n", 409 | "print(not False)" 410 | ] 411 | }, 412 | { 413 | "cell_type": "code", 414 | "execution_count": 12, 415 | "metadata": { 416 | "slideshow": { 417 | "slide_type": "-" 418 | } 419 | }, 420 | "outputs": [ 421 | { 422 | "name": "stdout", 423 | "output_type": "stream", 424 | "text": [ 425 | "5\n", 426 | "False\n" 427 | ] 428 | } 429 | ], 430 | "source": [ 431 | "a = 2 # This is assignment\n", 432 | "a += 3 # This assignment is equivalent to a = a + 3\n", 433 | "print(a)\n", 434 | "\n", 435 | "print(a == 1) # This is test for equality. It returns bool." 436 | ] 437 | }, 438 | { 439 | "cell_type": "markdown", 440 | "metadata": { 441 | "slideshow": { 442 | "slide_type": "slide" 443 | } 444 | }, 445 | "source": [ 446 | "## Unordered Types vs. Sequences\n", 447 | "\n", 448 | "* Unordered types: `set`, `dict`\n", 449 | "* Ordered types (sequences): `str`, `list`, `tuple`\n", 450 | " " 451 | ] 452 | }, 453 | { 454 | "cell_type": "code", 455 | "execution_count": 9, 456 | "metadata": { 457 | "slideshow": { 458 | "slide_type": "-" 459 | } 460 | }, 461 | "outputs": [ 462 | { 463 | "name": "stdout", 464 | "output_type": "stream", 465 | "text": [ 466 | "{'b', 1, 2, 'a'}\n" 467 | ] 468 | } 469 | ], 470 | "source": [ 471 | "st = {1, 2, 2, 'a', 'b'} # sets are unordered\n", 472 | "print(st)" 473 | ] 474 | }, 475 | { 476 | "cell_type": "markdown", 477 | "metadata": { 478 | "slideshow": { 479 | "slide_type": "slide" 480 | } 481 | }, 482 | "source": [ 483 | "## Dictionary Operations: Indexing\n", 484 | "\n", 485 | "* Dictionaries are indexed by keys" 486 | ] 487 | }, 488 | { 489 | "cell_type": "code", 490 | "execution_count": 20, 491 | "metadata": { 492 | "slideshow": { 493 | "slide_type": "-" 494 | } 495 | }, 496 | "outputs": [ 497 | { 498 | "name": "stdout", 499 | "output_type": "stream", 500 | "text": [ 501 | "astrophysicist\n" 502 | ] 503 | } 504 | ], 505 | "source": [ 506 | "mydic = {'Howard': 'aerospace engineer', 'Leonard': 'physicist', 'Sheldon': 'physicist', \n", 507 | " 'Penny': 'waitress', 'Raj': 'astrophysicist'}\n", 508 | "print(mydic['Raj'])" 509 | ] 510 | }, 511 | { 512 | "cell_type": "markdown", 513 | "metadata": { 514 | "slideshow": { 515 | "slide_type": "slide" 516 | } 517 | }, 518 | "source": [ 519 | "## Sequence Operations: Indexing and Slicing\n", 520 | "\n", 521 | "* Lists, tuples, and strings are indexed by numbers. **Indexing in Python starts from 0!**\n", 522 | "* Use `elem[index]` to extract individual sub-elements\n", 523 | "* Use `elem[start:end]` to get sub-sequence starting from index `start` and ending at index `end-1`\n", 524 | "* Use `elem[start:end:step]` to get sub-sequence starting from index `start`, in steps of `step`, ending at index `end-1`" 525 | ] 526 | }, 527 | { 528 | "cell_type": "code", 529 | "execution_count": 42, 530 | "metadata": { 531 | "slideshow": { 532 | "slide_type": "-" 533 | } 534 | }, 535 | "outputs": [ 536 | { 537 | "name": "stdout", 538 | "output_type": "stream", 539 | "text": [ 540 | "a\n", 541 | "c\n" 542 | ] 543 | }, 544 | { 545 | "ename": "IndexError", 546 | "evalue": "list index out of range", 547 | "output_type": "error", 548 | "traceback": [ 549 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 550 | "\u001b[0;31mIndexError\u001b[0m Traceback (most recent call last)", 551 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m \u001b[0;34m'abc'\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0;34m'a'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'b'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'c'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m# use negative numbers to index from the end\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m'a'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'b'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'c'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m3\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 552 | "\u001b[0;31mIndexError\u001b[0m: list index out of range" 553 | ] 554 | } 555 | ], 556 | "source": [ 557 | "print( 'abc'[0] ) \n", 558 | "print( ('a', 'b', 'c')[-1]) # use negative numbers to index from the end\n", 559 | "print( ['a', 'b', 'c'][3])" 560 | ] 561 | }, 562 | { 563 | "cell_type": "code", 564 | "execution_count": 1, 565 | "metadata": { 566 | "scrolled": true, 567 | "slideshow": { 568 | "slide_type": "-" 569 | } 570 | }, 571 | "outputs": [ 572 | { 573 | "name": "stdout", 574 | "output_type": "stream", 575 | "text": [ 576 | "[20, 30, 40]\n", 577 | "[10, 20, 30]\n", 578 | "[20, 30, 40, 50]\n" 579 | ] 580 | } 581 | ], 582 | "source": [ 583 | "ls = [10, 20, 30, 40, 50]\n", 584 | "print( ls[1:4] ) \n", 585 | "print( ls[:3] )\n", 586 | "print( ls[1:] )" 587 | ] 588 | }, 589 | { 590 | "cell_type": "code", 591 | "execution_count": 2, 592 | "metadata": { 593 | "scrolled": true, 594 | "slideshow": { 595 | "slide_type": "-" 596 | } 597 | }, 598 | "outputs": [ 599 | { 600 | "name": "stdout", 601 | "output_type": "stream", 602 | "text": [ 603 | "[10, 30, 50]\n", 604 | "[50, 40, 30, 20, 10]\n", 605 | "[10, 20, 30, 40, 50]\n" 606 | ] 607 | } 608 | ], 609 | "source": [ 610 | "ls = [10, 20, 30, 40, 50]\n", 611 | "print( ls[::2] ) # get elements with even indeces\n", 612 | "print( ls[::-1] ) # get elements in reverse order\n", 613 | "print( ls[:] ) # get a copy of the sequence" 614 | ] 615 | }, 616 | { 617 | "cell_type": "markdown", 618 | "metadata": { 619 | "slideshow": { 620 | "slide_type": "slide" 621 | } 622 | }, 623 | "source": [ 624 | ">## EXAMPLE PROJECT: Comparing Trump's and Biden's Inaugural Speeches\n", 625 | ">\n", 626 | ">We will use a mini-projest as an extended practical example to demonstrate the concepts we are learning. The project aims to analyze and compare the inaugural speeches of the current and last US presidents.\n", 627 | ">\n", 628 | ">The speech transcripts were obtained from https://millercenter.org/the-presidency/presidential-speeches and copied in the text files `biden_inauguration_millercenter.txt` and `trump_inauguration_millercenter.txt` in the `data` folder." 629 | ] 630 | }, 631 | { 632 | "cell_type": "code", 633 | "execution_count": 5, 634 | "metadata": { 635 | "slideshow": { 636 | "slide_type": "-" 637 | } 638 | }, 639 | "outputs": [ 640 | { 641 | "data": { 642 | "text/plain": [ 643 | "'Chief Justice Roberts, President Carter, President Clinton, President Bush, President Obama, fellow Americans, and people of the world: thank you.\\n\\nWe, the citizens of America, are now joined in a great national effort to rebuild our country and to restore its promise for all of our people.\\n\\nTogether, we will determine the course of America and the world for years to come.\\n\\nWe will face challenges. We will confront hardships. But we will get the job done.\\n\\nEvery four years, we gather on these st'" 644 | ] 645 | }, 646 | "execution_count": 5, 647 | "metadata": {}, 648 | "output_type": "execute_result" 649 | } 650 | ], 651 | "source": [ 652 | "# Open one of the file's and get the text into a string variable called txt\n", 653 | "with open('data/trump_inauguration_millercenter.txt') as f:\n", 654 | " txt = f.read()\n", 655 | " \n", 656 | "txt[:500] # Show the first 500 characters of the txt variable\n" 657 | ] 658 | }, 659 | { 660 | "cell_type": "markdown", 661 | "metadata": { 662 | "slideshow": { 663 | "slide_type": "slide" 664 | } 665 | }, 666 | "source": [ 667 | "## Evaluating Functions with Objects \n", 668 | "\n", 669 | "* Use the name of a type to convert values to that type\n", 670 | "* The `len()` function returns the length of the element" 671 | ] 672 | }, 673 | { 674 | "cell_type": "code", 675 | "execution_count": 6, 676 | "metadata": { 677 | "slideshow": { 678 | "slide_type": "-" 679 | } 680 | }, 681 | "outputs": [ 682 | { 683 | "name": "stdout", 684 | "output_type": "stream", 685 | "text": [ 686 | "123.0 32\n" 687 | ] 688 | } 689 | ], 690 | "source": [ 691 | "a = float(123)\n", 692 | "b = int('32')\n", 693 | "print(a, b)" 694 | ] 695 | }, 696 | { 697 | "cell_type": "code", 698 | "execution_count": 12, 699 | "metadata": { 700 | "slideshow": { 701 | "slide_type": "-" 702 | } 703 | }, 704 | "outputs": [ 705 | { 706 | "name": "stdout", 707 | "output_type": "stream", 708 | "text": [ 709 | "(1, 2, 3) {1: 'a', 2: 'b', 3: 'c'}\n" 710 | ] 711 | } 712 | ], 713 | "source": [ 714 | "c = tuple([1, 2, 3]) \n", 715 | "d = dict( [(1, 'a'), (2, 'b'), (3, 'c')] )\n", 716 | "print(c, d)" 717 | ] 718 | }, 719 | { 720 | "cell_type": "code", 721 | "execution_count": 13, 722 | "metadata": { 723 | "slideshow": { 724 | "slide_type": "-" 725 | } 726 | }, 727 | "outputs": [ 728 | { 729 | "name": "stdout", 730 | "output_type": "stream", 731 | "text": [ 732 | "3\n", 733 | "2\n", 734 | "5\n", 735 | "2\n" 736 | ] 737 | } 738 | ], 739 | "source": [ 740 | "print( len( [0, 1, 2] ) )\n", 741 | "print( len('ab') )\n", 742 | "print( len( (1, 2, 3, 4, 'a') ) )\n", 743 | "print( len( {1:'a', 2:'b'} ) )" 744 | ] 745 | }, 746 | { 747 | "cell_type": "markdown", 748 | "metadata": { 749 | "slideshow": { 750 | "slide_type": "slide" 751 | } 752 | }, 753 | "source": [ 754 | "## Calling Methods on Objects\n", 755 | "\n", 756 | "### `object.method()`\n", 757 | "\n", 758 | "Use the period `.` to link the method to the object." 759 | ] 760 | }, 761 | { 762 | "cell_type": "code", 763 | "execution_count": 11, 764 | "metadata": { 765 | "slideshow": { 766 | "slide_type": "-" 767 | } 768 | }, 769 | "outputs": [ 770 | { 771 | "data": { 772 | "text/plain": [ 773 | "'HELLO'" 774 | ] 775 | }, 776 | "execution_count": 11, 777 | "metadata": {}, 778 | "output_type": "execute_result" 779 | } 780 | ], 781 | "source": [ 782 | "string1 = 'Hello'\n", 783 | "\n", 784 | "string1 + '!' # This is an operator. Operators combine objects in expressions.\n", 785 | "len(string1) # This is a function. Functions take objects as arguments.\n", 786 | "string1.upper() # This is a method. Methods are attached to objects." 787 | ] 788 | }, 789 | { 790 | "cell_type": "markdown", 791 | "metadata": { 792 | "slideshow": { 793 | "slide_type": "slide" 794 | } 795 | }, 796 | "source": [ 797 | "## [String Methods](http://docs.python.org/3/library/stdtypes.html#string-methods)\n", 798 | "\n", 799 | "* `S.upper()` – change to upper case\n", 800 | "* `S.lower()` – change to lower case\n", 801 | "* `S.capitalize()` – capitalize the first word\n", 802 | "* `S.find(S1)` – return the index of the first instance of input\n", 803 | "* `S.replace(S1, S2)` – find all instances of S1 and change to S2\n", 804 | "* `S.strip(S1)` – remove whitespace characters from the beginning and end of a string (useful when reading in from a file)\n", 805 | "* `S.split(S1)` – split the string into a list\n", 806 | "* `S.join(L)` – combine the input sequence into a single string" 807 | ] 808 | }, 809 | { 810 | "cell_type": "code", 811 | "execution_count": 15, 812 | "metadata": { 813 | "slideshow": { 814 | "slide_type": "-" 815 | } 816 | }, 817 | "outputs": [ 818 | { 819 | "name": "stdout", 820 | "output_type": "stream", 821 | "text": [ 822 | "MAKE ME SCREAM!\n", 823 | "Make this into a proper sentence.\n", 824 | "1\n" 825 | ] 826 | } 827 | ], 828 | "source": [ 829 | "print('Make me scream!'.upper())\n", 830 | "x = 'make this into a proper sentence'\n", 831 | "print(x.capitalize() + '.')\n", 832 | "\n", 833 | "print('Find the first \"i\" in this sentence.'.find('i'))" 834 | ] 835 | }, 836 | { 837 | "cell_type": "code", 838 | "execution_count": 16, 839 | "metadata": { 840 | "slideshow": { 841 | "slide_type": "-" 842 | } 843 | }, 844 | "outputs": [ 845 | { 846 | "name": "stdout", 847 | "output_type": "stream", 848 | "text": [ 849 | " ThiS iS a long Sentence that we will uSe aS an example.\n", 850 | "\n", 851 | "This is a long sentence that we will use as an example.\n", 852 | "Thisisalongsentencethatwewilluseasanexample.\n", 853 | "\n" 854 | ] 855 | } 856 | ], 857 | "source": [ 858 | "x = ' This is a long sentence that we will use as an example.\\n'\n", 859 | "print(x.replace('s', 'S'))\n", 860 | "print(x.strip())\n", 861 | "print(x.replace(' ', ''))" 862 | ] 863 | }, 864 | { 865 | "cell_type": "code", 866 | "execution_count": 17, 867 | "metadata": { 868 | "slideshow": { 869 | "slide_type": "-" 870 | } 871 | }, 872 | "outputs": [ 873 | { 874 | "name": "stdout", 875 | "output_type": "stream", 876 | "text": [ 877 | "['this', 'is', 'a', 'collection', 'of', 'words', 'i', 'would', 'like', 'to', 'break', 'it', 'into', 'tokens']\n", 878 | "['this is a c', 'llecti', 'n ', 'f w', 'rds i w', 'uld like t', ' break it int', ' t', 'kens']\n", 879 | "this-is-a-collection-of-words-i-would-like-to-break-it-into-tokens\n" 880 | ] 881 | } 882 | ], 883 | "source": [ 884 | "x = 'this is a collection of words i would like to break it into tokens'\n", 885 | "y = x.split() # default is to split on ' '\n", 886 | "print(y)\n", 887 | "print(x.split('o')) \n", 888 | "\n", 889 | "x_new = '-'.join(y)\n", 890 | "print(x_new)" 891 | ] 892 | }, 893 | { 894 | "cell_type": "markdown", 895 | "metadata": { 896 | "slideshow": { 897 | "slide_type": "slide" 898 | } 899 | }, 900 | "source": [ 901 | ">## EXAMPLE PROJECT: Comparing Trump's and Biden's Inaugural Speeches\n", 902 | ">\n", 903 | ">Remember, our goal is to analyze and compare the inaugural speeches of the current and last US presidents. Using string methods, we can:\n", 904 | ">\n", 905 | ">1. Clean up the text\n", 906 | ">2. Extract a list of all the words used in the speech\n", 907 | ">3. Estimate the length of the speach\n", 908 | ">4. Estimate the number of unique words used in the speech" 909 | ] 910 | }, 911 | { 912 | "cell_type": "code", 913 | "execution_count": 4, 914 | "metadata": { 915 | "slideshow": { 916 | "slide_type": "-" 917 | } 918 | }, 919 | "outputs": [ 920 | { 921 | "name": "stdout", 922 | "output_type": "stream", 923 | "text": [ 924 | "['2017', '20th', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'about', 'about', 'accept', 'across', 'across', 'across', 'across', 'across', 'action', 'action', 'administration', 'affairs', 'again', 'again', 'again', 'again', 'again', 'again', 'again', 'again', 'again', 'against', 'aid', 'airports', 'all', 'all', 'all', 'all', 'all', 'all', 'all', 'all', 'all', 'all', 'all', 'all', 'allegiance', 'allegiance', 'alliances', 'allowing', 'almighty', 'along', 'always', 'always', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'american', 'american', 'american', 'american', 'american', 'american', 'american', 'american', 'american', 'american', 'american', 'american', 'americans', 'americans', 'americans', 'americans', 'an', 'an', 'an', 'and', 'and']\n", 925 | "1436\n", 926 | "536\n" 927 | ] 928 | } 929 | ], 930 | "source": [ 931 | "# Open one of the file's and get the text into a string variable called txt\n", 932 | "with open('data/trump_inauguration_millercenter.txt') as f:\n", 933 | " txt = f.read()\n", 934 | " \n", 935 | "# Remove paragraphs and format consistently\n", 936 | "txt = txt.strip().replace('\\n', ' ').replace(\"’\", \"'\")\n", 937 | "\n", 938 | "# Get rid of possessives and expand contractions\n", 939 | "txt = txt.replace(\"'s\", '').replace(\"'ve\", ' have').replace(\"'re\", ' are')\n", 940 | "txt = txt.replace(\"can't\", 'can not').replace(\"n't\", ' not')\n", 941 | "\n", 942 | "# Remove punctuation\n", 943 | "txt = txt.replace('—', '').replace('–', '')\n", 944 | "txt = txt.replace('.', '').replace(',', '').replace(':', '').replace(';', '').replace('…', '')\n", 945 | "txt = txt.replace(\"”\", '').replace(\"“\", '')\n", 946 | "\n", 947 | "# Convert to lower-case\n", 948 | "txt = txt.lower()\n", 949 | "\n", 950 | "# Break into words\n", 951 | "wrds = txt.split()\n", 952 | "print(sorted(wrds)[:100])\n", 953 | "\n", 954 | "# Count the number of words in the speech\n", 955 | "print(len(wrds))\n", 956 | "\n", 957 | "# Count the number of unique words\n", 958 | "print(len(set(wrds)))\n" 959 | ] 960 | }, 961 | { 962 | "cell_type": "markdown", 963 | "metadata": { 964 | "slideshow": { 965 | "slide_type": "slide" 966 | } 967 | }, 968 | "source": [ 969 | "## Set Methods\n", 970 | "\n", 971 | "![Set operations](figs/sets.png \"Set operations\")\n", 972 | "\n", 973 | "* `S1.union(S2)`, `S1|S2` — elements in S1 or S2, or both\n", 974 | "* `S1.intersection(S2)`, `S1&S2` — elements in both S1 and S2\n", 975 | "* `S1.difference(S2)`, `S1-S2` — elements in S1 but not in S2\n", 976 | "* `S1.symmetric_difference(S2)`, `S1^S2` — elements in S1 or S2 but not both" 977 | ] 978 | }, 979 | { 980 | "cell_type": "code", 981 | "execution_count": 19, 982 | "metadata": { 983 | "slideshow": { 984 | "slide_type": "-" 985 | } 986 | }, 987 | "outputs": [ 988 | { 989 | "name": "stdout", 990 | "output_type": "stream", 991 | "text": [ 992 | "{'m', 'e', 't', 'r'}\n" 993 | ] 994 | } 995 | ], 996 | "source": [ 997 | "st1 = set('homophily')\n", 998 | "st2 = set('heterophily')\n", 999 | "print(st1^st2)" 1000 | ] 1001 | }, 1002 | { 1003 | "cell_type": "markdown", 1004 | "metadata": { 1005 | "slideshow": { 1006 | "slide_type": "slide" 1007 | } 1008 | }, 1009 | "source": [ 1010 | "## Mutability\n", 1011 | "\n", 1012 | "* Immutable types: `str`, `tuple`, and all scalars\n", 1013 | "* Mutable types: `list`, `set`, `dict`\n", 1014 | "\n", 1015 | "**Objects of mutable types can be modified once they are created.**" 1016 | ] 1017 | }, 1018 | { 1019 | "cell_type": "code", 1020 | "execution_count": 25, 1021 | "metadata": { 1022 | "slideshow": { 1023 | "slide_type": "-" 1024 | } 1025 | }, 1026 | "outputs": [ 1027 | { 1028 | "name": "stdout", 1029 | "output_type": "stream", 1030 | "text": [ 1031 | "{1: 'a', 2: 'b', 3: 'c'}\n", 1032 | "[1, 2, 3, 4, 5]\n" 1033 | ] 1034 | } 1035 | ], 1036 | "source": [ 1037 | "dic = {1:'a', 2:'b'}\n", 1038 | "dic[3] = 'c'\n", 1039 | "print(dic)\n", 1040 | "\n", 1041 | "ls = [5, 4, 1, 3, 2]\n", 1042 | "ls.sort()\n", 1043 | "print(ls)" 1044 | ] 1045 | }, 1046 | { 1047 | "cell_type": "markdown", 1048 | "metadata": { 1049 | "slideshow": { 1050 | "slide_type": "slide" 1051 | } 1052 | }, 1053 | "source": [ 1054 | "## [List Methods](http://docs.python.org/3/library/stdtypes.html#mutable-sequence-types)\n", 1055 | "\n", 1056 | "* `L.append(e)`\n", 1057 | "* `L.insert(i, e)`\n", 1058 | "* `L.remove(e)`\n", 1059 | "* `L.extend(L1)`\n", 1060 | "* `L.pop(i)`\n", 1061 | "* `L.sort()`\n", 1062 | "* `L.reverse()`" 1063 | ] 1064 | }, 1065 | { 1066 | "cell_type": "code", 1067 | "execution_count": 13, 1068 | "metadata": { 1069 | "slideshow": { 1070 | "slide_type": "-" 1071 | } 1072 | }, 1073 | "outputs": [ 1074 | { 1075 | "name": "stdout", 1076 | "output_type": "stream", 1077 | "text": [ 1078 | "[1, 2, 3, 4]\n", 1079 | "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n" 1080 | ] 1081 | } 1082 | ], 1083 | "source": [ 1084 | "ls1 = [1, 2, 3]\n", 1085 | "ls1.append(4)\n", 1086 | "print(ls1)\n", 1087 | "\n", 1088 | "ls1.extend([5, 6, 7, 8, 9, 10])\n", 1089 | "print(ls1)" 1090 | ] 1091 | }, 1092 | { 1093 | "cell_type": "code", 1094 | "execution_count": 13, 1095 | "metadata": { 1096 | "slideshow": { 1097 | "slide_type": "-" 1098 | } 1099 | }, 1100 | "outputs": [ 1101 | { 1102 | "name": "stdout", 1103 | "output_type": "stream", 1104 | "text": [ 1105 | "[2, 3, 4]\n", 1106 | "3 [2, 4]\n" 1107 | ] 1108 | } 1109 | ], 1110 | "source": [ 1111 | "mylist = [1, 2, 3, 4]\n", 1112 | "\n", 1113 | "mylist.remove(1)\n", 1114 | "print(mylist)\n", 1115 | "\n", 1116 | "popped = mylist.pop(1)\n", 1117 | "print(popped, mylist)" 1118 | ] 1119 | }, 1120 | { 1121 | "cell_type": "code", 1122 | "execution_count": 31, 1123 | "metadata": { 1124 | "slideshow": { 1125 | "slide_type": "-" 1126 | } 1127 | }, 1128 | "outputs": [ 1129 | { 1130 | "name": "stdout", 1131 | "output_type": "stream", 1132 | "text": [ 1133 | "[1, 2, 3, 4, 5]\n", 1134 | "[10, 9, 6, 8, 7]\n", 1135 | "[10, 9, 6, 8, 7] [6, 7, 8, 9, 10]\n" 1136 | ] 1137 | } 1138 | ], 1139 | "source": [ 1140 | "mylist = [4, 5, 2, 1, 3]\n", 1141 | "mylist.sort() # Sorts in-place. It is more efficient but overwrites the input.\n", 1142 | "print(mylist)\n", 1143 | "\n", 1144 | "mylist = [10, 9, 6, 8, 7]\n", 1145 | "sorted(mylist) \n", 1146 | "print(mylist)\n", 1147 | "\n", 1148 | "newlist = sorted(mylist) # Creates a new list that is sorted, not changing the original.\n", 1149 | "print(mylist, newlist)" 1150 | ] 1151 | }, 1152 | { 1153 | "cell_type": "markdown", 1154 | "metadata": { 1155 | "slideshow": { 1156 | "slide_type": "slide" 1157 | } 1158 | }, 1159 | "source": [ 1160 | "## Mutability Can Be Dangerous" 1161 | ] 1162 | }, 1163 | { 1164 | "cell_type": "code", 1165 | "execution_count": 14, 1166 | "metadata": { 1167 | "slideshow": { 1168 | "slide_type": "-" 1169 | } 1170 | }, 1171 | "outputs": [ 1172 | { 1173 | "name": "stdout", 1174 | "output_type": "stream", 1175 | "text": [ 1176 | "[1, 2, 3, [4, 5, 6, 7]]\n", 1177 | "[1, 2, 3, [4, 5, 6, 7, 8, 9, 10]]\n" 1178 | ] 1179 | } 1180 | ], 1181 | "source": [ 1182 | "ls1 = [1, 2, 3]\n", 1183 | "ls2 = [4, 5, 6, 7]\n", 1184 | "\n", 1185 | "ls1.append(ls2)\n", 1186 | "print(ls1)\n", 1187 | "\n", 1188 | "ls2.extend([8, 9, 10])\n", 1189 | "print(ls1)" 1190 | ] 1191 | }, 1192 | { 1193 | "cell_type": "markdown", 1194 | "metadata": { 1195 | "slideshow": { 1196 | "slide_type": "slide" 1197 | } 1198 | }, 1199 | "source": [ 1200 | "## Aliasing vs. Cloning\n", 1201 | "\n", 1202 | "![Aliasing](figs/aliasing.png \"Aliasing\")" 1203 | ] 1204 | }, 1205 | { 1206 | "cell_type": "code", 1207 | "execution_count": 16, 1208 | "metadata": { 1209 | "slideshow": { 1210 | "slide_type": "-" 1211 | } 1212 | }, 1213 | "outputs": [ 1214 | { 1215 | "name": "stdout", 1216 | "output_type": "stream", 1217 | "text": [ 1218 | "[1, 2, 3]\n" 1219 | ] 1220 | } 1221 | ], 1222 | "source": [ 1223 | "ls1 = [1, 2, 3]\n", 1224 | "ls2 = ls1[:] # Using [:] is one way to clone\n", 1225 | "\n", 1226 | "ls1.reverse()\n", 1227 | "print(ls2)" 1228 | ] 1229 | }, 1230 | { 1231 | "cell_type": "markdown", 1232 | "metadata": { 1233 | "slideshow": { 1234 | "slide_type": "slide" 1235 | } 1236 | }, 1237 | "source": [ 1238 | ">## QUIZ QUESTION\n", 1239 | ">\n", 1240 | ">What will the following program print?\n", 1241 | ">\n", 1242 | ">```\n", 1243 | ">ls1 = [1, 2, 3, 4, 5]\n", 1244 | ">ls2 = ls1\n", 1245 | ">ls2[2] = 0\n", 1246 | ">print(ls1)\n", 1247 | ">```\n", 1248 | ">\n", 1249 | ">* (A) `[1, 2, 3, 4, 5]`\n", 1250 | ">* (B) `[1, 0, 3, 4, 5]`\n", 1251 | ">* (C) `[1, 2, 0, 4, 5]`\n", 1252 | ">* (D) `0`" 1253 | ] 1254 | }, 1255 | { 1256 | "cell_type": "markdown", 1257 | "metadata": { 1258 | "slideshow": { 1259 | "slide_type": "slide" 1260 | } 1261 | }, 1262 | "source": [ 1263 | "## So Far, We Learned How to Write Straight-Line Programs" 1264 | ] 1265 | }, 1266 | { 1267 | "cell_type": "code", 1268 | "execution_count": 1, 1269 | "metadata": { 1270 | "scrolled": true, 1271 | "slideshow": { 1272 | "slide_type": "-" 1273 | } 1274 | }, 1275 | "outputs": [ 1276 | { 1277 | "name": "stdout", 1278 | "output_type": "stream", 1279 | "text": [ 1280 | "There are 12 words in the sentence.\n" 1281 | ] 1282 | } 1283 | ], 1284 | "source": [ 1285 | "s = 'All animals are equal, but some animals are more equal than others.'\n", 1286 | "s = s.rstrip('.').lower()\n", 1287 | "s_tokens = s.split()\n", 1288 | "print('There are', len(s_tokens), 'words in the sentence.')\n" 1289 | ] 1290 | }, 1291 | { 1292 | "cell_type": "markdown", 1293 | "metadata": { 1294 | "slideshow": { 1295 | "slide_type": "fragment" 1296 | } 1297 | }, 1298 | "source": [ 1299 | "In straight-line programs, code is executed line by line, from top to bottom and within a line, from left to right (unless overridden with brackets).\n", 1300 | "\n", 1301 | "Statements can be executed in more complex order, however, and the control flow determines how this is done." 1302 | ] 1303 | }, 1304 | { 1305 | "cell_type": "markdown", 1306 | "metadata": { 1307 | "slideshow": { 1308 | "slide_type": "slide" 1309 | } 1310 | }, 1311 | "source": [ 1312 | "## [Control Flow](https://www.youtube.com/watch?v=k0xgjUhEG3U)\n", 1313 | "\n", 1314 | "* Control flow is the order in which statements are executed or evaluated\n", 1315 | "* In Python, there are three main categories of control flow:\n", 1316 | " * **Branches** (conditional statements) – execute only if some condition is met\n", 1317 | " * **Loops** (iteration) – execute repeatedly \n", 1318 | " * **Function calls** – execute a set of distant statements and return back to the control flow" 1319 | ] 1320 | }, 1321 | { 1322 | "cell_type": "markdown", 1323 | "metadata": { 1324 | "slideshow": { 1325 | "slide_type": "-" 1326 | } 1327 | }, 1328 | "source": [ 1329 | "![Three categories of control flow](figs/control_flow.png \"Three categories of control flow\")\n" 1330 | ] 1331 | }, 1332 | { 1333 | "cell_type": "markdown", 1334 | "metadata": { 1335 | "slideshow": { 1336 | "slide_type": "slide" 1337 | } 1338 | }, 1339 | "source": [ 1340 | "# Conditional Statements\n", 1341 | "\n", 1342 | "![Conditional statements](figs/conditional_statements.png \"Conditional statements\")" 1343 | ] 1344 | }, 1345 | { 1346 | "cell_type": "markdown", 1347 | "metadata": { 1348 | "slideshow": { 1349 | "slide_type": "slide" 1350 | } 1351 | }, 1352 | "source": [ 1353 | "## Conditional Statements" 1354 | ] 1355 | }, 1356 | { 1357 | "cell_type": "markdown", 1358 | "metadata": { 1359 | "slideshow": { 1360 | "slide_type": "-" 1361 | } 1362 | }, 1363 | "source": [ 1364 | "```\n", 1365 | "if *Boolean expression*:\n", 1366 | " *block of code*\n", 1367 | "```" 1368 | ] 1369 | }, 1370 | { 1371 | "cell_type": "markdown", 1372 | "metadata": { 1373 | "slideshow": { 1374 | "slide_type": "-" 1375 | } 1376 | }, 1377 | "source": [ 1378 | "```\n", 1379 | "if *Boolean expression*:\n", 1380 | " *block of code*\n", 1381 | "else:\n", 1382 | " *block of code*\n", 1383 | "```" 1384 | ] 1385 | }, 1386 | { 1387 | "cell_type": "markdown", 1388 | "metadata": { 1389 | "slideshow": { 1390 | "slide_type": "-" 1391 | } 1392 | }, 1393 | "source": [ 1394 | "```\n", 1395 | "if *Boolean expression*:\n", 1396 | " *block of code*\n", 1397 | "elif *Boolean expression*:\n", 1398 | " *block of code*\n", 1399 | "else:\n", 1400 | " *block of code*\n", 1401 | "```" 1402 | ] 1403 | }, 1404 | { 1405 | "cell_type": "code", 1406 | "execution_count": 4, 1407 | "metadata": { 1408 | "scrolled": true, 1409 | "slideshow": { 1410 | "slide_type": "-" 1411 | } 1412 | }, 1413 | "outputs": [ 1414 | { 1415 | "name": "stdout", 1416 | "output_type": "stream", 1417 | "text": [ 1418 | "Negative\n" 1419 | ] 1420 | } 1421 | ], 1422 | "source": [ 1423 | "x = -2\n", 1424 | "if x > 0:\n", 1425 | " print('Positive')\n", 1426 | "elif x < 0:\n", 1427 | " print('Negative')\n", 1428 | "else:\n", 1429 | " print('Zero')\n", 1430 | " " 1431 | ] 1432 | }, 1433 | { 1434 | "cell_type": "markdown", 1435 | "metadata": { 1436 | "slideshow": { 1437 | "slide_type": "slide" 1438 | } 1439 | }, 1440 | "source": [ 1441 | "## Indentation in Python Code\n", 1442 | "\n", 1443 | "* Indentation is semantically meaningful in Python\n", 1444 | "* You can use [tabs or spaces](https://www.youtube.com/watch?v=SsoOG6ZeyUI)" 1445 | ] 1446 | }, 1447 | { 1448 | "cell_type": "markdown", 1449 | "metadata": { 1450 | "slideshow": { 1451 | "slide_type": "fragment" 1452 | } 1453 | }, 1454 | "source": [ 1455 | "* Obviously(!), tabs are preferable\n", 1456 | "* However, it does not really matter in Jupyter as Jupyter converts tabs to spaces by default" 1457 | ] 1458 | }, 1459 | { 1460 | "cell_type": "markdown", 1461 | "metadata": { 1462 | "slideshow": { 1463 | "slide_type": "slide" 1464 | } 1465 | }, 1466 | "source": [ 1467 | "## You Can Nest Conditional Statements\n" 1468 | ] 1469 | }, 1470 | { 1471 | "cell_type": "code", 1472 | "execution_count": 6, 1473 | "metadata": { 1474 | "scrolled": true, 1475 | "slideshow": { 1476 | "slide_type": "-" 1477 | } 1478 | }, 1479 | "outputs": [ 1480 | { 1481 | "name": "stdout", 1482 | "output_type": "stream", 1483 | "text": [ 1484 | "This is a negative number\n" 1485 | ] 1486 | } 1487 | ], 1488 | "source": [ 1489 | "x = -100\n", 1490 | "\n", 1491 | "if type(x) == int or type(x) == float:\n", 1492 | " if x >= 0:\n", 1493 | " print('This is a nonnegative number.')\n", 1494 | " else:\n", 1495 | " print('This is a negative number.')\n", 1496 | "elif type(x) == str:\n", 1497 | " print('This is a string.')\n", 1498 | "else:\n", 1499 | " print(\"I don't know what this is.\")\n", 1500 | " " 1501 | ] 1502 | }, 1503 | { 1504 | "cell_type": "markdown", 1505 | "metadata": { 1506 | "slideshow": { 1507 | "slide_type": "slide" 1508 | } 1509 | }, 1510 | "source": [ 1511 | "# Iteration\n", 1512 | "\n", 1513 | "![Iteration](figs/iteration.png \"Iteration\")" 1514 | ] 1515 | }, 1516 | { 1517 | "cell_type": "markdown", 1518 | "metadata": { 1519 | "slideshow": { 1520 | "slide_type": "slide" 1521 | } 1522 | }, 1523 | "source": [ 1524 | "## Iteration: `while` vs. `for`" 1525 | ] 1526 | }, 1527 | { 1528 | "cell_type": "markdown", 1529 | "metadata": { 1530 | "slideshow": { 1531 | "slide_type": "-" 1532 | } 1533 | }, 1534 | "source": [ 1535 | "```\n", 1536 | "while *Boolean expression*:\n", 1537 | " *block of code*\n", 1538 | "```" 1539 | ] 1540 | }, 1541 | { 1542 | "cell_type": "markdown", 1543 | "metadata": { 1544 | "slideshow": { 1545 | "slide_type": "-" 1546 | } 1547 | }, 1548 | "source": [ 1549 | "```\n", 1550 | "for *element* in *sequence*:\n", 1551 | " *block of code*\n", 1552 | "```" 1553 | ] 1554 | }, 1555 | { 1556 | "cell_type": "markdown", 1557 | "metadata": { 1558 | "slideshow": { 1559 | "slide_type": "slide" 1560 | } 1561 | }, 1562 | "source": [ 1563 | "## Iteration: `while` with decrementing function\n", 1564 | "\n", 1565 | "The decrementing function is a function that maps variables to an integer that is initially non-negative but that decreases with every pass through the loop; the loop ends when the integer is 0." 1566 | ] 1567 | }, 1568 | { 1569 | "cell_type": "code", 1570 | "execution_count": 7, 1571 | "metadata": { 1572 | "scrolled": true, 1573 | "slideshow": { 1574 | "slide_type": "-" 1575 | } 1576 | }, 1577 | "outputs": [ 1578 | { 1579 | "name": "stdout", 1580 | "output_type": "stream", 1581 | "text": [ 1582 | "1\n", 1583 | "2\n", 1584 | "3\n", 1585 | "4\n", 1586 | "5\n" 1587 | ] 1588 | } 1589 | ], 1590 | "source": [ 1591 | "# decrementing function: 5 - x\n", 1592 | "x = 0\n", 1593 | "while x < 5: \n", 1594 | " x += 1\n", 1595 | " print(x)\n", 1596 | " " 1597 | ] 1598 | }, 1599 | { 1600 | "cell_type": "markdown", 1601 | "metadata": { 1602 | "slideshow": { 1603 | "slide_type": "slide" 1604 | } 1605 | }, 1606 | "source": [ 1607 | "## Iteration: `while` with conditional statements\n" 1608 | ] 1609 | }, 1610 | { 1611 | "cell_type": "code", 1612 | "execution_count": 8, 1613 | "metadata": { 1614 | "scrolled": true, 1615 | "slideshow": { 1616 | "slide_type": "-" 1617 | } 1618 | }, 1619 | "outputs": [ 1620 | { 1621 | "name": "stdout", 1622 | "output_type": "stream", 1623 | "text": [ 1624 | "Guess which number from 1 to 100 I'm thinking of? 70\n", 1625 | "You are quite far. Try again.\n", 1626 | "Guess which number from 1 to 100 I'm thinking of? 24\n", 1627 | "You are very close. Try again.\n", 1628 | "Guess which number from 1 to 100 I'm thinking of? 25\n", 1629 | "That's right!\n" 1630 | ] 1631 | } 1632 | ], 1633 | "source": [ 1634 | "correct = 25\n", 1635 | "repeat = True\n", 1636 | "\n", 1637 | "while repeat:\n", 1638 | " guess = int(input(\"Guess which number from 1 to 100 I'm thinking of? \"))\n", 1639 | " \n", 1640 | " if guess > correct + 10 or guess < correct - 10:\n", 1641 | " print(\"You are quite far. Try again.\")\n", 1642 | " elif guess != correct:\n", 1643 | " print(\"You are very close. Try again.\")\n", 1644 | " else:\n", 1645 | " print(\"That's right!\")\n", 1646 | " repeat = False\n", 1647 | " " 1648 | ] 1649 | }, 1650 | { 1651 | "cell_type": "markdown", 1652 | "metadata": { 1653 | "slideshow": { 1654 | "slide_type": "slide" 1655 | } 1656 | }, 1657 | "source": [ 1658 | "## Iteration: `for` with sequences" 1659 | ] 1660 | }, 1661 | { 1662 | "cell_type": "code", 1663 | "execution_count": 9, 1664 | "metadata": { 1665 | "scrolled": true, 1666 | "slideshow": { 1667 | "slide_type": "-" 1668 | } 1669 | }, 1670 | "outputs": [ 1671 | { 1672 | "name": "stdout", 1673 | "output_type": "stream", 1674 | "text": [ 1675 | "1 2 3 4 5 " 1676 | ] 1677 | } 1678 | ], 1679 | "source": [ 1680 | "for i in [1, 2, 3, 4, 5]:\n", 1681 | " print(i, end=' ') \n", 1682 | " # Note that the \"end\" parameter replaces the default new line with a space\n", 1683 | " # This allows us to print on the same line\n", 1684 | " " 1685 | ] 1686 | }, 1687 | { 1688 | "cell_type": "markdown", 1689 | "metadata": { 1690 | "slideshow": { 1691 | "slide_type": "slide" 1692 | } 1693 | }, 1694 | "source": [ 1695 | "## Iteration: `for` with `range()`\n", 1696 | "\n", 1697 | "* In-built function that produces an immutable ordered non-scalar object of type `range`\n", 1698 | "* Initiate as `range([start], stop, [step])`. If ommitted, `start = 0` and `step = 1`. \n", 1699 | "* Function produces progression of integers `[start, start + step, start + 2*step, ..., start + i*step]` " 1700 | ] 1701 | }, 1702 | { 1703 | "cell_type": "code", 1704 | "execution_count": 10, 1705 | "metadata": { 1706 | "scrolled": true, 1707 | "slideshow": { 1708 | "slide_type": "-" 1709 | } 1710 | }, 1711 | "outputs": [ 1712 | { 1713 | "name": "stdout", 1714 | "output_type": "stream", 1715 | "text": [ 1716 | "range(0, 6)\n", 1717 | "[0, 1, 2, 3, 4, 5]\n" 1718 | ] 1719 | } 1720 | ], 1721 | "source": [ 1722 | "print(range(6))\n", 1723 | "print(list(range(6)))\n" 1724 | ] 1725 | }, 1726 | { 1727 | "cell_type": "code", 1728 | "execution_count": 11, 1729 | "metadata": { 1730 | "scrolled": true, 1731 | "slideshow": { 1732 | "slide_type": "-" 1733 | } 1734 | }, 1735 | "outputs": [ 1736 | { 1737 | "name": "stdout", 1738 | "output_type": "stream", 1739 | "text": [ 1740 | "0 1 2 3 4 5 \n", 1741 | "1 2 3 4 5 \n", 1742 | "1 3 5 " 1743 | ] 1744 | } 1745 | ], 1746 | "source": [ 1747 | "for i in range(6):\n", 1748 | " print(i, end=' ')\n", 1749 | "print() \n", 1750 | "\n", 1751 | "for i in range(1, 6):\n", 1752 | " print(i, end=' ')\n", 1753 | "print()\n", 1754 | " \n", 1755 | "for i in range(1, 6, 2):\n", 1756 | " print(i, end=' ')\n", 1757 | " " 1758 | ] 1759 | }, 1760 | { 1761 | "cell_type": "markdown", 1762 | "metadata": { 1763 | "slideshow": { 1764 | "slide_type": "slide" 1765 | } 1766 | }, 1767 | "source": [ 1768 | "## Indexing Lists with `range(len(L))`" 1769 | ] 1770 | }, 1771 | { 1772 | "cell_type": "code", 1773 | "execution_count": 12, 1774 | "metadata": { 1775 | "scrolled": false, 1776 | "slideshow": { 1777 | "slide_type": "-" 1778 | } 1779 | }, 1780 | "outputs": [ 1781 | { 1782 | "name": "stdout", 1783 | "output_type": "stream", 1784 | "text": [ 1785 | "index 0 - a\n", 1786 | "index 1 - b\n", 1787 | "index 2 - c\n", 1788 | "index 3 - d\n" 1789 | ] 1790 | } 1791 | ], 1792 | "source": [ 1793 | "mylist = ['a', 'b', 'c', 'd']\n", 1794 | "for i in range(len(mylist)):\n", 1795 | " print('index', i, '-', mylist[i])\n", 1796 | " " 1797 | ] 1798 | }, 1799 | { 1800 | "cell_type": "markdown", 1801 | "metadata": { 1802 | "slideshow": { 1803 | "slide_type": "fragment" 1804 | } 1805 | }, 1806 | "source": [ 1807 | "* This is especially useful when you need to go simultaneously over two different lists of the same length" 1808 | ] 1809 | }, 1810 | { 1811 | "cell_type": "code", 1812 | "execution_count": 13, 1813 | "metadata": { 1814 | "scrolled": true, 1815 | "slideshow": { 1816 | "slide_type": "-" 1817 | } 1818 | }, 1819 | "outputs": [ 1820 | { 1821 | "name": "stdout", 1822 | "output_type": "stream", 1823 | "text": [ 1824 | "a1, b2, c3, d4, " 1825 | ] 1826 | } 1827 | ], 1828 | "source": [ 1829 | "mylist1 = ['a', 'b', 'c', 'd']\n", 1830 | "mylist2 = [1, 2, 3, 4]\n", 1831 | "for i in range(len(mylist1)):\n", 1832 | " print(mylist1[i] + str(mylist2[i]), end=', ')" 1833 | ] 1834 | }, 1835 | { 1836 | "cell_type": "markdown", 1837 | "metadata": { 1838 | "slideshow": { 1839 | "slide_type": "slide" 1840 | } 1841 | }, 1842 | "source": [ 1843 | "## Iteration: `break` and `continue`\n", 1844 | "\n", 1845 | "* Use `break` to exit a loop \n", 1846 | "* Use `continue` to go directly to next iteration" 1847 | ] 1848 | }, 1849 | { 1850 | "cell_type": "code", 1851 | "execution_count": 14, 1852 | "metadata": { 1853 | "scrolled": true, 1854 | "slideshow": { 1855 | "slide_type": "-" 1856 | } 1857 | }, 1858 | "outputs": [ 1859 | { 1860 | "name": "stdout", 1861 | "output_type": "stream", 1862 | "text": [ 1863 | "0\n", 1864 | "1\n", 1865 | "3\n", 1866 | "4\n" 1867 | ] 1868 | } 1869 | ], 1870 | "source": [ 1871 | "for i in range(5):\n", 1872 | " if i == 2:\n", 1873 | " continue # Now try with break\n", 1874 | " print(i)\n", 1875 | " " 1876 | ] 1877 | }, 1878 | { 1879 | "cell_type": "markdown", 1880 | "metadata": { 1881 | "slideshow": { 1882 | "slide_type": "slide" 1883 | } 1884 | }, 1885 | "source": [ 1886 | ">## EXAMPLE PROJECT: Comparing Trump's and Biden's Inaugural Speeches\n", 1887 | ">\n", 1888 | ">Above, we already cleaned up the text in Trump's speech and saved a list of all the words used in the speech in the variable `wrds`. Next, we will:\n", 1889 | ">\n", 1890 | ">1. Count the number of times each unique word is mentioned in the speech\n", 1891 | ">2. Exclude non-meaningful words such as articles and prepositions\n", 1892 | ">3. Identify the most commonly used meaningful words to reveal the theme and tone of the speech" 1893 | ] 1894 | }, 1895 | { 1896 | "cell_type": "code", 1897 | "execution_count": 6, 1898 | "metadata": { 1899 | "scrolled": true, 1900 | "slideshow": { 1901 | "slide_type": "-" 1902 | } 1903 | }, 1904 | "outputs": [ 1905 | { 1906 | "data": { 1907 | "text/plain": [ 1908 | "[('and', 74),\n", 1909 | " ('the', 70),\n", 1910 | " ('we', 49),\n", 1911 | " ('of', 48),\n", 1912 | " ('our', 48),\n", 1913 | " ('will', 40),\n", 1914 | " ('to', 37),\n", 1915 | " ('is', 21),\n", 1916 | " ('america', 18),\n", 1917 | " ('a', 15)]" 1918 | ] 1919 | }, 1920 | "execution_count": 6, 1921 | "metadata": {}, 1922 | "output_type": "execute_result" 1923 | } 1924 | ], 1925 | "source": [ 1926 | "# Create dictionary with word:count\n", 1927 | "word_counts = {}\n", 1928 | "\n", 1929 | "for i in wrds:\n", 1930 | " if i not in word_counts:\n", 1931 | " word_counts[i] = 1\n", 1932 | " else:\n", 1933 | " word_counts[i] += 1\n", 1934 | "\n", 1935 | "# Print the words with counts in decreasing order of popularity\n", 1936 | "# Note this produces a list of tuples\n", 1937 | "sorted_word_counts = sorted(word_counts.items(), key=lambda i: i[1], reverse=True)\n", 1938 | "\n", 1939 | "sorted_word_counts[:10]" 1940 | ] 1941 | }, 1942 | { 1943 | "cell_type": "code", 1944 | "execution_count": 7, 1945 | "metadata": { 1946 | "slideshow": { 1947 | "slide_type": "fragment" 1948 | } 1949 | }, 1950 | "outputs": [ 1951 | { 1952 | "data": { 1953 | "text/plain": [ 1954 | "[('we', 49),\n", 1955 | " ('our', 48),\n", 1956 | " ('will', 40),\n", 1957 | " ('america', 18),\n", 1958 | " ('you', 12),\n", 1959 | " ('all', 12),\n", 1960 | " ('american', 12),\n", 1961 | " ('their', 11),\n", 1962 | " ('your', 11),\n", 1963 | " ('people', 9)]" 1964 | ] 1965 | }, 1966 | "execution_count": 7, 1967 | "metadata": {}, 1968 | "output_type": "execute_result" 1969 | } 1970 | ], 1971 | "source": [ 1972 | "# We will create a dictionary of all words mentioned more than once without stop words\n", 1973 | "# Stop words are common words that are not meaningful in this context\n", 1974 | "stop_words = ['a', 'about', 'across', 'after', 'an', 'and', 'any', 'are', 'as', 'at', \n", 1975 | " 'be', 'because', 'but', 'by', 'did', 'do', 'does', 'for', 'from',\n", 1976 | " 'get', 'has', 'have', 'if', 'in', 'is', 'it', 'its',\n", 1977 | " 'many', 'more', 'much', 'no', 'not', 'of', 'on', 'or', 'out',\n", 1978 | " 'so', 'some', 'than', 'the', 'this', 'that', 'those', 'through', 'to',\n", 1979 | " 'very', 'what', 'where', 'whether', 'which', 'while', 'who', 'with']\n", 1980 | "\n", 1981 | "common_words = []\n", 1982 | "for i in sorted_word_counts:\n", 1983 | " if i[0] not in stop_words:\n", 1984 | " if i[1] > 1:\n", 1985 | " common_words.append(i)\n", 1986 | " else:\n", 1987 | " break\n", 1988 | " \n", 1989 | "common_words[:10]" 1990 | ] 1991 | }, 1992 | { 1993 | "cell_type": "markdown", 1994 | "metadata": { 1995 | "slideshow": { 1996 | "slide_type": "slide" 1997 | } 1998 | }, 1999 | "source": [ 2000 | "# List Comprehensions\n", 2001 | "\n", 2002 | "```\n", 2003 | "L = [*object, expression, or function* for *element* in *sequence*]\n", 2004 | "L = [*object, expression, or function* for *element* in *sequence* if *Boolean expression*]\n", 2005 | "L = [*object, expression, or function* for *element* in *sequence* for *element2* in *sequence2*]\n", 2006 | "```\n", 2007 | "\n", 2008 | "* Provide a concise way to create lists\n", 2009 | "* Faster because implemented in C\n", 2010 | "* Nested list comprehensions can be somewhat confusing\n" 2011 | ] 2012 | }, 2013 | { 2014 | "cell_type": "markdown", 2015 | "metadata": { 2016 | "slideshow": { 2017 | "slide_type": "slide" 2018 | } 2019 | }, 2020 | "source": [ 2021 | "## List Comprehensions" 2022 | ] 2023 | }, 2024 | { 2025 | "cell_type": "code", 2026 | "execution_count": 15, 2027 | "metadata": { 2028 | "scrolled": true, 2029 | "slideshow": { 2030 | "slide_type": "-" 2031 | } 2032 | }, 2033 | "outputs": [ 2034 | { 2035 | "name": "stdout", 2036 | "output_type": "stream", 2037 | "text": [ 2038 | "[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]\n", 2039 | "[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]\n" 2040 | ] 2041 | } 2042 | ], 2043 | "source": [ 2044 | "print([x**2 for x in range(1, 11)])\n", 2045 | "\n", 2046 | "ans = []\n", 2047 | "for x in range(1, 11):\n", 2048 | " ans.append(x**2)\n", 2049 | "print(ans)\n" 2050 | ] 2051 | }, 2052 | { 2053 | "cell_type": "code", 2054 | "execution_count": 16, 2055 | "metadata": { 2056 | "scrolled": true, 2057 | "slideshow": { 2058 | "slide_type": "fragment" 2059 | } 2060 | }, 2061 | "outputs": [ 2062 | { 2063 | "name": "stdout", 2064 | "output_type": "stream", 2065 | "text": [ 2066 | "[4, 16, 36, 64, 100]\n", 2067 | "['a1', 'a2', 'a3', 'b1', 'b2', 'b3', 'c1', 'c2', 'c3']\n" 2068 | ] 2069 | } 2070 | ], 2071 | "source": [ 2072 | "print([x**2 for x in range(1, 11) if x%2 == 0])\n", 2073 | "print([x + y for x in ['a', 'b', 'c'] for y in ['1','2', '3']])\n" 2074 | ] 2075 | }, 2076 | { 2077 | "cell_type": "markdown", 2078 | "metadata": { 2079 | "slideshow": { 2080 | "slide_type": "slide" 2081 | } 2082 | }, 2083 | "source": [ 2084 | "## Dictionary and Set Comprehensions" 2085 | ] 2086 | }, 2087 | { 2088 | "cell_type": "code", 2089 | "execution_count": 17, 2090 | "metadata": { 2091 | "scrolled": true, 2092 | "slideshow": { 2093 | "slide_type": "-" 2094 | } 2095 | }, 2096 | "outputs": [ 2097 | { 2098 | "name": "stdout", 2099 | "output_type": "stream", 2100 | "text": [ 2101 | "{1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81, 10: 100}\n", 2102 | "{'a': 1, 'b': 2, 'c': 2}\n", 2103 | "{'o', 's', 't', 'm', 'n', 'g', 'e', 'a', 'r', 'i', 'd'}\n" 2104 | ] 2105 | } 2106 | ], 2107 | "source": [ 2108 | "print({x: x**2 for x in range(1, 11)})\n", 2109 | "print({x.lower(): y for x, y in [('A', 1), ('b', 2), ('C', 2)]})\n", 2110 | "\n", 2111 | "print({x.lower() for x in 'SomeRandomSTRING'})\n" 2112 | ] 2113 | }, 2114 | { 2115 | "cell_type": "markdown", 2116 | "metadata": { 2117 | "slideshow": { 2118 | "slide_type": "slide" 2119 | } 2120 | }, 2121 | "source": [ 2122 | ">## EXAMPLE PROJECT: Comparing Trump's and Biden's Inaugural Speeches\n", 2123 | ">\n", 2124 | ">We can use a list comprehension to rewrite some of the code we wrote above:\n", 2125 | ">\n", 2126 | ">```\n", 2127 | ">common_words = []\n", 2128 | ">for i in sorted_word_counts:\n", 2129 | "> if i[0] not in stop_words:\n", 2130 | "> if i[1] > 1:\n", 2131 | "> common_words.append(i)\n", 2132 | "> else:\n", 2133 | "> break\n", 2134 | ">```" 2135 | ] 2136 | }, 2137 | { 2138 | "cell_type": "code", 2139 | "execution_count": 8, 2140 | "metadata": { 2141 | "slideshow": { 2142 | "slide_type": "-" 2143 | } 2144 | }, 2145 | "outputs": [ 2146 | { 2147 | "data": { 2148 | "text/plain": [ 2149 | "[('we', 49),\n", 2150 | " ('our', 48),\n", 2151 | " ('will', 40),\n", 2152 | " ('america', 18),\n", 2153 | " ('you', 12),\n", 2154 | " ('all', 12),\n", 2155 | " ('american', 12),\n", 2156 | " ('their', 11),\n", 2157 | " ('your', 11),\n", 2158 | " ('people', 9)]" 2159 | ] 2160 | }, 2161 | "execution_count": 8, 2162 | "metadata": {}, 2163 | "output_type": "execute_result" 2164 | } 2165 | ], 2166 | "source": [ 2167 | "common_words = [i for i in sorted_word_counts if i[0] not in stop_words and i[1] > 1] \n", 2168 | "common_words[:10]" 2169 | ] 2170 | }, 2171 | { 2172 | "cell_type": "markdown", 2173 | "metadata": { 2174 | "slideshow": { 2175 | "slide_type": "notes" 2176 | } 2177 | }, 2178 | "source": [ 2179 | "In a list comprehension, we cannot use `break` to escape iterating until the end of `sorted_word_counts` but the performance of this syntax is still likely to be faster for most data." 2180 | ] 2181 | }, 2182 | { 2183 | "cell_type": "markdown", 2184 | "metadata": { 2185 | "slideshow": { 2186 | "slide_type": "slide" 2187 | } 2188 | }, 2189 | "source": [ 2190 | "# Functions\n", 2191 | "\n", 2192 | "![Functions](figs/functions.png \"Functions\") \n", 2193 | "\n", 2194 | "\n", 2195 | "* Built-in\n", 2196 | " * `len()`, `max()`, `range()`, `open()`, etc.\n", 2197 | "* User-defined\n", 2198 | " * By you, collaborators, or the open-source community" 2199 | ] 2200 | }, 2201 | { 2202 | "cell_type": "markdown", 2203 | "metadata": { 2204 | "slideshow": { 2205 | "slide_type": "slide" 2206 | } 2207 | }, 2208 | "source": [ 2209 | "## Defining and Calling Functions\n", 2210 | "\n", 2211 | "**Defining a function**\n", 2212 | "\n", 2213 | "```\n", 2214 | "def *function_name*(*list of parameters*):\n", 2215 | " *body of function*\n", 2216 | "```\n", 2217 | "\n", 2218 | "**Calling a function**\n", 2219 | "\n", 2220 | "```\n", 2221 | "*function_name*(*arguments*)\n", 2222 | "```\n" 2223 | ] 2224 | }, 2225 | { 2226 | "cell_type": "markdown", 2227 | "metadata": { 2228 | "slideshow": { 2229 | "slide_type": "slide" 2230 | } 2231 | }, 2232 | "source": [ 2233 | "## When the Function is Used, the Parameters are Bound to the Arguments\n", 2234 | "\n", 2235 | "```\n", 2236 | "def *function_name*(*list of parameters*):\n", 2237 | " *body of function*\n", 2238 | "\n", 2239 | "*function_name*(*arguments*)\n", 2240 | "```\n" 2241 | ] 2242 | }, 2243 | { 2244 | "cell_type": "code", 2245 | "execution_count": 1, 2246 | "metadata": { 2247 | "slideshow": { 2248 | "slide_type": "-" 2249 | } 2250 | }, 2251 | "outputs": [ 2252 | { 2253 | "name": "stdout", 2254 | "output_type": "stream", 2255 | "text": [ 2256 | "4\n" 2257 | ] 2258 | } 2259 | ], 2260 | "source": [ 2261 | "def get_larger(x, y):\n", 2262 | " \"\"\"Assumes x and y are of numeric type.\n", 2263 | " Returns the larger of x and y.\n", 2264 | " \"\"\"\n", 2265 | " if x > y:\n", 2266 | " # The execution of a `return` statement terminates the function call\n", 2267 | " return x\n", 2268 | " else:\n", 2269 | " return y\n", 2270 | " \n", 2271 | "m = get_larger(3, 4)\n", 2272 | "print(m)" 2273 | ] 2274 | }, 2275 | { 2276 | "cell_type": "markdown", 2277 | "metadata": { 2278 | "slideshow": { 2279 | "slide_type": "slide" 2280 | } 2281 | }, 2282 | "source": [ 2283 | "## A Function Call Always Returns a Value\n", 2284 | "\n", 2285 | "* The execution of a `return` statement terminates the function call\n", 2286 | "* The function call also terminates when there are no more statements to execute\n", 2287 | "* If no expression follows `return` or there is no `return` statement, the function returns `None` " 2288 | ] 2289 | }, 2290 | { 2291 | "cell_type": "code", 2292 | "execution_count": 11, 2293 | "metadata": { 2294 | "slideshow": { 2295 | "slide_type": "-" 2296 | } 2297 | }, 2298 | "outputs": [ 2299 | { 2300 | "name": "stdout", 2301 | "output_type": "stream", 2302 | "text": [ 2303 | "5 6 None\n" 2304 | ] 2305 | } 2306 | ], 2307 | "source": [ 2308 | "def get_larger(x, y):\n", 2309 | " if x > y:\n", 2310 | " return x\n", 2311 | " if y > x:\n", 2312 | " return y\n", 2313 | "\n", 2314 | "ex1 = get_larger(3, 5)\n", 2315 | "ex2 = get_larger(6, 4)\n", 2316 | "ex3 = get_larger(3, 3)\n", 2317 | "\n", 2318 | "print(ex1, ex2, ex3)" 2319 | ] 2320 | }, 2321 | { 2322 | "cell_type": "markdown", 2323 | "metadata": { 2324 | "slideshow": { 2325 | "slide_type": "slide" 2326 | } 2327 | }, 2328 | "source": [ 2329 | "## Functions Can Return Multiple Values" 2330 | ] 2331 | }, 2332 | { 2333 | "cell_type": "code", 2334 | "execution_count": 12, 2335 | "metadata": { 2336 | "slideshow": { 2337 | "slide_type": "-" 2338 | } 2339 | }, 2340 | "outputs": [ 2341 | { 2342 | "name": "stdout", 2343 | "output_type": "stream", 2344 | "text": [ 2345 | "(10, 6)\n", 2346 | "10\n", 2347 | "6\n" 2348 | ] 2349 | } 2350 | ], 2351 | "source": [ 2352 | "def double_one(a):\n", 2353 | " return 2*a\n", 2354 | "\n", 2355 | "def double_two(a, b):\n", 2356 | " return 2*a, 2*b\n", 2357 | "\n", 2358 | "x = double_two(5, 3)\n", 2359 | "print(x)\n", 2360 | "\n", 2361 | "# You can unpack the tuple in two separate variables\n", 2362 | "x1, x2 = double_two(5, 3)\n", 2363 | "print(x1)\n", 2364 | "print(x2)" 2365 | ] 2366 | }, 2367 | { 2368 | "cell_type": "markdown", 2369 | "metadata": { 2370 | "slideshow": { 2371 | "slide_type": "slide" 2372 | } 2373 | }, 2374 | "source": [ 2375 | "## Positional vs. Keyword Arguments" 2376 | ] 2377 | }, 2378 | { 2379 | "cell_type": "code", 2380 | "execution_count": 13, 2381 | "metadata": { 2382 | "slideshow": { 2383 | "slide_type": "-" 2384 | } 2385 | }, 2386 | "outputs": [ 2387 | { 2388 | "name": "stdout", 2389 | "output_type": "stream", 2390 | "text": [ 2391 | "3 2 1\n", 2392 | "3 2 1\n", 2393 | "3 2 1\n" 2394 | ] 2395 | } 2396 | ], 2397 | "source": [ 2398 | "def print_reverse(first, second, third):\n", 2399 | " print(third, second, first)\n", 2400 | " \n", 2401 | "print_reverse(1, 2, 3)\n", 2402 | "print_reverse(third=3, second=2, first=1)\n", 2403 | "print_reverse(1, second=2, third=3)\n", 2404 | "\n", 2405 | "# Gives a syntax error because keyword arguments cannot come before positional arguments\n", 2406 | "# print_reverse(first=1, 2, 3) " 2407 | ] 2408 | }, 2409 | { 2410 | "cell_type": "markdown", 2411 | "metadata": { 2412 | "slideshow": { 2413 | "slide_type": "slide" 2414 | } 2415 | }, 2416 | "source": [ 2417 | "## Default Parameter Values\n", 2418 | "\n", 2419 | "* Default values allow to call a function with fewer arguments than specified\n", 2420 | "* Default arguments cannot come before non-default arguments" 2421 | ] 2422 | }, 2423 | { 2424 | "cell_type": "code", 2425 | "execution_count": 2, 2426 | "metadata": { 2427 | "slideshow": { 2428 | "slide_type": "-" 2429 | } 2430 | }, 2431 | "outputs": [ 2432 | { 2433 | "name": "stdout", 2434 | "output_type": "stream", 2435 | "text": [ 2436 | "The quick brown fox jumps over the lazy dog.\n", 2437 | "The quick brown fox jumps over the lazy dog.\n", 2438 | "The quick brown fox jumps over the lazy dog\n" 2439 | ] 2440 | } 2441 | ], 2442 | "source": [ 2443 | "def pretty_print(lst, sep, fullstop=True, capitalize=True):\n", 2444 | " toprint = sep.join(lst)\n", 2445 | " if fullstop:\n", 2446 | " toprint += '.'\n", 2447 | " if capitalize:\n", 2448 | " toprint = toprint.capitalize()\n", 2449 | " print(toprint)\n", 2450 | "\n", 2451 | "wordlst = ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog'] # an English pangram\n", 2452 | "\n", 2453 | "pretty_print(wordlst, ' ', True, True)\n", 2454 | "pretty_print(wordlst, ' ')\n", 2455 | "pretty_print(wordlst, ' ', False)\n" 2456 | ] 2457 | }, 2458 | { 2459 | "cell_type": "markdown", 2460 | "metadata": { 2461 | "slideshow": { 2462 | "slide_type": "slide" 2463 | } 2464 | }, 2465 | "source": [ 2466 | "## A Function Defines a New Scope\n", 2467 | "\n", 2468 | "* Scope = name space\n", 2469 | "* This means you can reuse your favorite variable names in different functions" 2470 | ] 2471 | }, 2472 | { 2473 | "cell_type": "code", 2474 | "execution_count": 16, 2475 | "metadata": { 2476 | "slideshow": { 2477 | "slide_type": "-" 2478 | } 2479 | }, 2480 | "outputs": [ 2481 | { 2482 | "name": "stdout", 2483 | "output_type": "stream", 2484 | "text": [ 2485 | "1\n" 2486 | ] 2487 | } 2488 | ], 2489 | "source": [ 2490 | "def func(x, y):\n", 2491 | " x += 1\n", 2492 | " # x is a parameter, z is a local variable\n", 2493 | " z = x + y # z, x, and y exist only in the scope of the definition of func\n", 2494 | " return z\n", 2495 | "\n", 2496 | "x = 1\n", 2497 | "res = func(x, 5)\n", 2498 | "\n", 2499 | "print(x) # x has not changed \n", 2500 | "#print(z) # Returns an error\n" 2501 | ] 2502 | }, 2503 | { 2504 | "cell_type": "markdown", 2505 | "metadata": { 2506 | "slideshow": { 2507 | "slide_type": "slide" 2508 | } 2509 | }, 2510 | "source": [ 2511 | "## The Global Scope" 2512 | ] 2513 | }, 2514 | { 2515 | "cell_type": "code", 2516 | "execution_count": 17, 2517 | "metadata": { 2518 | "slideshow": { 2519 | "slide_type": "-" 2520 | } 2521 | }, 2522 | "outputs": [ 2523 | { 2524 | "name": "stdout", 2525 | "output_type": "stream", 2526 | "text": [ 2527 | "3\n" 2528 | ] 2529 | } 2530 | ], 2531 | "source": [ 2532 | "GLOBVAR = 3 # It is conventional to use CAPITALS to name global variables\n", 2533 | "\n", 2534 | "def print_global():\n", 2535 | " # Since GLOBVAR is not defined in the function, it is treated as global\n", 2536 | " print(GLOBVAR) \n", 2537 | "\n", 2538 | "print_global()" 2539 | ] 2540 | }, 2541 | { 2542 | "cell_type": "markdown", 2543 | "metadata": { 2544 | "slideshow": { 2545 | "slide_type": "slide" 2546 | } 2547 | }, 2548 | "source": [ 2549 | ">## EXAMPLE PROJECT: Comparing Trump's and Biden's Inaugural Speeches\n", 2550 | ">\n", 2551 | ">With functions, we can make the code we have so far more modular so that you can easily apply to multiple data files. Below, we will:\n", 2552 | "1. Create a function to extract words from a text\n", 2553 | "2. Create another function to count words in a text\n", 2554 | "2. Apply the functions to each president's speech\n", 2555 | "3. Compare the length and repetitiveness of the speeches, the most common words, and the unique words" 2556 | ] 2557 | }, 2558 | { 2559 | "cell_type": "code", 2560 | "execution_count": 9, 2561 | "metadata": { 2562 | "slideshow": { 2563 | "slide_type": "-" 2564 | } 2565 | }, 2566 | "outputs": [], 2567 | "source": [ 2568 | "import string # See https://docs.python.org/3/library/string.html\n", 2569 | "\n", 2570 | "# This will now be a global variable so we will follow the convention and \n", 2571 | "# name it in all caps\n", 2572 | "STOP_WORDS = ['a', 'about', 'across', 'after', 'an', 'and', 'any', 'are', 'as', 'at', \n", 2573 | " 'be', 'because', 'but', 'by', 'did', 'do', 'does', 'for', 'from',\n", 2574 | " 'get', 'has', 'have', 'if', 'in', 'is', 'it', 'its',\n", 2575 | " 'many', 'more', 'much', 'no', 'not', 'of', 'on', 'or', 'out',\n", 2576 | " 'so', 'some', 'than', 'the', 'this', 'that', 'those', 'through', 'to',\n", 2577 | " 'very', 'what', 'where', 'whether', 'which', 'while', 'who', 'with']\n", 2578 | "\n", 2579 | "def get_tokens(fname):\n", 2580 | " \"\"\"Read given text file and return a list with all words in lowercase\n", 2581 | " in the order they appear in the text. Common contractions are expanded\n", 2582 | " and hyphenated words are combined in one word.\n", 2583 | " \"\"\"\n", 2584 | " with open(fname) as f:\n", 2585 | " txt = f.read()\n", 2586 | " \n", 2587 | " # Remove paragraphs and format consistently\n", 2588 | " txt = txt.strip().replace('\\n', ' ').replace(\"’\", \"'\")\n", 2589 | " \n", 2590 | " # Get rid of possessives and expand contractions\n", 2591 | " txt = txt.replace(\"'s\", '').replace(\"'ve\", ' have').replace(\"'re\", ' are')\n", 2592 | " txt = txt.replace(\"can't\", 'can not').replace(\"n't\", ' not')\n", 2593 | "\n", 2594 | " # Remove punctuation and convert to lower-case\n", 2595 | " exclude = set(string.punctuation) | {\"”\", \"“\", \"…\", '–'}\n", 2596 | " txt = ''.join(ch.lower() for ch in txt if ch not in exclude)\n", 2597 | "\n", 2598 | " # Break into words\n", 2599 | " wrds = txt.split()\n", 2600 | " \n", 2601 | " return wrds\n", 2602 | "\n", 2603 | "\n", 2604 | "def get_word_counts(tokens):\n", 2605 | " \"\"\"Take tokens and return a dictionary where keys are words\n", 2606 | " and values are counts of the number of time the word is repeated.\n", 2607 | " \"\"\"\n", 2608 | " # Create dictionary with word:count\n", 2609 | " word_counts = {}\n", 2610 | "\n", 2611 | " for i in tokens:\n", 2612 | " if i not in STOP_WORDS:\n", 2613 | " if i not in word_counts:\n", 2614 | " word_counts[i] = 1\n", 2615 | " else:\n", 2616 | " word_counts[i] += 1\n", 2617 | "\n", 2618 | " # Get the words with counts in decreasing order of popularity\n", 2619 | " # Note this produces a list of tuples\n", 2620 | " sorted_word_counts = sorted(word_counts.items(), key=lambda i: i[1], reverse=True)\n", 2621 | " \n", 2622 | " return sorted_word_counts\n", 2623 | "\n", 2624 | "\n", 2625 | "trump_tokens = get_tokens('data/trump_inauguration_millercenter.txt')\n", 2626 | "biden_tokens = get_tokens('data/biden_inauguration_millercenter.txt')" 2627 | ] 2628 | }, 2629 | { 2630 | "cell_type": "code", 2631 | "execution_count": 10, 2632 | "metadata": { 2633 | "slideshow": { 2634 | "slide_type": "-" 2635 | } 2636 | }, 2637 | "outputs": [ 2638 | { 2639 | "name": "stdout", 2640 | "output_type": "stream", 2641 | "text": [ 2642 | "1436 2382\n", 2643 | "536 721\n", 2644 | "2.6791044776119404 3.30374479889043\n", 2645 | "\n", 2646 | "[('we', 49), ('our', 48), ('will', 40), ('america', 18), ('you', 12), ('all', 12), ('american', 12), ('their', 11), ('your', 11), ('people', 9), ('country', 9), ('nation', 9), ('again', 9), ('one', 8), ('every', 7), ('world', 6), ('now', 6), ('great', 6), ('back', 6), ('never', 6)]\n", 2647 | "\n", 2648 | "[('we', 91), ('our', 43), ('will', 33), ('i', 33), ('us', 27), ('my', 20), ('america', 20), ('can', 18), ('you', 17), ('all', 17), ('one', 15), ('nation', 14), ('democracy', 11), ('me', 11), ('must', 10), ('americans', 9), ('today', 9), ('people', 9), ('american', 9), ('story', 9)]\n" 2649 | ] 2650 | } 2651 | ], 2652 | "source": [ 2653 | "# Biden's speech is longer\n", 2654 | "print(len(trump_tokens), len(biden_tokens))\n", 2655 | "print(len(set(trump_tokens)), len(set(biden_tokens)))\n", 2656 | "# Biden's speech is also more repetitive\n", 2657 | "print(len(trump_tokens)/len(set(trump_tokens)), len(biden_tokens)/len(set(biden_tokens)))\n", 2658 | "\n", 2659 | "print() # Add an empty line to separate results\n", 2660 | "\n", 2661 | "# The ten most common words for Trump and Biden\n", 2662 | "trump_wcounts = get_word_counts(trump_tokens)\n", 2663 | "biden_wcounts = get_word_counts(biden_tokens)\n", 2664 | "\n", 2665 | "# Biden's speech is more self-centered\n", 2666 | "print(trump_wcounts[:20])\n", 2667 | "\n", 2668 | "print() # Add an empty line to separate results\n", 2669 | "\n", 2670 | "print(biden_wcounts[:20])" 2671 | ] 2672 | }, 2673 | { 2674 | "cell_type": "markdown", 2675 | "metadata": { 2676 | "slideshow": { 2677 | "slide_type": "slide" 2678 | } 2679 | }, 2680 | "source": [ 2681 | "## Modules\n", 2682 | "\n", 2683 | "* For large programs, store different parts in `.py` files\n", 2684 | "* Get access using `import` statements" 2685 | ] 2686 | }, 2687 | { 2688 | "cell_type": "code", 2689 | "execution_count": 1, 2690 | "metadata": { 2691 | "slideshow": { 2692 | "slide_type": "fragment" 2693 | } 2694 | }, 2695 | "outputs": [ 2696 | { 2697 | "name": "stdout", 2698 | "output_type": "stream", 2699 | "text": [ 2700 | "['chief', 'justice', 'roberts', 'president', 'carter', 'president', 'clinton', 'president', 'bush', 'president', 'obama', 'fellow', 'americans', 'and', 'people', 'of', 'the', 'world', 'thank', 'you']\n" 2701 | ] 2702 | } 2703 | ], 2704 | "source": [ 2705 | "import speech_analysis\n", 2706 | "\n", 2707 | "trump_tokens = speech_analysis.get_tokens('data/trump_inauguration_millercenter.txt')\n", 2708 | "print(trump_tokens[:20])\n" 2709 | ] 2710 | }, 2711 | { 2712 | "cell_type": "code", 2713 | "execution_count": 2, 2714 | "metadata": { 2715 | "slideshow": { 2716 | "slide_type": "fragment" 2717 | } 2718 | }, 2719 | "outputs": [], 2720 | "source": [ 2721 | "import speech_analysis as sa\n", 2722 | "\n", 2723 | "trump_tokens = sa.get_tokens('data/trump_inauguration_millercenter.txt')\n" 2724 | ] 2725 | }, 2726 | { 2727 | "cell_type": "code", 2728 | "execution_count": 3, 2729 | "metadata": { 2730 | "slideshow": { 2731 | "slide_type": "fragment" 2732 | } 2733 | }, 2734 | "outputs": [], 2735 | "source": [ 2736 | "# You should be careful with this one: there will be a conflict if you\n", 2737 | "# import a different module that also has a function called get_tokens()\n", 2738 | "from speech_analysis import *\n", 2739 | "\n", 2740 | "trump_tokens = get_tokens('data/trump_inauguration_millercenter.txt')\n" 2741 | ] 2742 | }, 2743 | { 2744 | "cell_type": "markdown", 2745 | "metadata": { 2746 | "slideshow": { 2747 | "slide_type": "slide" 2748 | } 2749 | }, 2750 | "source": [ 2751 | "## Useful Python Modules\n", 2752 | "\n", 2753 | "https://docs.python.org/3/library/\n", 2754 | "\n", 2755 | "* `re` – Regular expression operations\n", 2756 | "* `datetime` – Basic date and time types\n", 2757 | "* `math` – Mathematical functions\n", 2758 | "* `random` – Generate pseudo-random numbers\n", 2759 | "* `os.path` – Common pathname manipulations\n", 2760 | "* `pickle` — Python object serialization\n", 2761 | "* `csv` — CSV file reading and writing\n", 2762 | "* `json` — JSON encoder and decoder\n", 2763 | "* ..." 2764 | ] 2765 | }, 2766 | { 2767 | "cell_type": "markdown", 2768 | "metadata": { 2769 | "slideshow": { 2770 | "slide_type": "slide" 2771 | } 2772 | }, 2773 | "source": [ 2774 | "## Useful Python Packages\n", 2775 | "\n", 2776 | "* `numpy` – Scientific computing with multi-dimensional arrays\n", 2777 | "* `pandas` – Data anlysis with table-like structures (R, pretty much)\n", 2778 | "* `statsmodels` – Statistical data analysis with linear models\n", 2779 | "* `scikit-learn` – Data mining and machine learning\n", 2780 | "* `networkx` – Network analysis\n", 2781 | "* `matplotlib` – Plotting\n", 2782 | "* ..." 2783 | ] 2784 | }, 2785 | { 2786 | "cell_type": "markdown", 2787 | "metadata": { 2788 | "slideshow": { 2789 | "slide_type": "slide" 2790 | } 2791 | }, 2792 | "source": [ 2793 | "## Decomposition and Abstraction\n", 2794 | "\n", 2795 | "![Decomposition and abstraction](figs/decomposition_abstraction.png \"Decomposition and abstraction\")" 2796 | ] 2797 | }, 2798 | { 2799 | "cell_type": "markdown", 2800 | "metadata": { 2801 | "slideshow": { 2802 | "slide_type": "notes" 2803 | } 2804 | }, 2805 | "source": [ 2806 | "* **Decomposition creates structure** – it allows to break the program into self-contained parts\n", 2807 | "* **Abstraction hides detail** – it allows to use code as if it is a black box" 2808 | ] 2809 | }, 2810 | { 2811 | "cell_type": "markdown", 2812 | "metadata": { 2813 | "slideshow": { 2814 | "slide_type": "fragment" 2815 | } 2816 | }, 2817 | "source": [ 2818 | "We can achieve decomposition and abstraction with:\n", 2819 | "\n", 2820 | "* Functions\n", 2821 | "* Classes" 2822 | ] 2823 | }, 2824 | { 2825 | "cell_type": "markdown", 2826 | "metadata": { 2827 | "slideshow": { 2828 | "slide_type": "slide" 2829 | } 2830 | }, 2831 | "source": [ 2832 | "# Object-Oriented Programming\n", 2833 | "\n", 2834 | "A programming paradigm based on the concept of \"objects\"\n", 2835 | "\n", 2836 | "An object is a **data abstraction** that captures:\n", 2837 | "\n", 2838 | "* **Internal representation** (data attributes)\n", 2839 | "* **Interface** for interacting with object (methods)\n" 2840 | ] 2841 | }, 2842 | { 2843 | "cell_type": "markdown", 2844 | "metadata": { 2845 | "slideshow": { 2846 | "slide_type": "slide" 2847 | } 2848 | }, 2849 | "source": [ 2850 | "## Procedural vs. Object-Oriented Programming\n", 2851 | "\n", 2852 | "![Procedural vs. object-oriented programming](figs/procedural_object-oriented.png \"Procedural vs. object-oriented programming\")" 2853 | ] 2854 | }, 2855 | { 2856 | "cell_type": "markdown", 2857 | "metadata": { 2858 | "slideshow": { 2859 | "slide_type": "slide" 2860 | } 2861 | }, 2862 | "source": [ 2863 | "## Everything in Python Is an Object!\n", 2864 | "\n", 2865 | "* Objects have types (belong to classes)\n", 2866 | "* Objects also have a set of procedures for interacting with them (methods)" 2867 | ] 2868 | }, 2869 | { 2870 | "cell_type": "code", 2871 | "execution_count": 8, 2872 | "metadata": { 2873 | "slideshow": { 2874 | "slide_type": "-" 2875 | } 2876 | }, 2877 | "outputs": [ 2878 | { 2879 | "name": "stdout", 2880 | "output_type": "stream", 2881 | "text": [ 2882 | "\n", 2883 | "SOME STRING\n" 2884 | ] 2885 | } 2886 | ], 2887 | "source": [ 2888 | "s = 'some string'\n", 2889 | "print(type(s))\n", 2890 | "print(s.upper())" 2891 | ] 2892 | }, 2893 | { 2894 | "cell_type": "markdown", 2895 | "metadata": { 2896 | "slideshow": { 2897 | "slide_type": "slide" 2898 | } 2899 | }, 2900 | "source": [ 2901 | "## Defining Classes in Python\n" 2902 | ] 2903 | }, 2904 | { 2905 | "cell_type": "code", 2906 | "execution_count": 1, 2907 | "metadata": { 2908 | "slideshow": { 2909 | "slide_type": "-" 2910 | } 2911 | }, 2912 | "outputs": [ 2913 | { 2914 | "name": "stdout", 2915 | "output_type": "stream", 2916 | "text": [ 2917 | "Greta Thunberg 16\n" 2918 | ] 2919 | } 2920 | ], 2921 | "source": [ 2922 | "from datetime import date\n", 2923 | "\n", 2924 | "class Person(object):\n", 2925 | " \n", 2926 | " def __init__(self, f_name, l_name):\n", 2927 | " \"\"\"Creates a person using first and last names.\"\"\"\n", 2928 | " self.first_name = f_name\n", 2929 | " self.last_name = l_name\n", 2930 | " self.birthdate = None\n", 2931 | " \n", 2932 | " def get_name(self):\n", 2933 | " \"\"\"Gets self's full name.\"\"\"\n", 2934 | " return self.first_name + ' ' + self.last_name\n", 2935 | " \n", 2936 | " def get_age(self):\n", 2937 | " \"\"\"Gets self's age in years.\"\"\"\n", 2938 | " return date.today().year - self.birthdate.year\n", 2939 | " \n", 2940 | " def set_birthdate(self, dob):\n", 2941 | " \"\"\"Assumes dob is of type date.\n", 2942 | " Sets self's birthdate to dob.\n", 2943 | " \"\"\"\n", 2944 | " self.birthdate = dob\n", 2945 | " \n", 2946 | " def __str__(self):\n", 2947 | " \"\"\"Returns self's full name.\"\"\"\n", 2948 | " return self.first_name + ' ' + self.last_name\n", 2949 | " \n", 2950 | "p1 = Person('Greta', 'Thunberg')\n", 2951 | "p1.set_birthdate(date(2003, 1, 3))\n", 2952 | "print(p1, p1.get_age())" 2953 | ] 2954 | }, 2955 | { 2956 | "cell_type": "markdown", 2957 | "metadata": { 2958 | "slideshow": { 2959 | "slide_type": "slide" 2960 | } 2961 | }, 2962 | "source": [ 2963 | "## Defining Classes in Python\n", 2964 | "\n", 2965 | "* Data attributes — `first_name`, `last_name`, `birthdate`\n", 2966 | "* Methods\n", 2967 | " * `get_name()`, `get_age()`, `set_birthdate()`\n", 2968 | " * `__init__()` — called when a class is instantiated\n", 2969 | " * `__str__()` — called by `print()` and `str()`\n", 2970 | " \n", 2971 | "---\n", 2972 | "\n", 2973 | "* Operations\n", 2974 | " * Instantiation: `p1 = Person('Greta', 'Thunberg')` calls method `__init__()`\n", 2975 | " * Attribute/method reference: `p1.get_age()`" 2976 | ] 2977 | }, 2978 | { 2979 | "cell_type": "markdown", 2980 | "metadata": { 2981 | "slideshow": { 2982 | "slide_type": "slide" 2983 | } 2984 | }, 2985 | "source": [ 2986 | "## Classes vs. Objects\n", 2987 | "\n", 2988 | "* `Person` is a class\n", 2989 | "* `p1` is an instance of the class `Person`; it is an object of type `Person`\n", 2990 | "* Similarly, `str` is a class and `'Greta Thunberg'` is an object of type `str`\n", 2991 | "\n", 2992 | "![Class vs. object](figs/person_greta.png \"Class vs. object\")\n", 2993 | "\n", 2994 | "By Anders Hellberg - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=77270098\n", 2995 | "\n" 2996 | ] 2997 | }, 2998 | { 2999 | "cell_type": "markdown", 3000 | "metadata": { 3001 | "slideshow": { 3002 | "slide_type": "slide" 3003 | } 3004 | }, 3005 | "source": [ 3006 | ">## EXAMPLE PROJECT: Comparing Trump's and Biden's Inaugural Speeches\n", 3007 | ">\n", 3008 | ">We can rebundle the code we have written so far in a class, following the object-oriented programming paradigm. In this case, the data and functions are encapsulated together. The functions become methods and they belong only to this particular data type. We cannot call them independently, on other data types, for example.\n" 3009 | ] 3010 | }, 3011 | { 3012 | "cell_type": "code", 3013 | "execution_count": 3, 3014 | "metadata": { 3015 | "slideshow": { 3016 | "slide_type": "-" 3017 | } 3018 | }, 3019 | "outputs": [ 3020 | { 3021 | "name": "stdout", 3022 | "output_type": "stream", 3023 | "text": [ 3024 | "Chief Justice Roberts, President Carter, President Clinton, President Bush, President Obama, fellow Americans, and people of the world: thank you.\n", 3025 | "\n", 3026 | "We, the citizens of America, are now joined in a gre...\n", 3027 | "1436\n", 3028 | "\n", 3029 | "Chief Justice Roberts, Vice President Harris, Speaker Pelosi, Leader Schumer, Leader McConnell, Vice President Pence, distinguished guests, and my fellow Americans.\n", 3030 | "\n", 3031 | "This is America’s day.\n", 3032 | "\n", 3033 | "This is de...\n", 3034 | "2382\n" 3035 | ] 3036 | } 3037 | ], 3038 | "source": [ 3039 | "import string\n", 3040 | "\n", 3041 | "STOP_WORDS = ['a', 'about', 'across', 'after', 'an', 'and', 'any', 'are', 'as', 'at', \n", 3042 | " 'be', 'because', 'but', 'by', 'did', 'do', 'does', 'for', 'from',\n", 3043 | " 'get', 'has', 'have', 'if', 'in', 'is', 'it', 'its',\n", 3044 | " 'many', 'more', 'much', 'no', 'not', 'of', 'on', 'or', 'out',\n", 3045 | " 'so', 'some', 'than', 'the', 'this', 'that', 'those', 'through', 'to',\n", 3046 | " 'very', 'what', 'where', 'whether', 'which', 'while', 'who', 'with']\n", 3047 | "\n", 3048 | "class Speech(object):\n", 3049 | " \n", 3050 | " def __init__(self, fname):\n", 3051 | " \"\"\"Creates a speech using the text in file fname.\"\"\"\n", 3052 | " \n", 3053 | " with open(fname) as f:\n", 3054 | " self.txt = f.read()\n", 3055 | " self.tokens = None\n", 3056 | " self.word_counts = None\n", 3057 | " \n", 3058 | " # Populate the empty attributes above by processing the text\n", 3059 | " self.process_tokens() \n", 3060 | " self.process_word_counts()\n", 3061 | " \n", 3062 | " \n", 3063 | " # The following two methods are called when you initialize a new object\n", 3064 | " \n", 3065 | " def process_tokens(self):\n", 3066 | " \"\"\"Extracts the tokens in the text and assigns them to \n", 3067 | " the attribute 'tokens'. 'tokens' is a list of strings.\n", 3068 | " \"\"\"\n", 3069 | " \n", 3070 | " # Remove paragraphs and format consistently\n", 3071 | " txt = self.txt.strip().replace('\\n', ' ').replace(\"’\", \"'\")\n", 3072 | " \n", 3073 | " # Get rid of possessives and expand contractions\n", 3074 | " txt = txt.replace(\"'s\", '').replace(\"'ve\", ' have').replace(\"'re\", ' are')\n", 3075 | " txt = txt.replace(\"can't\", 'can not').replace(\"n't\", ' not')\n", 3076 | " \n", 3077 | " # Remove punctuation and convert to lower-case\n", 3078 | " exclude = set(string.punctuation) | {\"”\", \"“\", \"…\", '–'}\n", 3079 | " txt = ''.join(ch.lower() for ch in txt if ch not in exclude)\n", 3080 | "\n", 3081 | " # Break into words\n", 3082 | " wrds = txt.split()\n", 3083 | "\n", 3084 | " self.tokens = wrds\n", 3085 | " \n", 3086 | " \n", 3087 | " def process_word_counts(self):\n", 3088 | " \"\"\"Counts the number of times each word, excluding stop words,\n", 3089 | " appears in the speech and assigns the counts to the attribute 'word_counts'.\n", 3090 | " 'word_counts' is a list of tuples in the form (token, count).\n", 3091 | " \"\"\"\n", 3092 | " # Create dictionary with word:count\n", 3093 | " word_counts = {}\n", 3094 | "\n", 3095 | " for i in self.tokens:\n", 3096 | " if i not in STOP_WORDS:\n", 3097 | " if i not in word_counts:\n", 3098 | " word_counts[i] = 1\n", 3099 | " else:\n", 3100 | " word_counts[i] += 1\n", 3101 | "\n", 3102 | " # Get the words with counts in decreasing order of popularity\n", 3103 | " # Note this produces a list of tuples\n", 3104 | " sorted_word_counts = sorted(word_counts.items(), key=lambda i: i[1], reverse=True)\n", 3105 | " self.word_counts = sorted_word_counts\n", 3106 | " \n", 3107 | " \n", 3108 | " # Use get and set methods to provide interface for interacting with the objects\n", 3109 | " \n", 3110 | " def get_text():\n", 3111 | " return self.text\n", 3112 | " \n", 3113 | " def get_tokens(self):\n", 3114 | " \"\"\"Get the tokens in the speech as a list of strings.\"\"\"\n", 3115 | " # Avoid returning mutable objects as they could be modified in undesirable ways\n", 3116 | " return self.tokens[:]\n", 3117 | " \n", 3118 | " def get_word_counts(self):\n", 3119 | " \"\"\"Get each unique word in the speech and the number of times it appears in the speech.\n", 3120 | " Return a list of tuples in the form (token, count).\n", 3121 | " \"\"\"\n", 3122 | " # Avoid returning mutable objects as they could be modified in undesirable ways\n", 3123 | " return self.word_counts[:]\n", 3124 | " \n", 3125 | " # You can make your code even more interactive by providing extra methods for\n", 3126 | " # common and useful operations\n", 3127 | " \n", 3128 | " def get_speech_length(self):\n", 3129 | " \"\"\"Get the number of tokens in the speech.\"\"\"\n", 3130 | " return len(self.tokens)\n", 3131 | " \n", 3132 | " def get_number_unique_tokens(self):\n", 3133 | " \"\"\"Gets the number of unique words used in the speech,\n", 3134 | " including stop words.\n", 3135 | " \"\"\"\n", 3136 | " return len(set(self.tokens))\n", 3137 | " \n", 3138 | " def __str__(self):\n", 3139 | " \"\"\"Returns the first 200 characters of the speech.\"\"\"\n", 3140 | " return self.txt[:200] + '...'\n", 3141 | "\n", 3142 | " \n", 3143 | "# Create an object of class Speech for Trump's inaugural speech\n", 3144 | "trump = Speech('data/trump_inauguration_millercenter.txt')\n", 3145 | "print(trump)\n", 3146 | "# Process the speech text and get the length of the speech\n", 3147 | "print(trump.get_speech_length())\n", 3148 | "\n", 3149 | "print()\n", 3150 | "\n", 3151 | "# Create another Speech object for Biden's inaugural speech\n", 3152 | "biden = Speech('data/biden_inauguration_millercenter.txt')\n", 3153 | "print(biden)\n", 3154 | "print(biden.get_speech_length())" 3155 | ] 3156 | }, 3157 | { 3158 | "cell_type": "markdown", 3159 | "metadata": { 3160 | "slideshow": { 3161 | "slide_type": "slide" 3162 | } 3163 | }, 3164 | "source": [ 3165 | "# Next Steps\n", 3166 | "\n", 3167 | "* Make use of other resources online\n", 3168 | " * [Coursera](https://www.coursera.org/)\n", 3169 | " * [MIT OpenCourseWare](https://ocw.mit.edu/index.htm)\n", 3170 | " * [Code School](http://tryr.codeschool.com/)\n", 3171 | " * ... and [many others](https://github.com/social-research/python-workshop/blob/main/RESOURCES.md)\n", 3172 | "* Write code at any opportunity\n", 3173 | "* Practice, practice, practice" 3174 | ] 3175 | }, 3176 | { 3177 | "cell_type": "markdown", 3178 | "metadata": { 3179 | "slideshow": { 3180 | "slide_type": "slide" 3181 | } 3182 | }, 3183 | "source": [ 3184 | "## Learn from Other Programmers\n", 3185 | "\n", 3186 | "![Not sure if I am a good programmer or just good at googling](figs/good_programmer.jpg \"Not sure if I am a good programmer or just good at googling\") " 3187 | ] 3188 | } 3189 | ], 3190 | "metadata": { 3191 | "celltoolbar": "Slideshow", 3192 | "kernelspec": { 3193 | "display_name": "Python 3", 3194 | "language": "python", 3195 | "name": "python3" 3196 | }, 3197 | "language_info": { 3198 | "codemirror_mode": { 3199 | "name": "ipython", 3200 | "version": 3 3201 | }, 3202 | "file_extension": ".py", 3203 | "mimetype": "text/x-python", 3204 | "name": "python", 3205 | "nbconvert_exporter": "python", 3206 | "pygments_lexer": "ipython3", 3207 | "version": "3.7.3" 3208 | } 3209 | }, 3210 | "nbformat": 4, 3211 | "nbformat_minor": 2 3212 | } 3213 | -------------------------------------------------------------------------------- /speech_analysis.py: -------------------------------------------------------------------------------- 1 | import string 2 | import urllib.request 3 | 4 | STOP_WORDS = ['a', 'about', 'across', 'after', 'an', 'and', 'any', 'are', 'as', 'at', 5 | 'be', 'because', 'but', 'by', 'did', 'do', 'does', 'for', 'from', 6 | 'get', 'has', 'have', 'if', 'in', 'is', 'it', 'its', 7 | 'many', 'more', 'much', 'no', 'not', 'of', 'on', 'or', 'out', 8 | 'so', 'some', 'than', 'the', 'this', 'that', 'those', 'through', 'to', 9 | 'very', 'what', 'where', 'whether', 'which', 'while', 'who', 'with'] 10 | 11 | 12 | def get_text(fname): 13 | """Open file given as either directory location or url by fname 14 | and return text as a string. 15 | """ 16 | if fname.startswith('http://') or fname.startswith('https://'): 17 | txt = '' 18 | for line in urllib.request.urlopen(fname): 19 | txt += line.decode("utf-8") 20 | else: 21 | with open(fname) as f: 22 | txt = f.read() 23 | return txt 24 | 25 | 26 | def get_tokens(fname): 27 | """Read given text file and return a list with all words in lowercase 28 | in the order they appear in the text. Common contractions are expanded 29 | and hyphenated words are combined in one word. 30 | """ 31 | # Open file 32 | txt = get_text(fname) 33 | 34 | # Remove paragraphs and format consistently 35 | txt = txt.strip().replace('\n', ' ').replace("’", "'") 36 | 37 | # Get rid of possessives and expand contractions 38 | txt = txt.replace("'s", '').replace("'ve", ' have').replace("'re", ' are') 39 | txt = txt.replace("can't", 'can not').replace("n't", ' not') 40 | 41 | # Remove punctuation and convert to lower-case 42 | exclude = set(string.punctuation) | {"”", "“", "…", '–'} 43 | txt = ''.join(ch.lower() for ch in txt if ch not in exclude) 44 | 45 | # Break into words 46 | wrds = txt.split() 47 | 48 | return wrds 49 | 50 | 51 | def get_word_counts(tokens): 52 | """Take tokens and return a dictionary where keys are words 53 | and values are counts of the number of time the word is repeated. 54 | """ 55 | # Create dictionary with word:count 56 | word_counts = {} 57 | 58 | for i in tokens: 59 | if i not in STOP_WORDS: 60 | if i not in word_counts: 61 | word_counts[i] = 1 62 | else: 63 | word_counts[i] += 1 64 | 65 | # Get the words with counts in decreasing order of popularity 66 | # Note this produces a list of tuples 67 | sorted_word_counts = sorted(word_counts.items(), key=lambda i: i[1], reverse=True) 68 | 69 | return sorted_word_counts 70 | --------------------------------------------------------------------------------