├── .gitignore
├── .gitmodules
├── InstallationGuide.md
├── InstallationGuide.pdf
├── README.md
├── Week1
├── examples_instructor.ipynb
├── examples_student.ipynb
├── exercises.ipynb
├── lecture.html
├── lecture.md
├── lecture.pdf
├── planning.md
└── solutions.ipynb
├── Week2
├── data
│ ├── BES-2017-F2F-codebook.pdf
│ ├── bes_data.csv
│ ├── bes_data.feather
│ ├── bes_data.pickle
│ ├── bes_data_full_week2.csv
│ ├── bes_data_full_week2.feather
│ ├── bes_data_full_week2.json
│ ├── bes_data_subset_week2.csv
│ ├── bes_data_subset_week2.feather
│ ├── bes_data_subset_week2.json
│ ├── bes_f2f_2017_v1.3.dta
│ ├── bes_relabelling.R
│ ├── data_prep.py
│ └── data_week2.zip
├── examples_instructor.ipynb
├── examples_student.ipynb
├── exercises.ipynb
├── lecture.html
├── lecture.md
├── lecture.pdf
├── planning.md
└── solutions.ipynb
├── Week3
├── examples.ipynb
├── examples_instructor.ipynb
├── examples_student.ipynb
├── exercises.ipynb
├── groupby_example.png
├── lecture.html
├── lecture.md
├── lecture.pdf
├── planning.md
├── solutions.ipynb
└── test.ipynb
├── Week4
├── crosstab_heatmap.py
├── examples_instructor.ipynb
├── examples_student.ipynb
├── exercises.ipynb
├── extra_challenge.png
├── figures.py
├── figures
│ ├── lecture_box1.png
│ ├── lecture_emptysubplot1.png
│ ├── lecture_emptysubplot2.png
│ ├── lecture_emptysubplot3.png
│ ├── lecture_fig1.png
│ ├── lecture_heatmap1.png
│ ├── lecture_hist1.png
│ ├── lecture_hist2.png
│ ├── lecture_line1.png
│ ├── lecture_linescatter1.png
│ ├── lecture_linescatter2.png
│ ├── lecture_linescatter3.png
│ ├── lecture_linescatter4.png
│ ├── lecture_rotated_labels1.png
│ ├── lecture_scatter1.png
│ ├── lecture_swarm1.png
│ ├── lecture_swarm2.png
│ └── lecture_violin1.png
├── latex_table.tex
├── lecture.html
├── lecture.md
├── lecture.pdf
├── planning.md
├── solutions.ipynb
└── test.ipynb
├── Week5
├── examples_instructor.ipynb
├── examples_student.ipynb
├── exercises.ipynb
├── lecture.html
├── lecture.md
├── lecture.pdf
├── local_plot_utils.py
├── planning.md
└── solutions.ipynb
├── Week6
├── examples.ipynb
├── examples_instructor.ipynb
├── examples_student.ipynb
├── exercises.ipynb
├── iris-TD-2.svg
├── lecture.html
├── lecture.md
└── lecture.pdf
├── Week7
├── examples_instructor.ipynb
├── examples_student.ipynb
├── exercises.ipynb
├── lecture.html
├── lecture.md
├── lecture.pdf
└── planning.md
├── Week8
├── examples_selenium.ipynb
├── examples_twitter.ipynb
├── lecture.html
├── lecture.md
├── lecture.pdf
└── lecture_planning.md
├── Week8Old
├── examples.ipynb
├── figure1.dot
├── figure1.png
├── figure1.svg
├── lecture.html
├── lecture.md
├── lecture.pdf
├── planning.md
└── wmd_fig1.png
├── _config.yml
├── dpir-intro-theme.css
├── images
├── anaconda_navigator_environments.png
├── anaconda_navigator_screenshot.png
├── atom_editor.png
├── jupyter_lab_editor.png
└── jupyter_lab_launcher.png
├── index.md
├── ipynb_slideify.py
├── minimal-theme.css
├── misc_presentations
├── ConfoundedMeasurement.png
├── Measurement Bias-Confounding.png
├── Measurement Bias-Relationship as Indicator.png
├── Measurement Bias-Simple Diagram.png
├── Measurement Bias.drawio
├── Measurement Bias.xml
├── MeasurementFig1.png
├── a7.png
├── cess-mt21-pres.html
├── cess-mt21-pres.md
├── compgov_revision.html
├── comptext-pres.html
├── draft4.pdf
├── epsa.html
├── fig_bernoulli.png
├── figures.py
├── figures
│ ├── allocations.pdf
│ ├── effect_of_targeting_presentation.png
│ ├── effect_of_targeting_total.pdf
│ ├── feature_importance.pdf
│ ├── heterogeneity_presentation.png
│ └── predicted_favorability_f_pid.pdf
├── knn.png
├── minimal-theme.css
├── pip-colloquium.html
├── planning.md
├── presentation.html
├── presentation.md
└── presentation_updated.md
├── syllabus.md
├── syllabus.pdf
└── teaching.yaml
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | wheels/
23 | pip-wheel-metadata/
24 | share/python-wheels/
25 | *.egg-info/
26 | .installed.cfg
27 | *.egg
28 | MANIFEST
29 |
30 | # PyInstaller
31 | # Usually these files are written by a python script from a template
32 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
33 | *.manifest
34 | *.spec
35 |
36 | # Installer logs
37 | pip-log.txt
38 | pip-delete-this-directory.txt
39 |
40 | # Unit test / coverage reports
41 | htmlcov/
42 | .tox/
43 | .nox/
44 | .coverage
45 | .coverage.*
46 | .cache
47 | nosetests.xml
48 | coverage.xml
49 | *.cover
50 | *.py,cover
51 | .hypothesis/
52 | .pytest_cache/
53 |
54 | # Translations
55 | *.mo
56 | *.pot
57 |
58 | # Django stuff:
59 | *.log
60 | local_settings.py
61 | db.sqlite3
62 | db.sqlite3-journal
63 |
64 | # Flask stuff:
65 | instance/
66 | .webassets-cache
67 |
68 | # Scrapy stuff:
69 | .scrapy
70 |
71 | # Sphinx documentation
72 | docs/_build/
73 |
74 | # PyBuilder
75 | target/
76 |
77 | # Jupyter Notebook
78 | .ipynb_checkpoints
79 |
80 | # IPython
81 | profile_default/
82 | ipython_config.py
83 |
84 | # pyenv
85 | # For a library or package, you might want to ignore these files since the code is
86 | # intended to run in multiple environments; otherwise, check them in:
87 | # .python-version
88 |
89 | # pipenv
90 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
91 | # However, in case of collaboration, if having platform-specific dependencies or dependencies
92 | # having no cross-platform support, pipenv may install dependencies that don't work, or not
93 | # install all needed dependencies.
94 | #Pipfile.lock
95 |
96 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
97 | __pypackages__/
98 |
99 | # Celery stuff
100 | celerybeat-schedule
101 | celerybeat.pid
102 |
103 | # SageMath parsed files
104 | *.sage.py
105 |
106 | # Environments
107 | .env
108 | .venv
109 | env/
110 | venv/
111 | ENV/
112 | env.bak/
113 | venv.bak/
114 |
115 | # Spyder project settings
116 | .spyderproject
117 | .spyproject
118 |
119 | # Rope project settings
120 | .ropeproject
121 |
122 | # mkdocs documentation
123 | /site
124 |
125 | # mypy
126 | .mypy_cache/
127 | .dmypy.json
128 | dmypy.json
129 |
130 | # Pyre type checker
131 | .pyre/
132 |
133 | # pytype static type analyzer
134 | .pytype/
135 |
--------------------------------------------------------------------------------
/.gitmodules:
--------------------------------------------------------------------------------
1 | [submodule "reveal.js"]
2 | path = reveal.js
3 | url = https://github.com/hakimel/reveal.js
4 |
--------------------------------------------------------------------------------
/InstallationGuide.md:
--------------------------------------------------------------------------------
1 | # Python Installation Guide
2 | _Musashi Harukawa, DPIR_
3 |
4 | # Installing Anaconda
5 |
6 | Anaconda is a freely-available data science software and environment manager. It can be used to manage versions and editors for Python, `R`, and other popular data science languages (such as `Julia`).
7 |
8 | In order to simplify installing all the relevant packages and software, we will be using Anaconda for this class. The instructions on how to install Anaconda can be found at:
9 |
10 | - **Windows**: https://docs.anaconda.com/anaconda/install/windows/
11 | - **MacOS**: https://docs.anaconda.com/anaconda/install/mac-os/
12 | - **Linux**: https://docs.anaconda.com/anaconda/install/linux/
13 |
14 | Follow the instructions contained within the guide. Note the following:
15 |
16 | - Install Python version 3.7 or later. Do NOT install Python 2.7.
17 | - Unless you know what you are doing, I recommend that you install anaconda to the default location.
18 | - You do not need to install as admin.
19 | - You do not need PyCharm, but if you have a experience with Sublime or other industry-standard IDEs, you may prefer to use it (please note that you will still need to use Jupyter).
20 | - (Windows only): You do not need to add Anaconda to your PATH environment variable.
21 | - You do not need Anaconda Cloud.
22 |
23 | If you run into trouble, _first try Googling the answer_ (or whatever your preferred non-invasive search engine is), and then ask me. Chances are that somebody has run into the same problem as you, and the answer exists on the Internet. If the problem persists, then feel free to get in touch.
24 |
25 | # Verifying your Installation
26 |
27 | Once you have installed Anaconda, check that everything works.
28 |
29 | Open up Anaconda Navigator (like you would any other application). You should see a menu with a number of items including JupyterLab and Jupyter Notebooks. Try opening up either, and navigating to the directory (folder) where you will be keeping all of your notes for this course.
30 |
31 | In this directory, create a Python Notebook and rename it to `my_first_notebook`. The default file ending should be `.ipynb`. Then in the first cell of this notebook, type the following command:
32 |
33 | ```{python}
34 | print("Hello World!")
35 | ```
36 |
37 | Click on "Run Selected Cell" in the notebook menu. You should get the following output:
38 |
39 | ```
40 | > Hello World!
41 | ```
42 |
43 | If all of this works without issue, then you are ready to come to class for week 1! If this does not work, and you are unable to troubleshoot the issue, please contact me prior to coming to class (so we can minimize the amount of time we spend in class installing software and troubleshooting indivdiual issues).
44 |
45 | See you on Wednesday!
46 |
--------------------------------------------------------------------------------
/InstallationGuide.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/InstallationGuide.pdf
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Introduction to Python for Social Science
2 |
3 | _Musashi Jacobs-Harukawa, Department of Politics and International Relations_
4 |
5 | ## Course Description
6 |
7 | _Introduction to Python for Social Science_ is an 8-week optional methods module aimed at social science researchers seeking to learn programming skills for their research. There will be weekly lectures, lasting 60 to 90 minutes, followed by a workshop, and supplemented by weekly office hours. All of the above will be conducted on Teams.
8 |
9 | The aim of this course is two-fold. The first goal is to teach students essential _data analysis_ and _scripting_ skills so that they are able to put together short programs and run their own analyses. The second aim is to give a introduction to the numerous techniques and technologies that researchers can integrate into their own research, and to provide incentives to invest in computational methods and skills. Some of the techniques that will be taught include:
10 |
11 | - Using Python as a Research and Development Tool
12 | - Data Cleaning and Merging with `pandas`
13 | - Static Data Visualisation with `matplotlib` and `seaborn`
14 | - Introduction to Machine Learning with `scikit-learn`
15 | - Introduction to Web Scraping with `beautifulsoup` and `selenium`
16 |
17 | Note that this course is not a course in _programming_. Students will learn how to use Python for data analysis and research, but the primary focus is on teaching them about the available methods and the bare minimum level of programming to implement these methods. Also note that this course is optional, and there will be no marked assignments, but there will be weekly tasks designed to aid learning. Students are encouraged to complete these tasks, and to ask questions about them during the workshop and clinic.
18 |
19 | This course is aimed at complete beginners, although experience with other programming languages (such as `R`) may provide some useful reference points. As spaces are limited, priority will be given to students without prior experience using Python, and those who have a use case for computational tools in their research.
20 |
21 | ## Using this Repository
22 |
23 | This repository contains all of the code, lecture slides, and jupyter notebooks for the course. You are welcome to clone this repository/browse the material here, but I've also made the effort to let you browse the slides in the browser at [`muhark.github.io/dpir-intro-python`](muhark.github.io/dpir-intro-python). I am also working on Google colab integration to allow students to work with the notebooks interactively from the website.
24 |
--------------------------------------------------------------------------------
/Week1/exercises.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Week 1 Lecture Exercises\n",
8 | "\n",
9 | "_Refer to the lecture notes, examples, and the Internet to help you complete these tasks.
\n",
10 | "Model solutions will be posted next week._"
11 | ]
12 | },
13 | {
14 | "cell_type": "markdown",
15 | "metadata": {},
16 | "source": [
17 | "## Task 1: Animal Sounds\n",
18 | "\n",
19 | "1. Create a dictionary of animals and their sounds and call it `animal_sounds`.\n",
20 | " - Each value should be the corresponding sound for the animal.\n",
21 | " - If your native language is not English, use the sounds from your language!\n",
22 | "2. Use a for loop to print the statement \"In my language, the **ANIMAL** makes the sound **SOUND**\" for each key-value pair in your dictionary.\n",
23 | "\n",
24 | "**_Extra Challenge_**:\n",
25 | "\n",
26 | "1. Create two separate lists, `animals` and `sounds`.\n",
27 | " - `animals` should be the list of the animals used in the previous task.\n",
28 | " - `sounds` should be the list of corresponding sounds.\n",
29 | " - Also: Make sure the `type` of `animals` and `sounds` is `list`!\n",
30 | "2. Create an empty dictionary called `animal_sounds`.\n",
31 | " - Hint: This can be done with `{}`\n",
32 | "3. Use a for loop to populate the dictionary with the information from animals and sounds.\n",
33 | " - _In the same for-loop, print the same statements as in the previous section_."
34 | ]
35 | },
36 | {
37 | "cell_type": "markdown",
38 | "metadata": {},
39 | "source": [
40 | "## Task 2: Writing a Menu\n",
41 | "\n",
42 | "A menu typically consists of the following information:\n",
43 | "\n",
44 | "- Course\n",
45 | "- Dish\n",
46 | "- Description\n",
47 | "- Price\n",
48 | "\n",
49 | "In this exercise, you will experiment with different ways of representing this information.\n",
50 | "\n",
51 | "1. Dictionary of Dictionaries (Nested Hierarchy)\n",
52 | " - Create a dictionary called `menu1`.\n",
53 | " - For each dish, create a second dictionary with the keys `'course'`, `'price'`, and `'description'`. Fill these in accordingly.\n",
54 | " \n",
55 | "2. Dictionary of Lists\n",
56 | " - Create a dictionary called `menu2`.\n",
57 | " - For each of the keys `'dish'`, `'course'`, `'description'` and `'price'`, write a list of all of the values.\n",
58 | " - Hint: `'course'` will contain many repeated values.\n",
59 | "\n",
60 | "**_Extra Challenge_**:\n",
61 | "\n",
62 | "- For both methods, find a way to iterate over the dictionary to print out a menu.\n",
63 | "- The fancier the better!\n",
64 | " - Note that you can get the length of a string using the `len` function. You can use this to create aligned columns!"
65 | ]
66 | },
67 | {
68 | "cell_type": "markdown",
69 | "metadata": {},
70 | "source": [
71 | "# Task 3:\n",
72 | "\n",
73 | "Similar to the exercise of making a sentence from the fewest letters possible. \n",
74 | "\n",
75 | "- Create a list of five letters and a space, call it `letters`.\n",
76 | "- Figure out the longest sentence you can make from those letters.\n",
77 | "- Use the indices of the list to write a sentence.\n",
78 | "- Create a new sentence using a for loop and the `join` function.\n",
79 | "\n",
80 | "**_Extra Challenge_**:\n",
81 | "\n",
82 | "There are other, smarter ways of doing this with dictionaries and lists. See if you can find a better method than the one below!"
83 | ]
84 | },
85 | {
86 | "cell_type": "markdown",
87 | "metadata": {},
88 | "source": [
89 | "# Task 4 (Bonus):\n",
90 | "\n",
91 | "A prime number is a natural number ($\\mathbb{N}$) that is greater than 1 and is not the product of two smaller natural numbers.\n",
92 | "\n",
93 | "Write code that prints all prime numbers less than 10000\n",
94 | "\n",
95 | "For an additional challenge, write `%%timeit` at the top of the codeblock to see how long your code takes to execute. See how fast you can make your code."
96 | ]
97 | },
98 | {
99 | "cell_type": "code",
100 | "execution_count": null,
101 | "metadata": {},
102 | "outputs": [],
103 | "source": []
104 | }
105 | ],
106 | "metadata": {
107 | "kernelspec": {
108 | "display_name": "teaching",
109 | "language": "python",
110 | "name": "teaching"
111 | },
112 | "language_info": {
113 | "codemirror_mode": {
114 | "name": "ipython",
115 | "version": 3
116 | },
117 | "file_extension": ".py",
118 | "mimetype": "text/x-python",
119 | "name": "python",
120 | "nbconvert_exporter": "python",
121 | "pygments_lexer": "ipython3",
122 | "version": "3.7.6"
123 | }
124 | },
125 | "nbformat": 4,
126 | "nbformat_minor": 4
127 | }
128 |
--------------------------------------------------------------------------------
/Week1/lecture.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week1/lecture.pdf
--------------------------------------------------------------------------------
/Week1/planning.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: Intro to Python Course Planning
3 | ---
4 |
5 | _From the syllabus_:
6 |
7 | ### _Week 1_: Introduction to Python and the Development Environment
8 |
9 | **Learning Aims**:
10 |
11 | 1. What is Python and what can I use it for?
12 | 2. What are the tools I can use to write Python code?
13 | 3. Writing your first Python script
14 |
15 | There are three learning goals in the first week. The first relates to what Python is, and how it can be useful for social science researchers. Students will learn about the various use cases for Python, and come up with ways that it may help them achieve their research aims.
16 |
17 | The second learning goal is to gain familiarity with the tools used to code in Python and present their research. These include Jupyter notebooks, IDEs and the terminal. Students will primarily use Jupyter notebooks in this course, but are welcome to use alternative development tools.
18 |
19 | The final goal is to write their first program in Python. Commands and operators such as `print`, `+`, `&` etc. will be introduced.
20 |
21 | ### Coding Goals
22 |
23 | - print()
24 | - import, from, as
25 | - int, float, str (and bool)
26 | - basic arithmetic
27 | - basic string operations
28 | - lists and dicts
29 |
30 | ### Next Week
31 |
32 | - Data I/O (requires knowledge of strings, paths)
33 | - Constructing pandas dataframes from dicts and lists
34 | - Selecting
35 |
36 |
37 | ### Lecture
38 |
39 | The lecture begins with a few administrative points, and then goes into the following parts:
40 |
41 | - _What is Python and what can I use it for_?
42 | - What is Python?
43 | - _General purpose scripting language with large data science community_
44 | - What is a script? For our purposes, it automates some task.
45 | - Usually some inputs and an output, but sometimes it can just generate things (from some external data source, e.g. the web).
46 | - Good to keep this input-output idea in your head. Each script should take some inputs, and give some outputs.
47 | - What can I use Python for?
48 | - List possible applications of Python for social science research. My primary goal is to motivate students, but also to try and illustrate the broad possibilities.
49 | - Aside: what do I use Python for?
50 | - General scripting
51 | - Quick data visualisation
52 | - Data Cleaning
53 | - Natural Language Processing
54 | - Web Scraping/Data Collection
55 | - If possible, try and find a number of papers that have used Python. This isn't always obvious, and `R` tends to be more popular in the computational political science community, whereas python is more popular amongst the engineering/hard sciences.
56 | - Quick Aside: Python vs `R`
57 | - This question comes up frequently. My two cents on the debate is that whereas `R` is a language specifically for statistical computing, python is a general-purpose programming language popular with the data science community. There is a lot of overlap in the functionality between the two languages, and learning either one will enable you to do many things anyway.
58 | - _Basic Coding Tools_
59 | - _Aim of this is to understand a few things about the coding toolkit and interface_.
60 | - Anaconda: An environment manager. Python has many _libraries_ and _versions_; this software helps you keep them tidy.
61 | - Jupyter: Code editor (and executor). Takes the form of terminals, notebooks and lab.
62 | - How to start Jupyter (and why is this running in a browser??)
63 | - Navigating the Jupyter Lab interface
64 | - Some other IDEs (for nerds):
65 | - Atom: my preferred tool, for anyone who is interested.
66 | - PyCharm: most commonly used tool for development, good for people in the class with prior experience coding in other languages.
67 | - vim: if you're hardcore
68 | - Takeaway: Python is a language, which is separate from the tools you use to write and execute it.
69 | - _First Steps in Python_: (to be done in RISE)
70 | 1. `print()`
71 | - Note: notebook returns output of last command, if this is just a variable then it returns that. In general if you want print statements, then
72 | 2. variable assignment
73 | 3. binary operators: +, -, ==
74 | 4. 4 basic data types
75 | 5. lists and dicts
76 |
--------------------------------------------------------------------------------
/Week2/data/BES-2017-F2F-codebook.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week2/data/BES-2017-F2F-codebook.pdf
--------------------------------------------------------------------------------
/Week2/data/bes_data.feather:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week2/data/bes_data.feather
--------------------------------------------------------------------------------
/Week2/data/bes_data.pickle:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week2/data/bes_data.pickle
--------------------------------------------------------------------------------
/Week2/data/bes_data_full_week2.feather:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week2/data/bes_data_full_week2.feather
--------------------------------------------------------------------------------
/Week2/data/bes_data_subset_week2.feather:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week2/data/bes_data_subset_week2.feather
--------------------------------------------------------------------------------
/Week2/data/bes_f2f_2017_v1.3.dta:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week2/data/bes_f2f_2017_v1.3.dta
--------------------------------------------------------------------------------
/Week2/data/bes_relabelling.R:
--------------------------------------------------------------------------------
1 | library("FactoRMine")
2 | library("haven")
3 | library("plyr")
4 | library("dplyr")
5 |
6 | options(
7 | summary.stats.lm = c(
8 | "R-squared",
9 | "p",
10 | "Deviance",
11 | "AIC",
12 | "BIC"
13 | )
14 | )
15 |
16 | data <- read_stata("/home/lunayneko/Documents/Teaching/dpir-intro-python/Week2/data/bes_f2f_2017_v1.3.dta")
17 |
18 | #' B: RESPONDENT'S ELECTORAL BEHAVIOUR
19 | #'
20 | #' B1: Did you vote in 2017?
21 | #' B2: Who did you vote for?
22 | #' B4: If you had voted, who would you have voted for?
23 | #' ^^ The above two can likely be combined to a single preferred party variable.
24 | #' B6: Why did you vote the way you did?
25 | #' ^^ Answers (3, 4) indicate strategic voting, B6a checks actual preferred
26 | #' party.
27 | #' B12-B13: How closely do Tories(12)/Labour(13) look after the interests of:
28 | #' - BAME
29 | #' - Trade unions
30 | #' - Middle class people
31 | #' - Big business
32 | #' - Working class people
33 | #' - People who are unemployed or on benefits
34 | #' 1: Very closely<>Not at all closely :4
35 | #'
36 | #' C: ATTITUDES TOWARD VOTING
37 | #'
38 | #' C1: How interested were you in the general election? (1 very)
39 | #' C2_2: _It is every citizen's duty to vote in an election_ (1 strong disagree)
40 | #'
41 | #' D: PARTY ID
42 | #'
43 | #' D1: Which party do you think of yourself as?
44 | #' -> If no, D2: do you think of yourelf as a little closer to one?
45 | #' -> If yes, D3: which party is that?
46 | #' D4: Strength of party identification. Can add option (4) based on non-
47 | #' identifaction based on answer to D1/D2
48 | #'
49 | #' E: LEFT-RIGHT
50 | #'
51 | #' E1: Left-Right self-placement
52 | #' ^^ Would be interesting to compare correlation of PCA with LR
53 | #'
54 | #' F1: Range of political statements:
55 | #' 01 Ordinary working people get their fair share of the nation's wealth
56 | #' 02 There is one law for the rich and one for the poor
57 | #' 03 Young people today don't have enough respect for traditional British
58 | #' values
59 | #' 04 Censorship of films and magazines is necessary to uphold moral standards
60 | #' 05 There is no need for strong trade unions to protect
61 | #' employees' working conditions and wages
62 | #' 06 Private enterprise is the best way to solve Britain's economic problems
63 | #' 07 Major public services and industries ought to be in state ownership
64 | #' 08 It is the government's responsibility to provide a job for everyone who
65 | #' wants one
66 | #' 09 People should be allowed to organise public meetings to protest against
67 | #' the government
68 | #' 10 People in Britain should be more tolerant of those who lead
69 | #' unconventional lives
70 | #' 11 For some crimes, the death penalty is the most appropriate sentence
71 | #' 12 People who break the law should be given stiffer sentences
72 | #'
73 | #' P: Europe 0-10, self and parties
74 | #'
75 | #' P1: How did you vote in the Brexit referendum?
76 | #' P2: How would you have voted in the Brexit referendum?
77 | #' P3_1: Own view, European integration
78 | #'
79 | #'
80 | #' W: CLASS
81 | #'
82 | #' W1: Class self-identification (1 middle, 2 working, 3 other)
83 | #' -> If not W1(1, 2), W2: if you had to choose, middle or working (upper is
84 | #' option but not mentioned).
85 | #'
86 | #' Y: DEMOGRAPHICS
87 | #'
88 | #' Y09: Gender
89 | #' Y13A: Highest education achieved
90 | #'
91 | # Let's predict Brexit vote and whether they voted Tory.
92 |
93 | vars <- c(
94 | "b01", "b02",
95 | "c01", "c02_2",
96 | "d01", "d03", "d02", "d04",
97 | "e01",
98 | "f01_1", "f01_2", "f01_3", "f01_4", "f01_5",
99 | "f01_6", "f01_7", "f01_8", "f01_9", "f01_10",
100 | "f01_11", "f01_12",
101 | "p01", "p02", "p03_1",
102 | "w01", "w02",
103 | "y09", "edlevel", "region")
104 |
105 | df <- data %>% select(vars)
106 |
107 | # Construct variables that are split over columns
108 |
109 | f_batt <- c(
110 | "f01_1", "f01_2", "f01_3", "f01_4", "f01_5",
111 | "f01_6", "f01_7", "f01_8", "f01_9", "f01_10",
112 | "f01_11", "f01_12")
113 | f_pca <- PCA(select(df, all_of(f_batt)), ncp=2)
114 |
115 |
116 | df <- df %>% transmute(
117 | voted = factor(as.integer(b01==1), levels=c(0,1)),
118 | tory_vote = factor(tidyr::replace_na(b02==2, 0), levels=c(0,1)),
119 | # lab_vote = factor(replace_na(b02==1, 0), levels=c(0,1)),
120 | election_interest = as.integer(c01),
121 | civic_duty = as.integer(ifelse(c02_2==-1, 3, c02_2)),
122 | party_id = select(df, c("d01", "d03")) %>%
123 | apply(function(x){ifelse(x[1]<=0 & !is.na(x[2]),x[2],x[1])}, MARGIN=1) %>%
124 | mapvalues(attr(data$d01, 'labels'), labels(attr(data$d01, 'labels'))) %>%
125 | factor(),
126 | ideo_lr = as.integer(ifelse(df$e01<0, 5, df$e01)),
127 | ideo_pc1 = f_pca$ind$coord[,"Dim.1"],
128 | ideo_pc2 = f_pca$ind$coord[,"Dim.2"],
129 | vote_leave = factor(as.integer(p01==2)),
130 | class = factor(ifelse(df$w01!=1&df$w01!=2, df$w02, df$w01)),
131 | female = factor(as.integer(y09==2)),
132 | edlevel = factor(tidyr::replace_na(edlevel, 0)),
133 | region = factor(region) %>% relevel(ref="London")
134 | )
135 |
136 | feather::write_feather("bes_data.feather")
137 | write.csv("bes_data.csv")
138 |
139 |
--------------------------------------------------------------------------------
/Week2/data/data_prep.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 |
3 | # This script contains the commands used to create the BES data subsets we use
4 | # for this week's lecture from the original data download.
5 | # It's in the repo for future reference, or in case you were curious where the
6 | # files actually come from.
7 |
8 | import pandas as pd
9 | import re
10 | import numpy as np
11 |
12 | # Reading in source data file, in stata format >:(
13 | df = pd.read_stata("bes_f2f_2017_v1.3.dta")
14 |
15 | # Choosing subset of columns for week 2.
16 | cols = df.columns.tolist()
17 | pattern = re.compile(r"^[aekxy][0-1][0-9]$")
18 | subset = cols[0:1]+cols[337:338]+cols[340:346]+cols[355:356]+['Age'] + \
19 | [col for col in cols if re.match(pattern, col)]
20 |
21 | # Fixing data for Feather conversion
22 |
23 | df.loc[:, 'Age'] = df['Age'].replace({"Refused": np.nan}).astype(np.float)
24 | df.loc[:, 'q25_cses'] = df['q25_cses'].replace(
25 | {'Not stated': np.nan}).astype(np.float)
26 |
27 |
28 | # Creating a new dataframe with just these columns.
29 | week2 = df[subset].copy()
30 |
31 |
32 | # Saving to `csv`, `json`, `feather`; `hdf` requires too many dependencies
33 | df.to_csv("bes_data_full_week2.csv", index=False)
34 | df.to_json("bes_data_full_week2.json")
35 | df.to_feather("bes_data_full_week2.feather")
36 | week2.to_csv("bes_data_subset_week2.csv", index=False)
37 | week2.to_json("bes_data_subset_week2.json")
38 | week2.to_feather("bes_data_subset_week2.feather")
39 | # week2.to_hdf("bes_data_subset_week2.hdf", key="a_meta")
40 |
--------------------------------------------------------------------------------
/Week2/data/data_week2.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week2/data/data_week2.zip
--------------------------------------------------------------------------------
/Week2/exercises.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Week 2 Lecture Exercises\n",
8 | "\n",
9 | "\n",
10 | "We'll be working with the BES 2017 face-to-face cross-sectional survey extensively in this course.\n",
11 | "\n",
12 | "- You can download a zip folder containing the data from the website: https://muhark.github.io/dpir-intro-python/Week2/data/data_week2.zip\n",
13 | "- For these exercises, you can either use `bes_data_full_week2` or `bes_data_subset_week2`.\n",
14 | "- I've included the codebook (`BES-2017-F2F-codebook.pdf`). You'll need this to interpret the columns."
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "### Step 0: Read in the Data\n",
22 | "\n",
23 | "I've taken this first step for you because I'm hosting the data files online. Normally you would write a filepath to the location the file is being kept relative to where the script is being executed."
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": null,
29 | "metadata": {},
30 | "outputs": [],
31 | "source": [
32 | "import pandas as pd\n",
33 | "\n",
34 | "link = 'http://github.com/muhark/dpir-intro-python/raw/master/Week2/data/bes_data_subset_week2.feather'\n",
35 | "bes_df = pd.read_feather(link)"
36 | ]
37 | },
38 | {
39 | "cell_type": "markdown",
40 | "metadata": {},
41 | "source": [
42 | "## Exercise 1: First Look at the Data\n",
43 | "\n",
44 | "_Answer the following questions about the dataset_:\n",
45 | "\n",
46 | "- How many observations in the dataset?\n",
47 | "- How many variables?\n",
48 | "- How many variables contain numeric values?\n",
49 | "- How many variables are open-ended response?\n",
50 | "- How many categorical variables?"
51 | ]
52 | },
53 | {
54 | "cell_type": "code",
55 | "execution_count": null,
56 | "metadata": {},
57 | "outputs": [],
58 | "source": []
59 | },
60 | {
61 | "cell_type": "markdown",
62 | "metadata": {},
63 | "source": [
64 | "# Exercise 2: Clean Up Labels\n",
65 | "\n",
66 | "_It's annoying to have to always refer to the codebook. Choose a few sections from the survey (i.e. questions a, questions b, etc.) and give the columns short, meaningful titles._\n",
67 | "\n",
68 | "For this part of the assignment, \n",
69 | "\n",
70 | "For instance, `a01` asks \"First, I'd like to ask you a few questions about the issues and problems facing Britain today. As far as you're concerned, what is the single most important issue facing the country at the present time?\". I might rename this question `most_important_issue`, or even `top_issue`.\n",
71 | "\n",
72 | "Another example: `y01` could be renamed `income` or `annual_income`.\n",
73 | "\n",
74 | "To keep your code neat, I recommend that you first create a dictionary called something like `col_name_dict`, put the original and replacements in there, and then use the `df.rename()` function to substitute the column names.\n"
75 | ]
76 | },
77 | {
78 | "cell_type": "code",
79 | "execution_count": null,
80 | "metadata": {},
81 | "outputs": [],
82 | "source": []
83 | },
84 | {
85 | "cell_type": "markdown",
86 | "metadata": {},
87 | "source": [
88 | "## Exercise 3: Cursory Statistics\n",
89 | "\n",
90 | "There are a few things you can calculate fairly easily. For instance:\n",
91 | "\n",
92 | "- How many responses per region? per constituency?\n",
93 | "- (If using section y:) Median income bracket? Modal religion? Mean/median age?\n",
94 | "\n",
95 | "Here you want to be creative. What questions would you ask of your data? What would a reviewer or a client be likely to want to know?\n",
96 | "\n",
97 | "For an additional challenge, calculate each of the statistics per-region, e.g. median income bracket per-region."
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": null,
103 | "metadata": {},
104 | "outputs": [],
105 | "source": []
106 | }
107 | ],
108 | "metadata": {
109 | "kernelspec": {
110 | "display_name": "Python 3",
111 | "language": "python",
112 | "name": "python3"
113 | },
114 | "language_info": {
115 | "codemirror_mode": {
116 | "name": "ipython",
117 | "version": 3
118 | },
119 | "file_extension": ".py",
120 | "mimetype": "text/x-python",
121 | "name": "python",
122 | "nbconvert_exporter": "python",
123 | "pygments_lexer": "ipython3",
124 | "version": "3.7.9"
125 | }
126 | },
127 | "nbformat": 4,
128 | "nbformat_minor": 4
129 | }
130 |
--------------------------------------------------------------------------------
/Week2/lecture.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week2/lecture.pdf
--------------------------------------------------------------------------------
/Week2/planning.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: Intro to Python Course Planning
3 | ---
4 |
5 |
6 | # Lecture Structure
7 |
8 | - Recap
9 | - This Week
10 | - Data Structures (Theoretical)
11 | - Importance of Understanding Data
12 | - Value and Relation
13 | - Data Structures (Graph, Hierarchical, Tabular)
14 | - Data Formats
15 | - `csv`, `xls(x)`, `html`
16 | - Data I/O
17 | - `read_csv`
18 | - `read_xlsx`
19 | - `read_html`
20 | - The `pandas.DataFrame`
21 | - `.info()`
22 | - numpy dtypes
23 | - Slicing and indexing your dataframe
24 | - `[]`
25 | - `.loc[]`
26 | - `.iloc[]`
27 | - Views vs copies
28 | - Understanding your data
29 | - `.head()`
30 | - `.describe()`
31 | - `.unique()`
32 | - `.value_counts()`
33 | - Summary functions
34 | - `.mean()`
35 | - `.sum()`
36 |
--------------------------------------------------------------------------------
/Week3/exercises.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Week 3 Exercises\n",
8 | "\n",
9 | "This week we learned how to do the following tasks:\n",
10 | "\n",
11 | "- Write functions.\n",
12 | "- Apply functions element-wise, cumulatively.\n",
13 | "- Calculate point and grouped summaries.\n",
14 | "- Concatenate and Merge Datasets\n"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "## Task 1: Functions\n",
22 | "\n",
23 | "### Task 1a: Numeric Functions\n",
24 | "\n",
25 | "In this exercise you write functions whose domain are either scalar numbers or numeric vectors.\n",
26 | "\n",
27 | "#### Scalar Functions\n",
28 | "\n",
29 | "- One Input: Absolute value\n",
30 | "- Two Inputs: Calculate the difference between the first input and the largest multiple of the second input that is less than the first input. Therefore, if the inputs are (41, 10), the function should calculate 41 - 4\\*10 = 1.\n",
31 | "- Challenge: Write a function that returns the factors of the input. For example, 132 = 2\\*2\\*3\\*11, so $f(132) = \\{2, 2, 3, 11\\}$\n",
32 | "\n",
33 | "#### Vector Functions\n",
34 | "\n",
35 | "- One Input: Write a summary statistics function. Given a vector, this function should return the following statistics in a `pd.Series` object with corresponding index labels: number of elements, sum, mean, median, variance, standard deviation, and any other statistics that you think are helpful.\n",
36 | "- Two Inputs: Write a function that given two equal-length inputs, determines whether each element in the first is divisible by the second. The output should be a vector of equal length to the inputs, indicating with True/False values whether the arguments of the first vector were divisible by the corresponding element in the second. CHALLENGE: Allow the function to take either a scalar or vector input as its second argument.\n",
37 | "\n",
38 | "### Task 1b: String Functions\n",
39 | "\n",
40 | "#### Scalar Functions\n",
41 | "\n",
42 | "- One Input: Write a function that divides a string into a list of words. Note: the `str.split()` function is useful here.\n",
43 | "- Two Inputs: Write a function that calculates the number of times the second argument occurs in the first. e.g. \"How many times does the letter e occur in this sentence?\"\n",
44 | "\n",
45 | "#### Vector Function\n",
46 | "\n",
47 | "- One Input: Write a function that, given a vector/list/series of strings, returns a series where the index is are the unique words in the input, and the values are the number of times that unique word occurs in the entire input. Therefore, if I took a list containing all of the State of the Union Address, I want a function that tells me a) what the unique words in the collection of all Addresses is, and b) how many times those words occur in the total collection.\n"
48 | ]
49 | },
50 | {
51 | "cell_type": "markdown",
52 | "metadata": {},
53 | "source": [
54 | "## Task 2: Apply\n",
55 | "\n",
56 | "### Task 2a: Element-Wise Operations\n",
57 | "\n",
58 | "1. Using the `Age` variable from the BES dataset, calculate the age of each respondent rounded down to the nearest multiple of 5. Try writing this both using a defined function and with a `lambda` function.\n",
59 | "2. Recode the column `y09` as 0 and 1.\n",
60 | "3. Write a function that gets the lower bound from the income bounds reported in column `y01`, and returns it as an integer.\n",
61 | "\n",
62 | "\n",
63 | "### Task 2b: Grouped Functions\n",
64 | "\n",
65 | "1. Calculate the summary statistics on `Age` for each region, and each region/constituency.\n",
66 | "2. Calculate the median income bracket (`y01`) per region and region/constituency.\n",
67 | "3. Calculate the most commonly given answer to `a02` per region and region/income bracket.\n",
68 | "4. Calculate the most commonly given answer to `a02` and `y06` per region."
69 | ]
70 | },
71 | {
72 | "cell_type": "code",
73 | "execution_count": null,
74 | "metadata": {},
75 | "outputs": [],
76 | "source": []
77 | }
78 | ],
79 | "metadata": {
80 | "kernelspec": {
81 | "display_name": "teaching",
82 | "language": "python",
83 | "name": "teaching"
84 | },
85 | "language_info": {
86 | "codemirror_mode": {
87 | "name": "ipython",
88 | "version": 3
89 | },
90 | "file_extension": ".py",
91 | "mimetype": "text/x-python",
92 | "name": "python",
93 | "nbconvert_exporter": "python",
94 | "pygments_lexer": "ipython3",
95 | "version": "3.7.6"
96 | }
97 | },
98 | "nbformat": 4,
99 | "nbformat_minor": 4
100 | }
101 |
--------------------------------------------------------------------------------
/Week3/groupby_example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week3/groupby_example.png
--------------------------------------------------------------------------------
/Week3/lecture.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week3/lecture.pdf
--------------------------------------------------------------------------------
/Week3/planning.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: Week 3 Planning - Data Structures and Pandas II
3 | author: Musashi Harukawa
4 | ---
5 |
6 |
47 |
48 | # Methods Covered:
49 |
50 | - Writing functions
51 | - Column/DataFrame apply
52 | - Groupby-Apply
53 | - Append, concat and merge
54 | - Melt and pivot
55 |
56 | # Methods/Theory
57 |
58 | - What is a function?
59 | - Applying functions to vectors:
60 | - transformation, pointwise, and summaries
61 | - grouped summary
62 | - Combining datasets
63 | - Append vs concat vs merge
64 | - A bit of set theory: union, etc.
65 | - Long vs wide-form data
66 |
67 |
68 | # Computational Aspect
69 |
70 | - Functions:
71 | - Tools for control flow + generalizability
72 | - Namespaces
73 | - Apply:
74 | - Groupby-Apply:
75 | - Vectorization and performance
76 | - Append, concat and merge
77 | - Performance, accessibility over indexed data
78 | -
79 |
80 |
81 | Scrap:
82 |
83 | $$
84 | f(X_{i, 1}) = \begin{bmatrix}
85 | f(x_{1, 1}) \\
86 | f(x_{2, 1}) \\
87 | \vdots \\
88 | f(x_{N, 1})
89 | \end{bmatrix}
90 | $$
91 |
--------------------------------------------------------------------------------
/Week4/crosstab_heatmap.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import re
3 | import seaborn as sns
4 | import matplotlib
5 | import matplotlib.pyplot as plt
6 |
7 | df = pd.read_feather("../Week2/data/bes_data_subset_week2.feather")
8 | matplotlib.rcParams["text.usetex"] = True
9 |
10 |
11 | def crosstab_heatmap(df, col1, col2, col_dict):
12 | # Produce crosstab
13 | xtab = pd.crosstab(df[col1], df[col2])
14 | # Right-column should show total number of answers
15 | if isinstance(xtab.columns, pd.CategoricalIndex):
16 | xtab.columns = xtab.columns.add_categories("Total")
17 | if isinstance(xtab.index, pd.CategoricalIndex):
18 | xtab.index = xtab.index.add_categories("Avg Prop")
19 | xtab.loc[:, "Total"] = xtab.sum(axis=1)
20 | xtab = xtab.loc[xtab['Total'] >= xtab.shape[1]*5]
21 | props = xtab.iloc[:, :-1].div(xtab['Total'], axis=0)
22 | props.loc['Avg Prop', :] = props.mean(axis=0)
23 | xtab.loc['Avg Prop', :] = ""
24 | props.loc[:, 'Total'] = ""
25 |
26 | annot = xtab.applymap(
27 | lambda x: "{:.4g}\n".format(int(x)) if isinstance(x, float) else ""
28 | ).add(
29 | props.applymap(
30 | lambda x: "({:.3g}\%)".format(
31 | 100*x) if isinstance(x, float) else ""
32 | ))
33 |
34 | annot.loc['Avg Prop', 'Total'] = "{:.5g}".format(xtab.iloc[:-1, -1].sum())
35 | annot.loc['Avg Prop', :] = annot.loc['Avg Prop',
36 | :].str.replace(re.compile(r"[\(\)]"), "")
37 | annot.loc[:, 'Total'] = annot['Total'].str.replace("\n", "")
38 | diffs = props.iloc[:, :-1] - props.iloc[-1, :-1]
39 | diffs['Total'] = float(0)
40 | diffs = diffs.applymap(lambda x: 100*x)
41 |
42 | fig_title = "Crosstab Heatmap of \\textbf{"+col_dict[col1]+"} by \\textbf{"+col_dict[col2]
43 |
44 | f, ax = plt.subplots(1, 1, figsize=annot.shape)
45 | sns.heatmap(
46 | diffs.T,
47 | annot=annot.T,
48 | fmt="s",
49 | ax=ax,
50 | cmap="RdBu_r",
51 | center=0,
52 | cbar_kws={"label": "Difference from Avg. Proportion"},
53 | )
54 | ax.set_title(fig_title)
55 | ax.xaxis.tick_top()
56 | ax.xaxis.set_label_position("top")
57 | ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="left")
58 | ax.set_yticklabels(ax.get_yticklabels())
59 | f.axes[1].set_yticklabels(
60 | [lab.get_text() + "\%" for lab in f.axes[1].get_yticklabels()]
61 | )
62 | ax.axvline(diffs.shape[0] - 1, color="k")
63 | ax.axhline(diffs.shape[1] - 1, color="k")
64 | ax.set_xlabel("")
65 | ax.set_ylabel("")
66 | ax.xaxis.set_label_position("bottom")
67 | return f
68 |
69 |
70 |
71 | col_dict = {'a02': 'Best Party', 'y09': 'Gender'}
72 | f = crosstab_heatmap(df, 'a02', 'y09', col_dict)
73 | f.savefig("extra_challenge.png", bbox_inches="tight")
74 |
--------------------------------------------------------------------------------
/Week4/exercises.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Week 4 Exercises: Data Visualisation\n",
8 | "\n",
9 | "This week in particular does not have correct \"solutions\". However, I encourage you to attempt the following:\n",
10 | "\n",
11 | "- Make your figures as complete and professional as possible. This means adding titles, legends, axis labels, and making them aesthetically pleasing.\n",
12 | "- Write the solutions as generalisable functions. As much as possible you should be able to substitute the inputs and still get a complete and correctly-labelled figure.\n",
13 | "\n",
14 | "Additional code examples can be found in: https://github.com/muhark/dpir-intro-python/blob/master/Week4/figures.py"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "## Task 1: Simple Figure\n",
22 | "\n",
23 | "Using the BES data:\n",
24 | "\n",
25 | "- Create a figure with a single subplot.\n",
26 | "- Plot the answer to item 'a02' (party best suited to tackle the biggest issue in Britain) as a function of age.\n",
27 | "\n",
28 | "As an additional challenge, try using only functions from `matplotlib` for this figure."
29 | ]
30 | },
31 | {
32 | "cell_type": "markdown",
33 | "metadata": {},
34 | "source": [
35 | "## Task 2: Panelling\n",
36 | "\n",
37 | "Recreate the same figure as above, but with a separate subplot for each region.\n",
38 | "\n",
39 | "- Challenge 1: Use a for-loop\n",
40 | "- Challenge 2: Write this as a function where 'a02' and 'region' can be substituted for other categorical variables.\n",
41 | "- Challenge 3: Make the figure size dynamic, i.e. a function of the number of subplots.\n",
42 | "- Challenge 4: Limit the number of subplot columns to 4; if there are more than 4 categories, the function should add an additional row to fit them."
43 | ]
44 | },
45 | {
46 | "cell_type": "markdown",
47 | "metadata": {},
48 | "source": [
49 | "## Task 3: Color Palettes and Mapping\n",
50 | "\n",
51 | "Seaborn has a lot of resources for customising the color palettes in your figures. See: https://seaborn.pydata.org/tutorial/color_palettes.html\n",
52 | "\n",
53 | "A useful tool when creating figures is creating a column of color values. In other words, given some categorical column, you want to be able to create a column where each category is replaced by a unique color.\n",
54 | "\n",
55 | "Let's try doing this:\n",
56 | "\n",
57 | "- First, create a color palette with a number of colors equal to the number of categories in your column. To do this, you will need the `sns.color_palette()`, `pd.Series.unique()`, and `dict(zip())`.\n",
58 | " - Given two lists, `zip` will combine them into a list of pairs. e.g. `zip([1, 2, 3], ['a', 'b', 'c'])` will return `[(1, 'a'), (2, 'b'), (3, 'c')]`. Passing this to `dict`, i.e. `dict(zip([1, 2, 3], ['a', 'b', 'c']))`, will return `{1: 'a', 2: 'b', 3: 'c'}`.\n",
59 | "- Apply the dictionary to your column. You should get back a column of RGB values (a triplet of red, green and blue defining a color)."
60 | ]
61 | },
62 | {
63 | "cell_type": "markdown",
64 | "metadata": {},
65 | "source": [
66 | "## Extra Challenge: Crosstab Heatmap\n",
67 | "\n",
68 | "This one is extremely hard. See the following figure: https://github.com/muhark/dpir-intro-python/blob/master/Week4/extra_challenge.png\n",
69 | "\n",
70 | "Do your best to create a similar figure. You will need the `pd.crosstab` function, and the `sns.heatmap` function.\n",
71 | "\n",
72 | "(Solution is in the github, but give it a go!)"
73 | ]
74 | }
75 | ],
76 | "metadata": {
77 | "kernelspec": {
78 | "display_name": "teaching",
79 | "language": "python",
80 | "name": "teaching"
81 | },
82 | "language_info": {
83 | "codemirror_mode": {
84 | "name": "ipython",
85 | "version": 3
86 | },
87 | "file_extension": ".py",
88 | "mimetype": "text/x-python",
89 | "name": "python",
90 | "nbconvert_exporter": "python",
91 | "pygments_lexer": "ipython3",
92 | "version": "3.7.9"
93 | }
94 | },
95 | "nbformat": 4,
96 | "nbformat_minor": 4
97 | }
98 |
--------------------------------------------------------------------------------
/Week4/extra_challenge.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week4/extra_challenge.png
--------------------------------------------------------------------------------
/Week4/figures.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | import matplotlib.pyplot as plt
4 | import matplotlib
5 | import seaborn as sns
6 |
7 | # sns.set_style("darkgrid")
8 |
9 | # Figure 1: Summary Statistics vs DistPlot
10 |
11 | x1 = list(np.random.normal(0, 1, 150))+list(np.random.normal(8, 1, 150))
12 | x2 = np.random.normal(4, 4, 300)
13 | x3 = np.random.multinomial(1, [1/6]*6, size=50).ravel()
14 | data = pd.DataFrame({
15 | "x1": x1,
16 | "x2": x2,
17 | "x3": x3
18 | })
19 | data.describe().to_html()
20 | df = data.melt()
21 |
22 | f, ax = plt.subplots(1, 1, figsize=(8, 4))
23 | sns.distplot(data['x1'], label='x1')
24 | sns.distplot(data['x2'], label='x2')
25 | ax.legend()
26 | f.savefig("figures/lecture_fig1.png", bbox_inches="tight")
27 |
28 | bes_df = pd.read_feather("../Week2/data/bes_data_subset_week2.feather")
29 |
30 | # Anatomy of a Figure
31 |
32 | matplotlib.rcdefaults()
33 | f = plt.figure(figsize=(8, 4))
34 |
35 | matplotlib.rc('axes', edgecolor='r', lw=5)
36 | f, ax = plt.subplots(1, 1, figsize=(8, 4))
37 | f.suptitle("This is a figure with a subplot")
38 | ax.set_title("This is a subplot", color="r")
39 | f.savefig("figures/lecture_emptysubplot1.png", bbox_inches="tight")
40 | matplotlib.rcdefaults()
41 |
42 |
43 | #matplotlib.rc('axes', edgecolor='r', lw=5)
44 | f, ax = plt.subplots(1, 2, figsize=(8, 4))
45 | f.suptitle("This is a figure with two subplots")
46 | ax[0].set_title("This is a subplot", color="r")
47 | ax[1].set_title("This is another subplot", color="r")
48 | f.savefig("figures/lecture_emptysubplot2.png", bbox_inches="tight")
49 | matplotlib.rcdefaults()
50 |
51 | #matplotlib.rc('axes', edgecolor='r', lw=5)
52 | f, ax = plt.subplots(2, 2, figsize=(8, 4))
53 | f.suptitle("This is a figure with four subplots")
54 | for i in range(2):
55 | for j in range(2):
56 | ax[i][j].set_title(f"Subplot [{i}][{j}]", color="r")
57 | f.savefig("figures/lecture_emptysubplot3.png", bbox_inches="tight")
58 | matplotlib.rcdefaults()
59 |
60 |
61 |
62 | f, ax = plt.subplots(1, 1, figsize=(8, 4))
63 | ax.scatter(data['x2'], data['x1'], color='r')
64 | # data['x3'].apply(lambda x: dict(zip(range(1, 6), sns.color_palette(n_colors=5)))[x])
65 | f.savefig("figures/lecture_scatter1.png", bbox_inches="tight")
66 |
67 | f, ax = plt.subplots(1, 1, figsize=(8, 4))
68 | ax.plot(np.linspace(0, 10, 100), np.linspace(0, 5, 100), color='r')
69 | f.savefig("figures/lecture_line1.png", bbox_inches="tight")
70 |
71 | f, ax = plt.subplots(1, 1, figsize=(8, 4))
72 | ax.scatter(data['x2'], data['x1'], color='r', s=3)
73 | ax.plot(np.linspace(-10, 20, 150), np.linspace(-3.5, 4, 150)**2)
74 | ax.axhline(0, color='k', alpha=0.5, ls="--")
75 | ax.axvline(0, color='k', alpha=0.5, ls="--")
76 | f.savefig("figures/lecture_linescatter1.png", bbox_inches="tight")
77 |
78 | f, ax = plt.subplots(1, 1, figsize=(8, 4))
79 | ax.scatter(data['x2'], data['x1'], color='g', s=3)
80 | ax.plot(np.linspace(-10, 20, 150), np.linspace(-3.5, 4, 150)**2)
81 | ax.axhline(0, color='k', alpha=0.5, ls="--")
82 | ax.axvline(0, color='k', alpha=0.5, ls="--")
83 | ax.xaxis.set_label_text("X-Axis Label", color='r')
84 | ax.yaxis.set_label_text("Y-Axis Label", color='r')
85 | f.savefig("figures/lecture_linescatter2.png", bbox_inches="tight")
86 |
87 | f, ax = plt.subplots(1, 1, figsize=(8, 3.5))
88 | ax.scatter(data['x2'], data['x1'], color='g', s=3)
89 | ax.plot(np.linspace(-10, 20, 150), np.linspace(-3.5, 4, 150)**2)
90 | ax.axhline(0, color='k', alpha=0.5, ls="--")
91 | ax.axvline(0, color='k', alpha=0.5, ls="--")
92 | ax.xaxis.set_label_text("X-Axis Label")
93 | ax.yaxis.set_label_text("Y-Axis Label")
94 | ax.xaxis.set_ticks(range(-10, 40, 10))
95 | ax.yaxis.set_ticks(range(-4, 25, 2))
96 | f.savefig("figures/lecture_linescatter3.png", bbox_inches="tight")
97 |
98 | f, ax = plt.subplots(1, 1, figsize=(8, 3.5))
99 | ax.scatter(data['x2'], data['x1'], color='g', s=3)
100 | ax.plot(np.linspace(-10, 20, 150), np.linspace(-3.5, 4, 150)**2)
101 | ax.axhline(0, color='k', alpha=0.5, ls="--")
102 | ax.axvline(0, color='k', alpha=0.5, ls="--")
103 | ax.xaxis.set_label_text("X-Axis Label")
104 | ax.yaxis.set_label_text("Y-Axis Label")
105 | ax.xaxis.set_major_locator(matplotlib.ticker.MultipleLocator(base=3))
106 | ax.yaxis.set_major_locator(matplotlib.ticker.MultipleLocator(base=2))
107 | f.savefig("figures/lecture_linescatter4.png", bbox_inches="tight")
108 |
109 |
110 | f, ax = plt.subplots(1, 1, figsize=(8, 4))
111 | sns.boxenplot(bes_df['region'], bes_df['Age'], ax=ax)
112 | ax.xaxis.set_ticklabels(ax.xaxis.get_ticklabels(), rotation=30)
113 | f.savefig("figures/lecture_rotated_labels1.png", bbox_inches="tight")
114 |
115 | # Gallery
116 |
117 | sns.set_style('darkgrid')
118 |
119 | # Histogram (One Category)
120 |
121 | f, ax = plt.subplots(1, 1, figsize=(8, 4))
122 | sns.distplot(bes_df['Age'].dropna(), kde=False, ax=ax)
123 | ax.set_title("Age Distribution of BES Respondents")
124 | f.savefig("figures/lecture_hist1.png")
125 |
126 | # Histogram (Two Categories)
127 |
128 | f, ax = plt.subplots(1, 1, figsize=(8, 6))
129 | sns.distplot(bes_df[bes_df['y09'] == 'Male']
130 | ['Age'].dropna(), kde=False, label='Male')
131 | sns.distplot(bes_df[bes_df['y09'] == 'Female']
132 | ['Age'].dropna(), kde=False, label='Female')
133 | ax.legend()
134 | ax.set_title("Age Distribution of BES by Gender")
135 | f.savefig("figures/lecture_hist2.png", bbox_inches="tight")
136 |
137 | # Box and Whisker Plot
138 |
139 | f, ax = plt.subplots(1, 1, figsize=(8, 4))
140 | sns.boxplot(bes_df['Age'].dropna(), bes_df['region'])
141 | ax.set_title("BES Age Distribution by Region")
142 | f.savefig("figures/lecture_box1.png", bbox_inches="tight")
143 |
144 | # Swarm Plot (One Category)
145 |
146 | f, ax = plt.subplots(1, 1, figsize=(8, 4))
147 | sns.swarmplot(bes_df['Age'])
148 | ax.set_title("BES Age Swarm Plot")
149 | f.savefig("figures/lecture_swarm1.png", bbox_inches="tight")
150 |
151 |
152 | # Swarm Plot (Multiple Categories)
153 |
154 | f, ax = plt.subplots(1, 1, figsize=(8, 4))
155 | sns.swarmplot(bes_df['Age'], bes_df['y01'])
156 | ax.set_title("BES Age Swarm Plot by Income Group")
157 | f.savefig("figures/lecture_swarm2.png", bbox_inches="tight")
158 |
159 |
160 | # Violin Plot (One Category)
161 |
162 | f, ax = plt.subplots(1, 1, figsize=(8, 4))
163 | sns.violinplot(bes_df['Age'])
164 | ax.set_title("BES Age Violin Plot")
165 | f.savefig("figures/lecture_violin1.png", bbox_inches="tight")
166 |
167 | # Heatmap
168 |
169 | f, ax=plt.subplots(1, 1, figsize=(15, 8))
170 | sns.heatmap(pd.crosstab(bes_df['y01'], bes_df['region']), cmap="RdBu_r")
171 | f.savefig("figures/lecture_heatmap1.png", bbox_inches="tight")
172 | # This is an issue with this particular version of matplotlib
173 |
174 | # Not implented in beamer?
175 | data.describe().to_latex("latex_table.tex")
176 |
--------------------------------------------------------------------------------
/Week4/figures/lecture_box1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week4/figures/lecture_box1.png
--------------------------------------------------------------------------------
/Week4/figures/lecture_emptysubplot1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week4/figures/lecture_emptysubplot1.png
--------------------------------------------------------------------------------
/Week4/figures/lecture_emptysubplot2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week4/figures/lecture_emptysubplot2.png
--------------------------------------------------------------------------------
/Week4/figures/lecture_emptysubplot3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week4/figures/lecture_emptysubplot3.png
--------------------------------------------------------------------------------
/Week4/figures/lecture_fig1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week4/figures/lecture_fig1.png
--------------------------------------------------------------------------------
/Week4/figures/lecture_heatmap1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week4/figures/lecture_heatmap1.png
--------------------------------------------------------------------------------
/Week4/figures/lecture_hist1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week4/figures/lecture_hist1.png
--------------------------------------------------------------------------------
/Week4/figures/lecture_hist2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week4/figures/lecture_hist2.png
--------------------------------------------------------------------------------
/Week4/figures/lecture_line1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week4/figures/lecture_line1.png
--------------------------------------------------------------------------------
/Week4/figures/lecture_linescatter1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week4/figures/lecture_linescatter1.png
--------------------------------------------------------------------------------
/Week4/figures/lecture_linescatter2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week4/figures/lecture_linescatter2.png
--------------------------------------------------------------------------------
/Week4/figures/lecture_linescatter3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week4/figures/lecture_linescatter3.png
--------------------------------------------------------------------------------
/Week4/figures/lecture_linescatter4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week4/figures/lecture_linescatter4.png
--------------------------------------------------------------------------------
/Week4/figures/lecture_rotated_labels1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week4/figures/lecture_rotated_labels1.png
--------------------------------------------------------------------------------
/Week4/figures/lecture_scatter1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week4/figures/lecture_scatter1.png
--------------------------------------------------------------------------------
/Week4/figures/lecture_swarm1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week4/figures/lecture_swarm1.png
--------------------------------------------------------------------------------
/Week4/figures/lecture_swarm2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week4/figures/lecture_swarm2.png
--------------------------------------------------------------------------------
/Week4/figures/lecture_violin1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week4/figures/lecture_violin1.png
--------------------------------------------------------------------------------
/Week4/latex_table.tex:
--------------------------------------------------------------------------------
1 | \begin{tabular}{lrrr}
2 | \toprule
3 | {} & x1 & x2 & x3 \\
4 | \midrule
5 | count & 300.000000 & 300.000000 & 300.000000 \\
6 | mean & 3.986588 & 4.256635 & 0.166667 \\
7 | std & 4.070890 & 4.146357 & 0.373301 \\
8 | min & -2.529208 & -9.047706 & 0.000000 \\
9 | 25\% & 0.010607 & 1.759387 & 0.000000 \\
10 | 50\% & 3.960120 & 4.333940 & 0.000000 \\
11 | 75\% & 7.860896 & 6.635001 & 0.000000 \\
12 | max & 10.472298 & 18.649113 & 1.000000 \\
13 | \bottomrule
14 | \end{tabular}
15 |
--------------------------------------------------------------------------------
/Week4/lecture.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week4/lecture.pdf
--------------------------------------------------------------------------------
/Week4/planning.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: Week 4 Planning
3 | ---
4 |
5 | # Data Visualisation
6 |
7 | ## Teaching Aims
8 |
9 | - Methodological: Systematising and understanding summarising and conveying information graphically.
10 | - Key Questions:
11 | - How many variables (dimensions)?
12 | - What type of variables? Discrete vs. Continuous, Ordered?
13 | - What kind of comparison?
14 | - Implementation: Understanding the implicit model behind graphing software like matplotlib (and seaborn)
15 | - Figures
16 | - Title
17 | - Spacing
18 | - Axes (subplots)
19 | - Title
20 | - Graphical Objects
21 | - Lines
22 | - Dots
23 | - Shapes
24 | - Text
25 | - axes (xaxis and yaxis)
26 | - Ticks
27 | - Tick intervals
28 | - Tick labels
29 | - Label
30 |
31 |
32 | ## Methodological Aspect
33 |
34 | - Figures, by and large, serve a similar function to statistics. They convey a great deal of relevant information about a dataset without requiring one to inspect the individual values.
35 | - For instance, a histogram or KDE plot says a lot more about the shape of a distribution than most statisitics can.
36 | - It's easier to understand the functional shape of a time trend by plotting it than just eyeballing the numbers.
37 | - Also, they look pretty.
38 | - Data visualisation is useful at two steps in the data analysis process:
39 | - Exploratory Analysis: here, being able to quickly construct a graph that shows what you need is key.
40 | - Presentation of Results: knowing a lot about how to customise a plot to exactly match your requirements matters more here.
41 | - There are dozens of types of figures:
42 | - Distribution:
43 | - Histogram, KDE, rugplot, swarmplot, violinplot, box-and-whiskers
44 | - Unorderable Frequencies:
45 | - Bar, Grouped bars,
46 | - Use Cases:
47 | - 1 dimensional:
48 | - Orderable:
49 | - Histogram, kernel density estimate
50 | - Unorderable:
51 | - Pie chart (if proportions), bar chart (frequencies)
52 | - 2 dimensional:
53 | - Orderable * Unorderable:
54 | -
55 |
56 |
57 | ## The Anatomy of a Figure
58 |
59 | - The figure
60 | - The subplots (axes)
61 | - The axes (labels, ticks, etc)
62 | - The graphical elements
63 |
--------------------------------------------------------------------------------
/Week5/examples_student.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "slideshow": {
7 | "slide_type": "subslide"
8 | }
9 | },
10 | "source": [
11 | "# Coding Tutorial 5: Unsupervised Learning\n",
12 | "\n",
13 | "In this coding tutorial, we learn how to do the following for `k-means` clustering and principal components analysis:\n",
14 | "\n",
15 | "- Import models from `scikit-learn`\n",
16 | "- Prepare a pandas dataframe for analysis with `scikit-learn`\n",
17 | "- Instantiate and fit a model to data\n",
18 | "- Visualise the results of the model"
19 | ]
20 | },
21 | {
22 | "cell_type": "markdown",
23 | "metadata": {
24 | "slideshow": {
25 | "slide_type": "subslide"
26 | }
27 | },
28 | "source": [
29 | "# Importing Models from Scikit-Learn\n",
30 | "\n",
31 | "`scikit-learn` is actually a collection of modules, so you will need to find which sub-module contains the model you want to use."
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": null,
37 | "metadata": {
38 | "slideshow": {
39 | "slide_type": "subslide"
40 | }
41 | },
42 | "outputs": [],
43 | "source": [
44 | "# standard imports\n",
45 | "import pandas as pd\n",
46 | "import numpy as np\n",
47 | "import matplotlib.pyplot as plt\n",
48 | "import seaborn as sns\n",
49 | "\n",
50 | "# scikit-learn imports\n",
51 | "from sklearn.preprocessing import StandardScaler\n",
52 | "from sklearn.cluster import KMeans\n",
53 | "from sklearn.decomposition import PCA"
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": null,
59 | "metadata": {
60 | "slideshow": {
61 | "slide_type": "subslide"
62 | }
63 | },
64 | "outputs": [],
65 | "source": [
66 | "# import the data\n",
67 | "link = 'http://github.com/muhark/dpir-intro-python/raw/master/Week2/data/bes_data_subset_week2.feather'\n",
68 | "df = pd.read_feather(link)"
69 | ]
70 | },
71 | {
72 | "cell_type": "markdown",
73 | "metadata": {
74 | "slideshow": {
75 | "slide_type": "subslide"
76 | }
77 | },
78 | "source": [
79 | "# Data Pre-Processing\n",
80 | "\n",
81 | "There are four steps for preparing data for analysis:\n",
82 | "\n",
83 | "1. Feature Selection\n",
84 | "2. Accounting for NAs\n",
85 | "3. One Hot Encoding\n",
86 | "4. Conversion to numpy ndarray"
87 | ]
88 | },
89 | {
90 | "cell_type": "markdown",
91 | "metadata": {
92 | "slideshow": {
93 | "slide_type": "subslide"
94 | }
95 | },
96 | "source": [
97 | "## Feature Selection\n",
98 | "\n",
99 | "Here we just choose which columns we are going to use. If your data has a lot of NAs, it may be worthwhile to prefer columns with fewer NAs."
100 | ]
101 | },
102 | {
103 | "cell_type": "code",
104 | "execution_count": null,
105 | "metadata": {
106 | "slideshow": {
107 | "slide_type": "fragment"
108 | }
109 | },
110 | "outputs": [],
111 | "source": [
112 | "features = ['region', 'Age', 'a02', 'a03', 'e01',\n",
113 | " 'k01', 'k02', 'k11', 'k13', 'k06', 'k08',\n",
114 | " 'y01', 'y03', 'y06', 'y08', 'y09', 'y11', 'y17']"
115 | ]
116 | },
117 | {
118 | "cell_type": "markdown",
119 | "metadata": {
120 | "slideshow": {
121 | "slide_type": "subslide"
122 | }
123 | },
124 | "source": [
125 | "## Accounting for NAs"
126 | ]
127 | },
128 | {
129 | "cell_type": "code",
130 | "execution_count": null,
131 | "metadata": {
132 | "slideshow": {
133 | "slide_type": "fragment"
134 | }
135 | },
136 | "outputs": [],
137 | "source": [
138 | "# Can check for na's with:\n",
139 | "# df[features].isna().sum()\n",
140 | "df = df[features].dropna()"
141 | ]
142 | },
143 | {
144 | "cell_type": "markdown",
145 | "metadata": {
146 | "slideshow": {
147 | "slide_type": "subslide"
148 | }
149 | },
150 | "source": [
151 | "## One-Hot Encoding\n",
152 | "\n",
153 | "We can do a one-hot encoding using the `pd.get_dummies()` function."
154 | ]
155 | },
156 | {
157 | "cell_type": "code",
158 | "execution_count": null,
159 | "metadata": {
160 | "slideshow": {
161 | "slide_type": "fragment"
162 | }
163 | },
164 | "outputs": [],
165 | "source": [
166 | "data = pd.get_dummies(df)\n",
167 | "print(df.shape, data.shape)"
168 | ]
169 | },
170 | {
171 | "cell_type": "markdown",
172 | "metadata": {
173 | "slideshow": {
174 | "slide_type": "subslide"
175 | }
176 | },
177 | "source": [
178 | "## Normalization and Conversion to `numpy`\n",
179 | "\n",
180 | "We call the `StandardScaler().fit_transform()` function on the `.values` argument of the dataframe"
181 | ]
182 | },
183 | {
184 | "cell_type": "code",
185 | "execution_count": null,
186 | "metadata": {
187 | "slideshow": {
188 | "slide_type": "fragment"
189 | }
190 | },
191 | "outputs": [],
192 | "source": [
193 | "X = data.values\n",
194 | "scaler = StandardScaler()\n",
195 | "X_norm = scaler.fit_transform(X)"
196 | ]
197 | },
198 | {
199 | "cell_type": "markdown",
200 | "metadata": {
201 | "slideshow": {
202 | "slide_type": "subslide"
203 | }
204 | },
205 | "source": [
206 | "# Instantiating and Fitting `k-means`\n",
207 | "\n",
208 | "We first create an instance of the model, where we provide parameters, and then we pass data to it."
209 | ]
210 | },
211 | {
212 | "cell_type": "code",
213 | "execution_count": null,
214 | "metadata": {
215 | "slideshow": {
216 | "slide_type": "fragment"
217 | }
218 | },
219 | "outputs": [],
220 | "source": [
221 | "kmeans = KMeans(n_clusters=5, random_state=634)"
222 | ]
223 | },
224 | {
225 | "cell_type": "code",
226 | "execution_count": null,
227 | "metadata": {
228 | "slideshow": {
229 | "slide_type": "fragment"
230 | }
231 | },
232 | "outputs": [],
233 | "source": [
234 | "kmeans.fit(X_norm)"
235 | ]
236 | },
237 | {
238 | "cell_type": "markdown",
239 | "metadata": {
240 | "slideshow": {
241 | "slide_type": "subslide"
242 | }
243 | },
244 | "source": [
245 | "We can extract the labels using the `.labels_` method, and then assign them to a column."
246 | ]
247 | },
248 | {
249 | "cell_type": "code",
250 | "execution_count": null,
251 | "metadata": {
252 | "slideshow": {
253 | "slide_type": "subslide"
254 | }
255 | },
256 | "outputs": [],
257 | "source": [
258 | "df['labels_'] = kmeans.labels_\n",
259 | "df['labels_'] = df['labels_'].astype(str)"
260 | ]
261 | },
262 | {
263 | "cell_type": "markdown",
264 | "metadata": {
265 | "slideshow": {
266 | "slide_type": "subslide"
267 | }
268 | },
269 | "source": [
270 | "# Visualising the Results\n",
271 | "\n",
272 | "This is a bit difficult with so many variables. Let's look at age."
273 | ]
274 | },
275 | {
276 | "cell_type": "code",
277 | "execution_count": null,
278 | "metadata": {
279 | "slideshow": {
280 | "slide_type": "subslide"
281 | }
282 | },
283 | "outputs": [],
284 | "source": [
285 | "f, ax = plt.subplots(1, 1, figsize=(15, 8))\n",
286 | "sns.histplot(df[['labels_', 'Age']].sort_values('labels_'),\n",
287 | " x='Age', ax=ax, kde=True, hue='labels_');"
288 | ]
289 | },
290 | {
291 | "cell_type": "code",
292 | "execution_count": null,
293 | "metadata": {
294 | "slideshow": {
295 | "slide_type": "subslide"
296 | }
297 | },
298 | "outputs": [],
299 | "source": [
300 | "# We can appropriate this function\n",
301 | "def grouped_barplot(data, var1, var2):\n",
302 | " \"\"\"\n",
303 | " Creates a grouped bar plot of the distribution of `var2` within each group of `var2`.\n",
304 | " \"\"\"\n",
305 | " temp = data.groupby([var1, var2]).apply(len).reset_index().rename({0: 'Count'}, axis=1)\n",
306 | " f, ax = plt.subplots(1, 1, figsize=(len(data[var1].unique())*len(data[var1].unique())/5, 10))\n",
307 | " sns.barplot(data=temp, x=var1, y='Count', hue=var2)\n",
308 | " ax.set_title(f\"BES Sample {var2} per {var1}\")\n",
309 | " ax.xaxis.set_ticklabels(ax.xaxis.get_ticklabels(), rotation=30)"
310 | ]
311 | },
312 | {
313 | "cell_type": "code",
314 | "execution_count": null,
315 | "metadata": {
316 | "slideshow": {
317 | "slide_type": "subslide"
318 | }
319 | },
320 | "outputs": [],
321 | "source": [
322 | "grouped_barplot(df, 'a02','labels_') "
323 | ]
324 | },
325 | {
326 | "cell_type": "code",
327 | "execution_count": null,
328 | "metadata": {
329 | "slideshow": {
330 | "slide_type": "subslide"
331 | }
332 | },
333 | "outputs": [],
334 | "source": [
335 | "grouped_barplot(df, 'region','labels_')"
336 | ]
337 | },
338 | {
339 | "cell_type": "markdown",
340 | "metadata": {
341 | "slideshow": {
342 | "slide_type": "subslide"
343 | }
344 | },
345 | "source": [
346 | "## Instantiating and Fitting PCA"
347 | ]
348 | },
349 | {
350 | "cell_type": "code",
351 | "execution_count": null,
352 | "metadata": {
353 | "slideshow": {
354 | "slide_type": "fragment"
355 | }
356 | },
357 | "outputs": [],
358 | "source": [
359 | "pca = PCA(n_components=2, random_state=634)\n",
360 | "pca = pca.fit(X_norm)\n",
361 | "reduced_X = pca.fit_transform(X_norm)"
362 | ]
363 | },
364 | {
365 | "cell_type": "code",
366 | "execution_count": null,
367 | "metadata": {
368 | "slideshow": {
369 | "slide_type": "fragment"
370 | }
371 | },
372 | "outputs": [],
373 | "source": [
374 | "sns.scatterplot(x=reduced_X[:, 0], y=reduced_X[:, 1]);"
375 | ]
376 | },
377 | {
378 | "cell_type": "markdown",
379 | "metadata": {
380 | "slideshow": {
381 | "slide_type": "subslide"
382 | }
383 | },
384 | "source": [
385 | "## Combining PCA and `k-means`\n",
386 | "\n",
387 | "We can fit k-means to PCA-reduced data:"
388 | ]
389 | },
390 | {
391 | "cell_type": "code",
392 | "execution_count": null,
393 | "metadata": {
394 | "slideshow": {
395 | "slide_type": "subslide"
396 | }
397 | },
398 | "outputs": [],
399 | "source": [
400 | "pcakmeans = KMeans(n_clusters=5, random_state=634)\n",
401 | "pcakmeans.fit(reduced_X)\n",
402 | "df['pcakmeans_labels'] = pcakmeans.labels_"
403 | ]
404 | },
405 | {
406 | "cell_type": "code",
407 | "execution_count": null,
408 | "metadata": {
409 | "slideshow": {
410 | "slide_type": "subslide"
411 | }
412 | },
413 | "outputs": [],
414 | "source": [
415 | "sns.set_style('darkgrid')\n",
416 | "f, ax = plt.subplots(1, 1, figsize=(15, 8))\n",
417 | "sns.scatterplot(x=reduced_X[:, 0], y=reduced_X[:, 1],\n",
418 | " hue=pcakmeans.labels_,\n",
419 | " palette=sns.color_palette(palette='colorblind', n_colors=5));"
420 | ]
421 | },
422 | {
423 | "cell_type": "code",
424 | "execution_count": null,
425 | "metadata": {
426 | "slideshow": {
427 | "slide_type": "subslide"
428 | }
429 | },
430 | "outputs": [],
431 | "source": [
432 | "grouped_barplot(df, 'a02', 'pcakmeans_labels')"
433 | ]
434 | },
435 | {
436 | "cell_type": "code",
437 | "execution_count": null,
438 | "metadata": {},
439 | "outputs": [],
440 | "source": [
441 | "pd.DataFrame(pca.components_, columns=data.columns)"
442 | ]
443 | }
444 | ],
445 | "metadata": {
446 | "kernelspec": {
447 | "display_name": "Python 3",
448 | "language": "python",
449 | "name": "python3"
450 | },
451 | "language_info": {
452 | "codemirror_mode": {
453 | "name": "ipython",
454 | "version": 3
455 | },
456 | "file_extension": ".py",
457 | "mimetype": "text/x-python",
458 | "name": "python",
459 | "nbconvert_exporter": "python",
460 | "pygments_lexer": "ipython3",
461 | "version": "3.7.9"
462 | }
463 | },
464 | "nbformat": 4,
465 | "nbformat_minor": 4
466 | }
467 |
--------------------------------------------------------------------------------
/Week5/exercises.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Coding Exercises Week 5: Unsupervised Learning\n",
8 | "\n",
9 | "The task for this week is a little bit different. Using the code in the examples as a guide, do the following:\n",
10 | "\n",
11 | "- Fit a kmeans, pca, and pca+kmeans model for a selection of the variables from the BES data.\n",
12 | "- Visualise the clusterings.\n",
13 | "- Interpret the principal components. To do this, you will need to inspect how each component \"weights\" each of the features. The higher the weight, the proportion of the variance of the component is being derived from this weight.\n",
14 | "- Interpret the clusters: what kind of people are assigned to each cluster? Does this make sense? How does this match up to your understanding of the division within British society?"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "## Bonus Challenge\n",
22 | "\n",
23 | "As mentioned, `sklearn.cluster` contains many other clustering algorithms. Read through the documentation at https://scikit-learn.org/stable/modules/clustering.html and choose one that you think you understand best (Agglomerative Clustering is the most intuitive, IMHO). Cluster the BES data with this algorithm, and compare results between the models."
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": null,
29 | "metadata": {},
30 | "outputs": [],
31 | "source": []
32 | }
33 | ],
34 | "metadata": {
35 | "kernelspec": {
36 | "display_name": "teaching",
37 | "language": "python",
38 | "name": "teaching"
39 | },
40 | "language_info": {
41 | "codemirror_mode": {
42 | "name": "ipython",
43 | "version": 3
44 | },
45 | "file_extension": ".py",
46 | "mimetype": "text/x-python",
47 | "name": "python",
48 | "nbconvert_exporter": "python",
49 | "pygments_lexer": "ipython3",
50 | "version": "3.7.9"
51 | }
52 | },
53 | "nbformat": 4,
54 | "nbformat_minor": 4
55 | }
56 |
--------------------------------------------------------------------------------
/Week5/lecture.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week5/lecture.pdf
--------------------------------------------------------------------------------
/Week5/local_plot_utils.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import re
3 | import seaborn as sns
4 | import matplotlib
5 | import matplotlib.pyplot as plt
6 |
7 | matplotlib.rcParams["text.usetex"] = True
8 |
9 |
10 | def crosstab_heatmap(df, col1, col2):
11 | # Produce crosstab
12 | xtab = pd.crosstab(df[col1], df[col2])
13 | # Right-column should show total number of answers
14 | if isinstance(xtab.columns, pd.CategoricalIndex):
15 | xtab.columns = xtab.columns.add_categories("Total")
16 | if isinstance(xtab.index, pd.CategoricalIndex):
17 | xtab.index = xtab.index.add_categories("Avg Prop")
18 | xtab.loc[:, "Total"] = xtab.sum(axis=1)
19 | xtab = xtab.loc[xtab['Total'] >= xtab.shape[1]*5]
20 | props = xtab.iloc[:, :-1].div(xtab['Total'], axis=0)
21 | props.loc['Avg Prop', :] = props.mean(axis=0)
22 | xtab.loc['Avg Prop', :] = ""
23 | props.loc[:, 'Total'] = ""
24 |
25 | annot = xtab.applymap(
26 | lambda x: "{:.4g}\n".format(int(x)) if isinstance(x, float) else ""
27 | ).add(
28 | props.applymap(
29 | lambda x: "({:.3g}\%)".format(
30 | 100*x) if isinstance(x, float) else ""
31 | ))
32 |
33 | annot.loc['Avg Prop', 'Total'] = "{:.5g}".format(xtab.iloc[:-1, -1].sum())
34 | annot.loc['Avg Prop', :] = annot.loc['Avg Prop',
35 | :].str.replace(re.compile(r"[\(\)]"), "")
36 | annot.loc[:, 'Total'] = annot['Total'].str.replace("\n", "")
37 | diffs = props.iloc[:, :-1] - props.iloc[-1, :-1]
38 | diffs['Total'] = float(0)
39 | diffs = diffs.applymap(lambda x: 100*x)
40 |
41 | f, ax = plt.subplots(1, 1, figsize=annot.shape)
42 | sns.heatmap(
43 | diffs.T,
44 | annot=annot.T,
45 | fmt="s",
46 | ax=ax,
47 | cmap="RdBu_r",
48 | center=0,
49 | cbar_kws={"label": "Difference from Avg. Proportion"},
50 | )
51 | ax.xaxis.tick_top()
52 | ax.xaxis.set_label_position("top")
53 | ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="left")
54 | ax.set_yticklabels(ax.get_yticklabels())
55 | f.axes[1].set_yticklabels(
56 | [lab.get_text() + "\%" for lab in f.axes[1].get_yticklabels()]
57 | )
58 | ax.axvline(diffs.shape[0] - 1, color="k")
59 | ax.axhline(diffs.shape[1] - 1, color="k")
60 | ax.set_xlabel("")
61 | ax.set_ylabel("")
62 | ax.xaxis.set_label_position("bottom")
63 | return f
--------------------------------------------------------------------------------
/Week5/planning.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: Week 5 Planning: Machine Learning
3 | ---
4 |
5 | # Unsupervised Methods
6 |
7 | ## Introduction
8 |
9 | - Need a definition for ML. My understanding is that it's a collection of numeric/algorithmic methods.
10 | - First lesson looks at unsupervised methods (only X, no Y)
11 | - We focus specifically on two use cases:
12 | - Clustering
13 | - Dimensionality reduction
14 |
15 | ## Clustering
16 |
17 | - _Motivation_: what are some things that we may wish to cluster in political science?
18 | - Sometimes, we may have an intuition that certain groups exist, and are seeking to discover them. Other times, we have no a priori expectation of groupings, and are exploring how the data can cluster.
19 | - _Method_: Given a $j$-dimensional space, and matrix $X_{ij}$ of $i$ length-$j$ vectors, assign each vector $x_i$ to a cluster $k \in K$.
20 | - Usually sense that members of the same cluster will be similar, and members of different clusters will be dissimilar.
21 | - Question: What metrics of (dis)similarity exist?
22 | - Two examples: `k-means`, agglomerative pair-wise.
23 | - (_Implementation_): Review each algorithm, highlighting what makes one more efficient/scalable than the other.
24 | - Clustering diagnostic metrics.
25 | - Useful summary: https://www.cc.gatech.edu/~isbell/reading/papers/berkhin02survey.pdf
26 |
27 |
28 | ## Dimensionality Reduction
29 |
30 | - _Motivation_: when would you reduce dimensionality in political science?
31 | - You have too many variables in your model, and are seeking to drop some.
32 | - You are aiming to visualise/understand some high-dimensional space.
33 | - You are seeking to recover some latent dimensions within your data that are not captured by your existing variables.
34 | - _Method_: Given a matrix $X_{ij}$ of $i$ length-$j$ vectors, reduce $X_{ij}$ to $X_{ik}$ where $K \le J$.
35 | - In some variants, $\forall k \subset J$, in others, $\exists k \not\subset J$.
36 | - Again, parametric vs non-parametric methods.
37 |
--------------------------------------------------------------------------------
/Week6/exercises.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Coding Exercise Week 6\n",
8 | "\n",
9 | "My main challenge to you this week is to improve on the 0.5066162570888468 correct out-of-sample prediction rate the lecture example RF achieved.\n",
10 | "\n",
11 | "To ensure that you have the same train-test split, I've given you the beginning of the code. Your goal is to build a model that provides a better prediction of `y_test` using `X_test` as inputs than the one in the lecture.\n",
12 | "\n",
13 | "Some ideas for how you might improve beyond the model in the lecture:\n",
14 | "\n",
15 | "- Using a GridCVSearch instead of RandomCVSearch to further fine-tune the hyperparameters\n",
16 | "- Add additional features to the model (make sure you have the same splits!)\n",
17 | "- Using a different prediction algorithm, for example a Support Vector Machine or a Neural Network."
18 | ]
19 | },
20 | {
21 | "cell_type": "code",
22 | "execution_count": 1,
23 | "metadata": {},
24 | "outputs": [],
25 | "source": [
26 | "import pandas as pd\n",
27 | "import numpy as np\n",
28 | "\n",
29 | "from sklearn.model_selection import train_test_split\n",
30 | "\n",
31 | "# Set random seed\n",
32 | "np.random.seed(634)"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": 3,
38 | "metadata": {},
39 | "outputs": [
40 | {
41 | "name": "stdout",
42 | "output_type": "stream",
43 | "text": [
44 | "(1585, 128) (1585,)\n",
45 | "(529, 128) (529,)\n"
46 | ]
47 | }
48 | ],
49 | "source": [
50 | "link = 'http://github.com/muhark/dpir-intro-python/raw/master/Week2/data/bes_data_subset_week2.feather'\n",
51 | "df = pd.read_feather(link)\n",
52 | "# Refactoring e01: partisan self-id\n",
53 | "df.loc[:, 'e01'] = df['e01'].apply(\n",
54 | " lambda x: int(x.split(' ')[0]) if x[0] in ''.join(list(map(str, list(range(10))))) else None)\n",
55 | "# Let's predict 'a02' as a function of the rest\n",
56 | "features = ['region', 'Age', 'a03', 'e01', 'k01',\n",
57 | " 'k02', 'k11', 'k06', 'k08', 'y01',\n",
58 | " 'y03', 'y06', 'y08', 'y09', 'y11', 'y17']\n",
59 | "labels = 'a02'\n",
60 | "\n",
61 | "# Prep data\n",
62 | "df = df[features+[labels]].dropna()\n",
63 | "temp = pd.get_dummies(df[features])\n",
64 | "feature_names = temp.columns.tolist()\n",
65 | "X = temp.values\n",
66 | "y = df[labels].values\n",
67 | "\n",
68 | "# Train-test split\n",
69 | "X_train, X_test, y_train, y_test = train_test_split(\n",
70 | " X, y, test_size=0.25)\n",
71 | "\n",
72 | "print(X_train.shape, y_train.shape)\n",
73 | "print(X_test.shape, y_test.shape)"
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "execution_count": null,
79 | "metadata": {},
80 | "outputs": [],
81 | "source": [
82 | "col_dict = {\n",
83 | " 'region': 'Region',\n",
84 | " 'Age': 'Age',\n",
85 | " 'a02': 'Which party is best able to handle this issue?',\n",
86 | " 'a03': 'How interested are you in politics?',\n",
87 | " 'e01': 'Left-Right Self-Placement',\n",
88 | " 'k01': 'Attention to Politics',\n",
89 | " 'k02': 'Reads politics news',\n",
90 | " 'k11': 'Contacted by canvasser',\n",
91 | " 'k06': 'Uses Twitter',\n",
92 | " 'k08': 'Uses Facebook',\n",
93 | " 'y01': 'Income bracket',\n",
94 | " 'y03': 'Housing type',\n",
95 | " 'y06': 'Religion',\n",
96 | " 'y08': 'Trade Union Membership',\n",
97 | " 'y09': 'Gender',\n",
98 | " 'y11': 'Ethnicity',\n",
99 | " 'y17': 'Employment type'\n",
100 | "}"
101 | ]
102 | }
103 | ],
104 | "metadata": {
105 | "kernelspec": {
106 | "display_name": "teaching",
107 | "language": "python",
108 | "name": "teaching"
109 | },
110 | "language_info": {
111 | "codemirror_mode": {
112 | "name": "ipython",
113 | "version": 3
114 | },
115 | "file_extension": ".py",
116 | "mimetype": "text/x-python",
117 | "name": "python",
118 | "nbconvert_exporter": "python",
119 | "pygments_lexer": "ipython3",
120 | "version": "3.7.9"
121 | }
122 | },
123 | "nbformat": 4,
124 | "nbformat_minor": 4
125 | }
126 |
--------------------------------------------------------------------------------
/Week6/lecture.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: Introduction to Python for Social Science
3 | subtitle: Lecture 6 - Machine Learning II
4 | author: Musashi Harukawa, DPIR
5 | date: 6th Week Hilary 2021
6 | ---
7 |
8 | # Recap
9 |
10 | ## Last Week
11 |
12 | - Unsupervised Machine Learning
13 | - Clustering with `k-means`
14 | - Dimensionality Reduction with `PCA`
15 |
16 | ## This Week
17 |
18 | - Supervised Machine Learning
19 | - In depth: Decision Trees
20 | - Ensemble Methods
21 | - Forests
22 | - Meta-Learners
23 | - Optimising Your Model
24 | - Cross Validation Methods
25 | - Hyperparameter Tuning
26 |
27 | # Supervised ML
28 |
29 | ## Supervised Learning: Use X to infer Y
30 |
31 | - Supervised Learning starts with a dataset containing both _features_ ($X$) and _labels_ ($y$).
32 | - They then construct a "rule" relating $X$ to $y$, so that given some combination of values for $X$, they can "predict" a value of $y$.
33 | - In other terms, supervised learning finds $f$ in $y=f(X)$.
34 | - If $y$ is discrete/categorical, then the task is called _classification_.
35 | - If $y$ is continuous, then the task is called _regression_.
36 |
37 |
38 | ## Supervised Learning Models
39 |
40 | Some general classes of supervised models include:
41 |
42 | - Linear Models
43 | - Support Vector Machines (SVMs)
44 | - Naive Bayes
45 | - Tree-based Estimators
46 | - (Supervised) Neural Networks
47 |
48 | ## Decision Trees
49 |
50 | Although radically distinct to linear estimators such as OLS, decision trees offer a simple and intuitive approach to estimating values of $y$ based on $X$.
51 |
52 | - If you have played the game twenty questions, then you should be familiar with the idea behind decision trees.
53 | - Constructs a series of binary questions (nodes) regarding your features, and eventually at the end of the resulting branches gives a prediction (leaf) of your label.
54 |
55 | ## Understanding Decision Trees
56 |
57 | A decision tree can be understood as a mapping from the multi-dimensional feature space, $X_{ij}$, to the label space $y_i$.
58 |
59 | - Each question partitions the $X_{ij}$-space.
60 | - Each leaf maps one of these partitions to a value (or range) in the $y$-space.
61 | - The algorithm necessarily sets some convergence threshold so that there are fewer leafs than observations.
62 |
63 | ## Impurity
64 |
65 | - Given that the algorithm knows the values of $y$:
66 | - Its goal is to split the $X$-space in such a way that each partition does not contain more than one distinct value of $y$.
67 | - In essence, it wants to split the $X$-space in way that increases the "purity" of each partition.
68 | - A partition containing more than one distinct value of $y$ will necessarily lead to at least one erroneous prediction.
69 | - There are various measures of impurity:
70 | - GINI: $H(X_m) = \sum_k p_{mk} (1 - p_{mk})$
71 | - Entropy: $H(X_m) = - \sum_k p_{mk} \log(p_{mk})$
72 |
73 | ## Visualising Trees
74 |
75 | 
76 |
77 | ## Tree Trade-offs
78 |
79 | Advantages:
80 |
81 | - Excels at capturing conditional dependencies
82 | - Arguably more intuitive than OLS.
83 | - Provides an metric of feature importance that has a substantive interpretation.
84 |
85 | ::: {.fragment}
86 | Disadvantages:
87 | :::
88 |
89 | - **Extremely** prone to _over-fitting_.
90 | - Does not provide a linear marginal effect estimate.
91 |
92 | ## Choosing Your Supervised Algorithm
93 |
94 | These are some of the criteria you may want to consider when choosing an algorithm:
95 |
96 | - _Prediction Accuracy_: Algorithms vary in their ability to predict unseen data. We will discuss this more during cross validation.
97 | - _Minimum Data_: Some models are able to do more with less. This is especially true if the model makes certain parametric assumptions about the nature or distribution of the data.
98 | - _Interpretability_: Not all methods provide insight into _how_ they formulate their predictions. Methods range from extremely intuitive, such as decision trees, to complete black boxes, such as neural networks. When seeking to _explain_ and not _predict_, one should take this into account.
99 |
100 | ## This brings me to...
101 |
102 | # Ensemble Methods
103 |
104 | ## Managing Shortcomings by Working Together
105 |
106 | - There is no single model or algorithm that performs best across all criteria in all scenarios.
107 | - Ensemble methods, which is really a fancy way of saying using more than one method, are often devised to address this issue.
108 | - I group ensemble methods into two types: _aggregating_ and _sequential_ ensembles.
109 | - _Aggregating Ensembles_ train on and estimate predicted values of the same data, and then use a meta-learner to aggregate these predictions.
110 | - _Sequential Ensembles_ use the output of one algorithm (often unsupervised) as features to train another. PCA+kmeans is an example of this.
111 |
112 | ## Aggregating Trees: Random Forests
113 |
114 | There are various algorithms that aggregate decision trees, but here I outline the logic behind the most straightforward and common one: Random Forests (RFs).
115 |
116 | - Construct $N$ decision trees.
117 | - For each split in each tree, randomly select a subset of features. This split can only be made over these features.
118 | - To predict, the same input array is passed to all the constituent trees, and the algorithm either returns mean prediction (continuous data) or modal prediction (categorical data).
119 | - The noteworthy improvement on this algorithm is Bayesian Additive Regression Trees (BARTs).
120 |
121 | ## Aggregating Learners: Meta-Learners
122 |
123 | A number of papers have been published recently that use ensemble methods to estimate heterogeneous treatment effects:
124 |
125 | - [Grimmer \& Westwood, _Political Analysis_ 2017](https://www.cambridge.org/core/journals/political-analysis/article/estimating-heterogeneous-treatment-effects-and-the-effects-of-heterogeneous-treatments-with-ensemble-methods/C7E3EA00D0AD83429CBE73F4F0C6652C)
126 | - [Kunzel et al, _PNAS_ 2019](https://arxiv.org/abs/1706.03461)
127 |
128 |
These papers both focus on innovating on the _meta-learner_.
129 | 130 | # Optimising Your Model 131 | 132 | ## Machine Learning is not just Algorithms 133 | 134 | - Another contribution of machine learning to econometrics, in my opinion, has been the development of strategies to test and evaluate models. 135 | - Epistemologically, machine learning frequently takes a more agnostic view on trying to find a specific functional specification of a theoretical model. 136 | - This means that the "correct" model is the one that does the best job of matching _empirics_, and not a particular theory. 137 | - The cost of this is the unsuitability of many machine learning algorithms to theory testing in the traditional econometric sense. 138 | 139 | ## Cross Validation 140 | 141 | Cross validation is one such of these strategies. It consists of dividing the data into _training_ and _test_ sets: 142 | 143 | 1. The model is fit using the _training_ data: $y_{train} = f(X_{train}) + \epsilon \rightarrow \hat{f}(X)$ 144 | 2. The fitted model is applied to the _test features_ to generate _predicted values_: $\hat{y} = \hat{f}(X_{test})$ 145 | 3. The difference between the _predicted values_ and the _test labels_ is used as a measure of the predictive accuracy of the model: $\hat{e} = y_{test} - \hat{y}$ 146 | 147 | ::: {.fragment} 148 | There are multiple aggregate measures of prediction error, but a common one is _mean squared (prediction) error_, calculated as the sum of squared differences between prediction and test label. 149 | ::: 150 | 151 | ## k-fold Cross Validation 152 | 153 | - There are some obvious shortcomings to dividing the data into a training at test set just once. 154 | - A slightly more advanced method for train-test splitting is known a k-fold CV, which consists of splitting the training data randomly into $k$ bins, and then iteratively using the $k$th bin as a test set for all bins not $k$. 155 | 156 | ## Cross Validation Visualised 157 | 158 |  159 | 160 | ## Choosing Parameters 161 | 162 | Another strategy for improving the predictive accuracy of algorithms relates to choosing the right _parameters_. 163 | 164 | Most, if not all algorithms have some parameters that affect predictions in very unobvious ways. For example: 165 | 166 | - `k-means`: number of clusters 167 | - Decision Tree: min/max number of splits 168 | - Random Forest: proportion of features to use in each subset 169 | - LASSO/Ridge/EN: $\beta$ 170 | 171 | ## Hyperparameter Tuning 172 | 173 | - Hyperparameter tuning is the practice of choosing model parameters by maximising an _objective function_. Some possible objective functions include: 174 | - _Mean Absolute Prediction Error_: Combine with train-test splits. 175 | - _Goodness-of-Fit_: Measures such as R-squared, AIC, etc. 176 | - _Coherence/Entropy Measures_: Most algorithms have a measure of the complexity/information tradeoff, which can be optimised. 177 | - Hyperparameter tuning is computationally costly, but also easily parallelisable. 178 | 179 | 180 | # Machine Learning Recap 181 | 182 | ## Key Terms 183 | 184 | - _Unsupervised Learning_: No $y$, explore $X$ 185 | - _Supervised Learning_: Learn relationship between features and labels. 186 | - _Clustering_: Split observations into groups. 187 | - _Dimensionality Redution_: Reduce $j$, the number of features. 188 | - _Classification vs Regression_: Depends on structure of $y$ 189 | - _Cross Validation_: Train-test split data to optimise supervised learner. 190 | - _Hyperparameter Tuning_: Systematically choose optimal parameters for algorithm. 191 | - _Objective Function_: An optimisable aspect of the data used to measure goodness-of-fit. 192 | 193 | ## Trade-offs 194 | 195 | These trade-offs are not linear, but generally hold: 196 | 197 | - _Explanatory vs predictive power_ 198 | - _Flexibility vs efficiency_ 199 | - _Information vs time_ 200 | 201 | ## Readings 202 | 203 | Ensemble Methods: 204 | 205 | - [Grimmer \& Westwood, _Political Analysis_ 2017](https://www.cambridge.org/core/journals/political-analysis/article/estimating-heterogeneous-treatment-effects-and-the-effects-of-heterogeneous-treatments-with-ensemble-methods/C7E3EA00D0AD83429CBE73F4F0C6652C) 206 | - [Kunzel et al, _PNAS_ 2019](https://arxiv.org/abs/1706.03461) 207 | 208 | Elements of Statistical Learning: 209 | 210 | - 9.2: Tree-Based Methods 211 | - 15: Random Forests 212 | - 16: Ensemble Learning 213 | 214 | -------------------------------------------------------------------------------- /Week6/lecture.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week6/lecture.pdf -------------------------------------------------------------------------------- /Week7/exercises.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Week 7 Coding Exercises\n", 8 | "\n", 9 | "This week we learned three main skills:\n", 10 | "\n", 11 | "- Regular Expressions\n", 12 | "- Pulling Webpages\n", 13 | "- Scraping Webpages\n", 14 | "\n", 15 | "I want you to go further with the code we started developing in the coding tutorial.\n", 16 | "\n", 17 | "By the end of the tutorial, we had code to get all of the books, along with their price, rating, and a link to their dedicated page.\n", 18 | "\n", 19 | "My next challenge for you is to do the following\n", 20 | "\n", 21 | "1. On each of the dedicated pages, there is a description of the book. Write code that will scrape the description for each book and add it as a column on the product_info dataframe.\n", 22 | "2. Currently, this scraper can only access the first page of results. Modify the scraper so that it iterates through pages. (Hint: There is a link close to the bottom of each page, but the title will be different each time (page-3, page-4, etc.). You will need to work in a solution that finds the link each time, or automatically increments the page at each iteration." 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": null, 28 | "metadata": {}, 29 | "outputs": [], 30 | "source": [ 31 | "import requests\n", 32 | "import pandas as pd\n", 33 | "import re\n", 34 | "from bs4 import BeautifulSoup\n", 35 | "\n", 36 | "url = \"http://books.toscrape.com\"\n", 37 | "session = requests.Session()\n", 38 | "page = session.get(url)\n", 39 | "\n", 40 | "soup = BeautifulSoup(page.text, 'html.parser')\n", 41 | "product_pods = soup.find_all('article', class_=\"product_pod\")\n", 42 | "\n", 43 | "def get_product_info(product_pod):\n", 44 | " # Title can be accessed from img alt\n", 45 | " image_elem = product_pod.div.a.img\n", 46 | " title = image_elem['alt']\n", 47 | " # Rating can be accessed from class (css) on star-rating\n", 48 | " rating_elem = product_pod.find('p', class_=re.compile(r\"star-rating .*\"))\n", 49 | " rating = rating_elem['class'][1]+\"/Five\" # Second class attribute\n", 50 | " price_elem = product_pod.find('div', class_=\"product_price\")\n", 51 | " price = re.search(re.compile(\"[0-9\\.]+\"), price_elem.text)[0]\n", 52 | " link = product_pod.find('a', href=True)['href']\n", 53 | " return title, rating, price, link\n", 54 | "\n", 55 | "product_info = []\n", 56 | "for pod in product_pods:\n", 57 | " product_info.append(get_product_info(pod))\n", 58 | "\n", 59 | "product_info = pd.DataFrame(product_info, columns=[\"Title\", \"Rating\", \"Price\", 'Link'])\n", 60 | "product_info.loc[:, 'Link'] = product_info['Link'].apply(lambda x: url+\"/\"+x)" 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "Regex Bonus Challenge:\n", 68 | "\n", 69 | "Full disclaimer: I wasn't able to figure out a solution to this, but I thought you might enjoy the challenge.\n", 70 | "\n", 71 | "Given the lyrics in the box below, find a way to match the full phrase after each \"million\".\n", 72 | "\n", 73 | "Therefore you should get back the matches:\n", 74 | "\n", 75 | "```\n", 76 | "['bags of the best Sligo rags',\n", 77 | " 'barrels of stone',\n", 78 | " 'sides of old blind horses hides',\n", 79 | " 'barrels of bones',\n", 80 | " 'hogs',\n", 81 | " 'dogs',\n", 82 | " 'barrels of porter',\n", 83 | " 'bails of old nanny goats']\n", 84 | " ```\n", 85 | "\n", 86 | "\n", 87 | "The closest I was able to come was: `million ([\\w ]+)`, but this matches the `and seven million dogs`, which I want separately." 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": null, 93 | "metadata": {}, 94 | "outputs": [], 95 | "source": [ 96 | "lyrics = \"\"\"\n", 97 | "On the Fourth of July, 1806\n", 98 | "We set sail from the sweet Cove of Cork\n", 99 | "We were sailing away with a cargo of bricks\n", 100 | "For the Grand City Hall in New York\n", 101 | "'Twas a wonderful craft, she was rigged fore and aft\n", 102 | "And oh, how the wild wind drove her\n", 103 | "She stood several blasts, she had twenty seven masts\n", 104 | "And they called her The Irish Rover\n", 105 | "We had one million bags of the best Sligo rags\n", 106 | "We had two million barrels of stone\n", 107 | "We had three million sides of old blind horses hides\n", 108 | "We had four million barrels of bones\n", 109 | "We had five million hogs and six million dogs\n", 110 | "Seven million barrels of porter\n", 111 | "We had eight million bails of old nanny goats' tails\n", 112 | "In the hold of the Irish Rover\n", 113 | "\"\"\"\n", 114 | "\n", 115 | "r = re.compile(r\"million ([\\w ]+)\") # This solution doesn't work\n", 116 | "\n", 117 | "re.findall(r, lyrics)" 118 | ] 119 | } 120 | ], 121 | "metadata": { 122 | "kernelspec": { 123 | "display_name": "teaching", 124 | "language": "python", 125 | "name": "teaching" 126 | }, 127 | "language_info": { 128 | "codemirror_mode": { 129 | "name": "ipython", 130 | "version": 3 131 | }, 132 | "file_extension": ".py", 133 | "mimetype": "text/x-python", 134 | "name": "python", 135 | "nbconvert_exporter": "python", 136 | "pygments_lexer": "ipython3", 137 | "version": "3.7.6" 138 | } 139 | }, 140 | "nbformat": 4, 141 | "nbformat_minor": 4 142 | } 143 | -------------------------------------------------------------------------------- /Week7/lecture.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/muhark/dpir-intro-python/e48dab681abe86dae75b1c21889974fbf42789ab/Week7/lecture.pdf -------------------------------------------------------------------------------- /Week7/planning.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Planning Week 7 - Web Scraping 3 | --- 4 | 5 | # Some Thoughts 6 | 7 | This lesson should cover: 8 | 9 | - the way that the Internet and webpages work 10 | - specific tools to navigate webpages (regex, requests, beautifulsoup) 11 | - a more general understanding of automation and "deployment" 12 | 13 | For good measure, I need to also include: 14 | 15 | - the legal grey area that is web scraping, and good etiquette 16 | - a discussion of time-saving/practicality 17 | 18 | In addition to the commands and libraries mentioned above, I'll need to cover: 19 | 20 | - Writing python scripts (instead of notebooks) 21 | - `try/except` loops and error handling 22 | - `while` loops 23 | 24 | # Structure 25 | 26 | - **Roadmap** 27 | - How does the Internet work? (short version) 28 | - Your computer sends a `GET` request with some routing information (url, ip address) to an intermediary server (DNS). 29 | - The DNS forwards the request to the desired destination. 30 | - The webpage host server receives the request, and sends back the requested information (via DNS, etc.). This information is usually a mixture of `html`, `css`, `javascript` and maybe `php`. 31 | - `html`: The skeleton and text. 32 | - `css`: The aesthetic styling elements. 33 | - `javascript`: Locally-executed interactive elements. 34 | - `php`: Host-side interactive elements. 35 | - Your computer receives the information, and a specialised program known as a "browser" renders the information as a webpage. 36 | - What kind of information is stored in webpages that we, as social scientists, might want to use? 37 | - News articles (news websites) (*ALL* news articles since _x_ date.) 38 | - Press releases (corporate, political) (*ALL* press release by politician _x_) 39 | - Government resources/archives (all parliamentary transcripts, or of a specific sub committee) 40 | - Tweets? (This lesson discusses APIs last) 41 | - How can we use Python to assist us with conducting this collection on a large scale? 42 | - This is a task of locating, filtering and extracting information from a largely _unstructured_ dataset. For this we use: 43 | - `requests` to get webpages 44 | - `beautifulsoup` to clean up, structure and work with `html` 45 | - `regex` to apply flexible patterns to character-string data. 46 | - Added challenges are rate limiting, not knowing the format of webpages, and so on. 47 | - Some glaring omissions: 48 | - `scrapy`: A library for building deployable web crawlers. 49 | - `selenium`: For dealing with `javascript`/creating a scraper that behaves more like a human. 50 | - When is it (in)appropriate to scrape? 51 | - Web servers have limited resources for serving requests; if they try to send too much data, then they will slow down/break. 52 | - Most web servers have DDoS protection measures; if they see that they are receiving a large volume of requests from a particular IP, then they will block/throttle that address. 53 | - Even if the server does not have these measures, be considerate, and do not accidentally cyberattack somebody. 54 | - Scraping is not *usually* included in the ToS of a website, but may be prohibited by your ISP etc. In most cases, it is in a legal grey area. 55 | - Companies and governments have the option of sending a cease-and-desist, in which case scraping does become illegal. 56 | - Obviously you should not do anything illegal. 57 | -------------------------------------------------------------------------------- /Week8/examples_selenium.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "opponent-colorado", 6 | "metadata": { 7 | "slideshow": { 8 | "slide_type": "subslide" 9 | } 10 | }, 11 | "source": [ 12 | "# Browser Automation with Selenium\n", 13 | "\n", 14 | "This notebook contains a short tutorial for scraping with the Selenium toolkit.\n", 15 | "\n", 16 | "We will be scraping `quotes.toscrape.com`, a wonderful page for practicing more advanced scraping techniques." 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": null, 22 | "id": "julian-canon", 23 | "metadata": { 24 | "slideshow": { 25 | "slide_type": "subslide" 26 | } 27 | }, 28 | "outputs": [], 29 | "source": [ 30 | "# imports\n", 31 | "import requests\n", 32 | "from selenium import webdriver\n", 33 | "from selenium.webdriver.common.by import By" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "id": "finished-mixer", 39 | "metadata": { 40 | "slideshow": { 41 | "slide_type": "subslide" 42 | } 43 | }, 44 | "source": [ 45 | "## When static scraping fails:\n", 46 | "\n", 47 | "The following webpage is generated dynamically by `javascript`.\n", 48 | "We can see the script source in this page, but this is often not the case:" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": null, 54 | "id": "straight-columbia", 55 | "metadata": { 56 | "slideshow": { 57 | "slide_type": "subslide" 58 | } 59 | }, 60 | "outputs": [], 61 | "source": [ 62 | "from bs4 import BeautifulSoup\n", 63 | "\n", 64 | "url = \"https://quotes.toscrape.com/js/\"\n", 65 | "page = requests.get(url)\n", 66 | "print(BeautifulSoup(page.text).body.prettify())" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "id": "collect-finnish", 72 | "metadata": { 73 | "slideshow": { 74 | "slide_type": "subslide" 75 | } 76 | }, 77 | "source": [ 78 | "## Instantiating the WebDriver\n", 79 | "\n", 80 | "When we call the `webdriver.Chrome()` method, if we have the webdriver properly installed, an automated Chrome instance should appear!\n" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": null, 86 | "id": "julian-nightlife", 87 | "metadata": { 88 | "slideshow": { 89 | "slide_type": "subslide" 90 | } 91 | }, 92 | "outputs": [], 93 | "source": [ 94 | "driver = webdriver.Chrome()\n", 95 | "driver.get(url)" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "id": "intimate-edgar", 101 | "metadata": { 102 | "slideshow": { 103 | "slide_type": "subslide" 104 | } 105 | }, 106 | "source": [ 107 | "Let's select all of the quote-boxes that have the tag \"life\"." 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "id": "white-interstate", 114 | "metadata": { 115 | "slideshow": { 116 | "slide_type": "subslide" 117 | } 118 | }, 119 | "outputs": [], 120 | "source": [ 121 | "# This returns a list of elements that have the CSS class 'quote'\n", 122 | "quote_boxes = driver.find_elements(\n", 123 | " By.CLASS_NAME, 'quote')" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": null, 129 | "id": "living-bundle", 130 | "metadata": { 131 | "slideshow": { 132 | "slide_type": "subslide" 133 | } 134 | }, 135 | "outputs": [], 136 | "source": [ 137 | "# Let's navigate the first element to recognize a pattern\n", 138 | "# Selecting the first div\n", 139 | "quote_box = quote_boxes[0]\n", 140 | "# Selecting the container div for the tags\n", 141 | "tags = quote_box.find_element(By.CLASS_NAME, 'tags')\n", 142 | "# Getting the tag names\n", 143 | "[\n", 144 | " tag.text for tag\n", 145 | " in tags.find_elements(By.TAG_NAME, 'a')\n", 146 | "]" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": null, 152 | "id": "impressive-diesel", 153 | "metadata": { 154 | "slideshow": { 155 | "slide_type": "subslide" 156 | } 157 | }, 158 | "outputs": [], 159 | "source": [ 160 | "# Some crazy list filtering\n", 161 | "life_quotes = [\n", 162 | " quote for quote in quote_boxes if # unpack quote_boxes\n", 163 | " 'life' in [tag.text for tag in # check if 'life' is in\n", 164 | " quote.find_element(By.CLASS_NAME, 'tags'). # the list of tags\n", 165 | " find_elements(By.TAG_NAME, 'a')] # like we obtained before\n", 166 | "]\n", 167 | "life_quotes" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": null, 173 | "id": "young-emergency", 174 | "metadata": { 175 | "slideshow": { 176 | "slide_type": "subslide" 177 | } 178 | }, 179 | "outputs": [], 180 | "source": [ 181 | "# Let's put that into a function\n", 182 | "def filter_quotes_by_tag(driver, tag):\n", 183 | " quote_boxes = driver.find_elements(By.CLASS_NAME, 'quote')\n", 184 | " tagged_quotes = [\n", 185 | " quote for quote in quote_boxes if # unpack quote_boxes\n", 186 | " tag in [t.text for t in # check if tag is in\n", 187 | " quote.find_element(By.CLASS_NAME, 'tags'). # the list of tags\n", 188 | " find_elements(By.TAG_NAME, 'a')] # like we obtained before\n", 189 | " ]\n", 190 | " return tagged_quotes" 191 | ] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "id": "binding-wheel", 196 | "metadata": { 197 | "slideshow": { 198 | "slide_type": "subslide" 199 | } 200 | }, 201 | "source": [ 202 | "## Simulating Clicks\n", 203 | "\n", 204 | "We can use the `.click()` property of any element to 'click' on it.\n", 205 | "\n", 206 | "Let's proceed to the next page of quotes." 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": null, 212 | "id": "aging-voltage", 213 | "metadata": { 214 | "slideshow": { 215 | "slide_type": "subslide" 216 | } 217 | }, 218 | "outputs": [], 219 | "source": [ 220 | "# Get the \"next\" element\n", 221 | "next_button = driver.find_element(By.PARTIAL_LINK_TEXT, 'Next')\n", 222 | "print(driver.current_url)\n", 223 | "next_button.click()\n", 224 | "print(driver.current_url)" 225 | ] 226 | }, 227 | { 228 | "cell_type": "markdown", 229 | "id": "still-gasoline", 230 | "metadata": { 231 | "slideshow": { 232 | "slide_type": "subslide" 233 | } 234 | }, 235 | "source": [ 236 | "## Sending Keys\n", 237 | "\n", 238 | "Let's try to log in!" 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": null, 244 | "id": "controversial-jackson", 245 | "metadata": { 246 | "slideshow": { 247 | "slide_type": "subslide" 248 | } 249 | }, 250 | "outputs": [], 251 | "source": [ 252 | "login_box = driver.find_element(By.LINK_TEXT, 'Login')\n", 253 | "login_box.click()" 254 | ] 255 | }, 256 | { 257 | "cell_type": "code", 258 | "execution_count": null, 259 | "id": "august-purpose", 260 | "metadata": { 261 | "slideshow": { 262 | "slide_type": "subslide" 263 | } 264 | }, 265 | "outputs": [], 266 | "source": [ 267 | "# Entering username and password\n", 268 | "username_box = driver.find_element(By.ID, 'username')\n", 269 | "password_box = driver.find_element(By.ID, 'password')" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": null, 275 | "id": "flush-minutes", 276 | "metadata": { 277 | "slideshow": { 278 | "slide_type": "subslide" 279 | } 280 | }, 281 | "outputs": [], 282 | "source": [ 283 | "username_box.send_keys('username')\n", 284 | "password_box.send_keys('password')" 285 | ] 286 | }, 287 | { 288 | "cell_type": "code", 289 | "execution_count": null, 290 | "id": "double-boring", 291 | "metadata": { 292 | "slideshow": { 293 | "slide_type": "subslide" 294 | } 295 | }, 296 | "outputs": [], 297 | "source": [ 298 | "# Using XPATH to get the login button\\\n", 299 | "# https://www.w3schools.com/xml/xpath_syntax.asp\n", 300 | "login_button = driver.find_element(\n", 301 | " By.XPATH, r\"//input[(@type='submit')]\")\n", 302 | "login_button.click()" 303 | ] 304 | }, 305 | { 306 | "cell_type": "markdown", 307 | "id": "correct-speaking", 308 | "metadata": { 309 | "slideshow": { 310 | "slide_type": "subslide" 311 | } 312 | }, 313 | "source": [ 314 | "## Race Conditions\n", 315 | "\n", 316 | "Usually the page will take time to load.\n", 317 | "\n", 318 | "If you are running Selenium from a script, it will execute the commands sequentially\n", 319 | "as fast as possible. This causes problems." 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": null, 325 | "id": "quality-israel", 326 | "metadata": { 327 | "slideshow": { 328 | "slide_type": "subslide" 329 | } 330 | }, 331 | "outputs": [], 332 | "source": [ 333 | "url = \"https://quotes.toscrape.com/js-delayed/\"\n", 334 | "driver.get(url)\n", 335 | "filter_quotes_by_tag(driver, 'life')" 336 | ] 337 | }, 338 | { 339 | "cell_type": "markdown", 340 | "id": "average-scottish", 341 | "metadata": { 342 | "slideshow": { 343 | "slide_type": "subslide" 344 | } 345 | }, 346 | "source": [ 347 | "Selenium does provide more sophisticated \"wait\" functionality,\n", 348 | "where you can define some condition that it will test until\n", 349 | "it becomes true.\n", 350 | "\n", 351 | "I'll demonstrate a simpler (and less reliable) solution, which\n", 352 | "is to just use a timed wait." 353 | ] 354 | }, 355 | { 356 | "cell_type": "code", 357 | "execution_count": null, 358 | "id": "frozen-forest", 359 | "metadata": { 360 | "slideshow": { 361 | "slide_type": "fragment" 362 | } 363 | }, 364 | "outputs": [], 365 | "source": [ 366 | "from time import sleep\n", 367 | "url = \"https://quotes.toscrape.com/js-delayed/\"\n", 368 | "driver.get(url)\n", 369 | "sleep(10) # I happen to know the length of the delay\n", 370 | "filter_quotes_by_tag(driver, 'life')" 371 | ] 372 | }, 373 | { 374 | "cell_type": "code", 375 | "execution_count": null, 376 | "id": "weekly-kingdom", 377 | "metadata": {}, 378 | "outputs": [], 379 | "source": [ 380 | "driver.quit()" 381 | ] 382 | } 383 | ], 384 | "metadata": { 385 | "kernelspec": { 386 | "display_name": "scrape", 387 | "language": "python", 388 | "name": "scrape" 389 | }, 390 | "language_info": { 391 | "codemirror_mode": { 392 | "name": "ipython", 393 | "version": 3 394 | }, 395 | "file_extension": ".py", 396 | "mimetype": "text/x-python", 397 | "name": "python", 398 | "nbconvert_exporter": "python", 399 | "pygments_lexer": "ipython3", 400 | "version": "3.8.5" 401 | } 402 | }, 403 | "nbformat": 4, 404 | "nbformat_minor": 5 405 | } 406 | -------------------------------------------------------------------------------- /Week8/lecture.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Introduction to Python for Social Science 3 | subtitle: Lecture 8 - APIs and Selenium 4 | author: Musashi Harukawa, DPIR 5 | date: 8th Week Hilary 2021 6 | --- 7 | 8 | # Lecture Roadmap 9 | 10 | ## Last Week 11 | 12 | - HTTP requests and Internet fundamentals 13 | - Regular Expressions 14 | 15 | ## This Week 16 | 17 | - APIs 18 | - Twitter's Academic Track 19 | - Browser Automation 20 | 21 | # APIs 22 | 23 | ## What is an API? 24 | 25 | - _Application Programming Interface_ 26 | - _Interface_: Specialized endpoint 27 | - Specific query syntax 28 | - Returns defined data packets 29 | - We are interested in _Web APIs_ 30 | 31 | ## Web API Examples 32 | 33 | - Twitter 34 | - Reddit 35 | - NY Times 36 | - The Guardian 37 | - Spotify 38 | - Netflix 39 | 40 | ## API Mechanics 41 | 42 | - REST vs SOAP 43 | - RESTful APIs loosely based on HTTP methods 44 | - Accept HTTP-like requests to access server-side assets 45 | - Return the payload usually as JSON or XML 46 | - _Stateless_: no server-side session information 47 | 48 | ::: notes 49 | - Most of the APIs I have come across are REST; all I know about SOAP is that it mandates XML payloads. 50 | - Loosely-based: depending on the API, may allow for header or body parameters that do not typically exist in HTTP requests. 51 | - Payload: the actual data packet. Sounds dramatic, it's just the thing you wanted (versus the header, which basically says what it is and where it should go). 52 | - Stateless: remember that the server does not remember who its speaking to. That means your credentials need to be sent with each request, and importantly for paginated requests, a "next page" token. We'll come back to that later. 53 | ::: 54 | 55 | ## Twitter's API 56 | 57 | - Many different Twitter APIs and endpoints (Standard, Premium, Enterprise, and **Academic**) 58 | - **Academic Research product track** has following endpoints: 59 | - _Full-archive search_: (Almost) everything back to 2006! 60 | - _Recent search_: Last 7 days, higher volumes 61 | - _Filtered stream_: Real-time filtered stream, capped at 1% of total volume 62 | - _Sampled stream_: $~1\%$ of all new Tweets in real-time 63 | - _Tweet and User Lookup_: Look up user/tweet by id 64 | - and [more](https://developer.twitter.com/en/solutions/academic-research/products-for-researchers) 65 | 66 | ## Applying for Access 67 | 68 | - The Academic Research track has the following criteria: 69 | - Master's student or above (doctoral candidate, post-doc, faculty, researcher, etc.) 70 | - Clearly defined research objective and specific plans for how you will use the Twitter data 71 | - Non-commercial use 72 | - You can apply [here](https://developer.twitter.com/en/portal/petition/academic/is-it-right-for-you) 73 | 74 | ## Using the API (with Python) 75 | 76 | - We can use Python to generate requests to interact with Twitter's API 77 | - Twitter provides a "wrapper" package: `searchtweets-v2` 78 | - Documentation provided [here](https://pypi.org/project/searchtweets-v2/) and [here](https://developer.twitter.com/en/docs/twitter-api/tweets/search/introduction) 79 | 80 | ## Managing Credentials 81 | 82 | - Once you are granted access, you will be given a set of credentials for your project/application. 83 | - Store these securely, i.e. do not post them somewhere public. 84 | - Place them in a credentials `yaml` file that looks like the following: 85 | 86 | :::{.fragment} 87 | ```{yaml} 88 | search_tweets_v2: 89 | endpoint: https://api.twitter.com/2/tweets/search/all 90 | consumer_key: