├── 2021
    ├── day1
    │   ├── .DS_Store
    │   ├── README.md
    │   ├── day1-afternoon.ipynb
    │   ├── day1-morning.pdf
    │   ├── day1-morning.tex
    │   ├── exercises
    │   │   ├── exercises.md
    │   │   ├── instructional-videos-exercises.md
    │   │   └── livecoding.ipynb
    │   ├── mockdata
    │   │   └── mockdata-your_posts_1.json
    │   └── solutions
    │   │   ├── ex2-1.py
    │   │   ├── ex2-2.py
    │   │   ├── ex2-3.py
    │   │   ├── ex2-4.py
    │   │   ├── ex3-1.py
    │   │   ├── ex3-2.py
    │   │   ├── gradechecker_function.py
    │   │   ├── gradechecker_robust.py
    │   │   └── gradechecker_simple.py
    ├── day2
    │   ├── .DS_Store
    │   ├── Day 2_slides.pdf
    │   ├── Notebooks
    │   │   ├── .DS_Store
    │   │   ├── BasicStats.ipynb
    │   │   ├── Datasets
    │   │   │   ├── RIVM_1.csv
    │   │   │   ├── RIVM_2.csv
    │   │   │   ├── RIVM_sentiment.csv
    │   │   │   ├── Sentiment_YouTubeClimateChange.csv
    │   │   │   ├── Sentiment_YouTubeClimateChange.pkl
    │   │   │   ├── YouTube_climatechange.csv
    │   │   │   ├── YouTube_climatechange.tab
    │   │   │   └── websites.csv
    │   │   ├── ExcercisesPandas.ipynb
    │   │   ├── PandasIntroduction.ipynb
    │   │   ├── PandasIntroduction2.ipynb
    │   │   └── Visualisations.ipynb
    │   └── README.md
    ├── day3
    │   ├── README.md
    │   ├── day3-afternoon.pdf
    │   ├── day3-afternoon.tex
    │   ├── day3-morning.pdf
    │   ├── day3-morning.tex
    │   └── exercises
    │   │   └── exercises.md
    ├── day4
    │   ├── README.md
    │   ├── day4-afternoon.pdf
    │   ├── day4-afternoon.tex
    │   ├── day4.pdf
    │   ├── day4.tex
    │   ├── example-nltk.md
    │   ├── example-vectorizer-to-dense.md
    │   ├── exercises-1
    │   │   ├── exercise-1.md
    │   │   └── possible-solution-exercise-1.md
    │   ├── exercises-2
    │   │   ├── exercise-2.md
    │   │   ├── fix_example_book.md
    │   │   └── possible-solution-exercise-2.md
    │   └── literature-examples.md
    ├── day5
    │   ├── 01-MachineLearning_Introduction.ipynb
    │   ├── 02-Unsupervised-Machine-Learning.ipynb
    │   ├── 03-Supervised-Machine-Learning.ipynb
    │   ├── README.md
    │   ├── WorkingNotebook.ipynb
    │   └── topic_model_example.ipynb
    ├── installation.md
    ├── media
    │   ├── boumanstrilling2016.eps
    │   ├── boumanstrilling2016.pdf
    │   ├── mannetje.png
    │   ├── pythoninterpreter.png
    │   └── sparse_dense.png
    ├── references.bib
    └── teachingtips.md
├── 2023
    ├── .DS_Store
    ├── Installationinstruction.md
    ├── Teachingtips.md
    ├── day1
    │   ├── introduction.ipynb
    │   └── introduction.slides.html
    ├── day2
    │   ├── Day 2.pdf
    │   └── Notebooks
    │   │   ├── BasicStats.ipynb
    │   │   ├── ExcercisesPandas.ipynb
    │   │   ├── PandasIntroduction.ipynb
    │   │   ├── PandasIntroduction2.ipynb
    │   │   └── Visualisations.ipynb
    ├── day3
    │   ├── API.ipynb
    │   ├── Data Formats.ipynb
    │   ├── Teaching Exercises.ipynb
    │   ├── Webscraping.ipynb
    │   ├── get_mails
    │   └── updated cell
    ├── day4
    │   ├── README.md
    │   ├── example-ngrams.md
    │   ├── exercises-afternoon
    │   │   ├── 01tuesday-regex-exercise.md
    │   │   ├── 01tuesday-regex-solution.md
    │   │   ├── 02tuesday-exercise_nexis.md
    │   │   ├── 02tuesday-exercise_nexis_solution.md
    │   │   └── corona_news.tar.gz
    │   ├── exercises-morning
    │   │   ├── exercise-feature-engineering.md
    │   │   ├── possible-solution-exercise-day2-vectorizers.md
    │   │   ├── possible-solution-exercise-day2.md
    │   │   └── possible-solutions-ordered.ipynb
    │   ├── exercises-vectorizers
    │   │   ├── Understanding_vectorizers.ipynb
    │   │   ├── exercise-text-to-features.md
    │   │   └── possible-solution-exercise-day1.md
    │   ├── regex_examples.ipynb
    │   ├── slides-04-1.pdf
    │   ├── slides-04-1.tex
    │   ├── slides-04-2.pdf
    │   ├── slides-04-2.tex
    │   └── spacy-examples.ipynb
    └── day5
    │   ├── Day 5 - Machine Learning - Afternoon.pdf
    │   ├── Day 5 - Machine Learning - Morning.pdf
    │   ├── Day 5 Take-aways.ipynb
    │   ├── Exercise 1
    │       ├── exercise1.ipynb
    │       └── hatespeech_text_label_vote_RESTRICTED_100K.csv
    │   ├── Exercise 2
    │       ├── exercise2.ipynb
    │       └── hatespeech_text_label_vote_RESTRICTED_100K.csv
    │   └── Exercise 3
    │       ├── SeeFlex_data.csv
    │       └── exercise3.ipynb
├── .DS_Store
├── .gitignore
└── README.md


/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/.DS_Store


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | ## Core latex/pdflatex auxiliary files:
  2 | *.aux
  3 | *.lof
  4 | *.log
  5 | *.lot
  6 | *.fls
  7 | *.out
  8 | *.toc
  9 | *.fmt
 10 | 
 11 | ## Intermediate documents:
 12 | *.dvi
 13 | *-converted-to.*
 14 | # these rules might exclude image files for figures etc.
 15 | # *.ps
 16 | # *.eps
 17 | # *.pdf
 18 | 
 19 | ## Bibliography auxiliary files (bibtex/biblatex/biber):
 20 | *.bbl
 21 | *.bcf
 22 | *.blg
 23 | *-blx.aux
 24 | *-blx.bib
 25 | *.brf
 26 | *.run.xml
 27 | 
 28 | ## Build tool auxiliary files:
 29 | *.fdb_latexmk
 30 | *.synctex
 31 | *.synctex.gz
 32 | *.synctex.gz(busy)
 33 | *.pdfsync
 34 | 
 35 | ## Auxiliary and intermediate files from other packages:
 36 | # algorithms
 37 | *.alg
 38 | *.loa
 39 | 
 40 | # achemso
 41 | acs-*.bib
 42 | 
 43 | # amsthm
 44 | *.thm
 45 | 
 46 | # beamer
 47 | *.nav
 48 | *.snm
 49 | *.vrb
 50 | 
 51 | # cprotect
 52 | *.cpt
 53 | 
 54 | #(e)ledmac/(e)ledpar
 55 | *.end
 56 | *.[1-9]
 57 | *.[1-9][0-9]
 58 | *.[1-9][0-9][0-9]
 59 | *.[1-9]R
 60 | *.[1-9][0-9]R
 61 | *.[1-9][0-9][0-9]R
 62 | *.eledsec[1-9]
 63 | *.eledsec[1-9]R
 64 | *.eledsec[1-9][0-9]
 65 | *.eledsec[1-9][0-9]R
 66 | *.eledsec[1-9][0-9][0-9]
 67 | *.eledsec[1-9][0-9][0-9]R
 68 | 
 69 | # glossaries
 70 | *.acn
 71 | *.acr
 72 | *.glg
 73 | *.glo
 74 | *.gls
 75 | 
 76 | # gnuplottex
 77 | *-gnuplottex-*
 78 | 
 79 | # hyperref
 80 | *.brf
 81 | 
 82 | # knitr
 83 | *-concordance.tex
 84 | *.tikz
 85 | *-tikzDictionary
 86 | 
 87 | # listings
 88 | *.lol
 89 | 
 90 | # makeidx
 91 | *.idx
 92 | *.ilg
 93 | *.ind
 94 | *.ist
 95 | 
 96 | # minitoc
 97 | *.maf
 98 | *.mtc
 99 | *.mtc[0-9]
100 | *.mtc[1-9][0-9]
101 | 
102 | # minted
103 | _minted*
104 | *.pyg
105 | 
106 | # morewrites
107 | *.mw
108 | 
109 | # mylatexformat
110 | *.fmt
111 | 
112 | # nomencl
113 | *.nlo
114 | 
115 | # sagetex
116 | *.sagetex.sage
117 | *.sagetex.py
118 | *.sagetex.scmd
119 | 
120 | # sympy
121 | *.sout
122 | *.sympy
123 | sympy-plots-for-*.tex/
124 | 
125 | # pdfcomment
126 | *.upa
127 | *.upb
128 | 
129 | #pythontex
130 | *.pytxcode
131 | pythontex-files-*/
132 | 
133 | # Texpad
134 | .texpadtmp
135 | 
136 | # TikZ & PGF
137 | *.dpth
138 | *.md5
139 | *.auxlock
140 | 
141 | # todonotes
142 | *.tdo
143 | 
144 | # xindy
145 | *.xdy
146 | 
147 | # xypic precompiled matrices
148 | *.xyc
149 | 
150 | # WinEdt
151 | *.bak
152 | *.sav
153 | 
154 | # endfloat
155 | *.ttt
156 | *.fff
157 | 
158 | # Latexian
159 | TSWLatexianTemp*
160 | 
161 | # Emacs
162 | *~
163 | \#*\#
164 | 
165 | # jupyter notebook
166 | .ipynb_checkpoints/
167 | 
168 | #DS_Store
169 | **/.DS_Store
170 | .DS_Store
171 | 2023/.DS_Store
172 | .DS_Store
173 | 


--------------------------------------------------------------------------------
/2021/day1/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2021/day1/.DS_Store


--------------------------------------------------------------------------------
/2021/day1/README.md:
--------------------------------------------------------------------------------
1 | # Day 1: Python basics
2 | 


--------------------------------------------------------------------------------
/2021/day1/day1-morning.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2021/day1/day1-morning.pdf


--------------------------------------------------------------------------------
/2021/day1/exercises/exercises.md:
--------------------------------------------------------------------------------
 1 | # Exercise 1: Working with lists
 2 | 
 3 | 
 4 | ## 1. Warming up
 5 | 
 6 | - Create a list, loop over the list, and do something with each value (you're free to choose). 
 7 | 
 8 | ## 2. Did you pass?
 9 | 
10 | - Think of a way to determine for a list of  grades whether they are a pass (>5.5) or fail.
11 | - Can you make that program robust enough to handle invalid input (e.g., a grade as 'ewghjieh')?
12 | - How does your program deal with impossible grades (e.g., 12 or -3)?
13 | - Any other improvements?
14 | 
15 | 
16 | 
17 | 
18 | # Exercise 2: Working with dictionaries
19 | 
20 | 
21 | - Create a program that takes lists of corresponding data (a list of first names, a list of last names, a list of phone numbers) and converts them into a dictionary. You may assume that the lists are ordered correspondingly. To loop over two lists at the same time, you can do sth like this: (of course, you later on do not want to print put to put in a dictionary):
22 | ```
23 | for i, j in zip(list1, list):
24 |    print(i,j)
25 | ```
26 | - Improve the program to control what should happen if the lists are (unexpectedly) of unequal length.
27 | - Create another program to handle a phone dictionary. The keys are names, and the value can either be a single phone number, a list of phone numbers, or another dict of the form {"office": "020123456", "mobile": "0699999999", ... ... ... }. Write a function that shows how many different phone numbers a given person has.
28 | - Write another function that prints only mobile numbers (and their owners) and omits the rest (If you want to take it easy, you may assume that they are stored in a dict and use the key "mobile". If you like challenges, you can also support strings and lists of strings by parsing the numbers themselves and check whether they start with 06. You can check whether a string starts with 06 by checking mystring[:2]=="06" (the double equal sign indicates a comparison that will return True or False). If you like even more challenges, you could support country codes).
29 | 
30 | 
31 | 
32 | # Exercise 3: Working with defaultdicts
33 | 
34 | - Take the data from Excercise 2. Write a program that collects all office numbers, all mobile numers, etc. Assume that there are potentially also other categories like "home", "second", maybe even "fax", and that they are unknown byforehand.
35 | - To do so, you can use the following approach:
36 | ```python
37 | from collections import defaultdict
38 | myresults = defaultdict(list)
39 | ```
40 | Loop over the appropriate data. For all the key-value pairs (like "office": "020111111"), do ` myresults[key].append(value)`: This will append the current phone numner (02011111) to the list of "office" numbers. 
41 | - Do you see why this works only with a defaultdict but not with a "normal" dict? What would happen with a normal dict?
42 | - Take the function from Exercise 2 that prints how many phone numbers a given person has. Use a defaultdict instead to achieve the same result. What are the pros and cons?
43 | 


--------------------------------------------------------------------------------
/2021/day1/exercises/instructional-videos-exercises.md:
--------------------------------------------------------------------------------
1 | 
2 | 
3 | # Instructional video's
4 | #### The linked video's further explain the answers provided to today's [exercises](https://github.com/uvacw/teachteacher-python/blob/main/day1/exercises/exercises.md).
5 | 
6 | - Instructional video explaining [Exercise 2](https://github.com/uvacw/teachteacher-python/blob/main/day1/exercises/exercises.md#exercise-2-working-with-dictionaries): *Working with dictionaries*: [Video here](https://www.youtube.com/watch?v=M_bkVPfQcgs)
7 | 
8 | - Instructional video explaining [Exercise 3](https://github.com/uvacw/teachteacher-python/blob/main/day1/exercises/exercises.md#exercise-3-working-with-defaultdicts): *Working with defaultdicts:* [Video here](https://www.youtube.com/watch?v=2l9aRWcKVyA)
9 | 


--------------------------------------------------------------------------------
/2021/day1/solutions/ex2-1.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | 
 3 | names = ["Alice", "Bob", "Carol"]
 4 | office = ["020222", "030111", "040444"]
 5 | mobile = ["0666666", "0622222", "0644444"]
 6 | 
 7 | mydict ={}
 8 | for n, o, m in zip(names, office, mobile):
 9 |     mydict[n] = {"office":o, "mobile":m}
10 | 
11 | print(mydict)
12 | 


--------------------------------------------------------------------------------
/2021/day1/solutions/ex2-2.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | 
 3 | names = ["Alice", "Bob", "Carol", "Damian"]
 4 | office = ["020222", "030111", "040444"]
 5 | mobile = ["0666666", "0622222", "0644444"]
 6 | 
 7 | if len(names) == len(office) == len(mobile):
 8 |     mydict ={}
 9 |     for n, o, m in zip(names, office, mobile):
10 |         mydict[n] = {"office":o, "mobile":m}
11 |     print(mydict)
12 | else:
13 |     print("Your data seems to be messed up - the lists do not have the same length")
14 | 


--------------------------------------------------------------------------------
/2021/day1/solutions/ex2-3.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | 
 3 | data = {'Alice': {'office': '020222', 'mobile': '0666666'},
 4 |         'Bob': {'office': '030111'},
 5 |         'Carol': {'office': '040444', 'mobile': '0644444'},
 6 |         "Daan": "020222222",
 7 |         "Els": ["010111", "06222"]}
 8 | 
 9 | def get_number_of_subscriptions(x):
10 |     if type(x) is str:
11 |         return 1
12 |     else:
13 |         return len(x)
14 | 
15 | for k, v in data.items():
16 |     print(f"{k} has {get_number_of_subscriptions(v)} phone subscriptions")
17 | 


--------------------------------------------------------------------------------
/2021/day1/solutions/ex2-4.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | 
 3 | data = {'Alice': {'office': '020222', 'mobile': '0666666'},
 4 |         'Bob': {'office': '030111'},
 5 |         'Carol': {'office': '040444', 'mobile': '0644444'},
 6 |         "Daan": "020222222",
 7 |         "Els": ["010111", "06222"]}
 8 | 
 9 | def get_number_of_subscriptions(x):
10 |     if type(x) is str:
11 |         return 1
12 |     else:
13 |         return len(x)
14 | 
15 | def get_mobile(x):
16 |     if type(x) is str and x[:2]=="06":
17 |         return x
18 |     if type(x) is list:
19 |         return [e for e in x if e[:2]=="06"]
20 |     if type(x) is dict:
21 |         return [v for k, v in x.items() if k=="mobile"]
22 | for k, v in data.items():
23 |     print(f"{k} has {get_number_of_subscriptions(v)} phone subscriptions. The mobile ones are {get_mobile(v)}")
24 | 


--------------------------------------------------------------------------------
/2021/day1/solutions/ex3-1.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | 
 3 | from collections import defaultdict
 4 | 
 5 | data = {'Alice': {'office': '020222', 'mobile': '0666666'},
 6 |         'Bob': {'office': '030111'},
 7 |         'Carol': {'office': '040444', 'mobile': '0644444', 'fax': "02012354"},
 8 |         "Daan": "020222222",
 9 |         "Els": ["010111", "06222"]}
10 | 
11 | myresults = defaultdict(list)
12 | 
13 | for name, entry in data.items():
14 |     try:
15 |         for k, v in entry.items():
16 |             myresults[k].append(v)
17 |     except:
18 |         print(f"{name}'s numbers aren't stored in a dict, so I don't know what they are and will skip them")
19 | 
20 | print(myresults)
21 | 


--------------------------------------------------------------------------------
/2021/day1/solutions/ex3-2.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | 
 3 | from collections import defaultdict
 4 | 
 5 | data = {'Alice': {'office': '020222', 'mobile': '0666666'},
 6 |         'Bob': {'office': '030111'},
 7 |         'Carol': {'office': '040444', 'mobile': '0644444'},
 8 |         "Daan": "020222222",
 9 |         "Els": ["010111", "06222"]}
10 | 
11 | subscriptions = defaultdict(int)
12 | 
13 | for name, entry in data.items():
14 |     if type(entry) is str:
15 |         subscriptions[name]+=1  # this is short for subscriptions[name] =  subscriptions[name]+1 
16 |     else:
17 |         subscriptions[name] += len(entry)
18 | 
19 | print(subscriptions)
20 | 


--------------------------------------------------------------------------------
/2021/day1/solutions/gradechecker_function.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | 
 3 | grades = [4, 7.8, -3, 3.6, 12, 9.1, "4.4", "KEGJKEG", 4.2, 7, 5.5]
 4 | 
 5 | 
 6 | 
 7 | def check_grade(grade):
 8 |     try:
 9 |         grade_float = float(grade)
10 |     except:
11 |         return('INVALID')
12 |     if grade_float >10 or grade_float <1:
13 |         return('INVALID')
14 |     elif grade_float >= 5.5:
15 |         return('PASS')
16 |     else:
17 |         return('FAIL')
18 | 
19 | 
20 | for grade in grades:
21 |     print(grade,'is',check_grade(grade))
22 | 


--------------------------------------------------------------------------------
/2021/day1/solutions/gradechecker_robust.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | 
 3 | grades = [4, 7.8, -3, 3.6, 12, 9.1, "4.4", "KEGJKEG", 4.2, 7, 5.5]
 4 | 
 5 | for grade in grades:
 6 |     try:
 7 |         grade_float = float(grade)
 8 |         if grade_float >10:
 9 |             print(grade_float,'is an invalid grade')
10 |         elif grade_float <1:
11 |             print(grade_float,'is an invalid grade')
12 |         elif grade_float >= 5.5:
13 |             print(grade,'is a PASS')
14 |         else:
15 |             print(grade,'is a FAIL')
16 | 
17 |     except:
18 |         print('I do not understand what',grade,'means')
19 | 
20 | 


--------------------------------------------------------------------------------
/2021/day1/solutions/gradechecker_simple.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | 
 3 | grades = [4, 7.8, 3.6, 9.1, 4.2, 7, 5.5]
 4 | 
 5 | for grade in grades:
 6 |     if grade >= 5.5:
 7 |         print(grade,'is a PASS')
 8 |     else:
 9 |         print(grade,'is a FAIL')
10 | 


--------------------------------------------------------------------------------
/2021/day2/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2021/day2/.DS_Store


--------------------------------------------------------------------------------
/2021/day2/Day 2_slides.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2021/day2/Day 2_slides.pdf


--------------------------------------------------------------------------------
/2021/day2/Notebooks/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2021/day2/Notebooks/.DS_Store


--------------------------------------------------------------------------------
/2021/day2/Notebooks/Datasets/Sentiment_YouTubeClimateChange.csv:
--------------------------------------------------------------------------------
  1 | ,videoId,negative,positive,neutral
  2 | 0,sGHq_EwXDn8,-1,1,0
  3 | 1,PRtn1W2RAVU,-1,1,0
  4 | 2,2CQvBGSiDvw,-1,1,0
  5 | 3,Cbwv1jg4gZU,-1,1,0
  6 | 4,cWsCX_yxXqw,-1,1,0
  7 | 5,pbiSuB3mzmo,-1,1,0
  8 | 6,6d9ENk3NfBM,-2,1,-1
  9 | 7,hzrFtZc9EkQ,-1,1,0
 10 | 8,Je2l7Gw7uns,-1,1,0
 11 | 9,8Rvl6z80baI,-1,1,0
 12 | 10,IQpIVsxx014,-1,1,0
 13 | 12,8lDwd5XM1HQ,-1,1,0
 14 | 13,ga-RBuhcJ7w,-1,1,0
 15 | 14,SUQQVKVHRUQ,-2,1,-1
 16 | 15,72fwlCXy1bw,-1,1,0
 17 | 16,lq0i6umUQzI,-4,1,-1
 18 | 17,riVP3Jy3Orc,-1,1,0
 19 | 18,76CLorQ9n5s,-1,1,0
 20 | 19,I3qbXSf93bc,-2,1,-1
 21 | 20,dretZCOvMzw,-1,1,0
 22 | 21,N-EWoGUzLIc,-1,2,1
 23 | 22,M6hA-MUFgXo,-1,1,0
 24 | 23,ezAZ5WVAOyI,-1,1,0
 25 | 24,pl1Rnz4zNkg,-1,1,0
 26 | 25,_lpVPh6LeU8,-1,1,0
 27 | 27,VOpmtCiAhX0,-1,1,0
 28 | 28,Yglapc3SBmc,-2,1,-1
 29 | 29,wLW4Tk8Pwdg,-3,1,-1
 30 | 30,rAam4R1M5zE,-1,1,0
 31 | 32,sw9pNVc9lw0,-1,1,0
 32 | 33,iLGgILUqbcc,-1,1,0
 33 | 34,cl4Uv9_7KJE,-2,1,-1
 34 | 35,9m3dYb1m7cw,-3,1,-1
 35 | 36,9t3ERx71ANA,-1,1,0
 36 | 37,S7Lu-R4EwQw,-1,1,0
 37 | 38,Qu0DNjp2OsY,-2,1,-1
 38 | 39,L6sanoP6rY8,-1,1,0
 39 | 40,mQuGnQEPJPs,-1,2,1
 40 | 41,FVX60nlzglk,-2,1,-1
 41 | 43,L0t_5RdQqb0,-1,1,0
 42 | 44,IDOouqQGZY4,-2,1,-1
 43 | 48,xsQgDwXmsyg,-1,1,0
 44 | 50,NvVX3ILhVac,-2,1,-1
 45 | 58,xnudgOC9D5Y,-3,1,-1
 46 | 61,38Mw-t5zmyM,-1,1,0
 47 | 62,K9MaGf-Su9I,-1,1,0
 48 | 63,5cY7fSbUWA8,-1,1,0
 49 | 66,yvDRQe2oCt4,-1,1,0
 50 | 68,8vP00TP6-p0,-1,1,0
 51 | 70,kl_i4MRYgv0,-1,1,0
 52 | 76,EXkbdELr4EQ,-2,1,-1
 53 | 78,1Uu9vCNH6Dk,-3,1,-1
 54 | 82,PWvLLGcb96k,-1,1,0
 55 | 87,aewK7Kzf43A,-1,1,0
 56 | 88,-fkCo_trbT8,-2,1,-1
 57 | 93,BXKUsTo_f1s,-2,1,-1
 58 | 97,ZbZwwUEzNLY,-1,1,0
 59 | 98,dcBXmj1nMTQ,-1,1,0
 60 | 101,M3Iztt4D2UE,-1,1,0
 61 | 102,4xkXjj6dalM,-1,1,0
 62 | 107,KF7cvmXjSyw,-1,1,0
 63 | 117,FtNTt_3PRoQ,-1,1,0
 64 | 119,RoQRkmRjz38,-1,1,0
 65 | 120,BYo7Mo1ncuM,-1,1,0
 66 | 121,KzjiNDbGZPM,-1,1,0
 67 | 122,TbW_1MtC2So,-1,1,0
 68 | 123,fw-g0PVpW2E,-1,1,0
 69 | 126,M0tENI3ef7Y,-1,1,0
 70 | 128,ZQGGhtguHns,-1,1,0
 71 | 129,SU6GMDDXFtw,-2,1,-1
 72 | 136,WkfTeGcItA0,-1,2,1
 73 | 140,mO4vtjfabm0,-1,1,0
 74 | 141,fWH6VGFs2z4,-1,1,0
 75 | 142,13t0tCV8hW8,-1,1,0
 76 | 144,Xem9EvvkJSc,-1,1,0
 77 | 147,YuMtSjq8W-g,-3,1,-1
 78 | 148,vvhtoL2A8dU,-2,1,-1
 79 | 149,R_cf5n3UgrU,-2,1,-1
 80 | 150,P-3ZlFokHfM,-1,1,0
 81 | 152,9M29ns1rUSE,-1,1,0
 82 | 153,tAkvNHEnctg,-1,1,0
 83 | 158,nZIOZwUPNnA,-1,1,0
 84 | 161,3SD-Mrv7QLQ,-2,1,-1
 85 | 162,y5cWazcakUw,-1,1,0
 86 | 163,bVAyj9bYHMw,-1,1,0
 87 | 165,gPVBDCDmrcU,-1,1,0
 88 | 168,cQlQL5obqDs,-1,1,0
 89 | 170,H2QxFM9y0tY,-1,1,0
 90 | 175,9A7_xCrgX1U,-1,1,0
 91 | 177,p05YJ5if8Ew,-1,1,0
 92 | 178,7WPsMsYCtjk,-1,1,0
 93 | 179,ssuevV4eyqM,-3,1,-1
 94 | 184,vWrF_ZHymoE,-1,1,0
 95 | 186,yyAuWeoTm2s,-1,1,0
 96 | 187,Av9SW1yw5lg,-3,1,-1
 97 | 189,nxhEXwaDyxM,-2,1,-1
 98 | 191,Ld47QsQHM7c,-1,1,0
 99 | 192,6EFHZfISGp4,-1,1,0
100 | 193,5nGYkH9ifzM,-1,1,0
101 | 195,gU9hPfx12GA,-1,1,0
102 | 203,JtHHnBUmc0g,-1,1,0
103 | 205,rv3DPaMaS2g,-1,1,0
104 | 206,1hhzrormtP4,-1,1,0
105 | 210,0sMwKLkW4lI,-1,1,0
106 | 212,wzjVT07bcYA,-1,1,0
107 | 213,cvjbXYdh8x0,-2,1,-1
108 | 214,VX5ku0LbMMk,-1,1,0
109 | 217,j43XK0wzMd4,-4,1,-1
110 | 222,1yeANLOHnJ8,-3,3,-1
111 | 224,fKXg-SUP5P4,-2,1,-1
112 | 230,YE7kwNXqV30,-1,1,0
113 | 232,Ix5U2S8UXPA,-2,1,-1
114 | 234,2efYeNroXvg,-1,1,0
115 | 235,gSXOxrjCA40,-1,1,0
116 | 238,6_VJXHfMevM,-1,1,0
117 | 239,-4k3AzfYuJg,-1,1,0
118 | 240,EagrIPTCqrg,-1,1,0
119 | 249,7yNhCDMB0ls,-1,1,0
120 | 250,_jA8k4YDzlo,-1,1,0
121 | 251,lpGUzz-tjWs,-1,1,0
122 | 252,qCeBPeBjKcA,-1,1,0
123 | 253,WVc-Y-mJ_uY,-2,2,-1
124 | 255,nu0f86EkzS8,-2,1,-1
125 | 258,nkoRm9A7xr8,-1,1,0
126 | 262,DMbu9w4pDXE,-1,1,0
127 | 264,X7Wv0AZC_D4,-1,1,0
128 | 268,sGx6P2UR8Ig,-1,1,0
129 | 276,61hsoU0AIK4,-1,1,0
130 | 280,FHUHsBnpCj8,-1,1,0
131 | 285,Ez4qvsR-gHQ,-1,1,0
132 | 286,UDlHcxWtbvw,-1,1,0
133 | 288,STeynRkoU3s,-3,2,-1
134 | 292,BWJBOwSa4h8,-1,1,0
135 | 293,e3duOpZlD9E,-1,1,0
136 | 302,EyMAmakw1dU,-1,1,0
137 | 304,11FCyUB81rI,-1,1,0
138 | 306,FDWAEKQ0KkU,-1,1,0
139 | 307,sSF8uFoSm1M,-3,1,-1
140 | 308,5Z9OZE_TypE,-1,1,0
141 | 309,GaUi64HwUZg,-1,1,0
142 | 310,TMF9aMI-9ek,-1,1,0
143 | 311,IagqMq4wfCc,-1,1,0
144 | 313,2iC-G7KXTBU,-1,1,0
145 | 314,8paPxMzc0mo,-1,1,0
146 | 316,o_dshuJTxLI,-1,1,0
147 | 317,fqqPjRNXgdA,-1,1,0
148 | 319,-61c8EQ8qro,-1,1,0
149 | 321,Opy-a_oW3Bw,-1,1,0
150 | 322,7GTtCtXJJ0Y,-1,1,0
151 | 323,KEkmIErcgT4,-1,1,0
152 | 324,AFvTrdOqdXo,-1,1,0
153 | 325,oSqmCNNV2dQ,-1,1,0
154 | 326,bW3IQ-ke43w,-1,1,0
155 | 328,ikz5JHfPQ6k,-1,1,0
156 | 329,KArS5ArSYY4,-2,1,-1
157 | 331,c7lCRYf9rHo,-2,1,-1
158 | 332,u9KxE4Kv9A8,-1,1,0
159 | 333,ba1tND0B0xk,-1,1,0
160 | 334,vqKLTEQjew4,-1,1,0
161 | 338,KAJsdgTPJpU,-1,1,0
162 | 339,7OxgWkhozQU,-1,1,0
163 | 341,vZIC6hJ_fCE,-2,2,-1
164 | 344,DYqtXR8iPlE,-2,1,-1
165 | 345,N1cdCUZNh04,-1,1,0
166 | 347,_WY7FEYN3QI,-1,1,0
167 | 349,oDjuoBAtLWA,-4,1,-1
168 | 351,UvHMhZ1T964,-1,1,0
169 | 353,rYxt0BeTrT8,-2,1,-1
170 | 355,7f5NVJTqPaU,-3,1,-1
171 | 358,L0ryCJVAGZE,-1,1,0
172 | 360,-PSR_OutuIw,-1,1,0
173 | 362,RA4mIbQo52k,-1,1,0
174 | 364,9Pqp_8XLC6c,-1,1,0
175 | 370,0__6kx-vTO4,-2,2,-1
176 | 373,jOHuUeZzPh0,-1,1,0
177 | 376,1tRDnjl_gwY,-3,1,-1
178 | 379,Da5-n9pf6sM,-2,1,-1
179 | 380,z9ALFf6eQI0,-1,1,0
180 | 383,088j0n0XxQE,-1,1,0
181 | 385,WRgv4V1ZxN4,-1,1,0
182 | 388,WR6uSXW-8p4,-1,1,0
183 | 389,rhQVustYV24,-1,1,0
184 | 390,kQozp2xZ3Q0,-2,1,-1
185 | 391,jOYvuLIwWEQ,-1,1,0
186 | 394,pj5ZLwtoAmI,-1,1,0
187 | 396,G-YdVrhAoNU,-1,1,0
188 | 397,n0bqG1GzlHU,-1,1,0
189 | 400,LRfwnxQN1Lw,-1,1,0
190 | 402,oQbftR8pG78,-3,1,-1
191 | 404,J__4V0ujlaU,-2,1,-1
192 | 405,78iIQdKmodc,-1,1,0
193 | 406,zvOcuZ3-FO8,-1,1,0
194 | 407,-BvcToPZCLI,-3,1,-1
195 | 409,YrnlZXeC1nM,-1,1,0
196 | 410,tNwkY_V_BPI,-1,1,0
197 | 415,8l-dhwqd2UM,-1,1,0
198 | 416,WFV-rcaBG9g,-2,1,-1
199 | 417,_XugW-yg2XI,-1,1,0
200 | 418,uo8qXxnFuRQ,-1,3,1
201 | 421,mapriv3vWBA,-3,1,-1
202 | 422,f6URRc-0Z1o,-2,1,-1
203 | 426,pYtEukvKjLc,-3,1,-1
204 | 427,w3FBLKHG-9M,-1,1,0
205 | 428,oCVQdr9QFwY,-1,1,0
206 | 431,ZPMy2Yw8teM,-1,1,0
207 | 432,vtMHfFxwg3U,-1,1,0
208 | 434,OYtAGTe9MjY,-1,2,1
209 | 436,wRk1p8Lzwvo,-1,1,0
210 | 437,4sSJpKTdwFo,-1,1,0
211 | 439,AJkFuRzJNoQ,-1,1,0
212 | 440,tffj_82IRsg,-3,1,-1
213 | 441,1CnyqLogH0Y,-2,1,-1
214 | 442,FP-9l6BeagE,-2,1,-1
215 | 443,MeKAdOySB_E,-1,1,0
216 | 445,VM4d66igm9w,-2,1,-1
217 | 448,9iAD_heE2kU,-2,1,-1
218 | 449,SQY7VOQF8sY,-1,1,0
219 | 450,cZwQN4JpJ8s,-1,1,0
220 | 451,DhhVr5iLF-c,-1,1,0
221 | 453,XL1rpFCBg5s,-1,1,0
222 | 454,BzPjWpkNWiU,-1,2,1
223 | 456,f2Wr7lDI-Hg,-1,1,0
224 | 457,3fQHpXkI-vc,-1,1,0
225 | 459,QpLdpjcHhqs,-1,1,0
226 | 463,DMTwbV9UqHA,-1,1,0
227 | 466,zIvjHSvzFLU,-1,1,0
228 | 467,9rkDTXEOpEM,-1,1,0
229 | 468,t-uRB26a-sg,-2,1,-1
230 | 469,9zcGrc2xcO0,-1,1,0
231 | 470,vapTJLUSvpQ,-1,1,0
232 | 471,zMQ0xQrgBms,-1,1,0
233 | 472,j8ZrRL2lbsA,-1,1,0
234 | 474,JYZpxRy5Mfg,-1,1,0
235 | 475,38dqOdQFdRI,-1,1,0
236 | 476,BFp3Q3WdVWI,-1,1,0
237 | 481,BQ4rBLCpEeM,-3,1,-1
238 | 482,00cKGt9v1as,-1,1,0
239 | 483,UgHNg-N-ENI,-1,1,0
240 | 484,dsyW3QjBQHU,-1,1,0
241 | 486,U3r-TzeSzrc,-1,1,0
242 | 487,8ISePLL1wcw,-1,1,0
243 | 490,w38fhmZkz64,-1,1,0
244 | 493,ZgdTHVcv1o8,-1,1,0
245 | 495,i-qBOyrD0-0,-1,1,0
246 | 


--------------------------------------------------------------------------------
/2021/day2/Notebooks/Datasets/Sentiment_YouTubeClimateChange.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2021/day2/Notebooks/Datasets/Sentiment_YouTubeClimateChange.pkl


--------------------------------------------------------------------------------
/2021/day2/Notebooks/Datasets/websites.csv:
--------------------------------------------------------------------------------
1 | site,type,views,active_users
2 | Twitter,Social Media,10000,200000
3 | Facebook,Social Media,35000,500000
4 | NYT,News media,78000,156000
5 | YouTube,Video platform,18000,289000
6 | Vimeo,Video platform,300,1580
7 | USA Today,News media,4800,5608
8 | 


--------------------------------------------------------------------------------
/2021/day2/Notebooks/ExcercisesPandas.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "accompanied-inspector",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "## Excercises pandas\n",
  9 |     "\n",
 10 |     "Let's practice data exploration and wrangling in Pandas. \n",
 11 |     "\n",
 12 |     "We will work with data collected through the Twitter API (more on API's tomorrow ;)). Few months ago I collected tweets published by the RIVM twice. I have also already run a sentiment analysis on these tweets and have saved it in a separate file.\n",
 13 |     "We have three datasets:\n",
 14 |     "* Two datasets with tweets by the RIVM (public tweets by account)\n",
 15 |     "* One dataset with sentiment of those tweets (simulated dataset, with two sentiment scores)\n",
 16 |     "\n",
 17 |     "We want to see how sentiment changes over time (per month), compare number of positive and negative tweets and analyze the relation between sentiment and engagement with the tweets\n",
 18 |     "\n",
 19 |     "We want to prepare the dataset for analysis:\n",
 20 |     "\n",
 21 |     "**Morning**\n",
 22 |     "* Data exploration\n",
 23 |     "    * Check columns, data types, missing values, descriptives for numeric variables measuring engagement and sentiment, value_counts for relevant categorical variables\n",
 24 |     "* Handling missing values and data types\n",
 25 |     "    * Handle missing values in variables of interest: number of likes and retweets - what can nan's mean?\n",
 26 |     "    * Make sure created_at has the right format (to use it for aggregation later)\n",
 27 |     "* Creating necessary variables (sentiment)\n",
 28 |     "    * Overall measure of sentiment - create it from positive and negative\n",
 29 |     "    * Binary variable (positive or negative tweet) - Tip: Write a function that \"recodes\" the sentiment column\n",
 30 |     "    \n",
 31 |     "\n",
 32 |     "**Afternoon**\n",
 33 |     "\n",
 34 |     "Pandas continued\n",
 35 |     "* Concatenating the dataframes (tweets1 and tweets2)\n",
 36 |     "* Merging the files (tweets with sentiment)\n",
 37 |     "    * Make sure the columns you merge match and check how to merge\n",
 38 |     "* Agrregating the files per month\n",
 39 |     "    * Tip: Create a column for month by transforming the date column. Remember that the date column needs the right format first!\n",
 40 |     "    \n",
 41 |     "    `df['month'] = df['date_dt_column'].dt.strftime('%Y-%m')`\n",
 42 |     "\n",
 43 |     "\n",
 44 |     "\n",
 45 |     "\n",
 46 |     "\n",
 47 |     "Visualisations:\n",
 48 |     "* Visualise different columns of the tweet dataset (change of sentiment over time, sentiment, engagement, relation between sentiment and engagement)\n",
 49 |     "* But *more fun*: use your own data to play with visualisations"
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "code",
 54 |    "execution_count": 2,
 55 |    "id": "brazilian-giving",
 56 |    "metadata": {},
 57 |    "outputs": [],
 58 |    "source": [
 59 |     "import pandas as pd"
 60 |    ]
 61 |   },
 62 |   {
 63 |    "cell_type": "code",
 64 |    "execution_count": 3,
 65 |    "id": "appreciated-cigarette",
 66 |    "metadata": {},
 67 |    "outputs": [],
 68 |    "source": [
 69 |     "tweets1 = pd.read_csv('Datasets/RIVM_1.csv')"
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "code",
 74 |    "execution_count": 4,
 75 |    "id": "hollow-honduras",
 76 |    "metadata": {},
 77 |    "outputs": [],
 78 |    "source": [
 79 |     "tweets2 = pd.read_csv('Datasets/RIVM_2.csv')"
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "code",
 84 |    "execution_count": 5,
 85 |    "id": "leading-adelaide",
 86 |    "metadata": {},
 87 |    "outputs": [],
 88 |    "source": [
 89 |     "sentiment = pd.read_csv('Datasets/RIVM_sentiment.csv')"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "code",
 94 |    "execution_count": null,
 95 |    "id": "antique-specific",
 96 |    "metadata": {},
 97 |    "outputs": [],
 98 |    "source": []
 99 |   }
100 |  ],
101 |  "metadata": {
102 |   "kernelspec": {
103 |    "display_name": "Python 3",
104 |    "language": "python",
105 |    "name": "python3"
106 |   },
107 |   "language_info": {
108 |    "codemirror_mode": {
109 |     "name": "ipython",
110 |     "version": 3
111 |    },
112 |    "file_extension": ".py",
113 |    "mimetype": "text/x-python",
114 |    "name": "python",
115 |    "nbconvert_exporter": "python",
116 |    "pygments_lexer": "ipython3",
117 |    "version": "3.9.1"
118 |   }
119 |  },
120 |  "nbformat": 4,
121 |  "nbformat_minor": 5
122 | }
123 | 


--------------------------------------------------------------------------------
/2021/day2/README.md:
--------------------------------------------------------------------------------
 1 | # Day 2: Pandas and statistics
 2 | 
 3 | | Time (indication) | Topic |
 4 | |-|-|
 5 | | 9.30-11.00 | Pandas: We will start the day with a general introduction to Pandas and will learn working with dataframes. We will start with reading different types of data into Pandas dataframe and continue with data wrangling. We will also shortly discuss pro's and con's of using Pandas comopared to formats discussed on Monday.|
 6 | | 11.00-12.00 | Exercises |
 7 | | 13.00-14.00 | Basic statistics and plotting: We will continue working with dataframes focusing on basic analysis and visualisation steps. We will discuss descritpive startistics as well as most commonly used statistics tests. We will also work with univariate and bivariate plots. |
 8 | | 14:00 - 15:30 | Exercises |
 9 | | 15:30 - 16:30 | Teaching presentations & general Q&A |
10 | | 16:30 - 17:00 | Questions / discussion/ next steps |
11 | 


--------------------------------------------------------------------------------
/2021/day3/README.md:
--------------------------------------------------------------------------------
 1 | # Day 3: Collecting and reading data
 2 | 
 3 | | Time (indication) | Topic |
 4 | |-|-|
 5 | | 9.30-11.00 | Data Collection 1: We will dive into handling data beyond typical tabular datasets (such as the csv files from Tuesday) and get an introduction to the JSON format, which is the de-facto standard for (online) data exchange. We will also get to know our first API (which uses this format). |
 6 | | 11.00-12.00 | Exercises |
 7 | | 13.00-14.00 | Data Collection 2: APIs and scraping. We will look more in detail into different APIs and how they can be used for data collection. We will also briefly talk about scraping which is highly relevant as a data collection technique (for instance, for theses), but also a bit too complex to cover in detail in this workshop. |
 8 | | 14:00 - 15:30 | Practice |
 9 | | 15:30 - 16:30 | Teaching presentations & general Q&A |
10 | | 16:30 - 17:00 | Closure / next steps |
11 | 
12 | 
13 | 


--------------------------------------------------------------------------------
/2021/day3/day3-afternoon.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2021/day3/day3-afternoon.pdf


--------------------------------------------------------------------------------
/2021/day3/day3-morning.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2021/day3/day3-morning.pdf


--------------------------------------------------------------------------------
/2021/day3/exercises/exercises.md:
--------------------------------------------------------------------------------
 1 | # Exercises Week 3
 2 | 
 3 | ## 1. Working with CSV files
 4 | 
 5 | 1. Take a dataset of your choice - something from your own work maybe. If it is not a CSV file (but, for instance, an Excel sheet or an SPSS file), export it as CSV. Inspect the file in a text editor of your choice (such as the ones available at https://www.sublimetext.com/, https://notepad-plus-plus.org, or https://atom.io) and check:
 6 | - the encoding
 7 | - the line ending style
 8 | - the delimiter
 9 | - whether it has a header row or not
10 | - take a quick look and check whether the file looks "ok", i.e. all rows have equal number of fields etc.
11 | 
12 | 
13 | 2. Open your file in Python and write it back (with a different file name). Do so both with the low-level (basic Python) and the high-level (pandas) approach. Inspect the result again in the editor and compare. (NB: Depending on the dialect, there may be small differences. If you observe some, which are they?)
14 | 
15 | 
16 | ## 2. Working with JSON files and APIs
17 | 
18 | 1. Reproduce examples 12.1 (page 315), 12.2 (page 316) and 12.3 (page 334) from the book. Explain the code to a classmate.
19 | 
20 | 2. Think of different ways of storing the data you collected. What would be the pros and cons? Discuss with a classmate.
21 | 
22 | 3. What do you think of example 12.2 (or line 12 in example 12.3, for that matter)? Would you rather store your data *before* or *after* the `json_normalize()` function? Discuss with a classmate. (NB: there are arguments to be made for both)
23 | 
24 | 4. What would happen if you would directly create a dataframe (e.g., via `pd.Dataframe(allitems)`, `pd.Dataframe(data['items'])`, or similar)? Based on this observation, can you describe what `json_normalize()` does?
25 | 


--------------------------------------------------------------------------------
/2021/day4/README.md:
--------------------------------------------------------------------------------
 1 | # Day 4: Natural Language Processing
 2 | 
 3 | | Time slot     | Content                                                                                                                                                                                                  |
 4 | |---------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 5 | | 09:30-11:00   | [NLP I](https://github.com/uvacw/teachteacher-python/blob/main/day4/day4.pdf): In a gentle introduction to NLP techniques, we will discuss the basics of bag-of-word (BAG) approaches, such as tokenization, stopword removal, stemming and lemmatization                       |
 6 | | 11:00 - 12:00 | [Exercise 1](https://github.com/uvacw/teachteacher-python/blob/main/day4/exercises-1/exercise-1.md)                                                                                                                                                                                              |
 7 | | 12:00-13:00   | Lunch                                                                                                                                                                                                   |
 8 | | 13:00-14:00   | [NLP II](https://github.com/uvacw/teachteacher-python/blob/main/day4/day4-afternoon.pdf):  In the second lecture of the day, we will delve a bit deeper in NLP approaches. We discuss different types of vectorizers (i.e., count and tfidf) and discuss the possibilities NER in spacy.  |
 9 | | 14:00-15:00   | [Exercise 2](https://github.com/uvacw/teachteacher-python/blob/main/day4/exercises-2/exercise-2.md)                                                                                                                                                                                             |
10 | | 15:30-16:00   | [NLP & teaching; General Q&A](https://github.com/uvacw/teachteacher-python/blob/main/day4/day4-afternoon.pdf)                                                                                                                                                                       |
11 | | 16:00-17:00   | Closure/next steps                                                                                                                                                                                      |
12 | 


--------------------------------------------------------------------------------
/2021/day4/day4-afternoon.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2021/day4/day4-afternoon.pdf


--------------------------------------------------------------------------------
/2021/day4/day4-afternoon.tex:
--------------------------------------------------------------------------------
  1 | % !TeX document-id = {f19fb972-db1f-447e-9d78-531139c30778}
  2 | % !BIB program = biber
  3 | 
  4 | \documentclass[handout]{beamer}
  5 | %\documentclass[compress]{beamer}
  6 | \usepackage[T1]{fontenc}
  7 | \usetheme[block=fill,subsectionpage=progressbar,sectionpage=progressbar]{metropolis} 
  8 | \usepackage{graphicx}
  9 | 
 10 | \usepackage{wasysym}
 11 | \usepackage{etoolbox}
 12 | \usepackage[utf8]{inputenc}
 13 | 
 14 | \usepackage{threeparttable}
 15 | \usepackage{subcaption}
 16 | 
 17 | \usepackage{tikz-qtree}
 18 | \setbeamercovered{still covered={\opaqueness<1->{5}},again covered={\opaqueness<1->{100}}}
 19 | 
 20 | 
 21 | \usepackage{listings}
 22 | 
 23 | \lstset{
 24 | 	basicstyle=\scriptsize\ttfamily,
 25 | 	columns=flexible,
 26 | 	breaklines=true,
 27 | 	numbers=left,
 28 | 	%stepsize=1,
 29 | 	numberstyle=\tiny,
 30 | 	backgroundcolor=\color[rgb]{0.85,0.90,1}
 31 | }
 32 | 
 33 | 
 34 | 
 35 | \lstnewenvironment{lstlistingoutput}{\lstset{basicstyle=\footnotesize\ttfamily,
 36 | 		columns=flexible,
 37 | 		breaklines=true,
 38 | 		numbers=left,
 39 | 		%stepsize=1,
 40 | 		numberstyle=\tiny,
 41 | 		backgroundcolor=\color[rgb]{.7,.7,.7}}}{}
 42 | 
 43 | 
 44 | \lstnewenvironment{lstlistingoutputtiny}{\lstset{basicstyle=\tiny\ttfamily,
 45 | 		columns=flexible,
 46 | 		breaklines=true,
 47 | 		numbers=left,
 48 | 		%stepsize=1,
 49 | 		numberstyle=\tiny,
 50 | 		backgroundcolor=\color[rgb]{.7,.7,.7}}}{}
 51 | 
 52 | 
 53 | 
 54 | \usepackage[american]{babel}
 55 | \usepackage{csquotes}
 56 | \usepackage[style=apa, backend = biber]{biblatex}
 57 | \DeclareLanguageMapping{american}{american-UoN}
 58 | \addbibresource{../../bdaca.bib}
 59 | \renewcommand*{\bibfont}{\tiny}
 60 | 
 61 | \usepackage{tikz}
 62 | \usetikzlibrary{shapes,arrows,matrix}
 63 | \usepackage{multicol}
 64 | 
 65 | \usepackage{subcaption}
 66 | 
 67 | \usepackage{booktabs}
 68 | \usepackage{graphicx}
 69 | 
 70 | 
 71 | 
 72 | \makeatletter
 73 | \setbeamertemplate{headline}{%
 74 | 	\begin{beamercolorbox}[colsep=1.5pt]{upper separation line head}
 75 | 	\end{beamercolorbox}
 76 | 	\begin{beamercolorbox}{section in head/foot}
 77 | 		\vskip2pt\insertnavigation{\paperwidth}\vskip2pt
 78 | 	\end{beamercolorbox}%
 79 | 	\begin{beamercolorbox}[colsep=1.5pt]{lower separation line head}
 80 | 	\end{beamercolorbox}
 81 | }
 82 | \makeatother
 83 | 
 84 | 
 85 | 
 86 | \setbeamercolor{section in head/foot}{fg=normal text.bg, bg=structure.fg}
 87 | 
 88 | 
 89 | 
 90 | \newcommand{\question}[1]{
 91 | 	\begin{frame}[plain]
 92 | 	\begin{columns}
 93 | 		\column{.3\textwidth}
 94 | 		\makebox[\columnwidth]{
 95 | 			\includegraphics[width=\columnwidth,height=\paperheight,keepaspectratio]{../media/mannetje.png}}
 96 | 		\column{.7\textwidth}
 97 | 		\large
 98 | 		\textcolor{orange}{\textbf{\emph{#1}}}
 99 | 	\end{columns}
100 | \end{frame}}
101 | 
102 | 
103 | 
104 | \title[Big Data and Automated Content Analysis]{\textbf{Teaching the Teacher} \\ Day 4 - Afternoon » From text to features:  Natural Language Processing «}
105 | \author[Anne Kroon]{Anne Kroon \\ ~ \\ \footnotesize{a.c.kroon@uva.nl \\@annekroon} \\ }
106 | \date{July 1, 2021}
107 | \institute[UvA]{Afdeling Communicatiewetenschap \\Universiteit van Amsterdam}
108 | 
109 | 
110 | 
111 | \begin{document}
112 | 	
113 | 	\begin{frame}{}
114 | 	\titlepage
115 | \end{frame}
116 | 
117 | \begin{frame}{Today}
118 | \tableofcontents
119 | \end{frame}
120 | 
121 | 
122 | \section{From text to features: vectorizers}
123 | \begin{frame}[plain]
124 | From text to features: vectorizers
125 | \end{frame}	
126 | 
127 | 
128 | 
129 | \subsection{General idea}
130 | 
131 | \begin{frame}[fragile]{A text as a collections of word}
132 | 
133 | Let us represent a string 
134 | \begin{lstlisting}
135 | t = "This this is is is a test test test"
136 | \end{lstlisting}
137 | like this:\\
138 | \begin{lstlisting}
139 | from collections import Counter
140 | print(Counter(t.split()))
141 | \end{lstlisting}
142 | \begin{lstlistingoutput}
143 | Counter({'is': 3, 'test': 3, 'This': 1, 'this': 1, 'a': 1})
144 | \end{lstlistingoutput}
145 | 
146 | \pause 
147 | Compared to the original string, this representation
148 | \begin{itemize}
149 | 	\item is less repetitive
150 | 	\item preserves word frequencies
151 | 	\item but does \emph{not} preserve word order
152 | 	\item can be interpreted as a vector to calculate with (!!!)
153 | \end{itemize}
154 | 
155 | \tiny{\emph{Of course, still a lot of stuff to fine-tune\ldots}  (for example, This/this)}
156 | \end{frame}
157 | 
158 | 
159 | 
160 | \begin{frame}{From vector to matrix}
161 | If we do this for multiple texts, we can arrange the vectors in a table.
162 | 
163 | t1 = "This this is is is a test test test" \newline
164 | t2 = "This is an example"
165 | 
166 | \begin{tabular}{| c|c|c|c|c|c|c|c|}
167 | 	\hline
168 | 	& a & an & example & is & this & This & test \\
169 | 	\hline
170 | 	\emph{t1} & 1 & 0 & 0 & 3 & 1 & 1 & 3 \\
171 | 	\emph{t2} &0 & 1 & 1 & 1 & 0 & 1 & 0 \\
172 | 	\hline
173 | \end{tabular}
174 | \end{frame}
175 | 
176 | 
177 | \question{What can you do with such a matrix? Why would you want to represent a collection of texts in such a way?}
178 | 
179 | \begin{frame}{What is a vectorizer}
180 | \begin{itemize}[<+->]
181 | 	\item Transforms a list of texts into a sparse (!) matrix (of word frequencies)
182 | 	\item Vectorizer needs to be ``fitted'' to the training data (learn which words (features) exist in the dataset and assign them to columns in the matrix)
183 | 	\item Vectorizer can then be re-used to transform other datasets 
184 | \end{itemize}
185 | \end{frame}
186 | 
187 | 
188 | \begin{frame}{The cell entries: raw counts versus tf$\cdot$idf scores}
189 | \begin{itemize}
190 | 	\item In the example, we entered simple counts (the ``term frequency'')
191 | \end{itemize}
192 | \end{frame}
193 | 
194 | \question{But are all terms equally important?}
195 | 
196 | 
197 | \begin{frame}{The cell entries: raw counts versus tf$\cdot$idf scores}
198 | 	\begin{itemize}
199 | 		\item In the example, we entered simple counts (the ``term frequency'')
200 | 		\item But does a word that occurs in almost all documents contain much information?
201 | 		\item And isn't the presence of a word that occurs in very few documents a pretty strong hint?
202 | 		\item<2-> \textbf{Solution: Weigh by \emph{the number of documents in which the term occurs at least once) (the ``document frequency'')}} 
203 | 	\end{itemize}
204 | \onslide<3->{
205 | $\Rightarrow$ we multiply the ``term frequency'' (tf) by the inverse document frequency (idf)
206 | 
207 | \tiny{(usually with some additional logarithmic transformation and normalization applied, see \url{https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html})}
208 | }
209 | \end{frame}
210 | 
211 | \begin{frame}{tf$\cdot$idf}
212 | \begin{array}{ccc}
213 | 	
214 | 	w_{i, j}=t f_{i, j} \times \log \left(\frac{N}{d f_{i}}\right)  \\ \\
215 | 	
216 | 	t f_{i, j}=\text { number of occurrences of } i \text { in } j \\
217 | 	d f_{i}=\text { number of documents containing } i \\
218 | 	N=\text {total number of documents }
219 | \end{array}
220 | \end{frame}
221 | 
222 | \begin{frame}{Is tf$\cdot$idf always better?}
223 | It depends.
224 | 
225 | \begin{itemize}
226 | 	\item Ultimately, it's an empirical question which works better ($\rightarrow$ machine learning)
227 | 	\item In many scenarios,  ``discounting'' too frequent words and ``boosting'' rare words makes a lot of sense (most frequent words in a text can be highly un-informative)
228 | 	\item Beauty of raw tf counts, though: interpretability + describes document in itself, not in relation to other documents
229 | \end{itemize}
230 | \end{frame}
231 | 
232 | 
233 | \begin{frame}{Different vectorizers}
234 | \begin{enumerate}[<+->]
235 | 	\item CountVectorizer (=simple word counts)
236 | 	\item TfidfVectorizer (word counts (``term frequency'') weighted by number of documents in which the word occurs at all (``inverse document frequency''))
237 | \end{enumerate}
238 | \end{frame}
239 | 
240 | \begin{frame}{Internal representations}
241 | \begin{block}{Sparse vs dense matrices}
242 | \begin{itemize}
243 | 	\item $\rightarrow$ tens of thousands of columns (terms), and one row per document
244 | 	\item Filling all cells is inefficient \emph{and} can make the matrix too large to fit in memory (!!!)
245 | 	\item Solution: store only non-zero values with their coordinates! (sparse matrix)
246 | 	\item dense matrix (or dataframes) not advisable, only for toy examples
247 | \end{itemize}
248 | \end{block}
249 | \end{frame}
250 | 
251 | 
252 | {\setbeamercolor{background canvas}{bg=black}
253 | 	\begin{frame}
254 | 	\makebox[\linewidth]{
255 | 		\includegraphics[width=\paperwidth,height=\paperheight,keepaspectratio]{../media/sparse_dense.png}}
256 | \url{https://matteding.github.io/2019/04/25/sparse-matrices/}
257 | \end{frame}
258 | }
259 | 
260 | 
261 | \begin{frame}[standout]
262 | This morning we learned how to tokenize with a list comprehension (and that's often a good idea!). But what if we want to \emph{directly} get a DTM instead of lists of tokens?
263 | \end{frame}
264 | 
265 | 
266 | \begin{frame}[fragile]{OK, good enough, perfect?}
267 | \begin{block}{scikit-learn's CountVectorizer (default settings)}
268 | \begin{itemize}
269 | 	\item applies lowercasing
270 | 	\item deals with punctuation etc. itself
271 | 	\item minimum word length $>1$
272 | 	\item more technically, tokenizes using this regular expression: \texttt{r"(?u)\textbackslash b\textbackslash w\textbackslash w+\textbackslash b"} \footnote{?u = support unicode, \textbackslash b = word boundary}
273 | \end{itemize}
274 | \end{block}
275 | \begin{lstlisting}
276 | from sklearn.feature_extraction.text import CountVectorizer
277 | cv = CountVectorizer()
278 | dtm_sparse = cv.fit_transform(docs)
279 | \end{lstlisting}
280 | \end{frame}
281 | 
282 | 
283 | \begin{frame}{OK, good enough, perfect?}
284 | \begin{block}{CountVectorizer supports more}
285 | \begin{itemize}
286 | \item stopword removal
287 | \item custom regular expression
288 | \item or even using an external tokenizer
289 | \item ngrams instead of unigrams
290 | \end{itemize}
291 | \end{block}
292 | \tiny{see \url{https://scikit-learn.org/stable/modules/generated/sklearn.feature\_extraction.text.CountVectorizer.html}}
293 | 
294 | \pause
295 | \begin{alertblock}{Best of both worlds}
296 | \textbf{Use the Count vectorizer with a NLTK-based external tokenizer! (see book)}
297 | \end{alertblock}
298 | \end{frame}
299 | 
300 | 
301 | \subsection{Pruning}
302 | 
303 | \begin{frame}{General idea}
304 | \begin{itemize}
305 | 	\item Idea behind both stopword removal and tf$\cdot$idf: too frequent words are uninformative
306 | 	\item<2-> (possible) downside stopword removal: a priori list, does not take empirical frequencies in dataset into account
307 | 	\item<3-> (possible) downside tf$\cdot$idf: does not reduce number of features
308 | \end{itemize}
309 | 
310 | \onslide<4->{Pruning: remove all features (tokens) that occur in less than X or more than X of the documents}
311 | \end{frame}
312 | 
313 | \begin{frame}[fragile, plain]
314 | CountVectorizer, only stopword removal
315 | \begin{lstlisting}
316 | from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
317 | myvectorizer = CountVectorizer(stop_words=mystopwords)
318 | \end{lstlisting}
319 | 
320 | CountVectorizer, better tokenization, stopword removal (pay attention that stopword list uses same tokenization!):
321 | \begin{lstlisting}
322 | myvectorizer = CountVectorizer(tokenizer = TreebankWordTokenizer().tokenize, stop_words=mystopwords)
323 | \end{lstlisting}
324 | 
325 | Additionally remove words that occur in more than 75\% or less than $n=2$ documents:
326 | \begin{lstlisting}
327 | myvectorizer = CountVectorizer(tokenizer = TreebankWordTokenizer().tokenize, stop_words=mystopwords, max_df=.75, min_df=2)
328 | \end{lstlisting}
329 | 
330 | All togehter: tf$\cdot$idf, explicit stopword removal, pruning
331 | \begin{lstlisting}
332 | myvectorizer = TfidfVectorizer(tokenizer = TreebankWordTokenizer().tokenize, stop_words=mystopwords, max_df=.75, min_df=2)
333 | \end{lstlisting}
334 | 
335 | 
336 | \end{frame}
337 | 
338 | 
339 | \question{What is ``best''? Which (combination of) techniques to use, and how to decide?}
340 | 
341 | 
342 | \section{Teaching Q\&A}
343 | 	
344 | 	\begin{frame}{NLP and teaching}
345 | 	\begin{block}{Teaching experiences}
346 | 		\begin{itemize}
347 | 			\item Transparancy 
348 | 			\item Students should be able to explain HOW they've preprocessed the data and WHY
349 | 			\item Arguments for preprocessing differ across unsupervised and supervised tasks
350 | 		\end{itemize}
351 | 	\end{block}
352 | \end{frame}
353 | 
354 | 
355 | 	\begin{frame}{NLP and teaching}
356 | \begin{block}{Teaching experiences}
357 | 	\begin{itemize}
358 | 		\item Rationale for using preprocessing: Why do you use specific techniques?
359 | 		\item For supervised learning: often an empirical question 
360 | 		\item Thus: testing different setting and explain what works best. Systematically testing different techniques
361 | 	\end{itemize}
362 | \end{block}
363 | \end{frame}
364 | 
365 | \begin{frame}{NLP and teaching}
366 | \begin{block}{Sentiment analysis}
367 | 	\begin{itemize}
368 | 		\item Dictionary-based approaches
369 | 		\item Keep in mind; what are best practices? Off-the-shelf do not necessarily generalize well. 
370 | 	\end{itemize}
371 | \end{block}
372 | \end{frame}
373 | 
374 | \begin{frame}{Thank you!!}
375 | \begin{block}{Thank you for your attention!}
376 | 	\begin{itemize}
377 | 		\item Questions? Comments?
378 | 	\end{itemize}
379 | \end{block}
380 | \end{frame}
381 | 
382 | 
383 | 
384 | \end{document}
385 | 
386 | 
387 | 


--------------------------------------------------------------------------------
/2021/day4/day4.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2021/day4/day4.pdf


--------------------------------------------------------------------------------
/2021/day4/example-nltk.md:
--------------------------------------------------------------------------------
 1 | 
 2 | 
 3 | #### Example using a single `str` object
 4 | 
 5 | ```python
 6 | import nltk
 7 | words = "the quick brown fox jumps over the lazy dog".split()
 8 | nltk.pos_tag(words, tagset='universal')
 9 | ```
10 | 
11 | #### Example using a `list` of `str`
12 | 
13 | ```python
14 | articles = ['the quick brown fox jumps over the lazy dog', 'a second sentence']
15 | tokens = [nltk.word_tokenize(sentence) for sentence in articles]
16 | tagged = [nltk.pos_tag(sentence, tagset='universal') for sentence in tokens]
17 | print(tagged[0])
18 | ```
19 | 
20 | -----
21 | 
22 | | Tag  | Meaning             | English Examples                       |
23 | |------|---------------------|----------------------------------------|
24 | | ADJ  | adjective           | new, good, high, special, big, local   |
25 | | ADP  | adposition          | on, of, at, with, by, into, under      |
26 | | ADV  | adverb              | really, already, still, early, now     |
27 | | CONJ | conjunction         | and, or, but, if, while, although      |
28 | | DET  | determiner, article | the, a, some, most, every, no, which   |
29 | | NOUN | noun                | year, home, costs, time, Africa        |
30 | | NUM  | numeral             | twenty-four, fourth, 1991, 14:24       |
31 | | PRT  | particle            | at, on, out, over per, that, up, with  |
32 | | PRON | pronoun             | he, their, her, its, my, I, us         |
33 | | VERB | verb                | is, say, told, given, playing, would   |
34 | | .    | punctuation marks   | . , ; !                                |
35 | | X    | other               | ersatz, esprit, dunno, gr8, univeristy |
36 | 
37 | [source](https://bond-lab.github.io/Corpus-Linguistics/ntumc_tag_u.html)
38 | 


--------------------------------------------------------------------------------
/2021/day4/example-vectorizer-to-dense.md:
--------------------------------------------------------------------------------
 1 | 
 2 | ```python
 3 | import pandas as pd
 4 | texts = ["hello teachers!", "how are you today?", "what?", "hello hello everybody"]
 5 | 
 6 | vect = CountVectorizer()
 7 | 
 8 | X = vect.fit_transform(texts)
 9 | print(pd.DataFrame(X.A, columns=vect.get_feature_names()).to_string())
10 | df = pd.DataFrame(X.toarray().transpose(), index = vect.get_feature_names())
11 | ```
12 | 


--------------------------------------------------------------------------------
/2021/day4/exercises-1/exercise-1.md:
--------------------------------------------------------------------------------
 1 | # Working with textual data
 2 | 
 3 | ### 0. Get the data.
 4 | 
 5 | - Download `articles.tar.gz` from
 6 | https://dx.doi.org/10.7910/DVN/ULHLCB
 7 | 
 8 | - Unpack it. On Linux and MacOS, you can do this with `tar -xzf mydata.tar.gz` on the command line. On Windows, you may need an additional tool such as `7zip` for that (note that technically speaking, there is a `tar` archive within a `gz` archive, so unpacking may take *two* steps depending on your tool).
 9 | 
10 | 
11 | ### 1. Inspect the structure of the dataset.
12 | What information do the following elements give you?
13 | 
14 | - folder (directory) names
15 | - folder structure/hierarchy
16 | - file names
17 | - file contents
18 | 
19 | ### 2. Discuss strategies for working with this dataset!
20 | 
21 | - Which questions could you answer?
22 | - How could you deal with it, given the size and the structure?
23 | - How much memory<sup>1</sup> (RAM) does your computer have? How large is the complete dataset? What does that mean?
24 | - Make a sketch (e.g., with pen&paper), how you could handle your workflow and your data to answer your question.
25 | 
26 | <sup>1</sup> *memory* (RAM), not *storage* (harddisk)!
27 | 
28 | ### 3. Read some (or all?) data
29 | 
30 | Here is some example code that you can modify. Assuming that he folder `articles` is in the same folder as the notebook you are currently working on, you could, for instance, do the following to read a *part* of your dataset.
31 | 
32 | ```python
33 | from glob import glob
34 | infowarsfiles = glob('articles/*/Infowars/*')
35 | infowarsarticles = []
36 | for filename in infowarsfiles:
37 |     with open(filename) as f:
38 | 	    infowarsarticles.append(f.read())
39 | ```
40 | 
41 | - Can you explain what the `glob` function does?
42 | - What does `infowarsfiles` contain, and what does `infowarsarticles` contain? First make an educated guess based on the code snippet, then check it! Do *not* print the whole thing, but use `len`, `type` en slicing `[:10]` to get the info ou need.
43 | 
44 | - Tip: take a random sample of the articles for practice purposes (if your code works, you can scale up!)
45 | 
46 | ```
47 | # taking a random sample of the articles for practice purposes
48 | articles =random.sample(infowarsarticles, 10)
49 | ```
50 | 
51 | ### 4. Perform some analyses!
52 | 
53 | - Perform some first analyses on the data using string methods and regular expressions!
54 | 
55 | Techniques you can try out include:
56 | 
57 | a.  lowercasing
58 | 
59 | b.  tokenization
60 | 
61 | c.  stopword removal
62 | 
63 | d.  stemming and/or lemmatizing)
64 | 
65 | 
66 | 
67 | If you want to tokenize and stem your data using `spacy`, you need to install `spacy` and the language model. Run the following in the your terminal environment:
68 | 
69 | ```bash
70 | pip3 install spacy
71 | python3 -m spacy download en_core_web_sm
72 | ```
73 | 
74 | ### 5. extract Information
75 | 
76 | Try to extract meaningful information from your texts. Depending on your interests and the nature of the data, you could:
77 | 
78 | - use regular expressions to distinguish relevant from irrelevant texts, or to extract substrings
79 | - use NLP techniques such as Named Entity Recognition to extract entities that occur.
80 | 
81 | 
82 | ### BONUS: Inceasing efficiency + reusability
83 | The approach under (3) gets you very far.
84 | But for those of you who want to go the extra mile, here are some suggestions for further improvements in handling such a large dataset, consisting of thousands of files, and for deeper thinking about data handling:
85 | 
86 | - Consider writing a function to read the data. Let your function take three parameters as input, `basepath` (where is the folder with articles located?), `month` and `outlet`, and return the articles that match this criterion.
87 | - Even better, make it a *generator* that yields the articles instead of returning a whole list.
88 | - Consider yielding a dict (with date, outlet, and the article itself) instead of yielding only the article text.
89 | - Think of the most memory-efficient way to get an overview of how often a given regular expression R is mentioned per outlet!
90 | - Under which circumstances would you consider having your function for reading the data return a pandas dataframe?
91 | 


--------------------------------------------------------------------------------
/2021/day4/exercises-1/possible-solution-exercise-1.md:
--------------------------------------------------------------------------------
  1 | ## Exercise 2: Working with textual data - possible solutions
  2 | 
  3 | ```python
  4 | from glob import glob
  5 | import random
  6 | import nltk
  7 | from nltk.stem.snowball import SnowballStemmer
  8 | import spacy
  9 | 
 10 | 
 11 | infowarsfiles = glob('articles/*/Infowars/*')
 12 | infowarsarticles = []
 13 | for filename in infowarsfiles:
 14 |     with open(filename) as f:
 15 |         infowarsarticles.append(f.read())
 16 | 
 17 | 
 18 | # taking a random sample of the articles for practice purposes
 19 | articles =random.sample(infowarsarticles, 10)
 20 | 
 21 | ```
 22 | 
 23 | ### [Task 4](https://github.com/uvacw/teachteacher-python/blob/main/day4/exercises-1/exercise-1.md#4-perform-some-analyses): Preprocessing data
 24 | 
 25 | ##### a. lowercasing articles
 26 | 
 27 | ```python
 28 | articles_lower_cased = [art.lower() for art in articles]
 29 | ```
 30 | 
 31 | ##### b. tokenization
 32 | 
 33 | Basic solution, using the `.str` method `.split()`. Not very sophisticated, though.
 34 | 
 35 | ```python
 36 | articles_split = [art.split() for art in articles]
 37 | ```
 38 | 
 39 | A more sophisticated solution:
 40 | 
 41 | ```python
 42 | from nltk.tokenize import TreebankWordTokenizer
 43 | articles_tokenized = [TreebankWordTokenizer().tokenize(art) for art in articles ]
 44 | ```
 45 | 
 46 | ##### c. removing stopwords
 47 | 
 48 | Define your stopwordlist:
 49 | 
 50 | ```python
 51 | from nltk.corpus import stopwords
 52 | mystopwords = stopwords.words("english")
 53 | mystopwords.extend(["add", "more", "words"]) # manually add more stopwords to your list if needed
 54 | print(mystopwords) #let's see what's inside
 55 | ```
 56 | 
 57 | Now, remove stopwords from the corpus:
 58 | 
 59 | ```python
 60 | articles_without_stopwords = []
 61 | for article in articles:
 62 |     articles_no_stop = ""
 63 |     for word in article.lower().split():
 64 |         if word not in mystopwords:
 65 |             articles_no_stop = articles_no_stop + " " + word
 66 |     articles_without_stopwords.append(articles_no_stop)
 67 | ```
 68 | 
 69 | Same solution, but with list comprehension:
 70 | 
 71 | ```python
 72 | articles_without_stopwords = [" ".join([w for w in article.lower().split() if w not in mystopwords]) for article in articles]
 73 | ```
 74 | 
 75 | Different--probably more sophisticated--solution, by writing a function and calling it in a list comprehension:
 76 | 
 77 | ```python
 78 | def remove_stopwords(article, stopwordlist):
 79 |     cleantokens = []
 80 |     for word in article:
 81 |         if word.lower() not in mystopwords:
 82 |             cleantokens.append(word)
 83 |     return cleantokens
 84 | 
 85 | articles_without_stopwords = [remove_stopwords(art, mystopwords) for art in articles_tokenized]
 86 | ```
 87 | 
 88 | It's good practice to frequently inspect the results of your code, to make sure you are not making mistakes, and the results make sense. For example, compare your results to some random articles from the original sample:
 89 | 
 90 | ```python
 91 | print(articles[8][:100])
 92 | print("-----------------")
 93 | print(" ".join(articles_without_stopwords[8])[:100])
 94 | ```
 95 | 
 96 | ##### d. stemming and lemmatization
 97 | 
 98 | ```python
 99 | stemmer = SnowballStemmer("english")
100 | 
101 | stemmed_text = []
102 | for article in articles:
103 |     stemmed_words = ""
104 |     for word in article.lower().split():
105 |         stemmed_words = stemmed_words + " " + stemmer.stem(word)
106 |     stemmed_text.append(stemmed_words.strip())
107 | ```
108 | 
109 | Same solution, but with list comprehension:
110 | 
111 | ```python
112 | stemmed_text  = [" ".join([stemmer.stem(w) for w in article.lower().split()]) for article in articles]
113 | ```
114 | 
115 | Compare tokeninzation and lemmatization using `Spacy`:
116 | 
117 | ```python
118 | import spacy
119 | nlp = spacy.load("en_core_web_sm")
120 | lemmatized_articles = [[token.lemma_ for token in nlp(art)] for art in articles]
121 | ```
122 | 
123 | 
124 | Again, frequently inspect your code, and for example compare the results to the original articles:
125 | 
126 | 
127 | ```python
128 | print(articles[6][:100])
129 | print("-----------------")
130 | print(stemmed_text[6][:100])
131 | print("-----------------")
132 | print(" ".join(lemmatized_articles[6])[:100])
133 | ```
134 | 
135 | ### [Task 5](https://github.com/uvacw/teachteacher-python/blob/main/day4/exercises-1/exercise-1.md#5-extract-information): Extract information
136 | 
137 | ```Python
138 | import nltk
139 | 
140 | tokens = [nltk.word_tokenize(sentence) for sentence in articles]
141 | tagged = [nltk.pos_tag(sentence) for sentence in tokens]
142 | print(tagged[0])
143 | ```
144 | 
145 | playing around with Spacy:
146 | 
147 | ```python
148 | nlp = spacy.load('en')
149 | 
150 | doc = [nlp(sentence) for sentence in articles]
151 | for i in doc:
152 |     for ent in i.ents:
153 |         if ent.label_ == 'PERSON':
154 |             print(ent.text, ent.label_ )
155 | 
156 | ```          
157 | 


--------------------------------------------------------------------------------
/2021/day4/exercises-2/exercise-2.md:
--------------------------------------------------------------------------------
 1 | # Exercise 2: From text to features
 2 | ----
 3 | 
 4 | Try to take some of the data from the [exercise of this morning](https://github.com/uvacw/teachteacher-python/blob/main/day4/exercises-1/exercise-1.md), and prepare this data for a supervised classification task. More specifically, imagine you want to train a classifier that will predict whether articles come from a fake news source (e.g., `Infowars`) or a quality news outlet (e.g., `bbc`). In other words, you want to predict `source` based on linguistic variations in the articles.
 5 | 
 6 | To arrive at a model that will do just that, please consider taking the following steps:
 7 | 
 8 | - Think about your **pre-processing steps**: what type of features will you feed your algorithm? Do you, for example, want to manually remove stopwords, or include ngrams? You can use the code you've written this morning as a starting point.
 9 | 
10 | - **Vectorize the data**: Try to fit different vectorizers to the data. You can use `count` vs. `tfidf` vectorizers, with or without pruning, stopword removal, etc.
11 | 
12 | - Try out a simple supervised model. Find some inspiration [here](https://github.com/uvacw/teachteacher-python/blob/main/day4/exercises-2/possible-solution-exercise-2.md#build-a-simple-classifier). Can you predict the `source` using linguistic variations in the articles?
13 | 
14 | - Which combination of pre-processing steps + vectorizer gives the best results?
15 | 
16 | ## BONUS
17 | 
18 | - Compare that bottom-up approach with a top-down (keyword or regular-expression based) approach.
19 | 


--------------------------------------------------------------------------------
/2021/day4/exercises-2/fix_example_book.md:
--------------------------------------------------------------------------------
 1 | # Example in the book p. 231
 2 | 
 3 | On page 231, there is an example that involves the line
 4 | 
 5 | ```python3
 6 | cv = CountVectorizer(tokenizer=mytokenizer.tokenize)
 7 | ```
 8 | 
 9 | This example only works if `mytokenizer` has been "instantiated" before, and that instruction is missing. 
10 | 
11 | Essentially, it assumes that this example from page 230 has been run before
12 | 
13 | ```python3
14 | from sklearn.feature_extraction.text import CountVectorizer
15 | import nltk
16 | from nltk.tokenize import TreebankWordTokenizer
17 | import regex
18 | 
19 | class MyTokenizer:
20 |     def tokenize(self, text):
21 |         result = []
22 |         word = r"\p{letter}" 
23 |         for sent in nltk.sent_tokenize(text): 
24 |             tokens = TreebankWordTokenizer().tokenize(sent)
25 |             tokens = [t for t in tokens if regex.search(word, t)]
26 |             result += tokens 
27 |         return result			
28 | ```
29 | 
30 | **and** that you create an "instantiate" of this class with the following command:
31 | ```python3
32 | mytokenizer = MyTokenizer()
33 | ```
34 | 
35 | Then, the command ```cv = CountVectorizer(tokenizer=mytokenizer.tokenize)``` will run as expected
36 | ```
37 | 


--------------------------------------------------------------------------------
/2021/day4/exercises-2/possible-solution-exercise-2.md:
--------------------------------------------------------------------------------
  1 | 
  2 | ## Exercise 2: From text to features - possible solutions
  3 | 
  4 | ### Trying out different preprocessing steps
  5 | 
  6 | Load the data...
  7 | 
  8 | ```python
  9 | infowarsfiles = glob('articles/*/Infowars/*')
 10 | documents = []
 11 | for filename in infowarsfiles:
 12 |     with open(filename) as f:
 13 |         documents.append(f.read())
 14 | ```
 15 | 
 16 | Let's inspect the data, and start some pre-processing/ cleaning steps
 17 | 
 18 | ```python
 19 | ## From text to features.
 20 | documents[17] # print a random article to inspect.
 21 | ## Typical cleaning up steps:
 22 | from string import punctuation
 23 | documents = [doc.replace('\n\n', '') for doc in documents] # remove line breaks
 24 | documents = ["".join([w for w in doc if w not in punctuation]) for doc in documents] # remove punctuation
 25 | documents = [doc.lower() for doc in documents] # covert to lower case
 26 | documents = [" ".join(doc.split()) for doc in documents] # remove double spaces by splitting the strings into words and joining these words again
 27 | 
 28 | documents[17] # print the same article to see whether the changes are in line with what you want
 29 | ```
 30 | 
 31 | Removing stopwords:
 32 | 
 33 | ```python
 34 | mystopwords = set(stopwords.words('english')) # use default NLTK stopword list; alternatively:
 35 | # mystopwords = set(open('mystopwordfile.txt').readlines())  #read stopword list from a textfile with one stopword per line
 36 | documents = [" ".join([w for w in doc.split() if w not in mystopwords]) for doc in documents]
 37 | documents[7]
 38 | ```
 39 | 
 40 | Using N-grams as features:
 41 | 
 42 | ```python
 43 | documents_bigrams = [["_".join(tup) for tup in nltk.ngrams(doc.split(),2)] for doc in documents] # creates bigrams
 44 | documents_bigrams[7][:5] # inspect the results...
 45 | 
 46 | # maybe we want both unigrams and bigrams in the feature set?
 47 | 
 48 | assert len(documents)==len(documents_bigrams)
 49 | 
 50 | documents_uniandbigrams = []
 51 | for a,b in zip([doc.split() for doc in documents],documents_bigrams):
 52 |     documents_uniandbigrams.append(a + b)
 53 | 
 54 | #and let's inspect the outcomes again.
 55 | documents_uniandbigrams[7]
 56 | len(documents_uniandbigrams[7]),len(documents_bigrams[7]),len(documents[7].split())
 57 | ```
 58 | 
 59 | Or, if you want to inspect collocations:
 60 | 
 61 | ```python
 62 | text = [nltk.Text(tkn for tkn in doc.split()) for doc in documents ]
 63 | text[7].collocations(num=10)
 64 | ```
 65 | 
 66 | ----------
 67 | 
 68 | ### Vectorize the data
 69 | 
 70 | ```python
 71 | from glob import glob
 72 | import random
 73 | 
 74 | def read_data(listofoutlets):
 75 |     texts = []
 76 |     labels = []
 77 |     for label in listofoutlets:
 78 |         for file in glob(f'articles/*/{label}/*'):
 79 |             with open(file) as f:
 80 |                 texts.append(f.read())
 81 |                 labels.append(label)
 82 |     return texts, labels
 83 | 
 84 | X, y = read_data(['Infowars', 'BBC']) #choose your own newsoutlets
 85 | 
 86 | ```
 87 | 
 88 | 
 89 | ```python
 90 | #split the dataset in a train and test sample
 91 | from sklearn.model_selection import train_test_split
 92 | X_train,X_test,y_train,y_test=train_test_split(X, y, test_size=0.2)    
 93 | ```
 94 | 
 95 | Define some vectorizers.
 96 | You can try out different variations:
 97 | - `count` versus `tfidf`
 98 | - with/ without a stopword list
 99 | - with / without pruning
100 | 
101 | 
102 | ```python
103 | from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
104 | 
105 | myvectorizer= CountVectorizer(stop_words=mystopwords) # you can further modify this yourself.
106 | 
107 | #Fit the vectorizer, and transform.
108 | X_features_train = myvectorizer.fit_transform(X_train)
109 | X_features_test = myvectorizer.transform(X_test)
110 | 
111 | ```
112 | ### Build a simple classifier
113 | 
114 | Now, lets build a simple classifier and predict outlet based on textual features:
115 | 
116 | ```python
117 | from sklearn.naive_bayes import MultinomialNB
118 | from sklearn.metrics import accuracy_score
119 | from sklearn.metrics import classification_report
120 | 
121 | model = MultinomialNB()
122 | model.fit(X_features_train, y_train)
123 | y_pred = model.predict(X_features_test)
124 | 
125 | print(f"Accuracy : {accuracy_score(y_test, y_pred)}")
126 | print(classification_report(y_test, y_pred))
127 | 
128 | ```
129 | 
130 | Can you improve this classifier when using different vectorizers?
131 | 
132 | ----
133 | 
134 | 
135 | 
136 | 
137 | *hint: if you want to include n-grams as feature input, add the following argument to your vectorizer:*
138 | 
139 | ```python
140 | myvectorizer= CountVectorizer(analyzer=lambda x:x)
141 | ```
142 | 


--------------------------------------------------------------------------------
/2021/day4/literature-examples.md:
--------------------------------------------------------------------------------
1 | 
2 | 
3 | # Literature with examples about preprocessing steps
4 | 
5 | [Example unsupervised](http://vanatteveldt.com/p/jacobi2015_lda.pdf)
6 | 
7 | [Example supervised](https://www.tandfonline.com/doi/full/10.1080/19312458.2018.1455817)
8 | 


--------------------------------------------------------------------------------
/2021/day5/01-MachineLearning_Introduction.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Modeling & Machine Learning - an introduction\n",
  8 |     "\n",
  9 |     "\n",
 10 |     "## Where are we in the course?\n",
 11 |     "\n",
 12 |     "After making progress in data understanding and preparation we will discuss modelling and have a brief introduction to Machine Learning. \n",
 13 |     "\n",
 14 |     "\n",
 15 |     "## Where are we in the data analysis process?\n",
 16 |     "\n",
 17 |     "<img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/b/b9/CRISP-DM_Process_Diagram.png/800px-CRISP-DM_Process_Diagram.png\" alt=\"Source: Wikipedia\">*Source: Wikipedia*\n",
 18 |     "\n"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "code",
 23 |    "execution_count": null,
 24 |    "metadata": {},
 25 |    "outputs": [],
 26 |    "source": []
 27 |   },
 28 |   {
 29 |    "cell_type": "markdown",
 30 |    "metadata": {},
 31 |    "source": [
 32 |     "# ML versus statistics\n",
 33 |     "\n",
 34 |     "<img src=\"https://miro.medium.com/max/1000/1*x7P7gqjo8k2_bj2rTQWAfg.jpeg\"> Source: [sandserif](https://www.instagram.com/sandserifcomics/)"
 35 |    ]
 36 |   },
 37 |   {
 38 |    "cell_type": "markdown",
 39 |    "metadata": {},
 40 |    "source": []
 41 |   },
 42 |   {
 43 |    "cell_type": "markdown",
 44 |    "metadata": {},
 45 |    "source": [
 46 |     "## Machine Learning in Python\n",
 47 |     "\n",
 48 |     "<img src=\"http://scikit-learn.org/stable/_images/scikit-learn-logo-notext.png\">\n",
 49 |     "\n",
 50 |     "We will use the package [scikit-learn](http://scikit-learn.org/) to do Machine Learning in Python. This is one of the most widely used packages in Python for Machine Learning, and is quite flexible and complete when it comes to the types of models it can implement, or data that it can use.\n",
 51 |     "\n",
 52 |     "\n",
 53 |     "\n",
 54 |     "Scikit-learn is usually installed together with Python in your machine by anaconda. Before continuing, however, make sure that you have the latest version installed. To do so, go to Terminal on Mac, or Command Prompt/Line in Windows (run as an administrator), and type ```conda install scikit-learn``` . Conda will check if scikit-learn needs to be updated. \n",
 55 |     "\n",
 56 |     "* **Note:** [This video](https://www.youtube.com/watch?v=_wCs2vvBCTM) contains more information on how to update or install packages with conda.\n",
 57 |     "\n",
 58 |     "\n"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "markdown",
 63 |    "metadata": {},
 64 |    "source": [
 65 |     "# The different types of ML\n",
 66 |     "\n",
 67 |     "Let's discuss the Sklearn Machine Learning Map:\n",
 68 |     "https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html"
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "code",
 73 |    "execution_count": null,
 74 |    "metadata": {},
 75 |    "outputs": [],
 76 |    "source": []
 77 |   }
 78 |  ],
 79 |  "metadata": {
 80 |   "kernelspec": {
 81 |    "display_name": "Python 3",
 82 |    "language": "python",
 83 |    "name": "python3"
 84 |   },
 85 |   "language_info": {
 86 |    "codemirror_mode": {
 87 |     "name": "ipython",
 88 |     "version": 3
 89 |    },
 90 |    "file_extension": ".py",
 91 |    "mimetype": "text/x-python",
 92 |    "name": "python",
 93 |    "nbconvert_exporter": "python",
 94 |    "pygments_lexer": "ipython3",
 95 |    "version": "3.8.5"
 96 |   },
 97 |   "latex_envs": {
 98 |    "bibliofile": "biblio.bib",
 99 |    "cite_by": "apalike",
100 |    "current_citInitial": 1,
101 |    "eqLabelWithNumbers": true,
102 |    "eqNumInitial": 0
103 |   }
104 |  },
105 |  "nbformat": 4,
106 |  "nbformat_minor": 1
107 | }
108 | 


--------------------------------------------------------------------------------
/2021/day5/02-Unsupervised-Machine-Learning.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Unsupervised Machine Learning\n",
 8 |     "\n",
 9 |     "Some examples for our discussion:\n",
10 |     "* *With \"numbers\"*: [Clustering with Scikit-Learn's KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans)\n",
11 |     "* *With \"text\"*: [LDA topic models with Gensim](https://radimrehurek.com/gensim/models/ldamodel.html) and [PyLDAVis](https://github.com/bmabey/pyLDAvis)"
12 |    ]
13 |   },
14 |   {
15 |    "cell_type": "code",
16 |    "execution_count": null,
17 |    "metadata": {},
18 |    "outputs": [],
19 |    "source": []
20 |   }
21 |  ],
22 |  "metadata": {
23 |   "kernelspec": {
24 |    "display_name": "Python 3",
25 |    "language": "python",
26 |    "name": "python3"
27 |   },
28 |   "language_info": {
29 |    "codemirror_mode": {
30 |     "name": "ipython",
31 |     "version": 3
32 |    },
33 |    "file_extension": ".py",
34 |    "mimetype": "text/x-python",
35 |    "name": "python",
36 |    "nbconvert_exporter": "python",
37 |    "pygments_lexer": "ipython3",
38 |    "version": "3.8.5"
39 |   }
40 |  },
41 |  "nbformat": 4,
42 |  "nbformat_minor": 4
43 | }
44 | 


--------------------------------------------------------------------------------
/2021/day5/03-Supervised-Machine-Learning.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Supervised Machine Learning\n",
 8 |     "\n",
 9 |     "An agenda for our discussion:\n",
10 |     "* SML vs. Statistical testing - same thing with different names?\n",
11 |     "* From SPSS to Python via something R-like: statsmodels\n",
12 |     "* Choosing and building a model\n",
13 |     "* Using a model for predictions\n",
14 |     "* A step back: how do we evaluate a model?\n",
15 |     "* Understanding what the model is doing: Explainable AI"
16 |    ]
17 |   },
18 |   {
19 |    "cell_type": "code",
20 |    "execution_count": null,
21 |    "metadata": {},
22 |    "outputs": [],
23 |    "source": []
24 |   }
25 |  ],
26 |  "metadata": {
27 |   "kernelspec": {
28 |    "display_name": "Python 3",
29 |    "language": "python",
30 |    "name": "python3"
31 |   },
32 |   "language_info": {
33 |    "codemirror_mode": {
34 |     "name": "ipython",
35 |     "version": 3
36 |    },
37 |    "file_extension": ".py",
38 |    "mimetype": "text/x-python",
39 |    "name": "python",
40 |    "nbconvert_exporter": "python",
41 |    "pygments_lexer": "ipython3",
42 |    "version": "3.8.5"
43 |   }
44 |  },
45 |  "nbformat": 4,
46 |  "nbformat_minor": 4
47 | }
48 | 


--------------------------------------------------------------------------------
/2021/day5/README.md:
--------------------------------------------------------------------------------
 1 | # Day 5: Building and testing models
 2 | 
 3 | ## Key topics:
 4 | 
 5 | * What is machine learning?
 6 | * Statistical testing vs. (?) Machine learning
 7 | * Unsupervised machine learning examples: with numbers, and with text
 8 | * Going deeper into supervised machine learning
 9 | 	* Building models
10 | 	* Evaluating models
11 | 	* Explaining models
12 | 
13 | 
14 | ## (Preliminary) Agenda:
15 | 
16 | | Time          | Topic                                                                         |
17 | |---------------|-------------------------------------------------------------------------------|
18 | | 09:30 - 11:00 | Basics of (supervised) machine learning & comparison with statistical testing |
19 | | 11:00 - 12:00 | Practice on (your own) data |
20 | | 13:00 - 14:00 | Model comparison & explainability |
21 | | 14:00 - 15:30 | Practice on (your own) data |
22 | |15:30 - 16:30  | Teaching presentations & general Q&A |
23 | |16:30 - 17:00  | Closure / next steps |


--------------------------------------------------------------------------------
/2021/installation.md:
--------------------------------------------------------------------------------
  1 | # Getting started
  2 | 
  3 | ## Installing Python
  4 | 
  5 | Python is a language and not a program and thus there are many different
  6 | ways you can run code in Python. For the course, it is important that
  7 | you have Python 3 installed and running on your machine and be able to
  8 | run Jupyter Notebooks as well as install packages. There are different
  9 | ways to install Python on your computer. We will provide instructions on
 10 | two widely used solutions:
 11 | 
 12 | Using Anaconda or using Python natively (we will discuss pros and cons
 13 | of them in our course as well). Whatever solution you go for, make sure
 14 | it works on your system.
 15 | 
 16 | ### Option 1: installing Python via Anaconda
 17 | 
 18 | #### Windows
 19 | 
 20 |   - Open [<span class="underline">https://www.anaconda.com/download/ </span>](https://www.anaconda.com/download/)with
 21 |     your web browser.
 22 | 
 23 |   - Download the Python 3.7 (or later) installer for Windows.
 24 | 
 25 |   - Install Python 3.7 (or later) using all of the defaults for
 26 |     installation but **make sure to check Make Anaconda the default
 27 |     Python**.
 28 | 
 29 | #### Mac OS X
 30 | 
 31 |   - Open [<span class="underline">https://www.anaconda.com/download/</span>](https://www.anaconda.com/download/) with
 32 |     your web browser.
 33 | 
 34 |   - Download the Python 3.7 (or later) installer for OS X.
 35 | 
 36 |   - Install Python 3.7 (or later) using all of the defaults for
 37 |     installation 
 38 | 
 39 |   -
 40 | #### Linux
 41 | 
 42 |   - Open [<span class="underline">https://www.anaconda.com/download/</span>](https://www.anaconda.com/download/) with
 43 |     your web browser.
 44 | 
 45 |   - Download the Python 3.7 (or later) installer for Linux.
 46 | 
 47 |   - Install Python 3.7 (or later) using all of the defaults for
 48 |     installation. (Installation requires using the shell. If you aren't
 49 |     comfortable doing this, come to one of the consultation hours and we
 50 |     will help you)
 51 | 
 52 |   - Open a terminal window.
 53 | 
 54 |   - Type bash Anaconda- and then press tab. The name of the file you
 55 |     just downloaded should appear.
 56 | 
 57 |   - Press enter. You will follow the text-only prompts. When there is a
 58 |     colon at the bottom of the screen press the down arrow to move down
 59 |     through the text. Type yes and press enter to approve the license.
 60 |     Press enter to approve the default location for the files. Type yes
 61 |     and press enter to prepend Anaconda to your PATH (this makes the
 62 |     Anaconda distribution the default Python).
 63 | 
 64 | #### How do I know if the installation worked?
 65 | 
 66 | Open the **Terminal** (in a Mac or Linux computer) or **Anaconda
 67 | Prompt** (in Windows), and type python.
 68 | 
 69 | Python should start, and should say "3.7" (perhaps 3.8, 3.9... etc.) and
 70 | "Continuum Analytics" or "Anaconda" somewhere in the header.
 71 | 
 72 | To quit Python, just type exit() and press enter
 73 | 
 74 | See example below: 
 75 | 
 76 | ![AnacondaPythonExample](./media/pythoninterpreter.png)
 77 | 
 78 | *Not sure how to open Terminal/Anaconda Prompt?*
 79 | 
 80 |   - [<span class="underline">Mac OSX instructions on YouTube - online
 81 |     tutorial</span>](https://www.youtube.com/watch?v=zw7Nd67_aFw)
 82 | 
 83 |   - For Windows, please use Anaconda Prompt (search for it in your
 84 |     computer)
 85 | 
 86 | Still unsure? Check the section below.
 87 | 
 88 | #### Check out some tutorials online
 89 | 
 90 | There are online tutorials offering specific advice
 91 | for [<span class="underline">Windows</span>](https://www.youtube.com/watch?v=xxQ0mzZ8UvA) and [<span class="underline">Mac
 92 | OSX</span>](https://www.youtube.com/watch?v=TcSAln46u9U).
 93 | 
 94 | Please note that you need Python 3.9, and the Anaconda website may look
 95 | a bit different from what you see in the video.
 96 | 
 97 | Here is another [<span class="underline">video
 98 | tutorial</span>](https://www.youtube.com/watch?v=YJC6ldI3hWk) with
 99 | information on how to install and use Anaconda. It also covers a lot of
100 | additional information that you will not need in the course. For now, as
101 | long as you managed to get Anaconda installed for now - you're more than
102 | OK\!
103 | 
104 | ### Option 2: using Python natively  
105 | *based on Chapter 1 of Atteveldt, Trilling & Arcila Calderón (2021)*
106 | 
107 | Oftentimes, you will have Python already installed on your computer.
108 | There are different ways to check if you already have it. For example,
109 | if you are using a Mac, you can open your system terminal[^1] and type
110 | python -V or python –version and you will get a message with the version
111 | that is installed by default on your computer.
112 | 
113 | If you do not have already Python on your computer, the first thing will
114 | be to download it and install it from its official
115 | [webpage](https://www.python.org/downloads/), selecting the right
116 | software according to your operating system (Windows, Linux/UNIX, Mac OS
117 | X). 
118 | 
119 | During the installation, additional features will be installed. They
120 | include *pip*, a basics package that you will need to install more
121 | packages for Python. In addition, you might be asked if you want to add
122 | Python to your path, which means that you set the path variable in order
123 | to call the executable software from your system terminal just by typing
124 | the word python. We recommend selecting this option.
125 | 
126 | 
127 | ####  Installing Jupyter
128 | 
129 | In the course, we will run our Python code using Jupyter Notebooks. They
130 | run as a web application that allows you to create documents that
131 | contain code and text (and also equations and visualizations). We will
132 | discuss other options to run Python in the course.
133 | 
134 | Jupyter is already installed if you went for option 1 and installed
135 | Python via Anaconda. If you are using the native installation, you will
136 | need to install it by running pip install notebook on your systems’
137 | terminal.
138 | 
139 | You can start Jupter Notebook by typing jupyter notebook in your
140 | system’s terminal (or in Anaconda prompt if you installed Python via
141 | Anaconda on a Windows computer).
142 | 
143 | There is a more fancy and moden environment called JupyterLab -- using JupyterLab instead of plain JupyterNotebooks is fine as well.
144 | 
145 | [^1] Not sure what is and how to open Terminal? Have a look at this short
146 |     [video](https://www.youtube.com/watch?v=zw7Nd67_aFw).
147 | 


--------------------------------------------------------------------------------
/2021/media/boumanstrilling2016.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2021/media/boumanstrilling2016.pdf


--------------------------------------------------------------------------------
/2021/media/mannetje.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2021/media/mannetje.png


--------------------------------------------------------------------------------
/2021/media/pythoninterpreter.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2021/media/pythoninterpreter.png


--------------------------------------------------------------------------------
/2021/media/sparse_dense.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2021/media/sparse_dense.png


--------------------------------------------------------------------------------
/2021/references.bib:
--------------------------------------------------------------------------------
 1 | @article{Boumans2016,
 2 | author = {Boumans, Jelle W. and Trilling, Damian},
 3 | doi = {10.1080/21670811.2015.1096598},
 4 | file = {:Users/damian/Dropbox/uva/literatuur-mendeley/Boumans, Trilling{\_}2016.pdf:pdf},
 5 | issn = {2167-0811},
 6 | journal = {Digital Journalism},
 7 | number = {1},
 8 | pages = {8--23},
 9 | title = {Taking stock of the toolkit: An overview of relevant autmated content analysis approaches and techniques for digital journalism scholars},
10 | volume = {4},
11 | year = {2016}
12 | }
13 | 


--------------------------------------------------------------------------------
/2021/teachingtips.md:
--------------------------------------------------------------------------------
 1 | # Teachingtips
 2 | 
 3 | This document contains some tips from experience to help you avoiding the most common pitfalls when getting started to teach with Python.
 4 | 
 5 | 
 6 | ## Avoiding technical problems
 7 | - Make everyone install (and test!) the environment *before* class starts. Avoid having to deal with "I can't open the notebook!"- or "It says 'module not found'!"-questions during class.
 8 | - **The first session is crucial.** Expect that during the first session, there will be students with technical problems. It is best to teach this session with two teachers, such that one can deal with individual problems and the other can continue with the rest of the group.
 9 | - Be aware that even though Python is largely platform-independent (yeah!!!), there can be subtle differences between using it on typical unix-based systems (MacOS or Linux) versus Windows. Think of `/home/damian/pythoncourse` vs `C:\\Users\\damian\\pythoncourse` (note the double (!) backslash!), but also about some modules that may have different external requirements (recent experience: try `pip install geopandas` on different systems!)
10 | - You can consider pointing students to [Google Colab](https://colab.research.google.com/) as a fallback option if they cannot get things to work.
11 | 
12 | 
13 | ## Grading and rules for assignments
14 | - Make clear from the beginning that it is fine, even encouraged, to get ideas from sites like https://stackoverflow.com . Emphasize, though, that copy-pasting code and presenting it as own work is considered plagiarism, just like in written assignments. A simple comment line like `# the following cell is [copied from/adapted from/inspired by] URL` is enough to prevent this. Document this rule, for instance in the course manual.
15 | 
16 | 
17 | 


--------------------------------------------------------------------------------
/2023/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2023/.DS_Store


--------------------------------------------------------------------------------
/2023/Installationinstruction.md:
--------------------------------------------------------------------------------
1 | # Installing Jupyter Lab
2 | 
3 | To ensure a smooth start for everyone, we ask you to download and install the Jupyter-Lab Desktop program: https://github.com/jupyterlab/jupyterlab-desktop/tree/master 
4 | 
5 | Please scroll down and download the respective installer for your system (Windows/Mac/Linux). When opening JupyterLab for the first time, you will see a small message at the bottom (see screenshot below). Please then click on “install using the bundled installer” to start the installation process, restart JupyterLab, and we should be good to go! If you encounter errors, do not worry, we will have enough time on Monday to make sure everyone — regardless of their computer or operating system — will be operational! 


--------------------------------------------------------------------------------
/2023/Teachingtips.md:
--------------------------------------------------------------------------------
 1 | # Teachingtips
 2 | This document contains some tips from experience to help you avoiding the most common pitfalls when getting started to teach with Python.
 3 | 
 4 | ## Avoiding technical problems
 5 | 
 6 | Make everyone install (and test!) the environment before class starts. Avoid having to deal with "I can't open the notebook!"- or "It says 'module not found'!"-questions during class.
 7 | 
 8 | The first session is crucial. Expect that during the first session, there will be students with technical problems. It is best to teach this session with two teachers, such that one can deal with individual problems and the other can continue with the rest of the group.
 9 | 
10 | Be aware that even though Python is largely platform-independent (yeah!!!), there can be subtle differences between using it on typical unix-based systems (MacOS or Linux) versus Windows. Think of /home/damian/pythoncourse vs C:\\Users\\damian\\pythoncourse (note the double (!) backslash!), but also about some modules that may have different external requirements (recent experience: try pip install geopandas on different systems!)
11 | 
12 | You can consider pointing students to Google Colab as a fallback option if they cannot get things to work.
13 | 
14 | ## Grading and rules for assignments
15 | 
16 | Make clear from the beginning that it is fine, even encouraged, to get ideas from sites like https://stackoverflow.com . Emphasize, though, that copy-pasting code and presenting it as own work is considered plagiarism, just like in written assignments. A simple comment line like # the following cell is [copied from/adapted from/inspired by] URL is enough to prevent this. Document this rule, for instance in the course manual.


--------------------------------------------------------------------------------
/2023/day2/Day 2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2023/day2/Day 2.pdf


--------------------------------------------------------------------------------
/2023/day2/Notebooks/ExcercisesPandas.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "guilty-steam",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "## Excercises pandas\n",
  9 |     "\n",
 10 |     "Let's practice data exploration and wrangling in Pandas. \n",
 11 |     "\n",
 12 |     "We will work with data collected through the Twitter API (RIP, more on API's and how quickly we're loosing them next week ;)). Few months ago (last days when the API was alive) we collected different tweets (for teaching purposes). We have also already run a sentiment analysis on these tweets and have saved it in a separate file.\n",
 13 |     "\n",
 14 |     "We have two datasets per account/topic:\n",
 15 |     "* One datasets with tweets (public tweets by account or with a hashtag)\n",
 16 |     "* One dataset with sentiment of those tweets (with three sentiment scores using veeery basic snetiment tool - more on how bad it is in two weeks ;))\n",
 17 |     "\n",
 18 |     "We want to see how sentiment changes over time, compare number of positive and negative tweets and analyze the relation between sentiment and engagement with the tweets. You can selest an account/topic that seems interesting to you.\n",
 19 |     "\n",
 20 |     "We want to prepare the dataset for analysis:\n",
 21 |     "\n",
 22 |     "**Morning**\n",
 23 |     "\n",
 24 |     "* Data exploration\n",
 25 |     "    * Check columns, data types, missing values, descriptives for numeric variables measuring engagement and sentiment, value_counts for relevant categorical variables\n",
 26 |     "* Handling missing values and data types\n",
 27 |     "    * Handle missing values in variables of interest: number of likes and retweets - what can nan's mean?\n",
 28 |     "    * Make sure created_at has the right format (to use it for aggregation later)\n",
 29 |     "* Creating necessary variables (sentiment)\n",
 30 |     "    * Overall measure of sentiment - create it from positive and negative\n",
 31 |     "    * Binary variable (positive or negative tweet) - Tip: Write a function that \"recodes\" the sentiment column\n",
 32 |     "    \n",
 33 |     "\n",
 34 |     "**Afternoon**\n",
 35 |     "\n",
 36 |     "* Merging the files (tweets with sentiment)\n",
 37 |     "    * Make sure the columns you merge match and check how to merge\n",
 38 |     "* Agrregating the files per month\n",
 39 |     "    * Tip: Create a column for month by transforming the date column. Remember that the date column needs the right format first!\n",
 40 |     "    `df['month'] = df['date_dt_column'].dt.strftime('%Y-%m')` \n",
 41 |     "* Visualisations:\n",
 42 |     "    * Visualise different columns of the tweet dataset (change of sentiment over time, sentiment, engagement, relation between sentiment and engagement)\n",
 43 |     "    * But *more fun*: use your own data to play with visualisations"
 44 |    ]
 45 |   },
 46 |   {
 47 |    "cell_type": "code",
 48 |    "execution_count": 4,
 49 |    "id": "accredited-adolescent",
 50 |    "metadata": {},
 51 |    "outputs": [],
 52 |    "source": [
 53 |     "import pandas as pd"
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "code",
 58 |    "execution_count": null,
 59 |    "metadata": {},
 60 |    "outputs": [],
 61 |    "source": [
 62 |     "df_jsonl = pd.read_json('filename', lines=True) #put your filename as filename\n"
 63 |    ]
 64 |   },
 65 |   {
 66 |    "cell_type": "code",
 67 |    "execution_count": null,
 68 |    "metadata": {},
 69 |    "outputs": [],
 70 |    "source": [
 71 |     "#run this cell \n",
 72 |     "\n",
 73 |     "def get_public_metrics(row):\n",
 74 |     "    if 'public_metrics' in row.keys():\n",
 75 |     "        if type(row['public_metrics']) == dict:\n",
 76 |     "            for key, value in row['public_metrics'].items():\n",
 77 |     "                row['metric_' + str(key)] = value\n",
 78 |     "    return row\n",
 79 |     "\n",
 80 |     "def get_tweets(df):\n",
 81 |     "    if 'data' not in df.columns:\n",
 82 |     "        return None\n",
 83 |     "    results = pd.DataFrame()\n",
 84 |     "    for item in df['data'].values.tolist():\n",
 85 |     "        results = pd.concat([results, pd.DataFrame(item)])\n",
 86 |     "        \n",
 87 |     "    results = results.apply(get_public_metrics, axis=1)\n",
 88 |     "        \n",
 89 |     "    results = results.reset_index()\n",
 90 |     "    del results['index']\n",
 91 |     "        \n",
 92 |     "    return results"
 93 |    ]
 94 |   },
 95 |   {
 96 |    "cell_type": "code",
 97 |    "execution_count": null,
 98 |    "metadata": {},
 99 |    "outputs": [],
100 |    "source": [
101 |     "#unpack the tweets - this gives you dataframe with tweets\n",
102 |     "tweets = get_tweets(df_jsonl)"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "code",
107 |    "execution_count": null,
108 |    "metadata": {},
109 |    "outputs": [],
110 |    "source": []
111 |   }
112 |  ],
113 |  "metadata": {
114 |   "kernelspec": {
115 |    "display_name": "Python 3",
116 |    "language": "python",
117 |    "name": "python3"
118 |   },
119 |   "language_info": {
120 |    "codemirror_mode": {
121 |     "name": "ipython",
122 |     "version": 3
123 |    },
124 |    "file_extension": ".py",
125 |    "mimetype": "text/x-python",
126 |    "name": "python",
127 |    "nbconvert_exporter": "python",
128 |    "pygments_lexer": "ipython3",
129 |    "version": "3.9.1"
130 |   }
131 |  },
132 |  "nbformat": 4,
133 |  "nbformat_minor": 5
134 | }
135 | 


--------------------------------------------------------------------------------
/2023/day2/Notebooks/PandasIntroduction.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Introduction to Pandas\n",
  8 |     "\n",
  9 |     "*Based on DA and CCS1 materials*"
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "markdown",
 14 |    "metadata": {},
 15 |    "source": [
 16 |     "Before using it, however, we need to import it."
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "code",
 21 |    "execution_count": null,
 22 |    "metadata": {},
 23 |    "outputs": [],
 24 |    "source": [
 25 |     "import pandas as pd"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": null,
 31 |    "metadata": {},
 32 |    "outputs": [],
 33 |    "source": [
 34 |     "pd.__version__"
 35 |    ]
 36 |   },
 37 |   {
 38 |    "cell_type": "code",
 39 |    "execution_count": null,
 40 |    "metadata": {},
 41 |    "outputs": [],
 42 |    "source": []
 43 |   },
 44 |   {
 45 |    "cell_type": "markdown",
 46 |    "metadata": {},
 47 |    "source": [
 48 |     "## Reading data into Pandas\n",
 49 |     "\n"
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "code",
 54 |    "execution_count": null,
 55 |    "metadata": {},
 56 |    "outputs": [],
 57 |    "source": [
 58 |     "\n",
 59 |     "websites = [\n",
 60 |     "    {'site': 'Twitter', 'type': 'Social Media', 'views': 10000, 'active_users': 200000},\n",
 61 |     "    {'site': 'Facebook', 'type': 'Social Media', 'views': 35000, 'active_users': 500000},\n",
 62 |     "    {'site': 'NYT', 'type': 'News media', 'views': 78000, 'active_users': 156000},    \n",
 63 |     "    {'site': 'YouTube', 'type': 'Video platform', 'views': 18000, 'active_users': 289000},\n",
 64 |     "    {'site': 'Vimeo', 'type': 'Video platform', 'views': 300, 'active_users': 1580},\n",
 65 |     "    {'site': 'USA Today', 'type': 'News media', 'views': 4800, 'active_users': 5608},\n",
 66 |     "]"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "code",
 71 |    "execution_count": null,
 72 |    "metadata": {},
 73 |    "outputs": [],
 74 |    "source": [
 75 |     "websites"
 76 |    ]
 77 |   },
 78 |   {
 79 |    "cell_type": "code",
 80 |    "execution_count": null,
 81 |    "metadata": {},
 82 |    "outputs": [],
 83 |    "source": [
 84 |     "websites=pd.DataFrame(websites)"
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "code",
 89 |    "execution_count": null,
 90 |    "metadata": {},
 91 |    "outputs": [],
 92 |    "source": [
 93 |     "websites"
 94 |    ]
 95 |   },
 96 |   {
 97 |    "cell_type": "code",
 98 |    "execution_count": null,
 99 |    "metadata": {},
100 |    "outputs": [],
101 |    "source": [
102 |     "websites.to_csv('websites.csv', index=False)"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "code",
107 |    "execution_count": null,
108 |    "metadata": {},
109 |    "outputs": [],
110 |    "source": [
111 |     "#Read from a csv\n",
112 |     "df_websites = pd.read_csv('websites.csv')"
113 |    ]
114 |   },
115 |   {
116 |    "cell_type": "code",
117 |    "execution_count": null,
118 |    "metadata": {},
119 |    "outputs": [],
120 |    "source": [
121 |     "df_websites"
122 |    ]
123 |   },
124 |   {
125 |    "cell_type": "markdown",
126 |    "metadata": {},
127 |    "source": [
128 |     "## Exploring this dataset"
129 |    ]
130 |   },
131 |   {
132 |    "cell_type": "markdown",
133 |    "metadata": {},
134 |    "source": [
135 |     "Which columns are available?"
136 |    ]
137 |   },
138 |   {
139 |    "cell_type": "code",
140 |    "execution_count": null,
141 |    "metadata": {},
142 |    "outputs": [],
143 |    "source": [
144 |     "df_websites.columns"
145 |    ]
146 |   },
147 |   {
148 |    "cell_type": "markdown",
149 |    "metadata": {},
150 |    "source": [
151 |     "Are there missing values?"
152 |    ]
153 |   },
154 |   {
155 |    "cell_type": "code",
156 |    "execution_count": null,
157 |    "metadata": {},
158 |    "outputs": [],
159 |    "source": [
160 |     "df_websites.dtypes"
161 |    ]
162 |   },
163 |   {
164 |    "cell_type": "code",
165 |    "execution_count": null,
166 |    "metadata": {},
167 |    "outputs": [],
168 |    "source": [
169 |     "df_websites.isna().sum()"
170 |    ]
171 |   },
172 |   {
173 |    "cell_type": "markdown",
174 |    "metadata": {},
175 |    "source": [
176 |     "Let's see the first few values"
177 |    ]
178 |   },
179 |   {
180 |    "cell_type": "code",
181 |    "execution_count": null,
182 |    "metadata": {},
183 |    "outputs": [],
184 |    "source": [
185 |     "df_websites.head()"
186 |    ]
187 |   },
188 |   {
189 |    "cell_type": "markdown",
190 |    "metadata": {},
191 |    "source": [
192 |     "And now the last few values"
193 |    ]
194 |   },
195 |   {
196 |    "cell_type": "code",
197 |    "execution_count": null,
198 |    "metadata": {},
199 |    "outputs": [],
200 |    "source": [
201 |     "df_websites.tail()"
202 |    ]
203 |   },
204 |   {
205 |    "cell_type": "markdown",
206 |    "metadata": {},
207 |    "source": [
208 |     "Let's look at some descriptive statistics..."
209 |    ]
210 |   },
211 |   {
212 |    "cell_type": "code",
213 |    "execution_count": null,
214 |    "metadata": {},
215 |    "outputs": [],
216 |    "source": [
217 |     "df_websites.describe()"
218 |    ]
219 |   },
220 |   {
221 |    "cell_type": "markdown",
222 |    "metadata": {},
223 |    "source": [
224 |     "Only numerical variables appear above... let's see the frequencies for the non-numerical variables"
225 |    ]
226 |   },
227 |   {
228 |    "cell_type": "code",
229 |    "execution_count": null,
230 |    "metadata": {},
231 |    "outputs": [],
232 |    "source": [
233 |     "df_websites['type'].describe()"
234 |    ]
235 |   },
236 |   {
237 |    "cell_type": "markdown",
238 |    "metadata": {},
239 |    "source": [
240 |     "This is not very informative... let's try to get the counts per value of the column"
241 |    ]
242 |   },
243 |   {
244 |    "cell_type": "code",
245 |    "execution_count": null,
246 |    "metadata": {},
247 |    "outputs": [],
248 |    "source": [
249 |     "df_websites['type'].value_counts()"
250 |    ]
251 |   },
252 |   {
253 |    "cell_type": "markdown",
254 |    "metadata": {},
255 |    "source": [
256 |     "Now let's get descriptive statistics per group:"
257 |    ]
258 |   },
259 |   {
260 |    "cell_type": "code",
261 |    "execution_count": null,
262 |    "metadata": {},
263 |    "outputs": [],
264 |    "source": [
265 |     "df_websites.groupby('type').describe()"
266 |    ]
267 |   },
268 |   {
269 |    "cell_type": "markdown",
270 |    "metadata": {},
271 |    "source": [
272 |     "This doesn't look so easy to read. Let's transpose this output\n",
273 |     "\n",
274 |     "By transposing a dataframe we move the rows data to columns and the columns data to the rows. "
275 |    ]
276 |   },
277 |   {
278 |    "cell_type": "code",
279 |    "execution_count": null,
280 |    "metadata": {},
281 |    "outputs": [],
282 |    "source": [
283 |     "df_websites.groupby('type').describe().transpose()"
284 |    ]
285 |   },
286 |   {
287 |    "cell_type": "markdown",
288 |    "metadata": {},
289 |    "source": [
290 |     "## Subsetting and slicing\n",
291 |     "\n",
292 |     "* Let's say I just want some of the **columns** that there are in the dataset\n",
293 |     "* Or that I just want some of the **rows** that are in the dataset"
294 |    ]
295 |   },
296 |   {
297 |    "cell_type": "markdown",
298 |    "metadata": {},
299 |    "source": [
300 |     "### Slicing by column"
301 |    ]
302 |   },
303 |   {
304 |    "cell_type": "code",
305 |    "execution_count": null,
306 |    "metadata": {},
307 |    "outputs": [],
308 |    "source": [
309 |     "df_websites.columns"
310 |    ]
311 |   },
312 |   {
313 |    "cell_type": "code",
314 |    "execution_count": null,
315 |    "metadata": {},
316 |    "outputs": [],
317 |    "source": [
318 |     "df_websites[['type', \"views\"]]"
319 |    ]
320 |   },
321 |   {
322 |    "cell_type": "code",
323 |    "execution_count": null,
324 |    "metadata": {},
325 |    "outputs": [],
326 |    "source": [
327 |     "df_websites"
328 |    ]
329 |   },
330 |   {
331 |    "cell_type": "code",
332 |    "execution_count": null,
333 |    "metadata": {},
334 |    "outputs": [],
335 |    "source": [
336 |     "type_views = df_websites[['site','views']]"
337 |    ]
338 |   },
339 |   {
340 |    "cell_type": "code",
341 |    "execution_count": null,
342 |    "metadata": {},
343 |    "outputs": [],
344 |    "source": [
345 |     "type_views"
346 |    ]
347 |   },
348 |   {
349 |    "cell_type": "code",
350 |    "execution_count": null,
351 |    "metadata": {},
352 |    "outputs": [],
353 |    "source": [
354 |     "df_websites"
355 |    ]
356 |   },
357 |   {
358 |    "cell_type": "markdown",
359 |    "metadata": {},
360 |    "source": [
361 |     "### Slicing by row (value)\n",
362 |     "\n",
363 |     "Filtering dataset based on values in columns"
364 |    ]
365 |   },
366 |   {
367 |    "cell_type": "code",
368 |    "execution_count": null,
369 |    "metadata": {},
370 |    "outputs": [],
371 |    "source": [
372 |     "df_websites[df_websites['type']=='Social Media']"
373 |    ]
374 |   },
375 |   {
376 |    "cell_type": "code",
377 |    "execution_count": null,
378 |    "metadata": {},
379 |    "outputs": [],
380 |    "source": [
381 |     "df_websites[df_websites['type']!='News media']"
382 |    ]
383 |   },
384 |   {
385 |    "cell_type": "markdown",
386 |    "metadata": {},
387 |    "source": [
388 |     "I want to have data that is not about News Media **and** with more than 12,000 views"
389 |    ]
390 |   },
391 |   {
392 |    "cell_type": "code",
393 |    "execution_count": null,
394 |    "metadata": {},
395 |    "outputs": [],
396 |    "source": [
397 |     "df_websites[(df_websites['type']!='News media') & (df_websites['views'] > 12000)]"
398 |    ]
399 |   },
400 |   {
401 |    "cell_type": "markdown",
402 |    "metadata": {},
403 |    "source": [
404 |     "I want to have data that is **either** not about News Media **or** with more than 12,000 views"
405 |    ]
406 |   },
407 |   {
408 |    "cell_type": "code",
409 |    "execution_count": null,
410 |    "metadata": {},
411 |    "outputs": [],
412 |    "source": [
413 |     "df_websites[(df_websites['type']!='News media') | (df_websites['views'] > 12000)]"
414 |    ]
415 |   },
416 |   {
417 |    "cell_type": "code",
418 |    "execution_count": null,
419 |    "metadata": {},
420 |    "outputs": [],
421 |    "source": [
422 |     "social_media = df_websites[df_websites['type']=='Social Media']"
423 |    ]
424 |   },
425 |   {
426 |    "cell_type": "code",
427 |    "execution_count": null,
428 |    "metadata": {},
429 |    "outputs": [],
430 |    "source": [
431 |     "social_media"
432 |    ]
433 |   },
434 |   {
435 |    "cell_type": "code",
436 |    "execution_count": null,
437 |    "metadata": {},
438 |    "outputs": [],
439 |    "source": [
440 |     "social_media.describe()"
441 |    ]
442 |   },
443 |   {
444 |    "cell_type": "code",
445 |    "execution_count": null,
446 |    "metadata": {},
447 |    "outputs": [],
448 |    "source": [
449 |     "df_websites[df_websites['type']=='Social Media'].describe()"
450 |    ]
451 |   },
452 |   {
453 |    "cell_type": "code",
454 |    "execution_count": null,
455 |    "metadata": {},
456 |    "outputs": [],
457 |    "source": [
458 |     "socialmediaviews = df_websites[df_websites['type']=='Social Media'][['type',  'views']]"
459 |    ]
460 |   },
461 |   {
462 |    "cell_type": "code",
463 |    "execution_count": null,
464 |    "metadata": {},
465 |    "outputs": [],
466 |    "source": [
467 |     "socialmediaviews"
468 |    ]
469 |   },
470 |   {
471 |    "cell_type": "markdown",
472 |    "metadata": {},
473 |    "source": [
474 |     "## Saving the dataframe\n",
475 |     "\n",
476 |     "Formats you can use : see https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html"
477 |    ]
478 |   },
479 |   {
480 |    "cell_type": "markdown",
481 |    "metadata": {},
482 |    "source": [
483 |     "CSV:"
484 |    ]
485 |   },
486 |   {
487 |    "cell_type": "code",
488 |    "execution_count": null,
489 |    "metadata": {},
490 |    "outputs": [],
491 |    "source": [
492 |     "df_websites.to_csv('websites.csv')"
493 |    ]
494 |   },
495 |   {
496 |    "cell_type": "markdown",
497 |    "metadata": {},
498 |    "source": [
499 |     "Pickle:"
500 |    ]
501 |   },
502 |   {
503 |    "cell_type": "code",
504 |    "execution_count": null,
505 |    "metadata": {},
506 |    "outputs": [],
507 |    "source": [
508 |     "df_websites.to_pickle('websites.pkl')"
509 |    ]
510 |   }
511 |  ],
512 |  "metadata": {
513 |   "anaconda-cloud": {},
514 |   "kernelspec": {
515 |    "display_name": "Python 3",
516 |    "language": "python",
517 |    "name": "python3"
518 |   },
519 |   "language_info": {
520 |    "codemirror_mode": {
521 |     "name": "ipython",
522 |     "version": 3
523 |    },
524 |    "file_extension": ".py",
525 |    "mimetype": "text/x-python",
526 |    "name": "python",
527 |    "nbconvert_exporter": "python",
528 |    "pygments_lexer": "ipython3",
529 |    "version": "3.9.1"
530 |   }
531 |  },
532 |  "nbformat": 4,
533 |  "nbformat_minor": 4
534 | }
535 | 


--------------------------------------------------------------------------------
/2023/day3/API.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "e9014de5",
  6 |    "metadata": {
  7 |     "slideshow": {
  8 |      "slide_type": "slide"
  9 |     }
 10 |    },
 11 |    "source": [
 12 |     "# API\n",
 13 |     "\n",
 14 |     "\n",
 15 |     "Author: Justin Chun-ting Ho\n",
 16 |     "\n",
 17 |     "Date: 27 Nov 2023\n",
 18 |     "\n",
 19 |     "Credit: Some sections are adopted from the slides prepared by Damian Trilling"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "markdown",
 24 |    "id": "510e823a",
 25 |    "metadata": {
 26 |     "slideshow": {
 27 |      "slide_type": "slide"
 28 |     }
 29 |    },
 30 |    "source": [
 31 |     "### Beyond files\n",
 32 |     "\n",
 33 |     "- we can write anything to files\n",
 34 |     "- as long as we know the structure and encoding, we can unpack it into data\n",
 35 |     "- we don't even need files!\n",
 36 |     "- how about sending it directly through the internet?"
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "markdown",
 41 |    "id": "11b982aa",
 42 |    "metadata": {
 43 |     "slideshow": {
 44 |      "slide_type": "slide"
 45 |     }
 46 |    },
 47 |    "source": [
 48 |     "### How does API work?\n",
 49 |     "\n",
 50 |     "![API](https://voyager.postman.com/illustration/diagram-what-is-an-api-postman-illustration.svg)"
 51 |    ]
 52 |   },
 53 |   {
 54 |    "cell_type": "markdown",
 55 |    "id": "2766306b",
 56 |    "metadata": {
 57 |     "slideshow": {
 58 |      "slide_type": "slide"
 59 |     }
 60 |    },
 61 |    "source": [
 62 |     "### Example: Google Books API\n",
 63 |     "\n",
 64 |     "You could try this in any browser: [https://www.googleapis.com/books/v1/volumes?q=isbn:9780261102217](https://www.googleapis.com/books/v1/volumes?q=isbn:9780261102217)\n",
 65 |     "\n",
 66 |     "But how do we know how to use it? Read the [documentation](https://developers.google.com/books/docs/v1/using#PerformingSearch)!"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "code",
 71 |    "execution_count": null,
 72 |    "id": "b2a5bfec",
 73 |    "metadata": {
 74 |     "slideshow": {
 75 |      "slide_type": "slide"
 76 |     }
 77 |    },
 78 |    "outputs": [],
 79 |    "source": [
 80 |     "# A better way to do this\n",
 81 |     "\n",
 82 |     "import json\n",
 83 |     "from urllib.request import urlopen\n",
 84 |     "\n",
 85 |     "api = \"https://www.googleapis.com/books/v1/volumes?q=\"\n",
 86 |     "query = \"isbn:9780261102217\"\n",
 87 |     "\n",
 88 |     "# send a request and get a JSON response\n",
 89 |     "resp = urlopen(api + query)\n",
 90 |     "# parse JSON into Python as a dictionary\n",
 91 |     "book_data = json.load(resp)"
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "code",
 96 |    "execution_count": null,
 97 |    "id": "152d9883",
 98 |    "metadata": {
 99 |     "slideshow": {
100 |      "slide_type": "slide"
101 |     }
102 |    },
103 |    "outputs": [],
104 |    "source": [
105 |     "volume_info = book_data[\"items\"][0][\"volumeInfo\"]\n",
106 |     "\n",
107 |     "print('Title: ' + volume_info['title'])\n",
108 |     "print('Author: ' + str(volume_info['authors']))\n",
109 |     "print('Publication Date: ' + volume_info['publishedDate'])"
110 |    ]
111 |   },
112 |   {
113 |    "cell_type": "markdown",
114 |    "id": "f7ce48d3",
115 |    "metadata": {
116 |     "slideshow": {
117 |      "slide_type": "slide"
118 |     }
119 |    },
120 |    "source": [
121 |     "### Example: Youtube API\n",
122 |     "\n",
123 |     "#### Getting an API key\n",
124 |     "\n",
125 |     "- Go to [Google Cloud Platform](https://console.cloud.google.com/)\n",
126 |     "\n",
127 |     "- Create a project in the Google Developers Console\n",
128 |     "\n",
129 |     "- Enable YouTube Data API \n",
130 |     "\n",
131 |     "- Obtain your API key\n",
132 |     "\n",
133 |     "#### Step by Step Guide\n",
134 |     "\n",
135 |     "![](https://miro.medium.com/v2/resize:fit:2000/format:webp/1*bCsi9C7yC8U-dVdWW4Zqhg.png)\n",
136 |     "\n",
137 |     "![](https://miro.medium.com/v2/resize:fit:2000/format:webp/1*k86eiqdHf9HhhxWKKnO7Sg.png)\n",
138 |     "\n",
139 |     "![](https://miro.medium.com/v2/resize:fit:2000/format:webp/1*DgLkgzXA9YkzMJC7Dh7JZg.png)\n",
140 |     "\n",
141 |     "![](https://miro.medium.com/v2/resize:fit:2000/format:webp/1*KzdLen4agoUi33_H0MutcA.png)\n",
142 |     "\n",
143 |     "![](https://miro.medium.com/v2/resize:fit:2000/format:webp/1*3HjyBix-P1gop_CPLYNpiQ.png)\n",
144 |     "\n",
145 |     "![](https://miro.medium.com/v2/resize:fit:2000/format:webp/1*rzq6FpRfV0ujb_B6nUoGEA.png)\n",
146 |     "\n",
147 |     "![](https://miro.medium.com/v2/resize:fit:2000/format:webp/1*FOTj3rvn0hGmHxgNz0x1Gw.png)\n",
148 |     "\n",
149 |     "![](https://miro.medium.com/v2/resize:fit:2000/format:webp/1*cLWdO9siuQE-3v0kPz-SAA.png)\n",
150 |     "\n",
151 |     "Credit: [Pedro Hernández](https://medium.com/mcd-unison/youtube-data-api-v3-in-python-tutorial-with-examples-e829a25d2ebd)\n",
152 |     "\n",
153 |     "#### Install google api package\n",
154 |     "\n",
155 |     "- install the package with `conda install -c conda-forge google-api-python-client` or \n",
156 |     "`pip install google-api-python-client`"
157 |    ]
158 |   },
159 |   {
160 |    "cell_type": "markdown",
161 |    "id": "f001ad51",
162 |    "metadata": {},
163 |    "source": [
164 |     "### Simple video search"
165 |    ]
166 |   },
167 |   {
168 |    "cell_type": "code",
169 |    "execution_count": null,
170 |    "id": "763cf34e",
171 |    "metadata": {},
172 |    "outputs": [],
173 |    "source": [
174 |     "# Setting Up\n",
175 |     "import googleapiclient.discovery\n",
176 |     "api_service_name = \"youtube\"\n",
177 |     "api_version = \"v3\"\n",
178 |     "DEVELOPER_KEY = \"#################\"\n",
179 |     "youtube = googleapiclient.discovery.build(\n",
180 |     "        api_service_name, api_version, developerKey = DEVELOPER_KEY)"
181 |    ]
182 |   },
183 |   {
184 |    "cell_type": "markdown",
185 |    "id": "810614af",
186 |    "metadata": {},
187 |    "source": [
188 |     "### Getting a list of videos"
189 |    ]
190 |   },
191 |   {
192 |    "cell_type": "code",
193 |    "execution_count": null,
194 |    "id": "27c002da",
195 |    "metadata": {},
196 |    "outputs": [],
197 |    "source": [
198 |     "# The codes to send the request\n",
199 |     "request = youtube.search().list(\n",
200 |     "        part=\"id,snippet\",\n",
201 |     "        type='video',\n",
202 |     "        q=\"Lord of the rings\",\n",
203 |     "        maxResults=1\n",
204 |     ")\n",
205 |     "# Request execution\n",
206 |     "response = request.execute()\n",
207 |     "print(response)"
208 |    ]
209 |   },
210 |   {
211 |    "cell_type": "code",
212 |    "execution_count": null,
213 |    "id": "c7d29894",
214 |    "metadata": {},
215 |    "outputs": [],
216 |    "source": [
217 |     "lotr_videos_ids = youtube.search().list(\n",
218 |     "        part=\"id\",\n",
219 |     "        type='video',\n",
220 |     "        order=\"viewCount\", # This can also be \"date\", \"rating\", \"relevance\" etc.\n",
221 |     "        q=\"Lord of the rings\", # The search query\n",
222 |     "        maxResults=50,\n",
223 |     "        fields=\"items(id(videoId))\"\n",
224 |     ").execute()"
225 |    ]
226 |   },
227 |   {
228 |    "cell_type": "code",
229 |    "execution_count": null,
230 |    "id": "ca3a3c6e",
231 |    "metadata": {},
232 |    "outputs": [],
233 |    "source": [
234 |     "lotr_videos_ids"
235 |    ]
236 |   },
237 |   {
238 |    "cell_type": "code",
239 |    "execution_count": null,
240 |    "id": "66a1711e",
241 |    "metadata": {},
242 |    "outputs": [],
243 |    "source": [
244 |     "info = {\n",
245 |     "    'id':[],\n",
246 |     "    'title':[],\n",
247 |     "    'views':[]\n",
248 |     "}\n",
249 |     "\n",
250 |     "for item in lotr_videos_ids['items']:\n",
251 |     "    vidId = item['id']['videoId']\n",
252 |     "    r = youtube.videos().list(\n",
253 |     "        part=\"statistics,snippet\",\n",
254 |     "        id=vidId,\n",
255 |     "        fields=\"items(statistics),snippet(title)\"\n",
256 |     "    ).execute()\n",
257 |     "\n",
258 |     "    views = r['items'][0]['statistics']['viewCount']\n",
259 |     "    title = r['items'][0]['snippet']['title']\n",
260 |     "    info['id'].append(vidId)\n",
261 |     "    info['title'].append(title)\n",
262 |     "    info['views'].append(views)\n",
263 |     "\n",
264 |     "df = pd.DataFrame(data=info)"
265 |    ]
266 |   },
267 |   {
268 |    "cell_type": "markdown",
269 |    "id": "f85288f6",
270 |    "metadata": {},
271 |    "source": [
272 |     "### How to search by channel id?\n",
273 |     "\n",
274 |     "First, you need to find the channel id, there are many tools for that, eg [this one](https://commentpicker.com/youtube-channel-id.php). While it is possible to search by username, sometimes it work, sometimes it doesn't.\n",
275 |     "\n",
276 |     "Example: Last Week Tonight by John Oliva (https://www.youtube.com/@LastWeekTonight)"
277 |    ]
278 |   },
279 |   {
280 |    "cell_type": "code",
281 |    "execution_count": null,
282 |    "id": "8524c0f2",
283 |    "metadata": {},
284 |    "outputs": [],
285 |    "source": [
286 |     "# Some Channel Statistics\n",
287 |     "response = youtube.channels().list( \n",
288 |     "    part='statistics', \n",
289 |     "    id='UC3XTzVzaHQEd30rQbuvCtTQ').execute()"
290 |    ]
291 |   },
292 |   {
293 |    "cell_type": "code",
294 |    "execution_count": null,
295 |    "id": "b53f73e8",
296 |    "metadata": {},
297 |    "outputs": [],
298 |    "source": [
299 |     "# You can search by channel id, but it will not return everything\n",
300 |     "videos_ids = youtube.search().list(\n",
301 |     "        part=\"id\",\n",
302 |     "        type='video',\n",
303 |     "        order=\"viewCount\", # This can also be \"date\", \"rating\", \"relevance\" etc.\n",
304 |     "        channelId=\"UC3XTzVzaHQEd30rQbuvCtTQ\", # The search query\n",
305 |     "        maxResults=500,\n",
306 |     "        fields=\"items(id(videoId))\"\n",
307 |     ").execute()"
308 |    ]
309 |   },
310 |   {
311 |    "cell_type": "code",
312 |    "execution_count": null,
313 |    "id": "603f6d16",
314 |    "metadata": {},
315 |    "outputs": [],
316 |    "source": [
317 |     "# A more robust way is to search by playlists. First, you need to get the playlists ids.\n",
318 |     "response = youtube.playlists().list( \n",
319 |     "        part='contentDetails,snippet', \n",
320 |     "        channelId='UC3XTzVzaHQEd30rQbuvCtTQ', \n",
321 |     "        maxResults=50\n",
322 |     "    ).execute() \n",
323 |     "\n",
324 |     "playlists = []\n",
325 |     "for i in response['items']:\n",
326 |     "    playlists.append(i['id'])"
327 |    ]
328 |   },
329 |   {
330 |    "cell_type": "code",
331 |    "execution_count": null,
332 |    "id": "177c377c",
333 |    "metadata": {},
334 |    "outputs": [],
335 |    "source": [
336 |     "# Next, write a loop to search through all the pages\n",
337 |     "\n",
338 |     "nextPageToken = None\n",
339 |     "\n",
340 |     "while True: \n",
341 |     "\n",
342 |     "    response = youtube.playlistItems().list( \n",
343 |     "        part='snippet', \n",
344 |     "        playlistId=playlists[0], \n",
345 |     "        maxResults=100, \n",
346 |     "        pageToken=nextPageToken \n",
347 |     "    ).execute() \n",
348 |     "\n",
349 |     "    # Iterate through all response and get video description \n",
350 |     "    for item in response['items']: \n",
351 |     "        description = item['snippet']['title']\n",
352 |     "        print(description) \n",
353 |     "        print(\"\\n\")\n",
354 |     "    nextPageToken = response.get('nextPageToken') \n",
355 |     "  \n",
356 |     "    if not nextPageToken: \n",
357 |     "        break"
358 |    ]
359 |   },
360 |   {
361 |    "cell_type": "markdown",
362 |    "id": "375a3095",
363 |    "metadata": {},
364 |    "source": [
365 |     "### Exercise\n",
366 |     "\n",
367 |     "Find the top 10 most viewed dogs and cats video on YouTube"
368 |    ]
369 |   }
370 |  ],
371 |  "metadata": {
372 |   "celltoolbar": "Slideshow",
373 |   "kernelspec": {
374 |    "display_name": "Python 3 (ipykernel)",
375 |    "language": "python",
376 |    "name": "python3"
377 |   },
378 |   "language_info": {
379 |    "codemirror_mode": {
380 |     "name": "ipython",
381 |     "version": 3
382 |    },
383 |    "file_extension": ".py",
384 |    "mimetype": "text/x-python",
385 |    "name": "python",
386 |    "nbconvert_exporter": "python",
387 |    "pygments_lexer": "ipython3",
388 |    "version": "3.9.18"
389 |   }
390 |  },
391 |  "nbformat": 4,
392 |  "nbformat_minor": 5
393 | }
394 | 


--------------------------------------------------------------------------------
/2023/day3/Teaching Exercises.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "id": "868c5ff0",
 6 |    "metadata": {
 7 |     "slideshow": {
 8 |      "slide_type": "slide"
 9 |     }
10 |    },
11 |    "source": [
12 |     "### Question: \n",
13 |     "### A student comes to you a vague topic in mind, what data collection methods would you recommend?"
14 |    ]
15 |   },
16 |   {
17 |    "cell_type": "markdown",
18 |    "id": "4d03208f",
19 |    "metadata": {
20 |     "slideshow": {
21 |      "slide_type": "slide"
22 |     }
23 |    },
24 |    "source": [
25 |     "Topics:\n",
26 |     "\n",
27 |     "- Analyzing Social Media Discourse: A Comparative Study of Political Communication Strategies during Elections.\n",
28 |     "- The Impact of Online Influencers on Consumer Behavior: A Analysis of Product Endorsements.\n",
29 |     "- Exploring Online Activism: Case Studies on the Effectiveness of Digital Campaigns in Social Change Movements.\n",
30 |     "- Fake News and Public Perception: An Examination of the Role of Online Information in Shaping Public Opinion.\n",
31 |     "- Digital Diplomacy: A Cross-Cultural Analysis of Nation Branding in International Relations.\n",
32 |     "- User-Generated Content and Brand Loyalty: A Study of Customer Engagement on E-commerce Platforms.\n",
33 |     "- Crisis Communication in the Age of Social Media: A Comparative Study of Organizational Responses to Online Controversies.\n",
34 |     "- Analyzing Media Bias and Framing in Online News Articles.\n",
35 |     "\n",
36 |     "\n",
37 |     "\n",
38 |     "\n",
39 |     "\n",
40 |     "\n",
41 |     "\n"
42 |    ]
43 |   },
44 |   {
45 |    "cell_type": "markdown",
46 |    "id": "70e0b8cc",
47 |    "metadata": {
48 |     "slideshow": {
49 |      "slide_type": "slide"
50 |     }
51 |    },
52 |    "source": [
53 |     "Considerations:\n",
54 |     "- What kind of data?\n",
55 |     "- Where can you get the data (state the actual URLs and/or API endpoint)?\n",
56 |     "- How can you get the data?\n",
57 |     "- How would you store the data?\n",
58 |     "- Any limitations?"
59 |    ]
60 |   },
61 |   {
62 |    "cell_type": "markdown",
63 |    "id": "1dff6acf",
64 |    "metadata": {},
65 |    "source": [
66 |     "#### Make a short presentation and share with the class"
67 |    ]
68 |   }
69 |  ],
70 |  "metadata": {
71 |   "celltoolbar": "Slideshow",
72 |   "kernelspec": {
73 |    "display_name": "Python 3 (ipykernel)",
74 |    "language": "python",
75 |    "name": "python3"
76 |   },
77 |   "language_info": {
78 |    "codemirror_mode": {
79 |     "name": "ipython",
80 |     "version": 3
81 |    },
82 |    "file_extension": ".py",
83 |    "mimetype": "text/x-python",
84 |    "name": "python",
85 |    "nbconvert_exporter": "python",
86 |    "pygments_lexer": "ipython3",
87 |    "version": "3.9.18"
88 |   }
89 |  },
90 |  "nbformat": 4,
91 |  "nbformat_minor": 5
92 | }
93 | 


--------------------------------------------------------------------------------
/2023/day3/Webscraping.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "e9014de5",
  6 |    "metadata": {
  7 |     "slideshow": {
  8 |      "slide_type": "slide"
  9 |     }
 10 |    },
 11 |    "source": [
 12 |     "# Webscraping\n",
 13 |     "\n",
 14 |     "Author: Justin Chun-ting Ho\n",
 15 |     "\n",
 16 |     "Date: 27 Nov 2023\n",
 17 |     "\n",
 18 |     "Credit: Some sections are adopted from the slides prepared by Damian Trilling"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "markdown",
 23 |    "id": "cb8d7db1",
 24 |    "metadata": {},
 25 |    "source": [
 26 |     "### What is an website?\n",
 27 |     "\n",
 28 |     "Let's take a look at [this](https://ascor.uva.nl/staff/ascor-faculty/ascor-staff---faculty.html)"
 29 |    ]
 30 |   },
 31 |   {
 32 |    "cell_type": "markdown",
 33 |    "id": "7c0d5028",
 34 |    "metadata": {},
 35 |    "source": [
 36 |     "### Typical Workflow\n",
 37 |     "\n",
 38 |     "- Download the source code (HTML)\n",
 39 |     "- Identify the pattern to isolate what we want\n",
 40 |     "- Write a script to extract"
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "markdown",
 45 |    "id": "76d0ecb3",
 46 |    "metadata": {},
 47 |    "source": [
 48 |     "## Approach 1: Regular Expression"
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "markdown",
 53 |    "id": "e34b3916",
 54 |    "metadata": {},
 55 |    "source": [
 56 |     "You probably need [this](https://images.datacamp.com/image/upload/v1665049611/Marketing/Blog/Regular_Expressions_Cheat_Sheet.pdf)."
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "code",
 61 |    "execution_count": null,
 62 |    "id": "50a3825b",
 63 |    "metadata": {},
 64 |    "outputs": [],
 65 |    "source": [
 66 |     "import requests\n",
 67 |     "import re"
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "code",
 72 |    "execution_count": null,
 73 |    "id": "67affb3c",
 74 |    "metadata": {},
 75 |    "outputs": [],
 76 |    "source": [
 77 |     "response = requests.get('https://ascor.uva.nl/staff/ascor-faculty/ascor-staff---faculty.html')"
 78 |    ]
 79 |   },
 80 |   {
 81 |    "cell_type": "code",
 82 |    "execution_count": null,
 83 |    "id": "58a38586",
 84 |    "metadata": {},
 85 |    "outputs": [],
 86 |    "source": [
 87 |     "text = response.text"
 88 |    ]
 89 |   },
 90 |   {
 91 |    "cell_type": "code",
 92 |    "execution_count": null,
 93 |    "id": "4d4d6d58",
 94 |    "metadata": {},
 95 |    "outputs": [],
 96 |    "source": [
 97 |     "emails = re.findall(r'mailto:(.*?)\\\"',text)"
 98 |    ]
 99 |   },
100 |   {
101 |    "cell_type": "markdown",
102 |    "id": "36b6b243",
103 |    "metadata": {},
104 |    "source": [
105 |     "## Approach 2: Modern Packages"
106 |    ]
107 |   },
108 |   {
109 |    "cell_type": "markdown",
110 |    "id": "4d588c46",
111 |    "metadata": {},
112 |    "source": [
113 |     "### Tools\n",
114 |     "- Beautiful Soup: `pip install beautifulsoup4` or `conda install -c anaconda beautifulsoup4`\n",
115 |     "- SelectorGadget: https://selectorgadget.com/"
116 |    ]
117 |   },
118 |   {
119 |    "cell_type": "code",
120 |    "execution_count": null,
121 |    "id": "ec502b65",
122 |    "metadata": {},
123 |    "outputs": [],
124 |    "source": [
125 |     "from bs4 import BeautifulSoup \n",
126 |     "import csv\n",
127 |     "import pandas as pd\n",
128 |     "import numpy as np"
129 |    ]
130 |   },
131 |   {
132 |    "cell_type": "code",
133 |    "execution_count": null,
134 |    "id": "c91536be",
135 |    "metadata": {},
136 |    "outputs": [],
137 |    "source": [
138 |     "URL = 'https://ascor.uva.nl/staff/faculty.html'\n",
139 |     "r = requests.get(URL) \n",
140 |     "soup = BeautifulSoup(r.content) "
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "code",
145 |    "execution_count": null,
146 |    "id": "6750db9b",
147 |    "metadata": {},
148 |    "outputs": [],
149 |    "source": [
150 |     "r.content"
151 |    ]
152 |   },
153 |   {
154 |    "cell_type": "code",
155 |    "execution_count": null,
156 |    "id": "541ee3fb",
157 |    "metadata": {},
158 |    "outputs": [],
159 |    "source": [
160 |     "emails = soup.find_all(class_=\"mail\")"
161 |    ]
162 |   },
163 |   {
164 |    "cell_type": "code",
165 |    "execution_count": null,
166 |    "id": "cfb229ac",
167 |    "metadata": {},
168 |    "outputs": [],
169 |    "source": [
170 |     "emails[0:6]"
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "code",
175 |    "execution_count": null,
176 |    "id": "ecbb15ca",
177 |    "metadata": {},
178 |    "outputs": [],
179 |    "source": [
180 |     "for email in emails[0:6]:\n",
181 |     "    print(email['href'])"
182 |    ]
183 |   },
184 |   {
185 |    "cell_type": "markdown",
186 |    "id": "75db440f",
187 |    "metadata": {},
188 |    "source": [
189 |     "### Another Way"
190 |    ]
191 |   },
192 |   {
193 |    "cell_type": "code",
194 |    "execution_count": null,
195 |    "id": "5021bfe5",
196 |    "metadata": {},
197 |    "outputs": [],
198 |    "source": [
199 |     "soup = BeautifulSoup(r.content) \n",
200 |     "items = soup.find_all(class_=\"c-item__link\")"
201 |    ]
202 |   },
203 |   {
204 |    "cell_type": "code",
205 |    "execution_count": null,
206 |    "id": "ab8091b9",
207 |    "metadata": {},
208 |    "outputs": [],
209 |    "source": [
210 |     "items[0]"
211 |    ]
212 |   },
213 |   {
214 |    "cell_type": "code",
215 |    "execution_count": null,
216 |    "id": "01795f75",
217 |    "metadata": {},
218 |    "outputs": [],
219 |    "source": [
220 |     "links = []\n",
221 |     "for i in items:\n",
222 |     "    link = i['href']\n",
223 |     "    links.append(link) "
224 |    ]
225 |   },
226 |   {
227 |    "cell_type": "code",
228 |    "execution_count": null,
229 |    "id": "ddad105f",
230 |    "metadata": {},
231 |    "outputs": [],
232 |    "source": [
233 |     "links[0:10]"
234 |    ]
235 |   },
236 |   {
237 |    "cell_type": "code",
238 |    "execution_count": null,
239 |    "id": "244dca96",
240 |    "metadata": {},
241 |    "outputs": [],
242 |    "source": [
243 |     "link = '/profile/h/o/j.c.ho/j.c.ho.html?origin=%2BkELbJiCRnm%2F56cOYZSXzA'\n",
244 |     "url = 'https://ascor.uva.nl/' + link\n",
245 |     "r = requests.get(url)\n",
246 |     "soup = BeautifulSoup(r.content)"
247 |    ]
248 |   },
249 |   {
250 |    "cell_type": "code",
251 |    "execution_count": null,
252 |    "id": "627daaa9",
253 |    "metadata": {},
254 |    "outputs": [],
255 |    "source": [
256 |     "name = soup.find(class_=\"c-profile__name\").get_text()\n",
257 |     "name"
258 |    ]
259 |   },
260 |   {
261 |    "cell_type": "code",
262 |    "execution_count": null,
263 |    "id": "13319474",
264 |    "metadata": {},
265 |    "outputs": [],
266 |    "source": [
267 |     "summary = soup.find(class_=\"c-profile__summary\").get_text()\n",
268 |     "summary"
269 |    ]
270 |   },
271 |   {
272 |    "cell_type": "code",
273 |    "execution_count": null,
274 |    "id": "1be51446",
275 |    "metadata": {},
276 |    "outputs": [],
277 |    "source": [
278 |     "profile = soup.find(id=\"Profile\").get_text()\n",
279 |     "profile"
280 |    ]
281 |   },
282 |   {
283 |    "cell_type": "code",
284 |    "execution_count": null,
285 |    "id": "ceabdc89",
286 |    "metadata": {},
287 |    "outputs": [],
288 |    "source": [
289 |     "divs = soup.find_all('div', class_=\"c-profile__list\")\n",
290 |     "divs"
291 |    ]
292 |   },
293 |   {
294 |    "cell_type": "code",
295 |    "execution_count": null,
296 |    "id": "059433f8",
297 |    "metadata": {},
298 |    "outputs": [],
299 |    "source": [
300 |     "divs[1].find_all('li')"
301 |    ]
302 |   },
303 |   {
304 |    "cell_type": "code",
305 |    "execution_count": null,
306 |    "id": "e1dea200",
307 |    "metadata": {},
308 |    "outputs": [],
309 |    "source": [
310 |     "divs[1].find_all('li')[0].get_text()"
311 |    ]
312 |   },
313 |   {
314 |    "cell_type": "code",
315 |    "execution_count": null,
316 |    "id": "15e9caf6",
317 |    "metadata": {},
318 |    "outputs": [],
319 |    "source": [
320 |     "profiles = []\n",
321 |     "\n",
322 |     "for link in links:\n",
323 |     "    print(link)\n",
324 |     "    url = 'https://ascor.uva.nl' + link\n",
325 |     "    r = requests.get(url)\n",
326 |     "    soup = BeautifulSoup(r.content) \n",
327 |     "    name = soup.find(class_=\"c-profile__name\").get_text()\n",
328 |     "    summary = soup.find(class_=\"c-profile__summary\").get_text()\n",
329 |     "#    profile = soup.find(id=\"Profile\").get_text()\n",
330 |     "    divs = soup.find_all('div', class_=\"c-profile__list\")\n",
331 |     "    email = divs[1].find_all('li')[0].get_text()\n",
332 |     "    \n",
333 |     "    profile = {} \n",
334 |     "    profile['name'] = name\n",
335 |     "    profile['summary'] = summary\n",
336 |     "#     profile['profile'] = profile\n",
337 |     "    profile['email'] = email\n",
338 |     "    profiles.append(profile) "
339 |    ]
340 |   },
341 |   {
342 |    "cell_type": "code",
343 |    "execution_count": null,
344 |    "id": "122eae1c",
345 |    "metadata": {},
346 |    "outputs": [],
347 |    "source": [
348 |     "profiles = []\n",
349 |     "\n",
350 |     "for link in links:\n",
351 |     "    print(link)\n",
352 |     "    url = 'https://ascor.uva.nl' + link\n",
353 |     "    r = requests.get(url)\n",
354 |     "    soup = BeautifulSoup(r.content) \n",
355 |     "    name = soup.find(class_=\"c-profile__name\").get_text()\n",
356 |     "    summary = soup.find(class_=\"c-profile__summary\").get_text()\n",
357 |     "    try:\n",
358 |     "        profile_text = soup.find(id=\"Profile\").get_text()\n",
359 |     "    except:\n",
360 |     "        profile_text = np.nan\n",
361 |     "    divs = soup.find_all('div', class_=\"c-profile__list\")\n",
362 |     "    try:\n",
363 |     "        email = divs[1].find_all('li')[0].get_text()\n",
364 |     "    except:\n",
365 |     "        email = np.nan\n",
366 |     "    \n",
367 |     "    profile = {} \n",
368 |     "    profile['name'] = name\n",
369 |     "    profile['summary'] = summary\n",
370 |     "    profile['profile_text'] = profile_text\n",
371 |     "    profile['email'] = email\n",
372 |     "    profiles.append(profile) "
373 |    ]
374 |   },
375 |   {
376 |    "cell_type": "code",
377 |    "execution_count": null,
378 |    "id": "db0e8c1d",
379 |    "metadata": {},
380 |    "outputs": [],
381 |    "source": [
382 |     "df = pd.DataFrame(profiles)"
383 |    ]
384 |   },
385 |   {
386 |    "cell_type": "code",
387 |    "execution_count": null,
388 |    "id": "4febe195",
389 |    "metadata": {},
390 |    "outputs": [],
391 |    "source": [
392 |     "df"
393 |    ]
394 |   },
395 |   {
396 |    "cell_type": "code",
397 |    "execution_count": null,
398 |    "id": "a7ff0b4c",
399 |    "metadata": {},
400 |    "outputs": [],
401 |    "source": [
402 |     "df.to_csv('profiles.csv')"
403 |    ]
404 |   },
405 |   {
406 |    "cell_type": "code",
407 |    "execution_count": null,
408 |    "id": "328cb381",
409 |    "metadata": {},
410 |    "outputs": [],
411 |    "source": [
412 |     "df.to_json('profiles.json', orient='records', lines=True)"
413 |    ]
414 |   },
415 |   {
416 |    "cell_type": "markdown",
417 |    "id": "431264f5",
418 |    "metadata": {},
419 |    "source": [
420 |     "### Exercise\n",
421 |     "\n",
422 |     "Get the full text of all the news item here: https://ascor.uva.nl/news/newslist.html"
423 |    ]
424 |   },
425 |   {
426 |    "cell_type": "code",
427 |    "execution_count": null,
428 |    "id": "4f11a935",
429 |    "metadata": {},
430 |    "outputs": [],
431 |    "source": []
432 |   }
433 |  ],
434 |  "metadata": {
435 |   "kernelspec": {
436 |    "display_name": "Python 3 (ipykernel)",
437 |    "language": "python",
438 |    "name": "python3"
439 |   },
440 |   "language_info": {
441 |    "codemirror_mode": {
442 |     "name": "ipython",
443 |     "version": 3
444 |    },
445 |    "file_extension": ".py",
446 |    "mimetype": "text/x-python",
447 |    "name": "python",
448 |    "nbconvert_exporter": "python",
449 |    "pygments_lexer": "ipython3",
450 |    "version": "3.9.18"
451 |   }
452 |  },
453 |  "nbformat": 4,
454 |  "nbformat_minor": 5
455 | }
456 | 


--------------------------------------------------------------------------------
/2023/day3/get_mails:
--------------------------------------------------------------------------------
1 | all_mail = []
2 | 
3 | for email in emails[0:6]:
4 |     new_mail = email['href'].replace('mailto:','')
5 |     all_mail.append(new_mail)
6 | 


--------------------------------------------------------------------------------
/2023/day3/updated cell:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | 
 3 | info = {
 4 |     'id':[],
 5 |     'views':[]
 6 | }
 7 | 
 8 | for item in lotr_videos_ids['items']:
 9 |     vidId = item['id']['videoId']
10 |     r = youtube.videos().list(
11 |         part="statistics,snippet",
12 |         id=vidId,
13 |         fields="items(statistics)"
14 |     ).execute()
15 | 
16 |     views = r['items'][0]['statistics']['viewCount']
17 |     info['id'].append(vidId)
18 |     info['views'].append(views)
19 | 
20 | df = pd.DataFrame(data=info)
21 | 
22 | 
23 | 
24 | 
25 | all_mail = []
26 | 
27 | for email in emails[0:6]:
28 |     new_mail = email['href'].replace('mailto:','')
29 |     all_mail.append(new_mail)
30 | 


--------------------------------------------------------------------------------
/2023/day4/README.md:
--------------------------------------------------------------------------------
 1 | # Day 4: Natural Language Processing
 2 | 
 3 | | Time slot     | Content                                                                                                                                                                                                  |
 4 | |---------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 5 | | 09:30-11:00   | [Introduction to NLP and text as data](slides-04-1.pdf): In a gentle introduction to NLP techniques, we will discuss the basics of bag-of-word (BAG) approaches, such as tokenization, stopword removal, stemming and lemmatization                       |
 6 | | 11:00 - 12:30 | Exercise time! [practice with some NLP](/exercises-morning/) or experiment with [vectorizers](/exercises-vectorizers/)                                                                        
 7 | | 12:30-13:30   | Lunch (tip: lunch lecture by Toni van der Meer!)                                                                                                                                                                                            |
 8 | | 13:30-14:30   | [Advanced NLP and regular expressions](slides-04-2.pdf):  In the second lecture of the day, we will delve a bit deeper in NLP approaches. We discuss the possibilities NER in spacy and introduce regular expressions  |
 9 | | 14:45-15:30   | Exercise time! [Play around with regular expressions](/exercises-afternoon/) or explore [spacy](spacy-examples.ipynb)                                                                                                                                              |
10 | |15:30-16:00   |  wrap up/ final questions                                                                                                                                                                                |
11 | 


--------------------------------------------------------------------------------
/2023/day4/example-ngrams.md:
--------------------------------------------------------------------------------
  1 | ## N-grams
  2 | 
  3 | ```python
  4 | import nltk
  5 | from gensim import corpora
  6 | from gensim import models
  7 | 
  8 | documents = ["In the train from Connecticut to New York",
  9 |              "He is a spokesman for New York City's Health Department",
 10 |              "New York has been one of the states hit hardest by Coronavirus"]
 11 | 
 12 | documents_bigrams = [["_".join(tup) for tup in nltk.ngrams(doc.split(),2)] for doc in documents]
 13 | 
 14 | len(documents) == len(documents_bigrams)
 15 | # maybe we want both unigrams and bigrams in the feature set?
 16 | documents_uniandbigrams = []
 17 | for a,b in zip([doc.split() for doc in documents],documents_bigrams):
 18 |     documents_uniandbigrams.append(a + b)
 19 | 
 20 | print(documents_uniandbigrams)
 21 | ```
 22 | 
 23 | if you want to use this as input for a `sklearn` classifier, you can do the following:
 24 | 
 25 | ```python
 26 | myvectorizer = CountVectorizer(analyzer=lambda x:x)
 27 | ```
 28 | 
 29 | And if you want to see what's happening, convert to a dense format (please only do this with a small toy sample, never on a large dataset):
 30 | 
 31 | ```python
 32 | X = myvectorizer.fit_transform(documents_uniandbigrams)
 33 | df = pd.DataFrame(X.toarray().transpose(), index = myvectorizer.get_feature_names())
 34 | df
 35 | documents_uniandbigrams
 36 | ​
 37 | myvectorizer = CountVectorizer(analyzer=lambda x:x)
 38 | X = myvectorizer.fit_transform(documents_uniandbigrams)
 39 | df = pd.DataFrame(X.toarray().transpose(), index = myvectorizer.get_feature_names())
 40 | ```
 41 | 
 42 | ### Collocations with `NLTK`
 43 | 
 44 | ```python
 45 | import nltk
 46 | documents = ["He travelled by train from New York to Connecticut and back to New York",
 47 |              "He is a spokesman for New York City's Health Department",
 48 |              "New York has been one of the states hit hardest by Coronavirus"]
 49 | 
 50 | text = [nltk.Text(tkn for tkn in doc.split()) for doc in documents ] # this inspects frequencies WITHIN documents
 51 | text[0].collocations(num=10)
 52 | ```
 53 | 
 54 | ### Collocations with `Gensim`
 55 | 
 56 | ```python
 57 | from nltk.tokenize import TreebankWordTokenizer
 58 | import pandas as pd
 59 | import regex
 60 | from sklearn.feature_extraction.text import CountVectorizer
 61 | from gensim.models import KeyedVectors, Phrases
 62 | from gensim.models.phrases import Phraser
 63 | from glob import glob
 64 | 
 65 | infowarsfiles = glob('articles/*/Infowars/*')
 66 | documents = []
 67 | for filename in infowarsfiles:
 68 |     with open(filename) as f:
 69 |         documents.append(f.read())
 70 | 
 71 | mytokenizer = TreebankWordTokenizer()
 72 | tokenized_texts = [mytokenizer.tokenize(t) for t in documents]
 73 | 
 74 | phrases_model = Phrases(tokenized_texts, min_count=10, scoring="npmi", threshold=.5)
 75 | score_dict = phrases_model.export_phrases()
 76 | scores = pd.DataFrame(score_dict.items(),
 77 | columns=["phrase", "score"])
 78 | scores.sort_values("score",ascending=False).head()
 79 | ```
 80 | 
 81 | Using `Gensim`'s collocations in `sklearn`'s vectorizer
 82 | 
 83 | ```python
 84 | from gensim.models.phrases import Phraser
 85 | import numpy as np
 86 | 
 87 | phraser = Phraser(phrases_model)
 88 | tokens_phrases = [phraser[doc] for doc in tokens]
 89 | cv = CountVectorizer(tokenizer=lambda x: x, lowercase=False) # initiate a count or tfidf vectorizer
 90 | ```
 91 | 
 92 | Inspecting the resulting dtm
 93 | 
 94 | ```python
 95 | from gensim.models.phrases import Phraser
 96 | import numpy as np
 97 | 
 98 | phraser = Phraser(phrases_model)
 99 | tokens_phrases = [phraser[doc] for doc in tokens]
100 | cv = CountVectorizer(tokenizer=lambda x: x, lowercase=False) # initiate a count or tfidf vectorizer
101 | 
102 | 
103 | 
104 | def termstats(dfm, vectorizer):
105 |     """Helper function to calculate term and document frequency per term"""
106 |     # Frequencies are the column sums of the DFM
107 |     frequencies = dfm.sum(axis=0).tolist()[0]
108 |     # Document frequencies are the binned count
109 |     # of the column indices of DFM entries
110 |     docfreqs = np.bincount(dfm.indices)
111 |     freq_df=pd.DataFrame(dict(frequency=frequencies,docfreq=docfreqs), index=vectorizer.get_feature_names())
112 |     return freq_df.sort_values("frequency", ascending=False)
113 | 
114 | dtm = cv.fit_transform(tokens_phrases)
115 | termstats(dtm, cv).filter(like="hussein", axis=0)
116 | ```
117 | 


--------------------------------------------------------------------------------
/2023/day4/exercises-afternoon/01tuesday-regex-exercise.md:
--------------------------------------------------------------------------------
 1 | 
 2 | # Exercise with regular expressions
 3 | 
 4 | Let’s take some time to write some regular expressions. Write a
 5 | script that
 6 | 
 7 | • extracts URLS form a list of strings  
 8 | • removes everything that is not a letter or number from a list of
 9 | strings  
10 | 
11 | 
12 | ```python
13 | list_w_urls = ["some text with a url http://www.youtube.com... ",
14 | "and another one!! https://www.facebook.com",
15 | "more urls www.baidu.com??",
16 | "And even more?!! %$##($^) https://www.yahoo.com and this one http://www.amazon.com and this one www.wikipedia.org" ]
17 | ```
18 | 


--------------------------------------------------------------------------------
/2023/day4/exercises-afternoon/01tuesday-regex-solution.md:
--------------------------------------------------------------------------------
 1 | 
 2 | # Possible solution to `regex` [exercise](01tuesday-regex-exercise.md)
 3 | *Please note that alternative solutions may work just as well or even better*
 4 | 
 5 | ## extracts URLS form a list of strings
 6 | 
 7 | ```python
 8 | import re
 9 | 
10 | for l in list_w_urls:
11 |     m = re.findall('(?:(?:https?|ftp):\/\/)?[\w.]+\.[\w]+.', l)
12 |     print(m)
13 | 
14 | ```
15 | 
16 | `?` = matches either once or zero times  
17 | `?:` = matches the group but does not captured it / save it.  
18 | `\w` = matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_].  
19 | 
20 | 
21 | ## remove everything that is not a letter or number from a list of strings
22 | 
23 | ```python
24 | for e in list_w_urls:
25 |     print(re.sub(r'[\W_]+', ' ', e))
26 | ```
27 | 
28 | `[\W_]` = matches any non-word character (or underscore, which is weirdly enough considered a 'word character' - therefore, if we simply do `\W`, we miss the underscores)  
29 | `+` = 1 or more times  
30 | 


--------------------------------------------------------------------------------
/2023/day4/exercises-afternoon/02tuesday-exercise_nexis.md:
--------------------------------------------------------------------------------
 1 | # A Practical Introduction to Machine Learning in Python
 2 | Anne Kroon and Damian Trilling
 3 | 
 4 | ## Day 2 (Tuesday Afternoon)
 5 | 
 6 | ## Exercise: Parsing unstructured text files
 7 | 
 8 | When working with text data, often we have to deal with unstructured files. Before we can start with our analysis, we have to transform such files to more structured forms of data.
 9 | 
10 | An example of such forms of unstructured data is the output of Nexis Uni, a large news database often used by social scientists.
11 | We will practise with some files downloaded from Nexis Uni.
12 | 
13 | Download and unpack a set of .RTF files [here](corona_news.tar.gz).
14 | Windows users may need an additional program to unpack it, such as 7zip.
15 | 
16 | Specific tasks
17 | 
18 | 1. Write some code to read the data in.
19 | 3. Try to extract the newspaper title using regular expressions.
20 | 4. Do the same for the publication dates.
21 | 5. Finally, extract the full body of the text.
22 | 6. Think about a way to store the data
23 | 
24 | 
25 | Hints:
26 | 
27 | In order to read .RTF files with python, we need to convert rtf files to strings, before we can start parsing and processing.
28 | This library can help: https://pypi.org/project/striprtf/
29 | 
30 | ```bash
31 | pip install striprtf
32 | ```
33 | 
34 | Afterwards, we can start converting our files:
35 | 
36 | ```python
37 | from striprtf.striprtf import rtf_to_text
38 | 
39 | rtf_string = open("exercises-afternoon/corona_news/news_corona_1.RTF").read()
40 | text = rtf_to_text(rtf_string)
41 | 
42 | ```
43 | 
44 | This will return a string object. In order to split up the string by article, we can look at the structure of the data.
45 | As you might notice, all news articles went with 'End of Document  '. We can use this information to split the string.
46 | 
47 | ```python
48 | splitted_text = text.replace("\n", " ").split("End of Document  ")
49 | ```
50 | 


--------------------------------------------------------------------------------
/2023/day4/exercises-afternoon/02tuesday-exercise_nexis_solution.md:
--------------------------------------------------------------------------------
 1 | # A Practical Introduction to Machine Learning in Python
 2 | Anne Kroon and Damian Trilling
 3 | 
 4 | This is just one solution. Maybe you came up with an even better one yourself!
 5 | 
 6 | ### Reading the files in:
 7 | 
 8 | ```python
 9 | from striprtf.striprtf import rtf_to_text
10 | 
11 | # read the files in
12 | filenames = ["news_corona_" + str(i) + ".RTF" for i in range(1, 4) ]
13 | rtf_string  = [ open("exercises-afternoon/corona_news/" + f).read() for f in filenames ]
14 | 
15 | # convert the files from rtf to string format
16 | text = [ rtf_to_text(i) for i in rtf_string ]
17 | 
18 | # replace line breaks and split articles
19 | 
20 | splitted_text = [ i.replace("\n", " ").split("End of Document  ") for i in text ]
21 | 
22 | ```
23 | 
24 | ### A function that parses the documents.
25 | 
26 | ```python
27 | import re
28 | 
29 | def parse_nexis_uni(news_string):
30 |     ''' parses strings (nexis news articles), so that the title, date and full text are extracted. '''
31 | 
32 |     parsed_results = []
33 |     for line in news_string:
34 | 
35 |         # newspaper title
36 |         matchObj1=re.match(" +([a-zA-Z\s]+?) \d+",line)
37 |         if matchObj1:
38 |             newspaper = matchObj1.group(1)
39 |         else:
40 |             newspaper = "NaN"
41 | 
42 |         # date
43 |         matchObj2 = re.match(r".*(\d{1,2}) ([jJ]anuari|[fF]ebruari|[mM]aart|[aA]pril|[mM]ei|[jJ]uni|[jJ]uli|[aA]ugustus|[sS]eptember|[Oo]ktober|[nN]ovember|[dD]ecember) (\d{4}).*", line)
44 |         if matchObj2:
45 |             day = matchObj2.group(1)
46 |             month = matchObj2.group(2)
47 |             year = matchObj2.group(3)
48 |             date = (day, month, year )
49 |         else:
50 |             date = "NaN"
51 | 
52 |         # full text
53 |         matchObj3=re.match(".*Body(.*) Classification",line)
54 |         if matchObj3:
55 |             text = matchObj3.group(1).strip()
56 |         else:
57 |             text = "NaN"
58 | 
59 |         parsed_results.append( {'newspaper': newspaper,
60 |                        'date' : date,
61 |                        'text': text }  )
62 | 
63 |     return parsed_results
64 | 
65 | ```
66 | 
67 | #### calling the function
68 | 
69 | ```python
70 | results = []
71 | for document in splitted_text:
72 |     results.extend(parse_nexis_uni(document))
73 | ```
74 | 


--------------------------------------------------------------------------------
/2023/day4/exercises-afternoon/corona_news.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2023/day4/exercises-afternoon/corona_news.tar.gz


--------------------------------------------------------------------------------
/2023/day4/exercises-morning/exercise-feature-engineering.md:
--------------------------------------------------------------------------------
 1 | # Exercise 1: Working with textual data
 2 | 
 3 | ### 0. Get the data.
 4 | 
 5 | - Download `articles.tar.gz` from
 6 | https://dx.doi.org/10.7910/DVN/ULHLCB
 7 | 
 8 | If you experience difficulties downloading this (rather large) dataset, you can also download just a part of the data [here](https://surfdrive.surf.nl/files/index.php/s/bfNFkuUVoVtiyuk)
 9 | 
10 | - Unpack it. On Linux and MacOS, you can do this with `tar -xzf mydata.tar.gz` on the command line. On Windows, you may need an additional tool such as `7zip` for that (note that technically speaking, there is a `tar` archive within a `gz` archive, so unpacking may take *two* steps depending on your tool).
11 | 
12 | 
13 | ### 1. Inspect the structure of the dataset.
14 | What information do the following elements give you?
15 | 
16 | - folder (directory) names
17 | - folder structure/hierarchy
18 | - file names
19 | - file contents
20 | 
21 | ### 2. Discuss strategies for working with this dataset!
22 | 
23 | - Which questions could you answer?
24 | - How could you deal with it, given the size and the structure?
25 | - How much memory<sup>1</sup> (RAM) does your computer have? How large is the complete dataset? What does that mean?
26 | - Make a sketch (e.g., with pen&paper), how you could handle your workflow and your data to answer your question.
27 | 
28 | <sup>1</sup> *memory* (RAM), not *storage* (harddisk)!
29 | 
30 | ### 3. Read some (or all?) data
31 | 
32 | Here is some example code that you can modify. Assuming that he folder `articles` is in the same folder as the notebook you are currently working on, you could, for instance, do the following to read a *part* of your dataset.
33 | 
34 | ```python
35 | from glob import glob
36 | infowarsfiles = glob('articles/*/Infowars/*')
37 | infowarsarticles = []
38 | for filename in infowarsfiles:
39 |     with open(filename) as f:
40 | 	    infowarsarticles.append(f.read())
41 | 
42 | ```
43 | 
44 | - Can you explain what the `glob` function does?
45 | - What does `infowarsfiles` contain, and what does `infowarsarticles` contain? First make an educated guess based on the code snippet, then check it! Do *not* print the whole thing, but use `len`, `type` en slicing `[:10]` to get the info ou need.
46 | 
47 | - Tip: take a random sample of the articles for practice purposes (if your code works, you can scale up!)
48 | 
49 | ```
50 | # taking a random sample of the articles for practice purposes
51 | articles =random.sample(infowarsarticles, 10)
52 | ```
53 | 
54 | ### 2. first analyses and pre-processing steps
55 | 
56 | - Perform some first analyses on the data using string methods and regular expressions.
57 | Techniques you can try out include:
58 | 
59 | a.  lowercasing  
60 | b.  tokenization  
61 | c.  stopword removal  
62 | d.  stemming and/or lemmatizing)  
63 | e.  cleaning: removing punctuation, line breaks, double spaces  
64 | 
65 | 
66 | ### 3. N-grams
67 | 
68 | - Think about what type of n-grams you want to add to your feature set. Extract and inspect n-grams and/or collocations, and add them to your feature set if you think this is relevant.
69 | 
70 | ### 4. Extract entities and other meaningful information
71 | 
72 | Try to extract meaningful information from your texts. Depending on your interests and the nature of the data, you could:
73 | 
74 | - use regular expressions to distinguish relevant from irrelevant texts, or to extract substrings
75 | - use NLP techniques such as Named Entity Recognition to extract entities that occur.
76 | 
77 | ### 5. Train a supervised classifier
78 | 
79 | Go back to your code belonging to yesterday's assignment. Perform the same classification task, but this time carefully consider which feature set you want to use. Reflect on the options listed above, and extract features that you think are relevant to include. Carefully consider **pre-processing steps**: what type of features will you feed your algorithm? Do you, for example, want to manually remove stopwords, or include ngrams? Use these features as input for your classifier, and investigate the effects hereof on performance of the classifier. Not that the purpose is not to build the perfect classifier, but to inspect the effects of different feature engineering decisions on the outcomes of your classification algorithm.
80 | 
81 | 
82 | ## BONUS
83 | 
84 | - Compare that bottom-up approach with a top-down (keyword or regular-expression based) approach.
85 | 


--------------------------------------------------------------------------------
/2023/day4/exercises-morning/possible-solution-exercise-day2-vectorizers.md:
--------------------------------------------------------------------------------
  1 | ### using manually crafted features as input for supervised machine learning with `sklearn`
  2 | 
  3 | 
  4 | ```python
  5 | import nltk
  6 | from sklearn.model_selection import train_test_split
  7 | 
  8 | from glob import glob
  9 | import random
 10 | 
 11 | 
 12 | def read_data(listofoutlets):
 13 |     texts = []
 14 |     labels = []
 15 |     for label in listofoutlets:
 16 |         for file in glob(f'../articles-small/*/{label}/*'):
 17 |             with open(file) as f:
 18 |                 texts.append(f.read())
 19 |                 labels.append(label)
 20 |     return texts, labels
 21 | 
 22 | documents, labels = read_data(['Infowars', 'BBC'])
 23 | ```
 24 | 
 25 | Create bigrams and combine with unigrams  
 26 | 
 27 | ```python
 28 | documents_bigrams = [["_".join(tup) for tup in nltk.ngrams(doc.split(),2)] for doc in documents] # creates bigrams
 29 | documents_bigrams[7][:5] # inspect the results...
 30 | 
 31 | # maybe we want both unigrams and bigrams in the feature set?
 32 | assert len(documents)==len(documents_bigrams)
 33 | 
 34 | documents_uniandbigrams = []
 35 | for a,b in zip([doc.split() for doc in documents],documents_bigrams):
 36 |     documents_uniandbigrams.append(a + b)
 37 | 
 38 | #and let's inspect the outcomes again.
 39 | documents_uniandbigrams[7]
 40 | ```
 41 | 
 42 | some sanity checks:
 43 | 
 44 | ```python
 45 | len(documents_uniandbigrams[7]),len(documents_bigrams[7]),len(documents[7].split())
 46 | assert len(documents_uniandbigrams) == len(labels)
 47 | ```
 48 | 
 49 | Now lets fit a `sklearn` vectorizer on the manually crafted feature set:
 50 | 
 51 | ```python
 52 | from sklearn.feature_extraction.text import CountVectorizer
 53 | X_train,X_test,y_train,y_test=train_test_split(documents_uniandbigrams, labels, test_size=0.3)
 54 | # We do *not* want scikit-learn to tokenize a string into a list of tokens,
 55 | # after all, we already *have* a list of tokens. lambda x:x is just a fancy way of saying:
 56 | # do nothing!
 57 | myvectorizer= CountVectorizer(analyzer=lambda x:x)
 58 | ```
 59 | 
 60 | let's fit and transform
 61 | 
 62 | ```python
 63 | #Fit the vectorizer, and transform.
 64 | X_features_train = myvectorizer.fit_transform(X_train)
 65 | X_features_test = myvectorizer.transform(X_test)
 66 | ```
 67 | 
 68 | Inspect the vocabulary and their id mappings
 69 | 
 70 | ```python
 71 | # inspect
 72 | myvectorizer.vocabulary_
 73 | ```
 74 | 
 75 | Finally, run the model again
 76 | 
 77 | ```python
 78 | from sklearn.naive_bayes import MultinomialNB
 79 | from sklearn.metrics import accuracy_score
 80 | from sklearn.metrics import classification_report
 81 | 
 82 | model = MultinomialNB()
 83 | model.fit(X_features_train, y_train)
 84 | y_pred = model.predict(X_features_test)
 85 | 
 86 | print(f"Accuracy : {accuracy_score(y_test, y_pred)}")
 87 | print(classification_report(y_test, y_pred))
 88 | ```
 89 | 
 90 | 
 91 | ### Final remark on ngrams in scikit learn
 92 | 
 93 | Of course, you do not *have* to do all of this if you just want to use ngrams. Alternatively, you can simply use
 94 | ```
 95 | myvectorizer = CountVectorizer(ngram_range=(1,2))
 96 | X_features_train = myvectorizer.fit_transform(X_train)
 97 | ```
 98 | *if X_train are the **untokenized** texts.*
 99 | 
100 | What this little example illustrates, though, is that you can use *any* manually crafted feature set as input for scikit-learn.
101 | 


--------------------------------------------------------------------------------
/2023/day4/exercises-morning/possible-solution-exercise-day2.md:
--------------------------------------------------------------------------------
  1 | 
  2 | ## Exercise 2: NLP and feature engineering
  3 | 
  4 | ### 1. Read in the data
  5 | 
  6 | Load the data...
  7 | 
  8 | ```python
  9 | from glob import glob
 10 | import random
 11 | import nltk
 12 | from nltk.stem.snowball import SnowballStemmer
 13 | import spacy
 14 | 
 15 | 
 16 | infowarsfiles = glob('articles/*/Infowars/*')
 17 | infowarsarticles = []
 18 | for filename in infowarsfiles:
 19 |     with open(filename) as f:
 20 |         infowarsarticles.append(f.read())
 21 | 
 22 | 
 23 | # taking a random sample of the articles for practice purposes
 24 | articles =random.sample(infowarsarticles, 10)
 25 | ```
 26 | 
 27 | Let's inspect the data, and start some pre-processing/ cleaning steps...
 28 | 
 29 | ### 2. first analyses and pre-processing steps
 30 | 
 31 | ##### a. lowercasing articles
 32 | 
 33 | ```python
 34 | articles_lower_cased = [art.lower() for art in articles]
 35 | ```
 36 | ##### b. tokenization
 37 | 
 38 | Basic solution, using the `.str` method `.split()`. Not very sophisticated, though.
 39 | 
 40 | ```python
 41 | articles_split = [art.split() for art in articles]
 42 | ```
 43 | 
 44 | A more sophisticated solution:
 45 | 
 46 | ```python
 47 | from nltk.tokenize import TreebankWordTokenizer
 48 | articles_tokenized = [TreebankWordTokenizer().tokenize(art) for art in articles ]
 49 | ```
 50 | 
 51 | Even more sophisticated; create your own tokenizer that first split into sentences. In this way,`TreebankWordTokenizer` works better.
 52 | 
 53 | ```python
 54 | import regex
 55 | 
 56 | nltk.download("punkt")
 57 | class MyTokenizer:
 58 |     def tokenize(self, text):
 59 |         tokenizer = TreebankWordTokenizer()
 60 |         result = []
 61 |         word = r"\p{letter}"
 62 |         for sent in nltk.sent_tokenize(text):
 63 |             tokens = tokenizer.tokenize(sent)    
 64 |             tokens = [t for t in tokens
 65 |                       if regex.search(word, t)]
 66 |             result += tokens
 67 |         return result
 68 | 
 69 | mytokenizer = MyTokenizer()
 70 | print(mytokenizer.tokenize(articles[0]))
 71 | ```
 72 | 
 73 | ##### c. removing stopwords
 74 | 
 75 | Define your stopwordlist:
 76 | 
 77 | ```python
 78 | from nltk.corpus import stopwords
 79 | mystopwords = stopwords.words("english")
 80 | mystopwords.extend(["add", "more", "words"]) # manually add more stopwords to your list if needed
 81 | print(mystopwords) #let's see what's inside
 82 | ```
 83 | 
 84 | Now, remove stopwords from the corpus:
 85 | 
 86 | ```python
 87 | articles_without_stopwords = []
 88 | for article in articles:
 89 |     articles_no_stop = ""
 90 |     for word in article.lower().split():
 91 |         if word not in mystopwords:
 92 |             articles_no_stop = articles_no_stop + " " + word
 93 |     articles_without_stopwords.append(articles_no_stop)
 94 | ```
 95 | 
 96 | Same solution, but with list comprehension:
 97 | 
 98 | ```python
 99 | articles_without_stopwords = [" ".join([w for w in article.lower().split() if w not in mystopwords]) for article in articles]
100 | ```
101 | 
102 | Different--probably more sophisticated--solution, by writing a function and calling it in a list comprehension:
103 | 
104 | ```python
105 | def remove_stopwords(article, stopwordlist):
106 |     cleantokens = []
107 |     for word in article:
108 |         if word.lower() not in mystopwords:
109 |             cleantokens.append(word)
110 |     return cleantokens
111 | 
112 | articles_without_stopwords = [remove_stopwords(art, mystopwords) for art in articles_tokenized]
113 | ```
114 | 
115 | It's good practice to frequently inspect the results of your code, to make sure you are not making mistakes, and the results make sense. For example, compare your results to some random articles from the original sample:
116 | 
117 | ```python
118 | print(articles[8][:100])
119 | print("-----------------")
120 | print(" ".join(articles_without_stopwords[8])[:100])
121 | ```
122 | 
123 | ##### d. stemming and lemmatization
124 | 
125 | ```python
126 | stemmer = SnowballStemmer("english")
127 | 
128 | stemmed_text = []
129 | for article in articles:
130 |     stemmed_words = ""
131 |     for word in article.lower().split():
132 |         stemmed_words = stemmed_words + " " + stemmer.stem(word)
133 |     stemmed_text.append(stemmed_words.strip())
134 | ```
135 | 
136 | Same solution, but with list comprehension:
137 | 
138 | ```python
139 | stemmed_text  = [" ".join([stemmer.stem(w) for w in article.lower().split()]) for article in articles]
140 | ```
141 | 
142 | Compare tokeninzation and lemmatization using `Spacy`:
143 | 
144 | ```python
145 | import spacy
146 | nlp = spacy.load("en_core_web_sm")
147 | lemmatized_articles = [[token.lemma_ for token in nlp(art)] for art in articles]
148 | ```
149 | 
150 | Again, frequently inspect your code, and for example compare the results to the original articles:
151 | 
152 | ```python
153 | print(articles[6][:100])
154 | print("-----------------")
155 | print(stemmed_text[6][:100])
156 | print("-----------------")
157 | print(" ".join(lemmatized_articles[6])[:100])
158 | ```
159 | 
160 | 
161 | #### e. cleaning: removing punctuation, line breaks, double spaces
162 | 
163 | ```python
164 | articles[17] # print a random article to inspect.
165 | ## Typical cleaning up steps:
166 | from string import punctuation
167 | articles = [art.replace('\n\n', '') for art in articles] # remove line breaks
168 | articles = ["".join([w for w in art if w not in punctuation]) for doc in articles] # remove punctuation
169 | articles = [" ".join(art.split()) for art in articles] # remove double spaces by splitting the strings into words and joining these words again
170 | 
171 | articles[17] # print the same article to see whether the changes are in line with what you want
172 | ```
173 | 
174 | ### 3. N-grams
175 | 
176 | ```python
177 | articles_bigrams = [["_".join(tup) for tup in nltk.ngrams(art.split(),2)] for art in articles] # creates bigrams
178 | articles_bigrams[7][:5] # inspect the results...
179 | 
180 | # maybe we want both unigrams and bigrams in the feature set?
181 | 
182 | assert len(articles)==len(articles_bigrams)
183 | 
184 | articles_uniandbigrams = []
185 | for a,b in zip([art.split() for art in articles],articles_bigrams):
186 |     articles_uniandbigrams.append(a + b)
187 | 
188 | #and let's inspect the outcomes again.
189 | articles_uniandbigrams[7]
190 | len(articles_uniandbigrams[7]),len(articles_bigrams[7]),len(articles[7].split())
191 | ```
192 | 
193 | Or, if you want to inspect collocations:
194 | 
195 | ```python
196 | text = [nltk.Text(tkn for tkn in art.split()) for art in articles ]
197 | text[7].collocations(num=10)
198 | ```
199 | 
200 | ----------
201 | 
202 | ### 4. Extract entities and other meaningful information
203 | 
204 | ```Python
205 | import nltk
206 | 
207 | tokens = [nltk.word_tokenize(sentence) for sentence in articles]
208 | tagged = [nltk.pos_tag(sentence) for sentence in tokens]
209 | print(tagged[0])
210 | ```
211 | 
212 | playing around with Spacy:
213 | 
214 | ```python
215 | nlp = spacy.load('en')
216 | 
217 | doc = [nlp(sentence) for sentence in articles]
218 | for i in doc:
219 |     for ent in i.ents:
220 |         if ent.label_ == 'PERSON':
221 |             print(ent.text, ent.label_ )
222 | 
223 | ```          
224 | 
225 | 
226 | 
227 | 
228 | 
229 | 
230 | 
231 | 
232 | Removing stopwords:
233 | 
234 | ```python
235 | mystopwords = set(stopwords.words('english')) # use default NLTK stopword list; alternatively:
236 | # mystopwords = set(open('mystopwordfile.txt').readlines())  #read stopword list from a textfile with one stopword per line
237 | documents = [" ".join([w for w in doc.split() if w not in mystopwords]) for doc in documents]
238 | documents[7]
239 | ```
240 | 
241 | Using N-grams as features:
242 | 
243 | ```python
244 | documents_bigrams = [["_".join(tup) for tup in nltk.ngrams(doc.split(),2)] for doc in documents] # creates bigrams
245 | documents_bigrams[7][:5] # inspect the results...
246 | 
247 | # maybe we want both unigrams and bigrams in the feature set?
248 | 
249 | assert len(documents)==len(documents_bigrams)
250 | 
251 | documents_uniandbigrams = []
252 | for a,b in zip([doc.split() for doc in documents],documents_bigrams):
253 |     documents_uniandbigrams.append(a + b)
254 | 
255 | #and let's inspect the outcomes again.
256 | documents_uniandbigrams[7]
257 | len(documents_uniandbigrams[7]),len(documents_bigrams[7]),len(documents[7].split())
258 | ```
259 | 
260 | Or, if you want to inspect collocations:
261 | 
262 | ```python
263 | text = [nltk.Text(tkn for tkn in doc.split()) for doc in documents ]
264 | text[7].collocations(num=10)
265 | ```
266 | 
267 | ----
268 | 
269 | 
270 | *hint: if you want to include n-grams as feature input, add the following argument to your vectorizer:*
271 | 
272 | ```python
273 | myvectorizer= CountVectorizer(analyzer=lambda x:x)
274 | ```
275 | 


--------------------------------------------------------------------------------
/2023/day4/exercises-vectorizers/Understanding_vectorizers.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "3903bc69",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# Understanding vectorizers\n",
  9 |     "\n",
 10 |     "In the following code examples, we will experiment with vectorizers to understand a bit better how they work. Feel free to adjust the code, and try things out yourself.\n",
 11 |     "\n",
 12 |     "For now, we will practice with `sklearn`'s vectorizers. however, packages such as `gensim` offer their own build in functionality to vectorize the data. "
 13 |    ]
 14 |   },
 15 |   {
 16 |    "cell_type": "markdown",
 17 |    "metadata": {},
 18 |    "source": [
 19 |     "Please keep in mind that we differentiate between `sparse` and `dense` matrixes. The following visualization may help you understand the difference. "
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "code",
 24 |    "execution_count": 2,
 25 |    "metadata": {},
 26 |    "outputs": [
 27 |     {
 28 |      "data": {
 29 |       "text/html": [
 30 |        "<img src=\"https://miro.medium.com/v2/resize:fit:4800/format:webp/1*1LLMA9VGH6x8mRKqT-Mhtw.gif\"/>"
 31 |       ],
 32 |       "text/plain": [
 33 |        "<IPython.core.display.Image object>"
 34 |       ]
 35 |      },
 36 |      "metadata": {},
 37 |      "output_type": "display_data"
 38 |     }
 39 |    ],
 40 |    "source": [
 41 |     "from IPython.display import display, Image\n",
 42 |     "url = \"https://miro.medium.com/v2/resize:fit:4800/format:webp/1*1LLMA9VGH6x8mRKqT-Mhtw.gif\"\n",
 43 |     "# Display the GIF\n",
 44 |     "display(Image(url=url))"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "code",
 49 |    "execution_count": 1,
 50 |    "id": "d6288fa8",
 51 |    "metadata": {},
 52 |    "outputs": [],
 53 |    "source": [
 54 |     "import pandas as pd\n",
 55 |     "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer"
 56 |    ]
 57 |   },
 58 |   {
 59 |    "cell_type": "markdown",
 60 |    "id": "9efbfbd6",
 61 |    "metadata": {},
 62 |    "source": [
 63 |     "## Example 1: Inspect the output of a vectorizer in a dense format\n",
 64 |     "\n",
 65 |     "The following code cell will fit and transform three documents using a `Count`-based vectorizer. Next, the output is transformed to a *dense* matrix, and printed. \n",
 66 |     "\n",
 67 |     "1. Do you understand the output?\n",
 68 |     "2. Is it smart to transform output to a dense format? What will happen if you work with millions of documents, rather than 3 short sentences?\n",
 69 |     "3. what happens if you replace `CountVectorizer()` for `TfidfVectorizer()`?"
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "code",
 74 |    "execution_count": 2,
 75 |    "id": "49495cfd",
 76 |    "metadata": {},
 77 |    "outputs": [
 78 |     {
 79 |      "name": "stdout",
 80 |      "output_type": "stream",
 81 |      "text": [
 82 |       "   are  everybody  hello  how  students  today  what  you\n",
 83 |       "0    0          0      1    0         1      0     0    0\n",
 84 |       "1    1          0      0    1         0      1     0    1\n",
 85 |       "2    0          0      0    0         0      0     1    0\n",
 86 |       "3    0          1      2    0         0      0     0    0\n"
 87 |      ]
 88 |     }
 89 |    ],
 90 |    "source": [
 91 |     "texts = [\"hello students!\", \"how are you today?\", \"what?\", \"hello hello everybody\"]\n",
 92 |     "vect = CountVectorizer()# initialize the vectorizer\n",
 93 |     "\n",
 94 |     "X = vect.fit_transform(texts) #fit the vectorizer and transform the documents in one go\n",
 95 |     "print(pd.DataFrame(X.A, columns=vect.get_feature_names_out()).to_string())\n",
 96 |     "df = pd.DataFrame(X.toarray().transpose(), index = vect.get_feature_names_out())"
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "markdown",
101 |    "id": "72b8d55e",
102 |    "metadata": {},
103 |    "source": [
104 |     "## Example 2: Inspect the output of a vectorizer in a sparse format\n",
105 |     "\n",
106 |     "Internally, `sklearn` represents the data in a *sparse* format, as this is computationally more efficient, and less memory is required.\n"
107 |    ]
108 |   },
109 |   {
110 |    "cell_type": "code",
111 |    "execution_count": 3,
112 |    "id": "88bfaeba",
113 |    "metadata": {},
114 |    "outputs": [],
115 |    "source": [
116 |     "texts = [\"hello students!\", \"how are you today?\", \"what?\", \"hello hello everybody\"]\n",
117 |     "count_vec = CountVectorizer() #initilize the vectorizer\n",
118 |     "count_vec_fit = count_vec.fit_transform(texts) #fit the vectorizer and transform the documents in one go"
119 |    ]
120 |   },
121 |   {
122 |    "cell_type": "markdown",
123 |    "id": "e95a380b",
124 |    "metadata": {},
125 |    "source": [
126 |     "    1.Inspect the shape of transformed texts. We can see that we have a 4x8 sparse matrix, meaning that we have 4 \n",
127 |     "    rows (=documents) and 8 unique tokens (=words, numbers)\n"
128 |    ]
129 |   },
130 |   {
131 |    "cell_type": "code",
132 |    "execution_count": 4,
133 |    "id": "d9363fb0",
134 |    "metadata": {},
135 |    "outputs": [
136 |     {
137 |      "data": {
138 |       "text/plain": [
139 |        "<4x8 sparse matrix of type '<class 'numpy.int64'>'\n",
140 |        "\twith 9 stored elements in Compressed Sparse Row format>"
141 |       ]
142 |      },
143 |      "execution_count": 4,
144 |      "metadata": {},
145 |      "output_type": "execute_result"
146 |     }
147 |    ],
148 |    "source": [
149 |     "count_vec_fit"
150 |    ]
151 |   },
152 |   {
153 |    "cell_type": "markdown",
154 |    "id": "64e134c2",
155 |    "metadata": {},
156 |    "source": [
157 |     "    2.Get the feature names. This will return the tokens that are in the vocabulary of the vectorizer"
158 |    ]
159 |   },
160 |   {
161 |    "cell_type": "code",
162 |    "execution_count": 5,
163 |    "id": "b9c92ac2",
164 |    "metadata": {},
165 |    "outputs": [
166 |     {
167 |      "data": {
168 |       "text/plain": [
169 |        "array(['are', 'everybody', 'hello', 'how', 'students', 'today', 'what',\n",
170 |        "       'you'], dtype=object)"
171 |       ]
172 |      },
173 |      "execution_count": 5,
174 |      "metadata": {},
175 |      "output_type": "execute_result"
176 |     }
177 |    ],
178 |    "source": [
179 |     "count_vec.get_feature_names_out()"
180 |    ]
181 |   },
182 |   {
183 |    "cell_type": "markdown",
184 |    "id": "14c6b9a0",
185 |    "metadata": {},
186 |    "source": [
187 |     "    3. Represent the token's mapping to it's id values. The numbers do *not* represent the count of the words but the position of the words in the matrix"
188 |    ]
189 |   },
190 |   {
191 |    "cell_type": "code",
192 |    "execution_count": 6,
193 |    "id": "0cf16fdc",
194 |    "metadata": {},
195 |    "outputs": [
196 |     {
197 |      "data": {
198 |       "text/plain": [
199 |        "{'hello': 2,\n",
200 |        " 'students': 4,\n",
201 |        " 'how': 3,\n",
202 |        " 'are': 0,\n",
203 |        " 'you': 7,\n",
204 |        " 'today': 5,\n",
205 |        " 'what': 6,\n",
206 |        " 'everybody': 1}"
207 |       ]
208 |      },
209 |      "execution_count": 6,
210 |      "metadata": {},
211 |      "output_type": "execute_result"
212 |     }
213 |    ],
214 |    "source": [
215 |     "count_vec.vocabulary_ "
216 |    ]
217 |   },
218 |   {
219 |    "cell_type": "markdown",
220 |    "id": "d4f3fb63",
221 |    "metadata": {},
222 |    "source": [
223 |     "    4. Get sparse representation on document level"
224 |    ]
225 |   },
226 |   {
227 |    "cell_type": "code",
228 |    "execution_count": 7,
229 |    "id": "1a70295b",
230 |    "metadata": {},
231 |    "outputs": [
232 |     {
233 |      "name": "stdout",
234 |      "output_type": "stream",
235 |      "text": [
236 |       "hello students!\n",
237 |       "  (0, 2)\t1\n",
238 |       "  (0, 4)\t1\n",
239 |       "\n",
240 |       "how are you today?\n",
241 |       "  (0, 3)\t1\n",
242 |       "  (0, 0)\t1\n",
243 |       "  (0, 7)\t1\n",
244 |       "  (0, 5)\t1\n",
245 |       "\n",
246 |       "what?\n",
247 |       "  (0, 6)\t1\n",
248 |       "\n",
249 |       "hello hello everybody\n",
250 |       "  (0, 2)\t2\n",
251 |       "  (0, 1)\t1\n",
252 |       "\n"
253 |      ]
254 |     }
255 |    ],
256 |    "source": [
257 |     "for i, document in zip(count_vec_fit, texts):\n",
258 |     "    print(document)\n",
259 |     "    print(i)\n",
260 |     "    print()"
261 |    ]
262 |   },
263 |   {
264 |    "cell_type": "markdown",
265 |    "id": "52645b99",
266 |    "metadata": {},
267 |    "source": [
268 |     "a. Do you understand the output printed above?  \n",
269 |     "b. What happens if you change the `count` to a `tfidf` vectorizer?  "
270 |    ]
271 |   }
272 |  ],
273 |  "metadata": {
274 |   "kernelspec": {
275 |    "display_name": "Python 3 (ipykernel)",
276 |    "language": "python",
277 |    "name": "python3"
278 |   },
279 |   "language_info": {
280 |    "codemirror_mode": {
281 |     "name": "ipython",
282 |     "version": 3
283 |    },
284 |    "file_extension": ".py",
285 |    "mimetype": "text/x-python",
286 |    "name": "python",
287 |    "nbconvert_exporter": "python",
288 |    "pygments_lexer": "ipython3",
289 |    "version": "3.9.6"
290 |   }
291 |  },
292 |  "nbformat": 4,
293 |  "nbformat_minor": 5
294 | }
295 | 


--------------------------------------------------------------------------------
/2023/day4/exercises-vectorizers/exercise-text-to-features.md:
--------------------------------------------------------------------------------
 1 | # Exercise 1: Working with textual data
 2 | 
 3 | ### 0. Get the data.
 4 | 
 5 | - Download `articles.tar.gz` from
 6 | https://dx.doi.org/10.7910/DVN/ULHLCB
 7 | 
 8 | If you experience difficulties downloading this (rather large) dataset, you can also download just a part of the data [here](https://surfdrive.surf.nl/files/index.php/s/bfNFkuUVoVtiyuk)
 9 | 
10 | - Unpack it. On Linux and MacOS, you can do this with `tar -xzf mydata.tar.gz` on the command line. On Windows, you may need an additional tool such as `7zip` for that (note that technically speaking, there is a `tar` archive within a `gz` archive, so unpacking may take *two* steps depending on your tool).
11 | 
12 | 
13 | ### 1. Inspect the structure of the dataset.
14 | What information do the following elements give you?
15 | 
16 | - folder (directory) names
17 | - folder structure/hierarchy
18 | - file names
19 | - file contents
20 | 
21 | ### 2. Discuss strategies for working with this dataset!
22 | 
23 | - Which questions could you answer?
24 | - How could you deal with it, given the size and the structure?
25 | - How much memory<sup>1</sup> (RAM) does your computer have? How large is the complete dataset? What does that mean?
26 | - Make a sketch (e.g., with pen&paper), how you could handle your workflow and your data to answer your question.
27 | 
28 | <sup>1</sup> *memory* (RAM), not *storage* (harddisk)!
29 | 
30 | ### 3. Read some (or all?) data
31 | 
32 | Here is some example code that you can modify. Assuming that he folder `articles` is in the same folder as the notebook you are currently working on, you could, for instance, do the following to read a *part* of your dataset.
33 | 
34 | ```python
35 | from glob import glob
36 | infowarsfiles = glob('articles/*/Infowars/*')
37 | infowarsarticles = []
38 | for filename in infowarsfiles:
39 |     with open(filename) as f:
40 | 	    infowarsarticles.append(f.read())
41 | 
42 | ```
43 | 
44 | - Can you explain what the `glob` function does?
45 | - What does `infowarsfiles` contain, and what does `infowarsarticles` contain? First make an educated guess based on the code snippet, then check it! Do *not* print the whole thing, but use `len`, `type` en slicing `[:10]` to get the info ou need.
46 | 
47 | - Tip: take a random sample of the articles for practice purposes (if your code works, you can scale up!)
48 | 
49 | ```
50 | # taking a random sample of the articles for practice purposes
51 | articles =random.sample(infowarsarticles, 10)
52 | ```
53 | 
54 | ### 4. Vectorize the data
55 | 
56 | Imagine you want to train a classifier that will predict whether articles come from a fake news source (e.g., `Infowars`) or a quality news outlet (e.g., `bbc`). In other words, you want to predict `source` based on linguistic variations in the articles.
57 | 
58 | To arrive at a model that will do just that, you have to transform 'text' to 'features'.
59 | 
60 | - Can you vectorize the data? Try defining different vectorizers. Consider the following options:
61 |     - `count` vs. `tfidf` vectorizers
62 |     - with/ without pruning
63 |     - with/ without stopword removal
64 | 
65 | ### 5. Fit a classifier
66 | 
67 | - Try out a simple supervised model. Find some inspiration [here](possible-solution-exercise-day1.md). Can you predict the `source` using linguistic variations in the articles?
68 | 
69 | - Which combination of pre-processing steps + vectorizer gives the best results?
70 | 
71 | ### BONUS: Inceasing efficiency + reusability
72 | The approach under (3) gets you very far.
73 | But for those of you who want to go the extra mile, here are some suggestions for further improvements in handling such a large dataset, consisting of thousands of files, and for deeper thinking about data handling:
74 | 
75 | - Consider writing a function to read the data. Let your function take three parameters as input, `basepath` (where is the folder with articles located?), `month` and `outlet`, and return the articles that match this criterion.
76 | - Even better, make it a *generator* that yields the articles instead of returning a whole list.
77 | - Consider yielding a dict (with date, outlet, and the article itself) instead of yielding only the article text.
78 | - Think of the most memory-efficient way to get an overview of how often a given regular expression R is mentioned per outlet!
79 | - Under which circumstances would you consider having your function for reading the data return a pandas dataframe?
80 | 


--------------------------------------------------------------------------------
/2023/day4/exercises-vectorizers/possible-solution-exercise-day1.md:
--------------------------------------------------------------------------------
 1 | ## Exercise 1 Working with textual data - possible solutions
 2 | 
 3 | ----------
 4 | 
 5 | ### Vectorize the data
 6 | 
 7 | ```python
 8 | from glob import glob
 9 | import random
10 | 
11 | def read_data(listofoutlets):
12 |     texts = []
13 |     labels = []
14 |     for label in listofoutlets:
15 |         for file in glob(f'articles/*/{label}/*'):
16 |             with open(file) as f:
17 |                 texts.append(f.read())
18 |                 labels.append(label)
19 |     return texts, labels
20 | 
21 | X, y = read_data(['Infowars', 'BBC']) #choose your own newsoutlets
22 | 
23 | ```
24 | 
25 | 
26 | ```python
27 | #split the dataset in a train and test sample
28 | from sklearn.model_selection import train_test_split
29 | X_train,X_test,y_train,y_test=train_test_split(X, y, test_size=0.2)    
30 | ```
31 | 
32 | Define some vectorizers.
33 | You can try out different variations:
34 | - `count` versus `tfidf`
35 | - with/ without a stopword list
36 | - with / without pruning
37 | 
38 | 
39 | ```python
40 | from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
41 | 
42 | myvectorizer= CountVectorizer(stop_words=mystopwords) # you can further modify this yourself.
43 | 
44 | #Fit the vectorizer, and transform.
45 | X_features_train = myvectorizer.fit_transform(X_train)
46 | X_features_test = myvectorizer.transform(X_test)
47 | 
48 | ```
49 | ### Build a simple classifier
50 | 
51 | Now, lets build a simple classifier and predict outlet based on textual features:
52 | 
53 | ```python
54 | from sklearn.naive_bayes import MultinomialNB
55 | from sklearn.metrics import accuracy_score
56 | from sklearn.metrics import classification_report
57 | 
58 | model = MultinomialNB()
59 | model.fit(X_features_train, y_train)
60 | y_pred = model.predict(X_features_test)
61 | 
62 | print(f"Accuracy : {accuracy_score(y_test, y_pred)}")
63 | print(classification_report(y_test, y_pred))
64 | 
65 | ```
66 | 
67 | Can you improve this classifier when using different vectorizers?
68 | 


--------------------------------------------------------------------------------
/2023/day4/regex_examples.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "a4bf8020",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "`.` matches any character\n",
  9 |     "\n",
 10 |     "`*` the expression before occurs 0 or more times\n",
 11 |     "\n",
 12 |     "`+` the expression before occurs 1 or more times\n",
 13 |     " "
 14 |    ]
 15 |   },
 16 |   {
 17 |    "cell_type": "code",
 18 |    "execution_count": 4,
 19 |    "id": "5830422c",
 20 |    "metadata": {},
 21 |    "outputs": [],
 22 |    "source": [
 23 |     "import re"
 24 |    ]
 25 |   },
 26 |   {
 27 |    "cell_type": "code",
 28 |    "execution_count": 63,
 29 |    "id": "ac3fad21",
 30 |    "metadata": {},
 31 |    "outputs": [
 32 |     {
 33 |      "data": {
 34 |       "text/plain": [
 35 |        "'** ** you hello'"
 36 |       ]
 37 |      },
 38 |      "execution_count": 63,
 39 |      "metadata": {},
 40 |      "output_type": "execute_result"
 41 |     }
 42 |    ],
 43 |    "source": [
 44 |     "pattern = r\"[A-Z]+\"\n",
 45 |     "example_string = \"HOW ARE you hello\"\n",
 46 |     "subs = \"**\"\n",
 47 |     "\n",
 48 |     "re.sub(pattern, subs, example_string)"
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "code",
 53 |    "execution_count": 72,
 54 |    "id": "444084cc",
 55 |    "metadata": {},
 56 |    "outputs": [
 57 |     {
 58 |      "data": {
 59 |       "text/plain": [
 60 |        "[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '!!']"
 61 |       ]
 62 |      },
 63 |      "execution_count": 72,
 64 |      "metadata": {},
 65 |      "output_type": "execute_result"
 66 |     }
 67 |    ],
 68 |    "source": [
 69 |     "#pattern= r\"[^a-zA-Z]+\" #other empty strings represent the sequences of characters that are not alphabetic characters but are separated by spaces\n",
 70 |     "#pattern = r\"\\d+\"\n",
 71 |     "pattern = r\"\\W+\"\n",
 72 |     "example_string = \"a sentence with stuff in 052953 and so forth!!\"\n",
 73 |     "\n",
 74 |     "re.findall(pattern, example_string)"
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "code",
 79 |    "execution_count": 13,
 80 |    "id": "b59362d5",
 81 |    "metadata": {},
 82 |    "outputs": [],
 83 |    "source": [
 84 |     "## r indicates that this is a raw string: backslashes are treated as literal characters and not escape characters"
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "code",
 89 |    "execution_count": 78,
 90 |    "id": "dd55b547",
 91 |    "metadata": {},
 92 |    "outputs": [
 93 |     {
 94 |      "data": {
 95 |       "text/plain": [
 96 |        "['RT @TimSenders']"
 97 |       ]
 98 |      },
 99 |      "execution_count": 78,
100 |      "metadata": {},
101 |      "output_type": "execute_result"
102 |     }
103 |    ],
104 |    "source": [
105 |     "pattern = 'RT ?:? @[a-zA-Z]*'\n",
106 |     "example_string = 'iewjogejiwojg RT @TimSenders395 iegwjo'\n",
107 |     "\n",
108 |     "re.findall(pattern, example_string)"
109 |    ]
110 |   },
111 |   {
112 |    "cell_type": "code",
113 |    "execution_count": 88,
114 |    "id": "0ad26e5f",
115 |    "metadata": {},
116 |    "outputs": [
117 |     {
118 |      "data": {
119 |       "text/plain": [
120 |        "['ABN', 'amro', 'ABNAMRO', 'abn amro']"
121 |       ]
122 |      },
123 |      "execution_count": 88,
124 |      "metadata": {},
125 |      "output_type": "execute_result"
126 |     }
127 |    ],
128 |    "source": [
129 |     "test_string = 'ABN and also amro. ABNAMRO and abn amro'\n",
130 |     "pattern = r'\\b(ABN\\s+AMRO|ABNAMRO|ABN|AMRO)\\b'\n",
131 |     "\n",
132 |     "re.findall(pattern, test_string, re.IGNORECASE)"
133 |    ]
134 |   },
135 |   {
136 |    "cell_type": "code",
137 |    "execution_count": 80,
138 |    "id": "d22dcd64",
139 |    "metadata": {},
140 |    "outputs": [
141 |     {
142 |      "data": {
143 |       "text/plain": [
144 |        "'A DUTCH BANK and also A DUTCH BANK. A DUTCH BANK and A DUTCH BANK'"
145 |       ]
146 |      },
147 |      "execution_count": 80,
148 |      "metadata": {},
149 |      "output_type": "execute_result"
150 |     }
151 |    ],
152 |    "source": [
153 |     "replacement_string = 'A DUTCH BANK'\n",
154 |     "re.sub(pattern, replacement_string, test_string, flags=re.IGNORECASE)"
155 |    ]
156 |   },
157 |   {
158 |    "cell_type": "markdown",
159 |    "id": "dde41d36",
160 |    "metadata": {},
161 |    "source": [
162 |     "`\\b:` Word boundary anchor.\n",
163 |     "\n",
164 |     "ABN`\\s+`AMRO: Matches \"ABN AMRO\" with one or more spaces in between."
165 |    ]
166 |   },
167 |   {
168 |    "cell_type": "code",
169 |    "execution_count": null,
170 |    "id": "15dd18de",
171 |    "metadata": {},
172 |    "outputs": [],
173 |    "source": []
174 |   },
175 |   {
176 |    "cell_type": "markdown",
177 |    "id": "efbd0a77",
178 |    "metadata": {},
179 |    "source": [
180 |     "## Making a custom regex application: printing matches in context"
181 |    ]
182 |   },
183 |   {
184 |    "cell_type": "code",
185 |    "execution_count": 17,
186 |    "id": "5e04500f",
187 |    "metadata": {},
188 |    "outputs": [],
189 |    "source": [
190 |     "# make some example data\n",
191 |     "texts = ['The top Republican on the House Intelligence Committee says he is prepared to impeach the head of the FBI and Deputy Attorney General if he doesnt get a two-page document he says prompted the Russia investigation.\\n\\nJust the fact that theyre not giving this to us tells me theres something wrong here, California Republican Rep. Devin Nunes told Fox News host Laura Ingraham on the The Ingraham Angle Tuesday night.\\n\\nI can tell you that were not just going to hold in contempt, we will have a plan to ',\n",
192 |     " 'The Gulf Coast is preparing as Tropical Storm Michael developed in the Caribbean Sea and is expected to strengthen into a hurricane before making landfall around the middle of this week.\\n\\nFlorida Gov. Rick Scott ordered activation of the State Emergency Operations Center in Tallahassee to enhance coordination between federal, state and local agencies.\\n\\nOur state understands how serious tropical weather is and how devastating any hurricane or tropical storm can be, Scott said. As we continue to m',\n",
193 |     " 'YouTube star Candace Owens says there is a card more valuable than VISA or AMERICAN EXPRESS called the black card.',\n",
194 |     " 'Donald Trump can claim another victory after Mexican authorities agreed to disband the illegal alien caravans working their way through Mexico towards America.\\n\\nMexican immigration authorities said they plan on disbanding the Central American caravan by Wednesday in Oaxaca. The most vulnerable will get humanitarian visas, tweeted BuzzFeed reporter Adolfo Flores.\\n\\nEveryone else in the caravan, which has traveled through Mexico for days from Chiapas, will have to petition the Mexican government fo']"
195 |    ]
196 |   },
197 |   {
198 |    "cell_type": "code",
199 |    "execution_count": 19,
200 |    "id": "d40df1dd",
201 |    "metadata": {},
202 |    "outputs": [
203 |     {
204 |      "name": "stdout",
205 |      "output_type": "stream",
206 |      "text": [
207 |       "<re.Match object; span=(102, 105), match='FBI'>\n"
208 |      ]
209 |     }
210 |    ],
211 |    "source": [
212 |     "# first show the principle:\n",
213 |     "for r in re.finditer(r\"[A-Z][A-Z]+\", texts[0]):   # words with two or more capital letters\n",
214 |     "    print(r)"
215 |    ]
216 |   },
217 |   {
218 |    "cell_type": "code",
219 |    "execution_count": 21,
220 |    "id": "f6b58a3e",
221 |    "metadata": {},
222 |    "outputs": [
223 |     {
224 |      "name": "stdout",
225 |      "output_type": "stream",
226 |      "text": [
227 |       "Now processing text number 0...\n",
228 |       "ach the head of the FBI and Deputy Attorney\n",
229 |       "\n",
230 |       "**********************************\n",
231 |       "\n",
232 |       "Now processing text number 1...\n",
233 |       "\n",
234 |       "**********************************\n",
235 |       "\n",
236 |       "Now processing text number 2...\n",
237 |       " more valuable than VISA or AMERICAN EXPRESS\n",
238 |       "luable than VISA or AMERICAN EXPRESS called the \n",
239 |       "an VISA or AMERICAN EXPRESS called the black ca\n",
240 |       "\n",
241 |       "**********************************\n",
242 |       "\n",
243 |       "Now processing text number 3...\n",
244 |       "\n",
245 |       "**********************************\n",
246 |       "\n"
247 |      ]
248 |     }
249 |    ],
250 |    "source": [
251 |     "# let's exploit the fact that span() gives us first and last index (=position) within\n",
252 |     "# the string\n",
253 |     "# so we print the matched string +/- 20 characters\n",
254 |     "for number, text in enumerate(texts):\n",
255 |     "    print(f\"Now processing text number {number}...\")\n",
256 |     "    for r in re.finditer(r\"[A-Z][A-Z]+\", text):\n",
257 |     "        print(text[r.span()[0]-20:r.span()[1]+20])\n",
258 |     "    print('\\n**********************************\\n')"
259 |    ]
260 |   }
261 |  ],
262 |  "metadata": {
263 |   "kernelspec": {
264 |    "display_name": "Python 3 (ipykernel)",
265 |    "language": "python",
266 |    "name": "python3"
267 |   },
268 |   "language_info": {
269 |    "codemirror_mode": {
270 |     "name": "ipython",
271 |     "version": 3
272 |    },
273 |    "file_extension": ".py",
274 |    "mimetype": "text/x-python",
275 |    "name": "python",
276 |    "nbconvert_exporter": "python",
277 |    "pygments_lexer": "ipython3",
278 |    "version": "3.8.10"
279 |   }
280 |  },
281 |  "nbformat": 4,
282 |  "nbformat_minor": 5
283 | }
284 | 


--------------------------------------------------------------------------------
/2023/day4/slides-04-1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2023/day4/slides-04-1.pdf


--------------------------------------------------------------------------------
/2023/day4/slides-04-1.tex:
--------------------------------------------------------------------------------
  1 |  % !TeX document-id = {f19fb972-db1f-447e-9d78-531139c30778}
  2 | % !BIB program = biber
  3 | 
  4 | %\documentclass[handout]{beamer}
  5 | \documentclass[compress]{beamer}
  6 | \usepackage[T1]{fontenc}
  7 | \usetheme[block=fill,subsectionpage=progressbar,sectionpage=progressbar]{metropolis} 
  8 | \usepackage{graphicx}
  9 | 
 10 | \usepackage{wasysym}
 11 | \usepackage{etoolbox}
 12 | \usepackage[utf8]{inputenc}
 13 | 
 14 | \usepackage{threeparttable}
 15 | \usepackage{subcaption}
 16 | 
 17 | \usepackage{tikz-qtree}
 18 | \setbeamercovered{still covered={\opaqueness<1->{5}},again covered={\opaqueness<1->{100}}}
 19 | 
 20 | 
 21 | % color-coded listings; replace those above 
 22 | \usepackage{xcolor}
 23 | \usepackage{minted}
 24 | \definecolor{listingbg}{rgb}{0.87,0.93,1}
 25 | \setminted[python]{
 26 | 	frame=none,
 27 | 	framesep=1mm,
 28 | 	baselinestretch=1,
 29 | 	bgcolor=listingbg,
 30 | 	fontsize=\scriptsize,
 31 | 	linenos,
 32 | 	breaklines
 33 | }
 34 | 
 35 | 
 36 | 
 37 | \usepackage{listings}
 38 | 
 39 | \lstset{
 40 | 	basicstyle=\scriptsize\ttfamily,
 41 | 	columns=flexible,
 42 | 	breaklines=true,
 43 | 	numbers=left,
 44 | 	%stepsize=1,
 45 | 	numberstyle=\tiny,
 46 | 	backgroundcolor=\color[rgb]{0.85,0.90,1}
 47 | }
 48 | 
 49 | 
 50 | \lstnewenvironment{lstlistingoutput}{\lstset{basicstyle=\footnotesize\ttfamily,
 51 | 		columns=flexible,
 52 | 		breaklines=true,
 53 | 		numbers=left,
 54 | 		%stepsize=1,
 55 | 		numberstyle=\tiny,
 56 | 		backgroundcolor=\color[rgb]{.7,.7,.7}}}{}
 57 | 
 58 | 
 59 | \lstnewenvironment{lstlistingoutputtiny}{\lstset{basicstyle=\tiny\ttfamily,
 60 | 		columns=flexible,
 61 | 		breaklines=true,
 62 | 		numbers=left,
 63 | 		%stepsize=1,
 64 | 		numberstyle=\tiny,
 65 | 		backgroundcolor=\color[rgb]{.7,.7,.7}}}{}
 66 | 
 67 | 
 68 | 
 69 | \usepackage[american]{babel}
 70 | \usepackage{csquotes}
 71 | \usepackage[style=apa, backend = biber]{biblatex}
 72 | \DeclareLanguageMapping{american}{american-UoN}
 73 | \addbibresource{../../literature.bib}
 74 | \renewcommand*{\bibfont}{\tiny}
 75 | 
 76 | \usepackage{tikz}
 77 | \usetikzlibrary{shapes,arrows,matrix}
 78 | \usepackage{multicol}
 79 | 
 80 | \usepackage{subcaption}
 81 | 
 82 | \usepackage{booktabs}
 83 | \usepackage{graphicx}
 84 | 
 85 | 
 86 | 
 87 | \makeatletter
 88 | \setbeamertemplate{headline}{%
 89 | 	\begin{beamercolorbox}[colsep=1.5pt]{upper separation line head}
 90 | 	\end{beamercolorbox}
 91 | 	\begin{beamercolorbox}{section in head/foot}
 92 | 		\vskip2pt\insertnavigation{\paperwidth}\vskip2pt
 93 | 	\end{beamercolorbox}%
 94 | 	\begin{beamercolorbox}[colsep=1.5pt]{lower separation line head}
 95 | 	\end{beamercolorbox}
 96 | }
 97 | \makeatother
 98 | 
 99 | 
100 | 
101 | \setbeamercolor{section in head/foot}{fg=normal text.bg, bg=structure.fg}
102 | 
103 | 
104 | 
105 | \newcommand{\question}[1]{
106 | 	\begin{frame}[plain]
107 | 		\begin{columns}
108 | 			\column{.3\textwidth}
109 | 			\makebox[\columnwidth]{
110 | 				\includegraphics[width=\columnwidth,height=\paperheight,keepaspectratio]{../../pictures/mannetje.png}}
111 | 			\column{.7\textwidth}
112 | 			\large
113 | 			\textcolor{orange}{\textbf{\emph{#1}}}
114 | 		\end{columns}
115 | \end{frame}}
116 | 
117 | 
118 | 
119 | \title[Teach-the-teacher: Python]{\textbf{Teach-the-teacher: Python} 
120 | \\Day 4: »Processing textual data // NLP« }
121 | \author[Anne Kroon]{Anne Kroon\\ \footnotesize{a.c.kroon@uva.nl}}
122 | \date{December 4, 2023}
123 | \institute[UvA CW]{UvA RM Communication Science}
124 | 
125 | 
126 | \begin{document}
127 | 	
128 | 	\begin{frame}{}
129 | 		\titlepage{\tiny }
130 | 	\end{frame}
131 | 	
132 | 	\begin{frame}{Today}
133 | 		\tableofcontents
134 | 	\end{frame}
135 | 
136 | 
137 | \section{Bottom-up vs. top-down}
138 | 
139 | \begin{frame}[standout]
140 | Automated content analysis can be either \textcolor{red}{bottom-up} (inductive, explorative, pattern recognition, \ldots) or \textcolor{red}{top-down} (deductive, based on a-priori developed rules, \ldots). Or in between.
141 | \end{frame}
142 | 
143 | 
144 | \begin{frame}{The ACA toolbox}
145 | \makebox[\columnwidth]{
146 | \includegraphics[width=\columnwidth,height=\paperheight,keepaspectratio]{../../media/boumanstrilling2016}}
147 | \\
148 | \cite{Boumans2016}
149 | \end{frame}
150 | 
151 | 
152 | \begin{frame}{Bottom-up vs. top-down}
153 | \begin{block}{Bottom-up}
154 | \begin{itemize}
155 | \item Count most frequently occurring words 
156 | \item Maybe better: Count combinations of words $\Rightarrow$ Which words co-occur together?
157 | \end{itemize}
158 | We \emph{don't} specify what to look for in advance	
159 | \end{block}
160 | 
161 | \onslide<2>{
162 | \begin{block}{Top-down}
163 | \begin{itemize}
164 | 	\item Count frequencies of pre-defined words
165 | 	\item Maybe better: patterns instead of words
166 | \end{itemize}
167 | We \emph{do} specify what to look for in advance	
168 | \end{block}
169 | }
170 | \end{frame}
171 | 
172 | 
173 | \begin{frame}[fragile]{A simple bottom-up approach}
174 | \begin{lstlisting}
175 | from collections import Counter
176 | 
177 | texts = ["I really really really love him, I do", "I hate him"]
178 | 
179 | for t in texts:
180 |     print(Counter(t.split()).most_common(3))
181 | \end{lstlisting}
182 | \begin{lstlistingoutput}
183 | [('really', 3), ('I', 2), ('love', 1)]
184 | [('I', 1), ('hate', 1), ('him', 1)]
185 | \end{lstlistingoutput}
186 | \end{frame}
187 | 
188 | 
189 | \begin{frame}[fragile]{A simple top-down approach}
190 | \begin{lstlisting}
191 | texts = ["I really really really love him, I do", "I hate him"]
192 | features = ['really', 'love', 'hate']
193 | 
194 | for t in texts:
195 |     print(f"\nAnalyzing '{t}':")
196 |     for f in features:
197 |         print(f"{f} occurs {t.count(f)} times")
198 | \end{lstlisting}
199 | \begin{lstlistingoutput}
200 | Analyzing 'I really really really love him, I do':
201 | really occurs 3 times
202 | love occurs 1 times
203 | hate occurs 0 times
204 | 
205 | Analyzing 'I hate him':
206 | really occurs 0 times
207 | love occurs 0 times
208 | hate occurs 1 times
209 | 
210 | \end{lstlistingoutput}
211 | \end{frame}
212 | 
213 | \question{When would you use which approach?}
214 | 
215 | 
216 | \begin{frame}{Some considerations}
217 | \begin{itemize}[<+->]
218 | 	\item Both can have a place in your workflow (e.g., bottom-up as first exploratory step)
219 | 	\item You have a clear theoretical expectation? Bottom-up makes little sense.
220 | 	\item But in any case: you need to transform your text into something ``countable''.
221 | \end{itemize}
222 | \end{frame}
223 | 
224 | 
225 | \input{../../modules/working-with-text/basic-string-operations.tex}
226 | \input{../../modules/working-with-text/bow.tex}
227 | 
228 | \begin{frame}[fragile]{General approach}
229 | \Large
230 | 
231 | \textcolor{red}{Test on a single string, then make a for loop or list comprehension!}
232 | 
233 | \pause
234 | 
235 | \normalsize
236 | 
237 | \begin{alertblock}{Own functions}
238 | If it gets more complex, you can write your ow= function and then use it in the list comprehension:
239 | \begin{lstlisting}
240 | def mycleanup(t):
241 |     # do sth with string t here, create new string t2
242 |     return t2
243 | 
244 | results = [mycleanup(t) for t in allmytexts]
245 | \end{lstlisting}
246 | \end{alertblock}
247 | \end{frame}
248 | 
249 | 
250 | \begin{frame}[fragile]{Pandas string methods as alternative}
251 | If you select column with strings from a pandas dataframe, pandas offers a collection of string methods (via \texttt{.str.}) that largely mirror standard Python string methods:
252 | 
253 | \begin{lstlisting}
254 | df['newcoloumnwithresults'] = df['columnwithtext'].str.count("bla")
255 | \end{lstlisting} 
256 | 
257 | 
258 | \pause
259 | 
260 | \begin{alertblock}{To pandas or not to pandas for text?}
261 | Partly a matter of taste. 
262 | 
263 | Not-too-large dataset with a lot of extra columns? Advanced statistical analysis planned? Sounds like pandas.
264 | 
265 | It's mainly a lot of text? Wanna do some machine learning later on anyway? It's large and (potentially) messy? Doesn't sound like pandas is a good idea.
266 | \end{alertblock}
267 | 
268 | \end{frame}
269 | 
270 | 
271 | 
272 | %\begin{frame}[plain]
273 | %	\printbibliography
274 | %\end{frame}
275 | 
276 | 
277 | 
278 | \end{document}
279 | 
280 | 
281 | 
282 | 


--------------------------------------------------------------------------------
/2023/day4/slides-04-2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2023/day4/slides-04-2.pdf


--------------------------------------------------------------------------------
/2023/day4/slides-04-2.tex:
--------------------------------------------------------------------------------
  1 | % !TeX document-id = {f19fb972-db1f-447e-9d78-531139c30778}
  2 | % !BIB program = biber
  3 | 
  4 | %\documentclass[handout]{beamer}
  5 | \documentclass[compress]{beamer}
  6 | \usepackage[T1]{fontenc}
  7 | \usetheme[block=fill,subsectionpage=progressbar,sectionpage=progressbar]{metropolis} 
  8 | \usepackage{graphicx}
  9 | 
 10 | \usepackage{wasysym}
 11 | \usepackage{etoolbox}
 12 | \usepackage[utf8]{inputenc}
 13 | 
 14 | \usepackage{threeparttable}
 15 | \usepackage{subcaption}
 16 | 
 17 | \usepackage{tikz-qtree}
 18 | \setbeamercovered{still covered={\opaqueness<1->{5}},again covered={\opaqueness<1->{100}}}
 19 | 
 20 | 
 21 | \usepackage{listings}
 22 | 
 23 | \lstset{
 24 | 	basicstyle=\scriptsize\ttfamily,
 25 | 	columns=flexible,
 26 | 	breaklines=true,
 27 | 	numbers=left,
 28 | 	%stepsize=1,
 29 | 	numberstyle=\tiny,
 30 | 	backgroundcolor=\color[rgb]{0.85,0.90,1}
 31 | }
 32 | 
 33 | 
 34 | 
 35 | \lstnewenvironment{lstlistingoutput}{\lstset{basicstyle=\footnotesize\ttfamily,
 36 | 		columns=flexible,
 37 | 		breaklines=true,
 38 | 		numbers=left,
 39 | 		%stepsize=1,
 40 | 		numberstyle=\tiny,
 41 | 		backgroundcolor=\color[rgb]{.7,.7,.7}}}{}
 42 | 
 43 | 
 44 | \lstnewenvironment{lstlistingoutputtiny}{\lstset{basicstyle=\tiny\ttfamily,
 45 | 		columns=flexible,
 46 | 		breaklines=true,
 47 | 		numbers=left,
 48 | 		%stepsize=1,
 49 | 		numberstyle=\tiny,
 50 | 		backgroundcolor=\color[rgb]{.7,.7,.7}}}{}
 51 | 
 52 | 
 53 | 
 54 | \usepackage[american]{babel}
 55 | \usepackage{csquotes}
 56 | \usepackage[style=apa, backend = biber]{biblatex}
 57 | \DeclareLanguageMapping{american}{american-UoN}
 58 | \addbibresource{../references.bib}
 59 | \renewcommand*{\bibfont}{\tiny}
 60 | 
 61 | \usepackage{tikz}
 62 | \usetikzlibrary{shapes,arrows,matrix}
 63 | \usepackage{multicol}
 64 | 
 65 | \usepackage{subcaption}
 66 | 
 67 | \usepackage{booktabs}
 68 | \usepackage{graphicx}
 69 | 
 70 | 
 71 | 
 72 | \makeatletter
 73 | \setbeamertemplate{headline}{%
 74 | 	\begin{beamercolorbox}[colsep=1.5pt]{upper separation line head}
 75 | 	\end{beamercolorbox}
 76 | 	\begin{beamercolorbox}{section in head/foot}
 77 | 		\vskip2pt\insertnavigation{\paperwidth}\vskip2pt
 78 | 	\end{beamercolorbox}%
 79 | 	\begin{beamercolorbox}[colsep=1.5pt]{lower separation line head}
 80 | 	\end{beamercolorbox}
 81 | }
 82 | \makeatother
 83 | 
 84 | 
 85 | 
 86 | \setbeamercolor{section in head/foot}{fg=normal text.bg, bg=structure.fg}
 87 | 
 88 | 
 89 | 
 90 | \newcommand{\question}[1]{
 91 | 	\begin{frame}[plain]
 92 | 		\begin{columns}
 93 | 			\column{.3\textwidth}
 94 | 			\makebox[\columnwidth]{
 95 | 				\includegraphics[width=\columnwidth,height=\paperheight,keepaspectratio]{../pictures/mannetje.png}}
 96 | 			\column{.7\textwidth}
 97 | 			\large
 98 | 			\textcolor{orange}{\textbf{\emph{#1}}}
 99 | 		\end{columns}
100 | \end{frame}}
101 | 
102 | 
103 | \title[Teach-the-teacher: Python]{\textbf{Teach-the-teacher: Python} 
104 | \\Day 4: » Advanced NLP \& Regular Expressions « }
105 | \author[Anne Kroon]{Anne Kroon\\ \footnotesize{a.c.kroon@uva.nl}}
106 | \date{December 4, 2023}
107 | \institute[UvA CW]{UvA RM Communication Science}
108 | 
109 | 
110 | \begin{document}
111 | 	
112 | 	\begin{frame}{}
113 | 		\titlepage
114 | 	\end{frame}
115 | 	
116 | 	\begin{frame}{Today}
117 | 		\tableofcontents
118 | 	\end{frame}
119 | 
120 | 
121 | \section{Advanced NLP}
122 | 
123 | \subsection{Parsing sentences}
124 | \begin{frame}{NLP: What and why?}
125 | \begin{block}{Why parse sentences?}
126 | \begin{itemize}
127 | 	\item To find out what grammatical function words have
128 | 	\item and to get closer to the meaning.
129 | \end{itemize}
130 | \end{block}
131 | \end{frame}
132 | 
133 | \begin{frame}[fragile]{Parsing a sentence using NLTK}
134 | Tokenize a sentence, and ``tag'' the tokenized sentence:
135 | \begin{lstlisting}
136 | tokens = nltk.word_tokenize(sentence)
137 | tagged = nltk.pos_tag(tokens)
138 | print (tagged[0:6])
139 | \end{lstlisting}
140 | gives you the following:
141 | \begin{lstlisting}
142 | [('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),
143 | ('Thursday', 'NNP'), ('morning', 'NN')]
144 | \end{lstlisting}
145 | 
146 | \onslide<2->{
147 | And you could get the word type of "morning" with \texttt{tagged[5][1]}!
148 | }
149 | 
150 | \end{frame}
151 | 
152 | 
153 | \begin{frame}[fragile]{Named Entity Recognition with spacy}
154 | Terminal:
155 | 
156 | \begin{lstlisting}
157 | sudo pip3 install spacy
158 | sudo python3 -m spacy download nl    # or en, de, fr ....
159 | \end{lstlisting}
160 | 
161 | Python:
162 | 
163 | \begin{lstlisting}
164 | import spacy
165 | nlp = spacy.load('nl')
166 | doc = nlp('Een 38-jarige vrouw uit Zeist en twee mannen moeten 24 maanden de cel in voor de gecoordineerde oplichting van Rabobank-klanten.')
167 | for ent in doc.ents:
168 |     print(ent.text,ent.label_)
169 | \end{lstlisting}
170 | 
171 | returns:
172 | 
173 | \begin{lstlisting}
174 | Zeist LOC
175 | Rabobank ORG
176 | \end{lstlisting}  
177 | 
178 | \end{frame}
179 | 
180 | 
181 | 
182 | \begin{frame}{More NLP}
183 | \url{http://nlp.stanford.edu}
184 | \url{http://spacy.io}
185 | \url{http://nltk.org}
186 | \url{https://www.clips.uantwerpen.be/pattern}
187 | \end{frame}
188 | 
189 | 
190 | 
191 | \begin{frame}{Main takeaway}
192 | 
193 | \begin{itemize}
194 | %	\item It matters how you transform your text into numbers (``vectorization'').
195 | \item Preprocessing matters, be able to make informed choices.
196 | \item Keep this in mind when moving to Machine Learning. 
197 | \end{itemize}
198 | \end{frame}
199 | 
200 | 
201 | \section[Regular expressions]{ACA using regular expressions}
202 | 
203 | \begin{frame}
204 | Automated content analysis using regular expressions
205 | \end{frame}
206 | 
207 | 
208 | \subsection{What is a regexp?}
209 | \begin{frame}{Regular Expressions: What and why?}
210 | \begin{block}{What is a regexp?}
211 | \begin{itemize}
212 | \item<1-> a \emph{very} widespread way to describe patterns in strings
213 | \item<2-> Think of wildcards like {\tt{*}} or operators like {\tt{OR}}, {\tt{AND}} or {\tt{NOT}} in search strings: a regexp does the same, but is \emph{much} more powerful
214 | \item<3-> You can use them in many editors (!), in the Terminal, in STATA \ldots and in Python
215 | \end{itemize}
216 | \end{block}
217 | \end{frame}
218 | 
219 | \begin{frame}{An example}
220 | \begin{block}{Regex example}
221 | \begin{itemize}
222 | \item Let's say we wanted to remove everything but words from a tweet
223 | \item We could do so by calling the \texttt{.replace()} method
224 | \item We could do this with a regular expression as well: \\
225 | {\tt{ \lbrack \^{}a-zA-Z\rbrack}} would match anything that is not a letter
226 | \end{itemize}
227 | \end{block}
228 | \end{frame}
229 | 
230 | \begin{frame}{Basic regexp elements}
231 | \begin{block}{Alternatives}
232 | \begin{description}
233 | \item[{\tt{\lbrack TtFf\rbrack}}] matches either T or t or F or f
234 | \item[{\tt{Twitter|Facebook}}] matches either Twitter or Facebook
235 | \item[{\tt{.}}] matches any character
236 | \end{description}
237 | \end{block}
238 | \begin{block}{Repetition}<2->
239 | \begin{description}
240 | \item[{\tt{*}}] the expression before occurs 0 or more times
241 | \item[{\tt{+}}] the expression before occurs 1 or more times
242 | \end{description}
243 | \end{block}
244 | \end{frame}
245 | 
246 | \begin{frame}{regexp quizz}
247 | \begin{block}{Which words would be matched?}
248 | \tt
249 | \begin{enumerate}
250 | \item<1-> \lbrack Pp\rbrack ython
251 | \item<2-> \lbrack A-Z\rbrack +
252 | \item<3-> RT ?:? @\lbrack a-zA-Z0-9\rbrack *
253 | \end{enumerate}
254 | \end{block}
255 | \end{frame}
256 | 
257 | \begin{frame}{What else is possible?}
258 | See the table in the book!
259 | \end{frame}
260 | 
261 | \subsection{Using a regexp in Python}
262 | \begin{frame}{How to use regular expressions in Python}
263 | \begin{block}{The module \texttt{re}*}
264 | \begin{description}
265 | \item<1->[{\tt{re.findall("\lbrack Tt\rbrack witter|\lbrack Ff\rbrack acebook",testo)}}] returns a list with all occurances of Twitter or Facebook in the string called {\tt{testo}}
266 | \item<1->[{\tt{re.findall("\lbrack 0-9\rbrack +\lbrack a-zA-Z\rbrack +",testo)}}] returns a list with all words that start with one or more numbers followed by one or more letters in the string called {\tt{testo}}
267 | \item<2->[{\tt{re.sub("\lbrack Tt\rbrack witter|\lbrack Ff\rbrack acebook","a social medium",testo)}}] returns a string in which all all occurances of Twitter or Facebook are replaced by "a social medium"
268 | \end{description}
269 | \end{block}
270 | 
271 | \tiny{Use the less-known but more powerful module \texttt{regex} instead to support all dialects used in the book}
272 | \end{frame}
273 | 
274 | 
275 | \begin{frame}[fragile]{How to use regular expressions in Python}
276 | \begin{block}{The module re}
277 | \begin{description}
278 | \item<1->[{\tt{re.match(" +(\lbrack 0-9\rbrack +) of (\lbrack 0-9\rbrack +) points",line)}}] returns  \texttt{None} unless it \emph{exactly} matches the string \texttt{line}. If it does, you can access the part between () with the \texttt{.group()} method.
279 | \end{description}
280 | \end{block}
281 | 
282 | Example:
283 | \begin{lstlisting}
284 | line="             2 of 25 points"
285 | result=re.match(" +([0-9]+) of ([0-9]+) points",line)
286 | if result:
287 |     print (f"Your points: {}result.group(1)}, Maximum points: {result.group(2)})
288 | \end{lstlisting}
289 | Your points: 2 Maximum points: 25
290 | \end{frame}
291 | 
292 | 
293 | 
294 | \begin{frame}{Possible applications}
295 | \begin{block}{Data preprocessing}
296 | \begin{itemize}
297 | \item Remove unwanted characters, words, \ldots
298 | \item Identify \emph{meaningful} bits of text: usernames, headlines, where an article starts, \ldots
299 | \item filter (distinguish relevant from irrelevant cases)
300 | \end{itemize}
301 | \end{block}
302 | \end{frame}
303 | 
304 | 
305 | \begin{frame}{Possible applications}
306 | \begin{block}{Data analysis: Automated coding}
307 | \begin{itemize}
308 | \item Actors
309 | \item Brands
310 | \item links or other markers that follow a regular pattern
311 | \item Numbers (!)
312 | \end{itemize}
313 | \end{block}
314 | \end{frame}
315 | 
316 | \begin{frame}[fragile,plain]{Example 1: Counting actors}
317 | \begin{lstlisting}
318 | import re, csv
319 | from glob import glob
320 | count1_list=[]
321 | count2_list=[]
322 | filename_list = glob("/home/damian/articles/*.txt")
323 | 
324 | for fn in filename_list:
325 |     with open(fn) as fi:
326 |         artikel = fi.read()
327 |         artikel = artikel.replace('\n',' ')
328 | 
329 |     count1 = len(re.findall('Israel.*(minister|politician.*|[Aa]uthorit)',artikel))
330 |     count2 = len(re.findall('[Pp]alest',artikel))
331 | 
332 |     count1_list.append(count1)
333 |     count2_list.append(count2)
334 | 
335 | output=zip(filename_list,count1_list, count2_list)
336 | with open("results.csv", mode='w',encoding="utf-8") as fo:
337 |     writer = csv.writer(fo)
338 |     writer.writerows(output)
339 | \end{lstlisting}
340 | \end{frame}
341 | 
342 | 
343 | 
344 | 
345 | \begin{frame}[fragile]{Example 2: Which number has this Lexis Nexis article?}
346 | \begin{lstlisting}
347 |                         All Rights Reserved
348 | 
349 |                   2 of 200 DOCUMENTS
350 | 
351 |             De Telegraaf
352 | 
353 |          21 maart 2014 vrijdag
354 | 
355 | Brussel bereikt akkoord  aanpak probleembanken;
356 | ECB krijgt meer in melk te brokkelen
357 | 
358 | SECTION: Finance; Blz. 24
359 | LENGTH: 660 woorden
360 | 
361 | BRUSSEL   Europa heeft gisteren op de valreep een akkoord bereikt 
362 | over een saneringsfonds voor banken. Daarmee staat de laatste
363 | \end{lstlisting}
364 | 
365 | \end{frame}
366 | 
367 | \begin{frame}[fragile]{Example 2: Check the number of a lexis nexis article}
368 | \begin{lstlisting}
369 |                         All Rights Reserved
370 | 
371 |                   2 of 200 DOCUMENTS
372 | 
373 |             De Telegraaf
374 | 
375 |        21 maart 2014 vrijdag
376 | 
377 | Brussel bereikt akkoord  aanpak probleembanken;
378 | ECB krijgt meer in melk te brokkelen
379 | 
380 | SECTION: Finance; Blz. 24
381 | LENGTH: 660 woorden
382 | 
383 | BRUSSEL   Europa heeft gisteren op de valreep een akkoord bereikt 
384 | over een saneringsfonds voor banken. Daarmee staat de laatste
385 | \end{lstlisting}
386 | 
387 | \begin{lstlisting}
388 | for line in tekst:
389 | matchObj=re.match(r" +([0-9]+) of ([0-9]+) DOCUMENTS",line)
390 | if matchObj:
391 |     numberofarticle= int(matchObj.group(1))
392 |     totalnumberofarticles= int(matchObj.group(2))
393 | \end{lstlisting}
394 | \end{frame}
395 | 
396 | 
397 | \begin{frame}{Practice yourself!}
398 | Let's take some time to write some regular expressions.
399 | Write a script that
400 | \begin{itemize}
401 | \item extracts URLS form a list of strings
402 | \item removes everything that is not a letter or number from a list of strings
403 | \end{itemize}
404 | (first develop it for a single string, then scale up)
405 | 
406 | More tips:
407 | \huge{\url{http://www.pyregex.com/}}
408 | \end{frame}
409 | 
410 | 
411 | 
412 | %\begin{frame}[plain]
413 | %	\printbibliography
414 | %\end{frame}
415 | 
416 | 
417 | 
418 | \end{document}
419 | 
420 | 
421 | 
422 | 


--------------------------------------------------------------------------------
/2023/day5/Day 5 - Machine Learning - Afternoon.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2023/day5/Day 5 - Machine Learning - Afternoon.pdf


--------------------------------------------------------------------------------
/2023/day5/Day 5 - Machine Learning - Morning.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2023/day5/Day 5 - Machine Learning - Morning.pdf


--------------------------------------------------------------------------------
/2023/day5/Day 5 Take-aways.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "id": "80deb237-7fb5-4ecd-8186-05b1dfde3014",
 6 |    "metadata": {},
 7 |    "source": [
 8 |     "# Supervised Machine Learning \n",
 9 |     "\n",
10 |     "## Main take aways:\n",
11 |     "* Congrats on obtaining your driver's license! Now, please get out on the road and learn how to drive :) \n",
12 |     "* Don't be impressed - you can certainly do it.\n",
13 |     "* Just because you can do it does not mean you should do it.\n",
14 |     "* All decisions regarding the SML process are arbritrary. The right choice is the one you can argue for best.\n",
15 |     "* Don't reinvent the wheel\n",
16 |     "* Google the error messages\n",
17 |     "                                                                                                   \n",
18 |     "\n",
19 |     "## More resources\n",
20 |     "\n",
21 |     "#### Overfitting\n",
22 |     "https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/\n",
23 |     "\n",
24 |     "#### Hyperparameter tuning \n",
25 |     "https://machinelearningmastery.com/hyperparameter-optimization-with-random-search-and-grid-search/\n",
26 |     "\n",
27 |     "#### Datasets and challenges\n",
28 |     "https://www.kaggle.com/\n",
29 |     "\n",
30 |     "\n",
31 |     "## Recommended readings\n",
32 |     "\n",
33 |     "#### Van Atteveldt et al. (book for Python learners)\n",
34 |     "Van Atteveldt, W., Trilling, D., & Calderón, C. A. (2022). Computational Analysis of Communication. Wiley Blackwell.\n",
35 |     "https://cssbook.net/\n",
36 |     "\n",
37 |     "#### Zhang et al. (Paper about shooting victims and thoughts and prayers in Tweets)\n",
38 |     "Zhang, Y., Shah, D., Foley, J., Abhishek, A., Lukito, J., Suk, J., ... & Garlough, C. (2019). Whose lives matter? Mass shootings and social media discourses of sympathy and policy, 2012–2014. Journal of Computer-Mediated Communication, 24(4), 182-202.\n",
39 |     "https://doi.org/10.1093/jcmc/zmz009\n",
40 |     "\n",
41 |     "#### Meppelink et al. (Paper about online health info and reliability)\n",
42 |     "Meppelink, C. S., Hendriks, H., Trilling, D., van Weert, J. C., Shao, A., & Smit, E. S. (2021). Reliable or not? An automated classification of webpages about early childhood vaccination using supervised machine learning. Patient Education and Counseling, 104(6), 1460-1466.\n",
43 |     "https://doi.org/10.1016/j.pec.2020.11.013\n",
44 |     "\n"
45 |    ]
46 |   },
47 |   {
48 |    "cell_type": "code",
49 |    "execution_count": null,
50 |    "id": "1dfd8fe2-d0eb-470f-966e-fb70393edf9d",
51 |    "metadata": {},
52 |    "outputs": [],
53 |    "source": []
54 |   }
55 |  ],
56 |  "metadata": {
57 |   "kernelspec": {
58 |    "display_name": "Python 3 (ipykernel)",
59 |    "language": "python",
60 |    "name": "python3"
61 |   },
62 |   "language_info": {
63 |    "codemirror_mode": {
64 |     "name": "ipython",
65 |     "version": 3
66 |    },
67 |    "file_extension": ".py",
68 |    "mimetype": "text/x-python",
69 |    "name": "python",
70 |    "nbconvert_exporter": "python",
71 |    "pygments_lexer": "ipython3",
72 |    "version": "3.9.6"
73 |   }
74 |  },
75 |  "nbformat": 4,
76 |  "nbformat_minor": 5
77 | }
78 | 


--------------------------------------------------------------------------------
/2023/day5/Exercise 3/exercise3.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "54c12790-b072-4582-9d1f-f10af87a2fcb",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "### Exercise 3\n",
  9 |     "\n",
 10 |     "In this exercise, you will practice with both applying SML and also evaluating it. When doing the latter, use the materials and slides discussed in today's workshop. Work together with your neighbour on this exercise!"
 11 |    ]
 12 |   },
 13 |   {
 14 |    "cell_type": "markdown",
 15 |    "id": "d03327ea-5806-4011-9523-31bebfd30227",
 16 |    "metadata": {},
 17 |    "source": [
 18 |     "### Q1: Describing Supervised Machine Learning (SML)\n",
 19 |     "\n",
 20 |     "a. SeeFlex is a video streaming platform that initially focused solely on television shows aimed at children. However, the CEO recently decided to expand the content available on SeeFlex to content aimed at adults as well. New availble genres on SeeFlex are, for example, horror shows or dating shows. To help customers select content and mostly, to help parents to keep selecting only content that is suitable for their children, the CEO wants to employ Supervised Machine Learning (SML) to automatically indicate the genre that a specific piece of content belongs to based on its description. She can do this because she got her hands on a large dataset with pre-labeled content descriptions which she can use to train and validate machines. \n",
 21 |     "\n",
 22 |     "With your neighbour, discuss the suitability of SML. Provide one argument in favor of and one argument against using SML in this case.\n"
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "markdown",
 27 |    "id": "be973d47-0572-40e3-a1fc-990c49a7befd",
 28 |    "metadata": {},
 29 |    "source": [
 30 |     "### Q2: Executing Supervised Machine Learning (SML)\n",
 31 |     "  \n",
 32 |     "a. Read in the dataset you need for this assignment ('SeeFlex_data.csv') and conduct some explorative analyses on it. Your analysis needs to result in an overview of how many pieces of content there are per genre. \n",
 33 |     "\n",
 34 |     "Hint: Are you getting a 'list index out of range' error when reading in the data? Check what delimiter you are using for this *comma*-seperated values file!\n"
 35 |    ]
 36 |   },
 37 |   {
 38 |    "cell_type": "markdown",
 39 |    "id": "9c374389-382a-40bf-abf1-5c4545ed94e5",
 40 |    "metadata": {},
 41 |    "source": [
 42 |     "b. Write a script for the CEO of SeeFlex using SML to categorize the content descriptions into genres. While doing so, keep in mind the following:\n",
 43 |     "* The CEO's goal is to automatically label content for all viewers. But because SeeFlex is mainly used by parents and their children, correctly identifying kids' content takes priority over correctly detecting other genres. \n",
 44 |     "* In your code, compare at least two different models (e.g., Logistic Regression, Decision Tree).\n",
 45 |     "* Your code needs to produce at least one metric that evaluates the classifiers - think about what metric is most important in the current situation.\n"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "markdown",
 50 |    "id": "42ba478c-9301-4800-a502-9140797740bf",
 51 |    "metadata": {},
 52 |    "source": [
 53 |     "### Q3: Reflecting on your script and findings\n",
 54 |     "\n",
 55 |     "a. Discuss the results of Q2a. Your answer needs to:\n",
 56 |     "* Discuss how the content descriptions are distributed across genres\n",
 57 |     "* Discuss about why it is (not) relevant to inspect the distribution of content descriptions across genres\n",
 58 |     "* Discuss what the above means for the classifier you developed for the CEO of SeeFlex\n"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "markdown",
 63 |    "id": "7e4f448a-188b-4a38-9c48-7c5e086ee843",
 64 |    "metadata": {},
 65 |    "source": [
 66 |     "b. How do your classifiers work: do they distinguish between the four different genres that content belongs to, or did you decide to merge some genres into one or more categories? Why did you decide to do it in this way? In your answer, reflect on the advantages and on the disadvantages of your approach for the CEO of SeeFlex.\n"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "markdown",
 71 |    "id": "2b97e603-ae45-424c-96ac-1f535a6007f4",
 72 |    "metadata": {},
 73 |    "source": [
 74 |     "c. Based on the results of your validation metrics, what classifier would you recommend to the CEO of SeeFlex? Why? \n"
 75 |    ]
 76 |   }
 77 |  ],
 78 |  "metadata": {
 79 |   "kernelspec": {
 80 |    "display_name": "Python 3 (ipykernel)",
 81 |    "language": "python",
 82 |    "name": "python3"
 83 |   },
 84 |   "language_info": {
 85 |    "codemirror_mode": {
 86 |     "name": "ipython",
 87 |     "version": 3
 88 |    },
 89 |    "file_extension": ".py",
 90 |    "mimetype": "text/x-python",
 91 |    "name": "python",
 92 |    "nbconvert_exporter": "python",
 93 |    "pygments_lexer": "ipython3",
 94 |    "version": "3.9.6"
 95 |   }
 96 |  },
 97 |  "nbformat": 4,
 98 |  "nbformat_minor": 5
 99 | }
100 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # teachteacher-python
 2 | "Teaching the Teacher" resources for colleagues that want to get started using computational methods in their teaching (using Python)
 3 | 
 4 | ## Purpose
 5 | 
 6 | This repository contains materials for colleagues who are new to teaching computational methods but want to do so in the future. In particular, this holds true for the following courses, but the tips and resources also apply to future yet-to-be-developed courses:
 7 | 
 8 | 
 9 | | Course name                                         | Resource link 1                                      | 
10 | |-----------------------------------------------------|------------------------------------------------------|
11 | | Gesis Course Introduction to Machine Learning       | https://github.com/annekroon/gesis-machine-learning  |                 
12 | | Big Data and Automated Content Analysis             | https://github.com/uvacw/teaching-bdaca              |                 
13 | | Computational Communication Science I               | https://github.com/uva-cw-ccs1/2223s2/               |                 
14 | | Computational Communication Science II              | https://github.com/uva-cw-ccs2/2223s2/               |                 
15 | | Data Journalism                                     | https://github.com/uvacw/datajournalism              |                 
16 | | Digital Analytics*                                  | https://github.com/uva-cw-digitalanalytics/2021s2    |              
17 | 
18 | *Ask Joanna or Theo for access
19 | 
20 | ## Requirements
21 | 
22 | You need to have a working Python environment and you need to be able to install Python packages on your system. There are several ways of achieving this, and it is important to note that not all of your students may have the same type of environment. In particular, one can either opt for the so-called Anaconda distribution or a native Python installation. There are pro's and con's for both approaches. Currently, students in Data Journalism as well as in  Digital Analytics are advised to install Anaconda; students in Big Data and Automated Content Analysis are explicitly given the choise. Please read the [our Installation Guide](installation.md) for detailed instructions.
23 | 
24 | 
25 | ## Structure of the ``Teaching the teacher`` course
26 | 
27 | As a pilot, we are holding a five-day course in which we combine 
28 | - teaching the necessary Python skills
29 | - teaching how to teach these skills
30 | - exercising and reflecting on best teaching practices.
31 | 
32 | 
33 | ## Additional resources
34 | 
35 | A list of additional resources that could be of interest:
36 | 
37 | - A 5-day workshop by Anne and Damian on Machine Learning in Python (for social scientists with no or minimal previous Python knowledge): https://github.com/annekroon/gesis-ml-learning/
38 | 
39 | - "The new book" (forthcoming open-access on https://cssbook.net and in print with Wiley): Van Atteveldt, W., Trilling, D., Arcila, C. (in press):  Computational Analysis of Communication: A practical introduction to the analysis of texts, networks, and images with code examples in Python and R
40 | 
41 | - "The old book" (the book used between 2015 and 2020 in the Big Data courses). Less focus on Pandas than in more modern approaches, slightly outdated coding style in some examples, and less depth than the "new" book. The Twitter API chapter is outdated and sentiment anaysis as described in Chapter 6 should not be tought like this any more. Apart from that, it still can be a good resource to get started and/or to look things up.  Trilling, D. (2020): Doing Computational Social Science with Python: An Introduction. Version 1.3.2. https://github.com/damian0604/bdaca/blob/master/book/bd-aca_book.pdf
42 | 


--------------------------------------------------------------------------------