├── 2021 ├── day1 │ ├── .DS_Store │ ├── README.md │ ├── day1-afternoon.ipynb │ ├── day1-morning.pdf │ ├── day1-morning.tex │ ├── exercises │ │ ├── exercises.md │ │ ├── instructional-videos-exercises.md │ │ └── livecoding.ipynb │ ├── mockdata │ │ └── mockdata-your_posts_1.json │ └── solutions │ │ ├── ex2-1.py │ │ ├── ex2-2.py │ │ ├── ex2-3.py │ │ ├── ex2-4.py │ │ ├── ex3-1.py │ │ ├── ex3-2.py │ │ ├── gradechecker_function.py │ │ ├── gradechecker_robust.py │ │ └── gradechecker_simple.py ├── day2 │ ├── .DS_Store │ ├── Day 2_slides.pdf │ ├── Notebooks │ │ ├── .DS_Store │ │ ├── BasicStats.ipynb │ │ ├── Datasets │ │ │ ├── RIVM_1.csv │ │ │ ├── RIVM_2.csv │ │ │ ├── RIVM_sentiment.csv │ │ │ ├── Sentiment_YouTubeClimateChange.csv │ │ │ ├── Sentiment_YouTubeClimateChange.pkl │ │ │ ├── YouTube_climatechange.csv │ │ │ ├── YouTube_climatechange.tab │ │ │ └── websites.csv │ │ ├── ExcercisesPandas.ipynb │ │ ├── PandasIntroduction.ipynb │ │ ├── PandasIntroduction2.ipynb │ │ └── Visualisations.ipynb │ └── README.md ├── day3 │ ├── README.md │ ├── day3-afternoon.pdf │ ├── day3-afternoon.tex │ ├── day3-morning.pdf │ ├── day3-morning.tex │ └── exercises │ │ └── exercises.md ├── day4 │ ├── README.md │ ├── day4-afternoon.pdf │ ├── day4-afternoon.tex │ ├── day4.pdf │ ├── day4.tex │ ├── example-nltk.md │ ├── example-vectorizer-to-dense.md │ ├── exercises-1 │ │ ├── exercise-1.md │ │ └── possible-solution-exercise-1.md │ ├── exercises-2 │ │ ├── exercise-2.md │ │ ├── fix_example_book.md │ │ └── possible-solution-exercise-2.md │ └── literature-examples.md ├── day5 │ ├── 01-MachineLearning_Introduction.ipynb │ ├── 02-Unsupervised-Machine-Learning.ipynb │ ├── 03-Supervised-Machine-Learning.ipynb │ ├── README.md │ ├── WorkingNotebook.ipynb │ └── topic_model_example.ipynb ├── installation.md ├── media │ ├── boumanstrilling2016.eps │ ├── boumanstrilling2016.pdf │ ├── mannetje.png │ ├── pythoninterpreter.png │ └── sparse_dense.png ├── references.bib └── teachingtips.md ├── 2023 ├── .DS_Store ├── Installationinstruction.md ├── Teachingtips.md ├── day1 │ ├── introduction.ipynb │ └── introduction.slides.html ├── day2 │ ├── Day 2.pdf │ └── Notebooks │ │ ├── BasicStats.ipynb │ │ ├── ExcercisesPandas.ipynb │ │ ├── PandasIntroduction.ipynb │ │ ├── PandasIntroduction2.ipynb │ │ └── Visualisations.ipynb ├── day3 │ ├── API.ipynb │ ├── Data Formats.ipynb │ ├── Teaching Exercises.ipynb │ ├── Webscraping.ipynb │ ├── get_mails │ └── updated cell ├── day4 │ ├── README.md │ ├── example-ngrams.md │ ├── exercises-afternoon │ │ ├── 01tuesday-regex-exercise.md │ │ ├── 01tuesday-regex-solution.md │ │ ├── 02tuesday-exercise_nexis.md │ │ ├── 02tuesday-exercise_nexis_solution.md │ │ └── corona_news.tar.gz │ ├── exercises-morning │ │ ├── exercise-feature-engineering.md │ │ ├── possible-solution-exercise-day2-vectorizers.md │ │ ├── possible-solution-exercise-day2.md │ │ └── possible-solutions-ordered.ipynb │ ├── exercises-vectorizers │ │ ├── Understanding_vectorizers.ipynb │ │ ├── exercise-text-to-features.md │ │ └── possible-solution-exercise-day1.md │ ├── regex_examples.ipynb │ ├── slides-04-1.pdf │ ├── slides-04-1.tex │ ├── slides-04-2.pdf │ ├── slides-04-2.tex │ └── spacy-examples.ipynb └── day5 │ ├── Day 5 - Machine Learning - Afternoon.pdf │ ├── Day 5 - Machine Learning - Morning.pdf │ ├── Day 5 Take-aways.ipynb │ ├── Exercise 1 │ ├── exercise1.ipynb │ └── hatespeech_text_label_vote_RESTRICTED_100K.csv │ ├── Exercise 2 │ ├── exercise2.ipynb │ └── hatespeech_text_label_vote_RESTRICTED_100K.csv │ └── Exercise 3 │ ├── SeeFlex_data.csv │ └── exercise3.ipynb ├── .DS_Store ├── .gitignore └── README.md /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/.DS_Store -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | ## Core latex/pdflatex auxiliary files: 2 | *.aux 3 | *.lof 4 | *.log 5 | *.lot 6 | *.fls 7 | *.out 8 | *.toc 9 | *.fmt 10 | 11 | ## Intermediate documents: 12 | *.dvi 13 | *-converted-to.* 14 | # these rules might exclude image files for figures etc. 15 | # *.ps 16 | # *.eps 17 | # *.pdf 18 | 19 | ## Bibliography auxiliary files (bibtex/biblatex/biber): 20 | *.bbl 21 | *.bcf 22 | *.blg 23 | *-blx.aux 24 | *-blx.bib 25 | *.brf 26 | *.run.xml 27 | 28 | ## Build tool auxiliary files: 29 | *.fdb_latexmk 30 | *.synctex 31 | *.synctex.gz 32 | *.synctex.gz(busy) 33 | *.pdfsync 34 | 35 | ## Auxiliary and intermediate files from other packages: 36 | # algorithms 37 | *.alg 38 | *.loa 39 | 40 | # achemso 41 | acs-*.bib 42 | 43 | # amsthm 44 | *.thm 45 | 46 | # beamer 47 | *.nav 48 | *.snm 49 | *.vrb 50 | 51 | # cprotect 52 | *.cpt 53 | 54 | #(e)ledmac/(e)ledpar 55 | *.end 56 | *.[1-9] 57 | *.[1-9][0-9] 58 | *.[1-9][0-9][0-9] 59 | *.[1-9]R 60 | *.[1-9][0-9]R 61 | *.[1-9][0-9][0-9]R 62 | *.eledsec[1-9] 63 | *.eledsec[1-9]R 64 | *.eledsec[1-9][0-9] 65 | *.eledsec[1-9][0-9]R 66 | *.eledsec[1-9][0-9][0-9] 67 | *.eledsec[1-9][0-9][0-9]R 68 | 69 | # glossaries 70 | *.acn 71 | *.acr 72 | *.glg 73 | *.glo 74 | *.gls 75 | 76 | # gnuplottex 77 | *-gnuplottex-* 78 | 79 | # hyperref 80 | *.brf 81 | 82 | # knitr 83 | *-concordance.tex 84 | *.tikz 85 | *-tikzDictionary 86 | 87 | # listings 88 | *.lol 89 | 90 | # makeidx 91 | *.idx 92 | *.ilg 93 | *.ind 94 | *.ist 95 | 96 | # minitoc 97 | *.maf 98 | *.mtc 99 | *.mtc[0-9] 100 | *.mtc[1-9][0-9] 101 | 102 | # minted 103 | _minted* 104 | *.pyg 105 | 106 | # morewrites 107 | *.mw 108 | 109 | # mylatexformat 110 | *.fmt 111 | 112 | # nomencl 113 | *.nlo 114 | 115 | # sagetex 116 | *.sagetex.sage 117 | *.sagetex.py 118 | *.sagetex.scmd 119 | 120 | # sympy 121 | *.sout 122 | *.sympy 123 | sympy-plots-for-*.tex/ 124 | 125 | # pdfcomment 126 | *.upa 127 | *.upb 128 | 129 | #pythontex 130 | *.pytxcode 131 | pythontex-files-*/ 132 | 133 | # Texpad 134 | .texpadtmp 135 | 136 | # TikZ & PGF 137 | *.dpth 138 | *.md5 139 | *.auxlock 140 | 141 | # todonotes 142 | *.tdo 143 | 144 | # xindy 145 | *.xdy 146 | 147 | # xypic precompiled matrices 148 | *.xyc 149 | 150 | # WinEdt 151 | *.bak 152 | *.sav 153 | 154 | # endfloat 155 | *.ttt 156 | *.fff 157 | 158 | # Latexian 159 | TSWLatexianTemp* 160 | 161 | # Emacs 162 | *~ 163 | \#*\# 164 | 165 | # jupyter notebook 166 | .ipynb_checkpoints/ 167 | 168 | #DS_Store 169 | **/.DS_Store 170 | .DS_Store 171 | 2023/.DS_Store 172 | .DS_Store 173 | -------------------------------------------------------------------------------- /2021/day1/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2021/day1/.DS_Store -------------------------------------------------------------------------------- /2021/day1/README.md: -------------------------------------------------------------------------------- 1 | # Day 1: Python basics 2 | -------------------------------------------------------------------------------- /2021/day1/day1-morning.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2021/day1/day1-morning.pdf -------------------------------------------------------------------------------- /2021/day1/exercises/exercises.md: -------------------------------------------------------------------------------- 1 | # Exercise 1: Working with lists 2 | 3 | 4 | ## 1. Warming up 5 | 6 | - Create a list, loop over the list, and do something with each value (you're free to choose). 7 | 8 | ## 2. Did you pass? 9 | 10 | - Think of a way to determine for a list of grades whether they are a pass (>5.5) or fail. 11 | - Can you make that program robust enough to handle invalid input (e.g., a grade as 'ewghjieh')? 12 | - How does your program deal with impossible grades (e.g., 12 or -3)? 13 | - Any other improvements? 14 | 15 | 16 | 17 | 18 | # Exercise 2: Working with dictionaries 19 | 20 | 21 | - Create a program that takes lists of corresponding data (a list of first names, a list of last names, a list of phone numbers) and converts them into a dictionary. You may assume that the lists are ordered correspondingly. To loop over two lists at the same time, you can do sth like this: (of course, you later on do not want to print put to put in a dictionary): 22 | ``` 23 | for i, j in zip(list1, list): 24 | print(i,j) 25 | ``` 26 | - Improve the program to control what should happen if the lists are (unexpectedly) of unequal length. 27 | - Create another program to handle a phone dictionary. The keys are names, and the value can either be a single phone number, a list of phone numbers, or another dict of the form {"office": "020123456", "mobile": "0699999999", ... ... ... }. Write a function that shows how many different phone numbers a given person has. 28 | - Write another function that prints only mobile numbers (and their owners) and omits the rest (If you want to take it easy, you may assume that they are stored in a dict and use the key "mobile". If you like challenges, you can also support strings and lists of strings by parsing the numbers themselves and check whether they start with 06. You can check whether a string starts with 06 by checking mystring[:2]=="06" (the double equal sign indicates a comparison that will return True or False). If you like even more challenges, you could support country codes). 29 | 30 | 31 | 32 | # Exercise 3: Working with defaultdicts 33 | 34 | - Take the data from Excercise 2. Write a program that collects all office numbers, all mobile numers, etc. Assume that there are potentially also other categories like "home", "second", maybe even "fax", and that they are unknown byforehand. 35 | - To do so, you can use the following approach: 36 | ```python 37 | from collections import defaultdict 38 | myresults = defaultdict(list) 39 | ``` 40 | Loop over the appropriate data. For all the key-value pairs (like "office": "020111111"), do ` myresults[key].append(value)`: This will append the current phone numner (02011111) to the list of "office" numbers. 41 | - Do you see why this works only with a defaultdict but not with a "normal" dict? What would happen with a normal dict? 42 | - Take the function from Exercise 2 that prints how many phone numbers a given person has. Use a defaultdict instead to achieve the same result. What are the pros and cons? 43 | -------------------------------------------------------------------------------- /2021/day1/exercises/instructional-videos-exercises.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | # Instructional video's 4 | #### The linked video's further explain the answers provided to today's [exercises](https://github.com/uvacw/teachteacher-python/blob/main/day1/exercises/exercises.md). 5 | 6 | - Instructional video explaining [Exercise 2](https://github.com/uvacw/teachteacher-python/blob/main/day1/exercises/exercises.md#exercise-2-working-with-dictionaries): *Working with dictionaries*: [Video here](https://www.youtube.com/watch?v=M_bkVPfQcgs) 7 | 8 | - Instructional video explaining [Exercise 3](https://github.com/uvacw/teachteacher-python/blob/main/day1/exercises/exercises.md#exercise-3-working-with-defaultdicts): *Working with defaultdicts:* [Video here](https://www.youtube.com/watch?v=2l9aRWcKVyA) 9 | -------------------------------------------------------------------------------- /2021/day1/solutions/ex2-1.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | names = ["Alice", "Bob", "Carol"] 4 | office = ["020222", "030111", "040444"] 5 | mobile = ["0666666", "0622222", "0644444"] 6 | 7 | mydict ={} 8 | for n, o, m in zip(names, office, mobile): 9 | mydict[n] = {"office":o, "mobile":m} 10 | 11 | print(mydict) 12 | -------------------------------------------------------------------------------- /2021/day1/solutions/ex2-2.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | names = ["Alice", "Bob", "Carol", "Damian"] 4 | office = ["020222", "030111", "040444"] 5 | mobile = ["0666666", "0622222", "0644444"] 6 | 7 | if len(names) == len(office) == len(mobile): 8 | mydict ={} 9 | for n, o, m in zip(names, office, mobile): 10 | mydict[n] = {"office":o, "mobile":m} 11 | print(mydict) 12 | else: 13 | print("Your data seems to be messed up - the lists do not have the same length") 14 | -------------------------------------------------------------------------------- /2021/day1/solutions/ex2-3.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | data = {'Alice': {'office': '020222', 'mobile': '0666666'}, 4 | 'Bob': {'office': '030111'}, 5 | 'Carol': {'office': '040444', 'mobile': '0644444'}, 6 | "Daan": "020222222", 7 | "Els": ["010111", "06222"]} 8 | 9 | def get_number_of_subscriptions(x): 10 | if type(x) is str: 11 | return 1 12 | else: 13 | return len(x) 14 | 15 | for k, v in data.items(): 16 | print(f"{k} has {get_number_of_subscriptions(v)} phone subscriptions") 17 | -------------------------------------------------------------------------------- /2021/day1/solutions/ex2-4.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | data = {'Alice': {'office': '020222', 'mobile': '0666666'}, 4 | 'Bob': {'office': '030111'}, 5 | 'Carol': {'office': '040444', 'mobile': '0644444'}, 6 | "Daan": "020222222", 7 | "Els": ["010111", "06222"]} 8 | 9 | def get_number_of_subscriptions(x): 10 | if type(x) is str: 11 | return 1 12 | else: 13 | return len(x) 14 | 15 | def get_mobile(x): 16 | if type(x) is str and x[:2]=="06": 17 | return x 18 | if type(x) is list: 19 | return [e for e in x if e[:2]=="06"] 20 | if type(x) is dict: 21 | return [v for k, v in x.items() if k=="mobile"] 22 | for k, v in data.items(): 23 | print(f"{k} has {get_number_of_subscriptions(v)} phone subscriptions. The mobile ones are {get_mobile(v)}") 24 | -------------------------------------------------------------------------------- /2021/day1/solutions/ex3-1.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | from collections import defaultdict 4 | 5 | data = {'Alice': {'office': '020222', 'mobile': '0666666'}, 6 | 'Bob': {'office': '030111'}, 7 | 'Carol': {'office': '040444', 'mobile': '0644444', 'fax': "02012354"}, 8 | "Daan": "020222222", 9 | "Els": ["010111", "06222"]} 10 | 11 | myresults = defaultdict(list) 12 | 13 | for name, entry in data.items(): 14 | try: 15 | for k, v in entry.items(): 16 | myresults[k].append(v) 17 | except: 18 | print(f"{name}'s numbers aren't stored in a dict, so I don't know what they are and will skip them") 19 | 20 | print(myresults) 21 | -------------------------------------------------------------------------------- /2021/day1/solutions/ex3-2.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | from collections import defaultdict 4 | 5 | data = {'Alice': {'office': '020222', 'mobile': '0666666'}, 6 | 'Bob': {'office': '030111'}, 7 | 'Carol': {'office': '040444', 'mobile': '0644444'}, 8 | "Daan": "020222222", 9 | "Els": ["010111", "06222"]} 10 | 11 | subscriptions = defaultdict(int) 12 | 13 | for name, entry in data.items(): 14 | if type(entry) is str: 15 | subscriptions[name]+=1 # this is short for subscriptions[name] = subscriptions[name]+1 16 | else: 17 | subscriptions[name] += len(entry) 18 | 19 | print(subscriptions) 20 | -------------------------------------------------------------------------------- /2021/day1/solutions/gradechecker_function.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | grades = [4, 7.8, -3, 3.6, 12, 9.1, "4.4", "KEGJKEG", 4.2, 7, 5.5] 4 | 5 | 6 | 7 | def check_grade(grade): 8 | try: 9 | grade_float = float(grade) 10 | except: 11 | return('INVALID') 12 | if grade_float >10 or grade_float <1: 13 | return('INVALID') 14 | elif grade_float >= 5.5: 15 | return('PASS') 16 | else: 17 | return('FAIL') 18 | 19 | 20 | for grade in grades: 21 | print(grade,'is',check_grade(grade)) 22 | -------------------------------------------------------------------------------- /2021/day1/solutions/gradechecker_robust.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | grades = [4, 7.8, -3, 3.6, 12, 9.1, "4.4", "KEGJKEG", 4.2, 7, 5.5] 4 | 5 | for grade in grades: 6 | try: 7 | grade_float = float(grade) 8 | if grade_float >10: 9 | print(grade_float,'is an invalid grade') 10 | elif grade_float <1: 11 | print(grade_float,'is an invalid grade') 12 | elif grade_float >= 5.5: 13 | print(grade,'is a PASS') 14 | else: 15 | print(grade,'is a FAIL') 16 | 17 | except: 18 | print('I do not understand what',grade,'means') 19 | 20 | -------------------------------------------------------------------------------- /2021/day1/solutions/gradechecker_simple.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | grades = [4, 7.8, 3.6, 9.1, 4.2, 7, 5.5] 4 | 5 | for grade in grades: 6 | if grade >= 5.5: 7 | print(grade,'is a PASS') 8 | else: 9 | print(grade,'is a FAIL') 10 | -------------------------------------------------------------------------------- /2021/day2/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2021/day2/.DS_Store -------------------------------------------------------------------------------- /2021/day2/Day 2_slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2021/day2/Day 2_slides.pdf -------------------------------------------------------------------------------- /2021/day2/Notebooks/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2021/day2/Notebooks/.DS_Store -------------------------------------------------------------------------------- /2021/day2/Notebooks/Datasets/Sentiment_YouTubeClimateChange.csv: -------------------------------------------------------------------------------- 1 | ,videoId,negative,positive,neutral 2 | 0,sGHq_EwXDn8,-1,1,0 3 | 1,PRtn1W2RAVU,-1,1,0 4 | 2,2CQvBGSiDvw,-1,1,0 5 | 3,Cbwv1jg4gZU,-1,1,0 6 | 4,cWsCX_yxXqw,-1,1,0 7 | 5,pbiSuB3mzmo,-1,1,0 8 | 6,6d9ENk3NfBM,-2,1,-1 9 | 7,hzrFtZc9EkQ,-1,1,0 10 | 8,Je2l7Gw7uns,-1,1,0 11 | 9,8Rvl6z80baI,-1,1,0 12 | 10,IQpIVsxx014,-1,1,0 13 | 12,8lDwd5XM1HQ,-1,1,0 14 | 13,ga-RBuhcJ7w,-1,1,0 15 | 14,SUQQVKVHRUQ,-2,1,-1 16 | 15,72fwlCXy1bw,-1,1,0 17 | 16,lq0i6umUQzI,-4,1,-1 18 | 17,riVP3Jy3Orc,-1,1,0 19 | 18,76CLorQ9n5s,-1,1,0 20 | 19,I3qbXSf93bc,-2,1,-1 21 | 20,dretZCOvMzw,-1,1,0 22 | 21,N-EWoGUzLIc,-1,2,1 23 | 22,M6hA-MUFgXo,-1,1,0 24 | 23,ezAZ5WVAOyI,-1,1,0 25 | 24,pl1Rnz4zNkg,-1,1,0 26 | 25,_lpVPh6LeU8,-1,1,0 27 | 27,VOpmtCiAhX0,-1,1,0 28 | 28,Yglapc3SBmc,-2,1,-1 29 | 29,wLW4Tk8Pwdg,-3,1,-1 30 | 30,rAam4R1M5zE,-1,1,0 31 | 32,sw9pNVc9lw0,-1,1,0 32 | 33,iLGgILUqbcc,-1,1,0 33 | 34,cl4Uv9_7KJE,-2,1,-1 34 | 35,9m3dYb1m7cw,-3,1,-1 35 | 36,9t3ERx71ANA,-1,1,0 36 | 37,S7Lu-R4EwQw,-1,1,0 37 | 38,Qu0DNjp2OsY,-2,1,-1 38 | 39,L6sanoP6rY8,-1,1,0 39 | 40,mQuGnQEPJPs,-1,2,1 40 | 41,FVX60nlzglk,-2,1,-1 41 | 43,L0t_5RdQqb0,-1,1,0 42 | 44,IDOouqQGZY4,-2,1,-1 43 | 48,xsQgDwXmsyg,-1,1,0 44 | 50,NvVX3ILhVac,-2,1,-1 45 | 58,xnudgOC9D5Y,-3,1,-1 46 | 61,38Mw-t5zmyM,-1,1,0 47 | 62,K9MaGf-Su9I,-1,1,0 48 | 63,5cY7fSbUWA8,-1,1,0 49 | 66,yvDRQe2oCt4,-1,1,0 50 | 68,8vP00TP6-p0,-1,1,0 51 | 70,kl_i4MRYgv0,-1,1,0 52 | 76,EXkbdELr4EQ,-2,1,-1 53 | 78,1Uu9vCNH6Dk,-3,1,-1 54 | 82,PWvLLGcb96k,-1,1,0 55 | 87,aewK7Kzf43A,-1,1,0 56 | 88,-fkCo_trbT8,-2,1,-1 57 | 93,BXKUsTo_f1s,-2,1,-1 58 | 97,ZbZwwUEzNLY,-1,1,0 59 | 98,dcBXmj1nMTQ,-1,1,0 60 | 101,M3Iztt4D2UE,-1,1,0 61 | 102,4xkXjj6dalM,-1,1,0 62 | 107,KF7cvmXjSyw,-1,1,0 63 | 117,FtNTt_3PRoQ,-1,1,0 64 | 119,RoQRkmRjz38,-1,1,0 65 | 120,BYo7Mo1ncuM,-1,1,0 66 | 121,KzjiNDbGZPM,-1,1,0 67 | 122,TbW_1MtC2So,-1,1,0 68 | 123,fw-g0PVpW2E,-1,1,0 69 | 126,M0tENI3ef7Y,-1,1,0 70 | 128,ZQGGhtguHns,-1,1,0 71 | 129,SU6GMDDXFtw,-2,1,-1 72 | 136,WkfTeGcItA0,-1,2,1 73 | 140,mO4vtjfabm0,-1,1,0 74 | 141,fWH6VGFs2z4,-1,1,0 75 | 142,13t0tCV8hW8,-1,1,0 76 | 144,Xem9EvvkJSc,-1,1,0 77 | 147,YuMtSjq8W-g,-3,1,-1 78 | 148,vvhtoL2A8dU,-2,1,-1 79 | 149,R_cf5n3UgrU,-2,1,-1 80 | 150,P-3ZlFokHfM,-1,1,0 81 | 152,9M29ns1rUSE,-1,1,0 82 | 153,tAkvNHEnctg,-1,1,0 83 | 158,nZIOZwUPNnA,-1,1,0 84 | 161,3SD-Mrv7QLQ,-2,1,-1 85 | 162,y5cWazcakUw,-1,1,0 86 | 163,bVAyj9bYHMw,-1,1,0 87 | 165,gPVBDCDmrcU,-1,1,0 88 | 168,cQlQL5obqDs,-1,1,0 89 | 170,H2QxFM9y0tY,-1,1,0 90 | 175,9A7_xCrgX1U,-1,1,0 91 | 177,p05YJ5if8Ew,-1,1,0 92 | 178,7WPsMsYCtjk,-1,1,0 93 | 179,ssuevV4eyqM,-3,1,-1 94 | 184,vWrF_ZHymoE,-1,1,0 95 | 186,yyAuWeoTm2s,-1,1,0 96 | 187,Av9SW1yw5lg,-3,1,-1 97 | 189,nxhEXwaDyxM,-2,1,-1 98 | 191,Ld47QsQHM7c,-1,1,0 99 | 192,6EFHZfISGp4,-1,1,0 100 | 193,5nGYkH9ifzM,-1,1,0 101 | 195,gU9hPfx12GA,-1,1,0 102 | 203,JtHHnBUmc0g,-1,1,0 103 | 205,rv3DPaMaS2g,-1,1,0 104 | 206,1hhzrormtP4,-1,1,0 105 | 210,0sMwKLkW4lI,-1,1,0 106 | 212,wzjVT07bcYA,-1,1,0 107 | 213,cvjbXYdh8x0,-2,1,-1 108 | 214,VX5ku0LbMMk,-1,1,0 109 | 217,j43XK0wzMd4,-4,1,-1 110 | 222,1yeANLOHnJ8,-3,3,-1 111 | 224,fKXg-SUP5P4,-2,1,-1 112 | 230,YE7kwNXqV30,-1,1,0 113 | 232,Ix5U2S8UXPA,-2,1,-1 114 | 234,2efYeNroXvg,-1,1,0 115 | 235,gSXOxrjCA40,-1,1,0 116 | 238,6_VJXHfMevM,-1,1,0 117 | 239,-4k3AzfYuJg,-1,1,0 118 | 240,EagrIPTCqrg,-1,1,0 119 | 249,7yNhCDMB0ls,-1,1,0 120 | 250,_jA8k4YDzlo,-1,1,0 121 | 251,lpGUzz-tjWs,-1,1,0 122 | 252,qCeBPeBjKcA,-1,1,0 123 | 253,WVc-Y-mJ_uY,-2,2,-1 124 | 255,nu0f86EkzS8,-2,1,-1 125 | 258,nkoRm9A7xr8,-1,1,0 126 | 262,DMbu9w4pDXE,-1,1,0 127 | 264,X7Wv0AZC_D4,-1,1,0 128 | 268,sGx6P2UR8Ig,-1,1,0 129 | 276,61hsoU0AIK4,-1,1,0 130 | 280,FHUHsBnpCj8,-1,1,0 131 | 285,Ez4qvsR-gHQ,-1,1,0 132 | 286,UDlHcxWtbvw,-1,1,0 133 | 288,STeynRkoU3s,-3,2,-1 134 | 292,BWJBOwSa4h8,-1,1,0 135 | 293,e3duOpZlD9E,-1,1,0 136 | 302,EyMAmakw1dU,-1,1,0 137 | 304,11FCyUB81rI,-1,1,0 138 | 306,FDWAEKQ0KkU,-1,1,0 139 | 307,sSF8uFoSm1M,-3,1,-1 140 | 308,5Z9OZE_TypE,-1,1,0 141 | 309,GaUi64HwUZg,-1,1,0 142 | 310,TMF9aMI-9ek,-1,1,0 143 | 311,IagqMq4wfCc,-1,1,0 144 | 313,2iC-G7KXTBU,-1,1,0 145 | 314,8paPxMzc0mo,-1,1,0 146 | 316,o_dshuJTxLI,-1,1,0 147 | 317,fqqPjRNXgdA,-1,1,0 148 | 319,-61c8EQ8qro,-1,1,0 149 | 321,Opy-a_oW3Bw,-1,1,0 150 | 322,7GTtCtXJJ0Y,-1,1,0 151 | 323,KEkmIErcgT4,-1,1,0 152 | 324,AFvTrdOqdXo,-1,1,0 153 | 325,oSqmCNNV2dQ,-1,1,0 154 | 326,bW3IQ-ke43w,-1,1,0 155 | 328,ikz5JHfPQ6k,-1,1,0 156 | 329,KArS5ArSYY4,-2,1,-1 157 | 331,c7lCRYf9rHo,-2,1,-1 158 | 332,u9KxE4Kv9A8,-1,1,0 159 | 333,ba1tND0B0xk,-1,1,0 160 | 334,vqKLTEQjew4,-1,1,0 161 | 338,KAJsdgTPJpU,-1,1,0 162 | 339,7OxgWkhozQU,-1,1,0 163 | 341,vZIC6hJ_fCE,-2,2,-1 164 | 344,DYqtXR8iPlE,-2,1,-1 165 | 345,N1cdCUZNh04,-1,1,0 166 | 347,_WY7FEYN3QI,-1,1,0 167 | 349,oDjuoBAtLWA,-4,1,-1 168 | 351,UvHMhZ1T964,-1,1,0 169 | 353,rYxt0BeTrT8,-2,1,-1 170 | 355,7f5NVJTqPaU,-3,1,-1 171 | 358,L0ryCJVAGZE,-1,1,0 172 | 360,-PSR_OutuIw,-1,1,0 173 | 362,RA4mIbQo52k,-1,1,0 174 | 364,9Pqp_8XLC6c,-1,1,0 175 | 370,0__6kx-vTO4,-2,2,-1 176 | 373,jOHuUeZzPh0,-1,1,0 177 | 376,1tRDnjl_gwY,-3,1,-1 178 | 379,Da5-n9pf6sM,-2,1,-1 179 | 380,z9ALFf6eQI0,-1,1,0 180 | 383,088j0n0XxQE,-1,1,0 181 | 385,WRgv4V1ZxN4,-1,1,0 182 | 388,WR6uSXW-8p4,-1,1,0 183 | 389,rhQVustYV24,-1,1,0 184 | 390,kQozp2xZ3Q0,-2,1,-1 185 | 391,jOYvuLIwWEQ,-1,1,0 186 | 394,pj5ZLwtoAmI,-1,1,0 187 | 396,G-YdVrhAoNU,-1,1,0 188 | 397,n0bqG1GzlHU,-1,1,0 189 | 400,LRfwnxQN1Lw,-1,1,0 190 | 402,oQbftR8pG78,-3,1,-1 191 | 404,J__4V0ujlaU,-2,1,-1 192 | 405,78iIQdKmodc,-1,1,0 193 | 406,zvOcuZ3-FO8,-1,1,0 194 | 407,-BvcToPZCLI,-3,1,-1 195 | 409,YrnlZXeC1nM,-1,1,0 196 | 410,tNwkY_V_BPI,-1,1,0 197 | 415,8l-dhwqd2UM,-1,1,0 198 | 416,WFV-rcaBG9g,-2,1,-1 199 | 417,_XugW-yg2XI,-1,1,0 200 | 418,uo8qXxnFuRQ,-1,3,1 201 | 421,mapriv3vWBA,-3,1,-1 202 | 422,f6URRc-0Z1o,-2,1,-1 203 | 426,pYtEukvKjLc,-3,1,-1 204 | 427,w3FBLKHG-9M,-1,1,0 205 | 428,oCVQdr9QFwY,-1,1,0 206 | 431,ZPMy2Yw8teM,-1,1,0 207 | 432,vtMHfFxwg3U,-1,1,0 208 | 434,OYtAGTe9MjY,-1,2,1 209 | 436,wRk1p8Lzwvo,-1,1,0 210 | 437,4sSJpKTdwFo,-1,1,0 211 | 439,AJkFuRzJNoQ,-1,1,0 212 | 440,tffj_82IRsg,-3,1,-1 213 | 441,1CnyqLogH0Y,-2,1,-1 214 | 442,FP-9l6BeagE,-2,1,-1 215 | 443,MeKAdOySB_E,-1,1,0 216 | 445,VM4d66igm9w,-2,1,-1 217 | 448,9iAD_heE2kU,-2,1,-1 218 | 449,SQY7VOQF8sY,-1,1,0 219 | 450,cZwQN4JpJ8s,-1,1,0 220 | 451,DhhVr5iLF-c,-1,1,0 221 | 453,XL1rpFCBg5s,-1,1,0 222 | 454,BzPjWpkNWiU,-1,2,1 223 | 456,f2Wr7lDI-Hg,-1,1,0 224 | 457,3fQHpXkI-vc,-1,1,0 225 | 459,QpLdpjcHhqs,-1,1,0 226 | 463,DMTwbV9UqHA,-1,1,0 227 | 466,zIvjHSvzFLU,-1,1,0 228 | 467,9rkDTXEOpEM,-1,1,0 229 | 468,t-uRB26a-sg,-2,1,-1 230 | 469,9zcGrc2xcO0,-1,1,0 231 | 470,vapTJLUSvpQ,-1,1,0 232 | 471,zMQ0xQrgBms,-1,1,0 233 | 472,j8ZrRL2lbsA,-1,1,0 234 | 474,JYZpxRy5Mfg,-1,1,0 235 | 475,38dqOdQFdRI,-1,1,0 236 | 476,BFp3Q3WdVWI,-1,1,0 237 | 481,BQ4rBLCpEeM,-3,1,-1 238 | 482,00cKGt9v1as,-1,1,0 239 | 483,UgHNg-N-ENI,-1,1,0 240 | 484,dsyW3QjBQHU,-1,1,0 241 | 486,U3r-TzeSzrc,-1,1,0 242 | 487,8ISePLL1wcw,-1,1,0 243 | 490,w38fhmZkz64,-1,1,0 244 | 493,ZgdTHVcv1o8,-1,1,0 245 | 495,i-qBOyrD0-0,-1,1,0 246 | -------------------------------------------------------------------------------- /2021/day2/Notebooks/Datasets/Sentiment_YouTubeClimateChange.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2021/day2/Notebooks/Datasets/Sentiment_YouTubeClimateChange.pkl -------------------------------------------------------------------------------- /2021/day2/Notebooks/Datasets/websites.csv: -------------------------------------------------------------------------------- 1 | site,type,views,active_users 2 | Twitter,Social Media,10000,200000 3 | Facebook,Social Media,35000,500000 4 | NYT,News media,78000,156000 5 | YouTube,Video platform,18000,289000 6 | Vimeo,Video platform,300,1580 7 | USA Today,News media,4800,5608 8 | -------------------------------------------------------------------------------- /2021/day2/Notebooks/ExcercisesPandas.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "accompanied-inspector", 6 | "metadata": {}, 7 | "source": [ 8 | "## Excercises pandas\n", 9 | "\n", 10 | "Let's practice data exploration and wrangling in Pandas. \n", 11 | "\n", 12 | "We will work with data collected through the Twitter API (more on API's tomorrow ;)). Few months ago I collected tweets published by the RIVM twice. I have also already run a sentiment analysis on these tweets and have saved it in a separate file.\n", 13 | "We have three datasets:\n", 14 | "* Two datasets with tweets by the RIVM (public tweets by account)\n", 15 | "* One dataset with sentiment of those tweets (simulated dataset, with two sentiment scores)\n", 16 | "\n", 17 | "We want to see how sentiment changes over time (per month), compare number of positive and negative tweets and analyze the relation between sentiment and engagement with the tweets\n", 18 | "\n", 19 | "We want to prepare the dataset for analysis:\n", 20 | "\n", 21 | "**Morning**\n", 22 | "* Data exploration\n", 23 | " * Check columns, data types, missing values, descriptives for numeric variables measuring engagement and sentiment, value_counts for relevant categorical variables\n", 24 | "* Handling missing values and data types\n", 25 | " * Handle missing values in variables of interest: number of likes and retweets - what can nan's mean?\n", 26 | " * Make sure created_at has the right format (to use it for aggregation later)\n", 27 | "* Creating necessary variables (sentiment)\n", 28 | " * Overall measure of sentiment - create it from positive and negative\n", 29 | " * Binary variable (positive or negative tweet) - Tip: Write a function that \"recodes\" the sentiment column\n", 30 | " \n", 31 | "\n", 32 | "**Afternoon**\n", 33 | "\n", 34 | "Pandas continued\n", 35 | "* Concatenating the dataframes (tweets1 and tweets2)\n", 36 | "* Merging the files (tweets with sentiment)\n", 37 | " * Make sure the columns you merge match and check how to merge\n", 38 | "* Agrregating the files per month\n", 39 | " * Tip: Create a column for month by transforming the date column. Remember that the date column needs the right format first!\n", 40 | " \n", 41 | " `df['month'] = df['date_dt_column'].dt.strftime('%Y-%m')`\n", 42 | "\n", 43 | "\n", 44 | "\n", 45 | "\n", 46 | "\n", 47 | "Visualisations:\n", 48 | "* Visualise different columns of the tweet dataset (change of sentiment over time, sentiment, engagement, relation between sentiment and engagement)\n", 49 | "* But *more fun*: use your own data to play with visualisations" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 2, 55 | "id": "brazilian-giving", 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "import pandas as pd" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": 3, 65 | "id": "appreciated-cigarette", 66 | "metadata": {}, 67 | "outputs": [], 68 | "source": [ 69 | "tweets1 = pd.read_csv('Datasets/RIVM_1.csv')" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 4, 75 | "id": "hollow-honduras", 76 | "metadata": {}, 77 | "outputs": [], 78 | "source": [ 79 | "tweets2 = pd.read_csv('Datasets/RIVM_2.csv')" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 5, 85 | "id": "leading-adelaide", 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "sentiment = pd.read_csv('Datasets/RIVM_sentiment.csv')" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": null, 95 | "id": "antique-specific", 96 | "metadata": {}, 97 | "outputs": [], 98 | "source": [] 99 | } 100 | ], 101 | "metadata": { 102 | "kernelspec": { 103 | "display_name": "Python 3", 104 | "language": "python", 105 | "name": "python3" 106 | }, 107 | "language_info": { 108 | "codemirror_mode": { 109 | "name": "ipython", 110 | "version": 3 111 | }, 112 | "file_extension": ".py", 113 | "mimetype": "text/x-python", 114 | "name": "python", 115 | "nbconvert_exporter": "python", 116 | "pygments_lexer": "ipython3", 117 | "version": "3.9.1" 118 | } 119 | }, 120 | "nbformat": 4, 121 | "nbformat_minor": 5 122 | } 123 | -------------------------------------------------------------------------------- /2021/day2/README.md: -------------------------------------------------------------------------------- 1 | # Day 2: Pandas and statistics 2 | 3 | | Time (indication) | Topic | 4 | |-|-| 5 | | 9.30-11.00 | Pandas: We will start the day with a general introduction to Pandas and will learn working with dataframes. We will start with reading different types of data into Pandas dataframe and continue with data wrangling. We will also shortly discuss pro's and con's of using Pandas comopared to formats discussed on Monday.| 6 | | 11.00-12.00 | Exercises | 7 | | 13.00-14.00 | Basic statistics and plotting: We will continue working with dataframes focusing on basic analysis and visualisation steps. We will discuss descritpive startistics as well as most commonly used statistics tests. We will also work with univariate and bivariate plots. | 8 | | 14:00 - 15:30 | Exercises | 9 | | 15:30 - 16:30 | Teaching presentations & general Q&A | 10 | | 16:30 - 17:00 | Questions / discussion/ next steps | 11 | -------------------------------------------------------------------------------- /2021/day3/README.md: -------------------------------------------------------------------------------- 1 | # Day 3: Collecting and reading data 2 | 3 | | Time (indication) | Topic | 4 | |-|-| 5 | | 9.30-11.00 | Data Collection 1: We will dive into handling data beyond typical tabular datasets (such as the csv files from Tuesday) and get an introduction to the JSON format, which is the de-facto standard for (online) data exchange. We will also get to know our first API (which uses this format). | 6 | | 11.00-12.00 | Exercises | 7 | | 13.00-14.00 | Data Collection 2: APIs and scraping. We will look more in detail into different APIs and how they can be used for data collection. We will also briefly talk about scraping which is highly relevant as a data collection technique (for instance, for theses), but also a bit too complex to cover in detail in this workshop. | 8 | | 14:00 - 15:30 | Practice | 9 | | 15:30 - 16:30 | Teaching presentations & general Q&A | 10 | | 16:30 - 17:00 | Closure / next steps | 11 | 12 | 13 | -------------------------------------------------------------------------------- /2021/day3/day3-afternoon.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2021/day3/day3-afternoon.pdf -------------------------------------------------------------------------------- /2021/day3/day3-morning.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2021/day3/day3-morning.pdf -------------------------------------------------------------------------------- /2021/day3/exercises/exercises.md: -------------------------------------------------------------------------------- 1 | # Exercises Week 3 2 | 3 | ## 1. Working with CSV files 4 | 5 | 1. Take a dataset of your choice - something from your own work maybe. If it is not a CSV file (but, for instance, an Excel sheet or an SPSS file), export it as CSV. Inspect the file in a text editor of your choice (such as the ones available at https://www.sublimetext.com/, https://notepad-plus-plus.org, or https://atom.io) and check: 6 | - the encoding 7 | - the line ending style 8 | - the delimiter 9 | - whether it has a header row or not 10 | - take a quick look and check whether the file looks "ok", i.e. all rows have equal number of fields etc. 11 | 12 | 13 | 2. Open your file in Python and write it back (with a different file name). Do so both with the low-level (basic Python) and the high-level (pandas) approach. Inspect the result again in the editor and compare. (NB: Depending on the dialect, there may be small differences. If you observe some, which are they?) 14 | 15 | 16 | ## 2. Working with JSON files and APIs 17 | 18 | 1. Reproduce examples 12.1 (page 315), 12.2 (page 316) and 12.3 (page 334) from the book. Explain the code to a classmate. 19 | 20 | 2. Think of different ways of storing the data you collected. What would be the pros and cons? Discuss with a classmate. 21 | 22 | 3. What do you think of example 12.2 (or line 12 in example 12.3, for that matter)? Would you rather store your data *before* or *after* the `json_normalize()` function? Discuss with a classmate. (NB: there are arguments to be made for both) 23 | 24 | 4. What would happen if you would directly create a dataframe (e.g., via `pd.Dataframe(allitems)`, `pd.Dataframe(data['items'])`, or similar)? Based on this observation, can you describe what `json_normalize()` does? 25 | -------------------------------------------------------------------------------- /2021/day4/README.md: -------------------------------------------------------------------------------- 1 | # Day 4: Natural Language Processing 2 | 3 | | Time slot | Content | 4 | |---------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 5 | | 09:30-11:00 | [NLP I](https://github.com/uvacw/teachteacher-python/blob/main/day4/day4.pdf): In a gentle introduction to NLP techniques, we will discuss the basics of bag-of-word (BAG) approaches, such as tokenization, stopword removal, stemming and lemmatization | 6 | | 11:00 - 12:00 | [Exercise 1](https://github.com/uvacw/teachteacher-python/blob/main/day4/exercises-1/exercise-1.md) | 7 | | 12:00-13:00 | Lunch | 8 | | 13:00-14:00 | [NLP II](https://github.com/uvacw/teachteacher-python/blob/main/day4/day4-afternoon.pdf): In the second lecture of the day, we will delve a bit deeper in NLP approaches. We discuss different types of vectorizers (i.e., count and tfidf) and discuss the possibilities NER in spacy. | 9 | | 14:00-15:00 | [Exercise 2](https://github.com/uvacw/teachteacher-python/blob/main/day4/exercises-2/exercise-2.md) | 10 | | 15:30-16:00 | [NLP & teaching; General Q&A](https://github.com/uvacw/teachteacher-python/blob/main/day4/day4-afternoon.pdf) | 11 | | 16:00-17:00 | Closure/next steps | 12 | -------------------------------------------------------------------------------- /2021/day4/day4-afternoon.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2021/day4/day4-afternoon.pdf -------------------------------------------------------------------------------- /2021/day4/day4-afternoon.tex: -------------------------------------------------------------------------------- 1 | % !TeX document-id = {f19fb972-db1f-447e-9d78-531139c30778} 2 | % !BIB program = biber 3 | 4 | \documentclass[handout]{beamer} 5 | %\documentclass[compress]{beamer} 6 | \usepackage[T1]{fontenc} 7 | \usetheme[block=fill,subsectionpage=progressbar,sectionpage=progressbar]{metropolis} 8 | \usepackage{graphicx} 9 | 10 | \usepackage{wasysym} 11 | \usepackage{etoolbox} 12 | \usepackage[utf8]{inputenc} 13 | 14 | \usepackage{threeparttable} 15 | \usepackage{subcaption} 16 | 17 | \usepackage{tikz-qtree} 18 | \setbeamercovered{still covered={\opaqueness<1->{5}},again covered={\opaqueness<1->{100}}} 19 | 20 | 21 | \usepackage{listings} 22 | 23 | \lstset{ 24 | basicstyle=\scriptsize\ttfamily, 25 | columns=flexible, 26 | breaklines=true, 27 | numbers=left, 28 | %stepsize=1, 29 | numberstyle=\tiny, 30 | backgroundcolor=\color[rgb]{0.85,0.90,1} 31 | } 32 | 33 | 34 | 35 | \lstnewenvironment{lstlistingoutput}{\lstset{basicstyle=\footnotesize\ttfamily, 36 | columns=flexible, 37 | breaklines=true, 38 | numbers=left, 39 | %stepsize=1, 40 | numberstyle=\tiny, 41 | backgroundcolor=\color[rgb]{.7,.7,.7}}}{} 42 | 43 | 44 | \lstnewenvironment{lstlistingoutputtiny}{\lstset{basicstyle=\tiny\ttfamily, 45 | columns=flexible, 46 | breaklines=true, 47 | numbers=left, 48 | %stepsize=1, 49 | numberstyle=\tiny, 50 | backgroundcolor=\color[rgb]{.7,.7,.7}}}{} 51 | 52 | 53 | 54 | \usepackage[american]{babel} 55 | \usepackage{csquotes} 56 | \usepackage[style=apa, backend = biber]{biblatex} 57 | \DeclareLanguageMapping{american}{american-UoN} 58 | \addbibresource{../../bdaca.bib} 59 | \renewcommand*{\bibfont}{\tiny} 60 | 61 | \usepackage{tikz} 62 | \usetikzlibrary{shapes,arrows,matrix} 63 | \usepackage{multicol} 64 | 65 | \usepackage{subcaption} 66 | 67 | \usepackage{booktabs} 68 | \usepackage{graphicx} 69 | 70 | 71 | 72 | \makeatletter 73 | \setbeamertemplate{headline}{% 74 | \begin{beamercolorbox}[colsep=1.5pt]{upper separation line head} 75 | \end{beamercolorbox} 76 | \begin{beamercolorbox}{section in head/foot} 77 | \vskip2pt\insertnavigation{\paperwidth}\vskip2pt 78 | \end{beamercolorbox}% 79 | \begin{beamercolorbox}[colsep=1.5pt]{lower separation line head} 80 | \end{beamercolorbox} 81 | } 82 | \makeatother 83 | 84 | 85 | 86 | \setbeamercolor{section in head/foot}{fg=normal text.bg, bg=structure.fg} 87 | 88 | 89 | 90 | \newcommand{\question}[1]{ 91 | \begin{frame}[plain] 92 | \begin{columns} 93 | \column{.3\textwidth} 94 | \makebox[\columnwidth]{ 95 | \includegraphics[width=\columnwidth,height=\paperheight,keepaspectratio]{../media/mannetje.png}} 96 | \column{.7\textwidth} 97 | \large 98 | \textcolor{orange}{\textbf{\emph{#1}}} 99 | \end{columns} 100 | \end{frame}} 101 | 102 | 103 | 104 | \title[Big Data and Automated Content Analysis]{\textbf{Teaching the Teacher} \\ Day 4 - Afternoon » From text to features: Natural Language Processing «} 105 | \author[Anne Kroon]{Anne Kroon \\ ~ \\ \footnotesize{a.c.kroon@uva.nl \\@annekroon} \\ } 106 | \date{July 1, 2021} 107 | \institute[UvA]{Afdeling Communicatiewetenschap \\Universiteit van Amsterdam} 108 | 109 | 110 | 111 | \begin{document} 112 | 113 | \begin{frame}{} 114 | \titlepage 115 | \end{frame} 116 | 117 | \begin{frame}{Today} 118 | \tableofcontents 119 | \end{frame} 120 | 121 | 122 | \section{From text to features: vectorizers} 123 | \begin{frame}[plain] 124 | From text to features: vectorizers 125 | \end{frame} 126 | 127 | 128 | 129 | \subsection{General idea} 130 | 131 | \begin{frame}[fragile]{A text as a collections of word} 132 | 133 | Let us represent a string 134 | \begin{lstlisting} 135 | t = "This this is is is a test test test" 136 | \end{lstlisting} 137 | like this:\\ 138 | \begin{lstlisting} 139 | from collections import Counter 140 | print(Counter(t.split())) 141 | \end{lstlisting} 142 | \begin{lstlistingoutput} 143 | Counter({'is': 3, 'test': 3, 'This': 1, 'this': 1, 'a': 1}) 144 | \end{lstlistingoutput} 145 | 146 | \pause 147 | Compared to the original string, this representation 148 | \begin{itemize} 149 | \item is less repetitive 150 | \item preserves word frequencies 151 | \item but does \emph{not} preserve word order 152 | \item can be interpreted as a vector to calculate with (!!!) 153 | \end{itemize} 154 | 155 | \tiny{\emph{Of course, still a lot of stuff to fine-tune\ldots} (for example, This/this)} 156 | \end{frame} 157 | 158 | 159 | 160 | \begin{frame}{From vector to matrix} 161 | If we do this for multiple texts, we can arrange the vectors in a table. 162 | 163 | t1 = "This this is is is a test test test" \newline 164 | t2 = "This is an example" 165 | 166 | \begin{tabular}{| c|c|c|c|c|c|c|c|} 167 | \hline 168 | & a & an & example & is & this & This & test \\ 169 | \hline 170 | \emph{t1} & 1 & 0 & 0 & 3 & 1 & 1 & 3 \\ 171 | \emph{t2} &0 & 1 & 1 & 1 & 0 & 1 & 0 \\ 172 | \hline 173 | \end{tabular} 174 | \end{frame} 175 | 176 | 177 | \question{What can you do with such a matrix? Why would you want to represent a collection of texts in such a way?} 178 | 179 | \begin{frame}{What is a vectorizer} 180 | \begin{itemize}[<+->] 181 | \item Transforms a list of texts into a sparse (!) matrix (of word frequencies) 182 | \item Vectorizer needs to be ``fitted'' to the training data (learn which words (features) exist in the dataset and assign them to columns in the matrix) 183 | \item Vectorizer can then be re-used to transform other datasets 184 | \end{itemize} 185 | \end{frame} 186 | 187 | 188 | \begin{frame}{The cell entries: raw counts versus tf$\cdot$idf scores} 189 | \begin{itemize} 190 | \item In the example, we entered simple counts (the ``term frequency'') 191 | \end{itemize} 192 | \end{frame} 193 | 194 | \question{But are all terms equally important?} 195 | 196 | 197 | \begin{frame}{The cell entries: raw counts versus tf$\cdot$idf scores} 198 | \begin{itemize} 199 | \item In the example, we entered simple counts (the ``term frequency'') 200 | \item But does a word that occurs in almost all documents contain much information? 201 | \item And isn't the presence of a word that occurs in very few documents a pretty strong hint? 202 | \item<2-> \textbf{Solution: Weigh by \emph{the number of documents in which the term occurs at least once) (the ``document frequency'')}} 203 | \end{itemize} 204 | \onslide<3->{ 205 | $\Rightarrow$ we multiply the ``term frequency'' (tf) by the inverse document frequency (idf) 206 | 207 | \tiny{(usually with some additional logarithmic transformation and normalization applied, see \url{https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html})} 208 | } 209 | \end{frame} 210 | 211 | \begin{frame}{tf$\cdot$idf} 212 | \begin{array}{ccc} 213 | 214 | w_{i, j}=t f_{i, j} \times \log \left(\frac{N}{d f_{i}}\right) \\ \\ 215 | 216 | t f_{i, j}=\text { number of occurrences of } i \text { in } j \\ 217 | d f_{i}=\text { number of documents containing } i \\ 218 | N=\text {total number of documents } 219 | \end{array} 220 | \end{frame} 221 | 222 | \begin{frame}{Is tf$\cdot$idf always better?} 223 | It depends. 224 | 225 | \begin{itemize} 226 | \item Ultimately, it's an empirical question which works better ($\rightarrow$ machine learning) 227 | \item In many scenarios, ``discounting'' too frequent words and ``boosting'' rare words makes a lot of sense (most frequent words in a text can be highly un-informative) 228 | \item Beauty of raw tf counts, though: interpretability + describes document in itself, not in relation to other documents 229 | \end{itemize} 230 | \end{frame} 231 | 232 | 233 | \begin{frame}{Different vectorizers} 234 | \begin{enumerate}[<+->] 235 | \item CountVectorizer (=simple word counts) 236 | \item TfidfVectorizer (word counts (``term frequency'') weighted by number of documents in which the word occurs at all (``inverse document frequency'')) 237 | \end{enumerate} 238 | \end{frame} 239 | 240 | \begin{frame}{Internal representations} 241 | \begin{block}{Sparse vs dense matrices} 242 | \begin{itemize} 243 | \item $\rightarrow$ tens of thousands of columns (terms), and one row per document 244 | \item Filling all cells is inefficient \emph{and} can make the matrix too large to fit in memory (!!!) 245 | \item Solution: store only non-zero values with their coordinates! (sparse matrix) 246 | \item dense matrix (or dataframes) not advisable, only for toy examples 247 | \end{itemize} 248 | \end{block} 249 | \end{frame} 250 | 251 | 252 | {\setbeamercolor{background canvas}{bg=black} 253 | \begin{frame} 254 | \makebox[\linewidth]{ 255 | \includegraphics[width=\paperwidth,height=\paperheight,keepaspectratio]{../media/sparse_dense.png}} 256 | \url{https://matteding.github.io/2019/04/25/sparse-matrices/} 257 | \end{frame} 258 | } 259 | 260 | 261 | \begin{frame}[standout] 262 | This morning we learned how to tokenize with a list comprehension (and that's often a good idea!). But what if we want to \emph{directly} get a DTM instead of lists of tokens? 263 | \end{frame} 264 | 265 | 266 | \begin{frame}[fragile]{OK, good enough, perfect?} 267 | \begin{block}{scikit-learn's CountVectorizer (default settings)} 268 | \begin{itemize} 269 | \item applies lowercasing 270 | \item deals with punctuation etc. itself 271 | \item minimum word length $>1$ 272 | \item more technically, tokenizes using this regular expression: \texttt{r"(?u)\textbackslash b\textbackslash w\textbackslash w+\textbackslash b"} \footnote{?u = support unicode, \textbackslash b = word boundary} 273 | \end{itemize} 274 | \end{block} 275 | \begin{lstlisting} 276 | from sklearn.feature_extraction.text import CountVectorizer 277 | cv = CountVectorizer() 278 | dtm_sparse = cv.fit_transform(docs) 279 | \end{lstlisting} 280 | \end{frame} 281 | 282 | 283 | \begin{frame}{OK, good enough, perfect?} 284 | \begin{block}{CountVectorizer supports more} 285 | \begin{itemize} 286 | \item stopword removal 287 | \item custom regular expression 288 | \item or even using an external tokenizer 289 | \item ngrams instead of unigrams 290 | \end{itemize} 291 | \end{block} 292 | \tiny{see \url{https://scikit-learn.org/stable/modules/generated/sklearn.feature\_extraction.text.CountVectorizer.html}} 293 | 294 | \pause 295 | \begin{alertblock}{Best of both worlds} 296 | \textbf{Use the Count vectorizer with a NLTK-based external tokenizer! (see book)} 297 | \end{alertblock} 298 | \end{frame} 299 | 300 | 301 | \subsection{Pruning} 302 | 303 | \begin{frame}{General idea} 304 | \begin{itemize} 305 | \item Idea behind both stopword removal and tf$\cdot$idf: too frequent words are uninformative 306 | \item<2-> (possible) downside stopword removal: a priori list, does not take empirical frequencies in dataset into account 307 | \item<3-> (possible) downside tf$\cdot$idf: does not reduce number of features 308 | \end{itemize} 309 | 310 | \onslide<4->{Pruning: remove all features (tokens) that occur in less than X or more than X of the documents} 311 | \end{frame} 312 | 313 | \begin{frame}[fragile, plain] 314 | CountVectorizer, only stopword removal 315 | \begin{lstlisting} 316 | from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer 317 | myvectorizer = CountVectorizer(stop_words=mystopwords) 318 | \end{lstlisting} 319 | 320 | CountVectorizer, better tokenization, stopword removal (pay attention that stopword list uses same tokenization!): 321 | \begin{lstlisting} 322 | myvectorizer = CountVectorizer(tokenizer = TreebankWordTokenizer().tokenize, stop_words=mystopwords) 323 | \end{lstlisting} 324 | 325 | Additionally remove words that occur in more than 75\% or less than $n=2$ documents: 326 | \begin{lstlisting} 327 | myvectorizer = CountVectorizer(tokenizer = TreebankWordTokenizer().tokenize, stop_words=mystopwords, max_df=.75, min_df=2) 328 | \end{lstlisting} 329 | 330 | All togehter: tf$\cdot$idf, explicit stopword removal, pruning 331 | \begin{lstlisting} 332 | myvectorizer = TfidfVectorizer(tokenizer = TreebankWordTokenizer().tokenize, stop_words=mystopwords, max_df=.75, min_df=2) 333 | \end{lstlisting} 334 | 335 | 336 | \end{frame} 337 | 338 | 339 | \question{What is ``best''? Which (combination of) techniques to use, and how to decide?} 340 | 341 | 342 | \section{Teaching Q\&A} 343 | 344 | \begin{frame}{NLP and teaching} 345 | \begin{block}{Teaching experiences} 346 | \begin{itemize} 347 | \item Transparancy 348 | \item Students should be able to explain HOW they've preprocessed the data and WHY 349 | \item Arguments for preprocessing differ across unsupervised and supervised tasks 350 | \end{itemize} 351 | \end{block} 352 | \end{frame} 353 | 354 | 355 | \begin{frame}{NLP and teaching} 356 | \begin{block}{Teaching experiences} 357 | \begin{itemize} 358 | \item Rationale for using preprocessing: Why do you use specific techniques? 359 | \item For supervised learning: often an empirical question 360 | \item Thus: testing different setting and explain what works best. Systematically testing different techniques 361 | \end{itemize} 362 | \end{block} 363 | \end{frame} 364 | 365 | \begin{frame}{NLP and teaching} 366 | \begin{block}{Sentiment analysis} 367 | \begin{itemize} 368 | \item Dictionary-based approaches 369 | \item Keep in mind; what are best practices? Off-the-shelf do not necessarily generalize well. 370 | \end{itemize} 371 | \end{block} 372 | \end{frame} 373 | 374 | \begin{frame}{Thank you!!} 375 | \begin{block}{Thank you for your attention!} 376 | \begin{itemize} 377 | \item Questions? Comments? 378 | \end{itemize} 379 | \end{block} 380 | \end{frame} 381 | 382 | 383 | 384 | \end{document} 385 | 386 | 387 | -------------------------------------------------------------------------------- /2021/day4/day4.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2021/day4/day4.pdf -------------------------------------------------------------------------------- /2021/day4/example-nltk.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | #### Example using a single `str` object 4 | 5 | ```python 6 | import nltk 7 | words = "the quick brown fox jumps over the lazy dog".split() 8 | nltk.pos_tag(words, tagset='universal') 9 | ``` 10 | 11 | #### Example using a `list` of `str` 12 | 13 | ```python 14 | articles = ['the quick brown fox jumps over the lazy dog', 'a second sentence'] 15 | tokens = [nltk.word_tokenize(sentence) for sentence in articles] 16 | tagged = [nltk.pos_tag(sentence, tagset='universal') for sentence in tokens] 17 | print(tagged[0]) 18 | ``` 19 | 20 | ----- 21 | 22 | | Tag | Meaning | English Examples | 23 | |------|---------------------|----------------------------------------| 24 | | ADJ | adjective | new, good, high, special, big, local | 25 | | ADP | adposition | on, of, at, with, by, into, under | 26 | | ADV | adverb | really, already, still, early, now | 27 | | CONJ | conjunction | and, or, but, if, while, although | 28 | | DET | determiner, article | the, a, some, most, every, no, which | 29 | | NOUN | noun | year, home, costs, time, Africa | 30 | | NUM | numeral | twenty-four, fourth, 1991, 14:24 | 31 | | PRT | particle | at, on, out, over per, that, up, with | 32 | | PRON | pronoun | he, their, her, its, my, I, us | 33 | | VERB | verb | is, say, told, given, playing, would | 34 | | . | punctuation marks | . , ; ! | 35 | | X | other | ersatz, esprit, dunno, gr8, univeristy | 36 | 37 | [source](https://bond-lab.github.io/Corpus-Linguistics/ntumc_tag_u.html) 38 | -------------------------------------------------------------------------------- /2021/day4/example-vectorizer-to-dense.md: -------------------------------------------------------------------------------- 1 | 2 | ```python 3 | import pandas as pd 4 | texts = ["hello teachers!", "how are you today?", "what?", "hello hello everybody"] 5 | 6 | vect = CountVectorizer() 7 | 8 | X = vect.fit_transform(texts) 9 | print(pd.DataFrame(X.A, columns=vect.get_feature_names()).to_string()) 10 | df = pd.DataFrame(X.toarray().transpose(), index = vect.get_feature_names()) 11 | ``` 12 | -------------------------------------------------------------------------------- /2021/day4/exercises-1/exercise-1.md: -------------------------------------------------------------------------------- 1 | # Working with textual data 2 | 3 | ### 0. Get the data. 4 | 5 | - Download `articles.tar.gz` from 6 | https://dx.doi.org/10.7910/DVN/ULHLCB 7 | 8 | - Unpack it. On Linux and MacOS, you can do this with `tar -xzf mydata.tar.gz` on the command line. On Windows, you may need an additional tool such as `7zip` for that (note that technically speaking, there is a `tar` archive within a `gz` archive, so unpacking may take *two* steps depending on your tool). 9 | 10 | 11 | ### 1. Inspect the structure of the dataset. 12 | What information do the following elements give you? 13 | 14 | - folder (directory) names 15 | - folder structure/hierarchy 16 | - file names 17 | - file contents 18 | 19 | ### 2. Discuss strategies for working with this dataset! 20 | 21 | - Which questions could you answer? 22 | - How could you deal with it, given the size and the structure? 23 | - How much memory1 (RAM) does your computer have? How large is the complete dataset? What does that mean? 24 | - Make a sketch (e.g., with pen&paper), how you could handle your workflow and your data to answer your question. 25 | 26 | 1 *memory* (RAM), not *storage* (harddisk)! 27 | 28 | ### 3. Read some (or all?) data 29 | 30 | Here is some example code that you can modify. Assuming that he folder `articles` is in the same folder as the notebook you are currently working on, you could, for instance, do the following to read a *part* of your dataset. 31 | 32 | ```python 33 | from glob import glob 34 | infowarsfiles = glob('articles/*/Infowars/*') 35 | infowarsarticles = [] 36 | for filename in infowarsfiles: 37 | with open(filename) as f: 38 | infowarsarticles.append(f.read()) 39 | ``` 40 | 41 | - Can you explain what the `glob` function does? 42 | - What does `infowarsfiles` contain, and what does `infowarsarticles` contain? First make an educated guess based on the code snippet, then check it! Do *not* print the whole thing, but use `len`, `type` en slicing `[:10]` to get the info ou need. 43 | 44 | - Tip: take a random sample of the articles for practice purposes (if your code works, you can scale up!) 45 | 46 | ``` 47 | # taking a random sample of the articles for practice purposes 48 | articles =random.sample(infowarsarticles, 10) 49 | ``` 50 | 51 | ### 4. Perform some analyses! 52 | 53 | - Perform some first analyses on the data using string methods and regular expressions! 54 | 55 | Techniques you can try out include: 56 | 57 | a. lowercasing 58 | 59 | b. tokenization 60 | 61 | c. stopword removal 62 | 63 | d. stemming and/or lemmatizing) 64 | 65 | 66 | 67 | If you want to tokenize and stem your data using `spacy`, you need to install `spacy` and the language model. Run the following in the your terminal environment: 68 | 69 | ```bash 70 | pip3 install spacy 71 | python3 -m spacy download en_core_web_sm 72 | ``` 73 | 74 | ### 5. extract Information 75 | 76 | Try to extract meaningful information from your texts. Depending on your interests and the nature of the data, you could: 77 | 78 | - use regular expressions to distinguish relevant from irrelevant texts, or to extract substrings 79 | - use NLP techniques such as Named Entity Recognition to extract entities that occur. 80 | 81 | 82 | ### BONUS: Inceasing efficiency + reusability 83 | The approach under (3) gets you very far. 84 | But for those of you who want to go the extra mile, here are some suggestions for further improvements in handling such a large dataset, consisting of thousands of files, and for deeper thinking about data handling: 85 | 86 | - Consider writing a function to read the data. Let your function take three parameters as input, `basepath` (where is the folder with articles located?), `month` and `outlet`, and return the articles that match this criterion. 87 | - Even better, make it a *generator* that yields the articles instead of returning a whole list. 88 | - Consider yielding a dict (with date, outlet, and the article itself) instead of yielding only the article text. 89 | - Think of the most memory-efficient way to get an overview of how often a given regular expression R is mentioned per outlet! 90 | - Under which circumstances would you consider having your function for reading the data return a pandas dataframe? 91 | -------------------------------------------------------------------------------- /2021/day4/exercises-1/possible-solution-exercise-1.md: -------------------------------------------------------------------------------- 1 | ## Exercise 2: Working with textual data - possible solutions 2 | 3 | ```python 4 | from glob import glob 5 | import random 6 | import nltk 7 | from nltk.stem.snowball import SnowballStemmer 8 | import spacy 9 | 10 | 11 | infowarsfiles = glob('articles/*/Infowars/*') 12 | infowarsarticles = [] 13 | for filename in infowarsfiles: 14 | with open(filename) as f: 15 | infowarsarticles.append(f.read()) 16 | 17 | 18 | # taking a random sample of the articles for practice purposes 19 | articles =random.sample(infowarsarticles, 10) 20 | 21 | ``` 22 | 23 | ### [Task 4](https://github.com/uvacw/teachteacher-python/blob/main/day4/exercises-1/exercise-1.md#4-perform-some-analyses): Preprocessing data 24 | 25 | ##### a. lowercasing articles 26 | 27 | ```python 28 | articles_lower_cased = [art.lower() for art in articles] 29 | ``` 30 | 31 | ##### b. tokenization 32 | 33 | Basic solution, using the `.str` method `.split()`. Not very sophisticated, though. 34 | 35 | ```python 36 | articles_split = [art.split() for art in articles] 37 | ``` 38 | 39 | A more sophisticated solution: 40 | 41 | ```python 42 | from nltk.tokenize import TreebankWordTokenizer 43 | articles_tokenized = [TreebankWordTokenizer().tokenize(art) for art in articles ] 44 | ``` 45 | 46 | ##### c. removing stopwords 47 | 48 | Define your stopwordlist: 49 | 50 | ```python 51 | from nltk.corpus import stopwords 52 | mystopwords = stopwords.words("english") 53 | mystopwords.extend(["add", "more", "words"]) # manually add more stopwords to your list if needed 54 | print(mystopwords) #let's see what's inside 55 | ``` 56 | 57 | Now, remove stopwords from the corpus: 58 | 59 | ```python 60 | articles_without_stopwords = [] 61 | for article in articles: 62 | articles_no_stop = "" 63 | for word in article.lower().split(): 64 | if word not in mystopwords: 65 | articles_no_stop = articles_no_stop + " " + word 66 | articles_without_stopwords.append(articles_no_stop) 67 | ``` 68 | 69 | Same solution, but with list comprehension: 70 | 71 | ```python 72 | articles_without_stopwords = [" ".join([w for w in article.lower().split() if w not in mystopwords]) for article in articles] 73 | ``` 74 | 75 | Different--probably more sophisticated--solution, by writing a function and calling it in a list comprehension: 76 | 77 | ```python 78 | def remove_stopwords(article, stopwordlist): 79 | cleantokens = [] 80 | for word in article: 81 | if word.lower() not in mystopwords: 82 | cleantokens.append(word) 83 | return cleantokens 84 | 85 | articles_without_stopwords = [remove_stopwords(art, mystopwords) for art in articles_tokenized] 86 | ``` 87 | 88 | It's good practice to frequently inspect the results of your code, to make sure you are not making mistakes, and the results make sense. For example, compare your results to some random articles from the original sample: 89 | 90 | ```python 91 | print(articles[8][:100]) 92 | print("-----------------") 93 | print(" ".join(articles_without_stopwords[8])[:100]) 94 | ``` 95 | 96 | ##### d. stemming and lemmatization 97 | 98 | ```python 99 | stemmer = SnowballStemmer("english") 100 | 101 | stemmed_text = [] 102 | for article in articles: 103 | stemmed_words = "" 104 | for word in article.lower().split(): 105 | stemmed_words = stemmed_words + " " + stemmer.stem(word) 106 | stemmed_text.append(stemmed_words.strip()) 107 | ``` 108 | 109 | Same solution, but with list comprehension: 110 | 111 | ```python 112 | stemmed_text = [" ".join([stemmer.stem(w) for w in article.lower().split()]) for article in articles] 113 | ``` 114 | 115 | Compare tokeninzation and lemmatization using `Spacy`: 116 | 117 | ```python 118 | import spacy 119 | nlp = spacy.load("en_core_web_sm") 120 | lemmatized_articles = [[token.lemma_ for token in nlp(art)] for art in articles] 121 | ``` 122 | 123 | 124 | Again, frequently inspect your code, and for example compare the results to the original articles: 125 | 126 | 127 | ```python 128 | print(articles[6][:100]) 129 | print("-----------------") 130 | print(stemmed_text[6][:100]) 131 | print("-----------------") 132 | print(" ".join(lemmatized_articles[6])[:100]) 133 | ``` 134 | 135 | ### [Task 5](https://github.com/uvacw/teachteacher-python/blob/main/day4/exercises-1/exercise-1.md#5-extract-information): Extract information 136 | 137 | ```Python 138 | import nltk 139 | 140 | tokens = [nltk.word_tokenize(sentence) for sentence in articles] 141 | tagged = [nltk.pos_tag(sentence) for sentence in tokens] 142 | print(tagged[0]) 143 | ``` 144 | 145 | playing around with Spacy: 146 | 147 | ```python 148 | nlp = spacy.load('en') 149 | 150 | doc = [nlp(sentence) for sentence in articles] 151 | for i in doc: 152 | for ent in i.ents: 153 | if ent.label_ == 'PERSON': 154 | print(ent.text, ent.label_ ) 155 | 156 | ``` 157 | -------------------------------------------------------------------------------- /2021/day4/exercises-2/exercise-2.md: -------------------------------------------------------------------------------- 1 | # Exercise 2: From text to features 2 | ---- 3 | 4 | Try to take some of the data from the [exercise of this morning](https://github.com/uvacw/teachteacher-python/blob/main/day4/exercises-1/exercise-1.md), and prepare this data for a supervised classification task. More specifically, imagine you want to train a classifier that will predict whether articles come from a fake news source (e.g., `Infowars`) or a quality news outlet (e.g., `bbc`). In other words, you want to predict `source` based on linguistic variations in the articles. 5 | 6 | To arrive at a model that will do just that, please consider taking the following steps: 7 | 8 | - Think about your **pre-processing steps**: what type of features will you feed your algorithm? Do you, for example, want to manually remove stopwords, or include ngrams? You can use the code you've written this morning as a starting point. 9 | 10 | - **Vectorize the data**: Try to fit different vectorizers to the data. You can use `count` vs. `tfidf` vectorizers, with or without pruning, stopword removal, etc. 11 | 12 | - Try out a simple supervised model. Find some inspiration [here](https://github.com/uvacw/teachteacher-python/blob/main/day4/exercises-2/possible-solution-exercise-2.md#build-a-simple-classifier). Can you predict the `source` using linguistic variations in the articles? 13 | 14 | - Which combination of pre-processing steps + vectorizer gives the best results? 15 | 16 | ## BONUS 17 | 18 | - Compare that bottom-up approach with a top-down (keyword or regular-expression based) approach. 19 | -------------------------------------------------------------------------------- /2021/day4/exercises-2/fix_example_book.md: -------------------------------------------------------------------------------- 1 | # Example in the book p. 231 2 | 3 | On page 231, there is an example that involves the line 4 | 5 | ```python3 6 | cv = CountVectorizer(tokenizer=mytokenizer.tokenize) 7 | ``` 8 | 9 | This example only works if `mytokenizer` has been "instantiated" before, and that instruction is missing. 10 | 11 | Essentially, it assumes that this example from page 230 has been run before 12 | 13 | ```python3 14 | from sklearn.feature_extraction.text import CountVectorizer 15 | import nltk 16 | from nltk.tokenize import TreebankWordTokenizer 17 | import regex 18 | 19 | class MyTokenizer: 20 | def tokenize(self, text): 21 | result = [] 22 | word = r"\p{letter}" 23 | for sent in nltk.sent_tokenize(text): 24 | tokens = TreebankWordTokenizer().tokenize(sent) 25 | tokens = [t for t in tokens if regex.search(word, t)] 26 | result += tokens 27 | return result 28 | ``` 29 | 30 | **and** that you create an "instantiate" of this class with the following command: 31 | ```python3 32 | mytokenizer = MyTokenizer() 33 | ``` 34 | 35 | Then, the command ```cv = CountVectorizer(tokenizer=mytokenizer.tokenize)``` will run as expected 36 | ``` 37 | -------------------------------------------------------------------------------- /2021/day4/exercises-2/possible-solution-exercise-2.md: -------------------------------------------------------------------------------- 1 | 2 | ## Exercise 2: From text to features - possible solutions 3 | 4 | ### Trying out different preprocessing steps 5 | 6 | Load the data... 7 | 8 | ```python 9 | infowarsfiles = glob('articles/*/Infowars/*') 10 | documents = [] 11 | for filename in infowarsfiles: 12 | with open(filename) as f: 13 | documents.append(f.read()) 14 | ``` 15 | 16 | Let's inspect the data, and start some pre-processing/ cleaning steps 17 | 18 | ```python 19 | ## From text to features. 20 | documents[17] # print a random article to inspect. 21 | ## Typical cleaning up steps: 22 | from string import punctuation 23 | documents = [doc.replace('\n\n', '') for doc in documents] # remove line breaks 24 | documents = ["".join([w for w in doc if w not in punctuation]) for doc in documents] # remove punctuation 25 | documents = [doc.lower() for doc in documents] # covert to lower case 26 | documents = [" ".join(doc.split()) for doc in documents] # remove double spaces by splitting the strings into words and joining these words again 27 | 28 | documents[17] # print the same article to see whether the changes are in line with what you want 29 | ``` 30 | 31 | Removing stopwords: 32 | 33 | ```python 34 | mystopwords = set(stopwords.words('english')) # use default NLTK stopword list; alternatively: 35 | # mystopwords = set(open('mystopwordfile.txt').readlines()) #read stopword list from a textfile with one stopword per line 36 | documents = [" ".join([w for w in doc.split() if w not in mystopwords]) for doc in documents] 37 | documents[7] 38 | ``` 39 | 40 | Using N-grams as features: 41 | 42 | ```python 43 | documents_bigrams = [["_".join(tup) for tup in nltk.ngrams(doc.split(),2)] for doc in documents] # creates bigrams 44 | documents_bigrams[7][:5] # inspect the results... 45 | 46 | # maybe we want both unigrams and bigrams in the feature set? 47 | 48 | assert len(documents)==len(documents_bigrams) 49 | 50 | documents_uniandbigrams = [] 51 | for a,b in zip([doc.split() for doc in documents],documents_bigrams): 52 | documents_uniandbigrams.append(a + b) 53 | 54 | #and let's inspect the outcomes again. 55 | documents_uniandbigrams[7] 56 | len(documents_uniandbigrams[7]),len(documents_bigrams[7]),len(documents[7].split()) 57 | ``` 58 | 59 | Or, if you want to inspect collocations: 60 | 61 | ```python 62 | text = [nltk.Text(tkn for tkn in doc.split()) for doc in documents ] 63 | text[7].collocations(num=10) 64 | ``` 65 | 66 | ---------- 67 | 68 | ### Vectorize the data 69 | 70 | ```python 71 | from glob import glob 72 | import random 73 | 74 | def read_data(listofoutlets): 75 | texts = [] 76 | labels = [] 77 | for label in listofoutlets: 78 | for file in glob(f'articles/*/{label}/*'): 79 | with open(file) as f: 80 | texts.append(f.read()) 81 | labels.append(label) 82 | return texts, labels 83 | 84 | X, y = read_data(['Infowars', 'BBC']) #choose your own newsoutlets 85 | 86 | ``` 87 | 88 | 89 | ```python 90 | #split the dataset in a train and test sample 91 | from sklearn.model_selection import train_test_split 92 | X_train,X_test,y_train,y_test=train_test_split(X, y, test_size=0.2) 93 | ``` 94 | 95 | Define some vectorizers. 96 | You can try out different variations: 97 | - `count` versus `tfidf` 98 | - with/ without a stopword list 99 | - with / without pruning 100 | 101 | 102 | ```python 103 | from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer 104 | 105 | myvectorizer= CountVectorizer(stop_words=mystopwords) # you can further modify this yourself. 106 | 107 | #Fit the vectorizer, and transform. 108 | X_features_train = myvectorizer.fit_transform(X_train) 109 | X_features_test = myvectorizer.transform(X_test) 110 | 111 | ``` 112 | ### Build a simple classifier 113 | 114 | Now, lets build a simple classifier and predict outlet based on textual features: 115 | 116 | ```python 117 | from sklearn.naive_bayes import MultinomialNB 118 | from sklearn.metrics import accuracy_score 119 | from sklearn.metrics import classification_report 120 | 121 | model = MultinomialNB() 122 | model.fit(X_features_train, y_train) 123 | y_pred = model.predict(X_features_test) 124 | 125 | print(f"Accuracy : {accuracy_score(y_test, y_pred)}") 126 | print(classification_report(y_test, y_pred)) 127 | 128 | ``` 129 | 130 | Can you improve this classifier when using different vectorizers? 131 | 132 | ---- 133 | 134 | 135 | 136 | 137 | *hint: if you want to include n-grams as feature input, add the following argument to your vectorizer:* 138 | 139 | ```python 140 | myvectorizer= CountVectorizer(analyzer=lambda x:x) 141 | ``` 142 | -------------------------------------------------------------------------------- /2021/day4/literature-examples.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | # Literature with examples about preprocessing steps 4 | 5 | [Example unsupervised](http://vanatteveldt.com/p/jacobi2015_lda.pdf) 6 | 7 | [Example supervised](https://www.tandfonline.com/doi/full/10.1080/19312458.2018.1455817) 8 | -------------------------------------------------------------------------------- /2021/day5/01-MachineLearning_Introduction.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Modeling & Machine Learning - an introduction\n", 8 | "\n", 9 | "\n", 10 | "## Where are we in the course?\n", 11 | "\n", 12 | "After making progress in data understanding and preparation we will discuss modelling and have a brief introduction to Machine Learning. \n", 13 | "\n", 14 | "\n", 15 | "## Where are we in the data analysis process?\n", 16 | "\n", 17 | "\"Source:*Source: Wikipedia*\n", 18 | "\n" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": null, 24 | "metadata": {}, 25 | "outputs": [], 26 | "source": [] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "# ML versus statistics\n", 33 | "\n", 34 | " Source: [sandserif](https://www.instagram.com/sandserifcomics/)" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "## Machine Learning in Python\n", 47 | "\n", 48 | "\n", 49 | "\n", 50 | "We will use the package [scikit-learn](http://scikit-learn.org/) to do Machine Learning in Python. This is one of the most widely used packages in Python for Machine Learning, and is quite flexible and complete when it comes to the types of models it can implement, or data that it can use.\n", 51 | "\n", 52 | "\n", 53 | "\n", 54 | "Scikit-learn is usually installed together with Python in your machine by anaconda. Before continuing, however, make sure that you have the latest version installed. To do so, go to Terminal on Mac, or Command Prompt/Line in Windows (run as an administrator), and type ```conda install scikit-learn``` . Conda will check if scikit-learn needs to be updated. \n", 55 | "\n", 56 | "* **Note:** [This video](https://www.youtube.com/watch?v=_wCs2vvBCTM) contains more information on how to update or install packages with conda.\n", 57 | "\n", 58 | "\n" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "# The different types of ML\n", 66 | "\n", 67 | "Let's discuss the Sklearn Machine Learning Map:\n", 68 | "https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": null, 74 | "metadata": {}, 75 | "outputs": [], 76 | "source": [] 77 | } 78 | ], 79 | "metadata": { 80 | "kernelspec": { 81 | "display_name": "Python 3", 82 | "language": "python", 83 | "name": "python3" 84 | }, 85 | "language_info": { 86 | "codemirror_mode": { 87 | "name": "ipython", 88 | "version": 3 89 | }, 90 | "file_extension": ".py", 91 | "mimetype": "text/x-python", 92 | "name": "python", 93 | "nbconvert_exporter": "python", 94 | "pygments_lexer": "ipython3", 95 | "version": "3.8.5" 96 | }, 97 | "latex_envs": { 98 | "bibliofile": "biblio.bib", 99 | "cite_by": "apalike", 100 | "current_citInitial": 1, 101 | "eqLabelWithNumbers": true, 102 | "eqNumInitial": 0 103 | } 104 | }, 105 | "nbformat": 4, 106 | "nbformat_minor": 1 107 | } 108 | -------------------------------------------------------------------------------- /2021/day5/02-Unsupervised-Machine-Learning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Unsupervised Machine Learning\n", 8 | "\n", 9 | "Some examples for our discussion:\n", 10 | "* *With \"numbers\"*: [Clustering with Scikit-Learn's KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans)\n", 11 | "* *With \"text\"*: [LDA topic models with Gensim](https://radimrehurek.com/gensim/models/ldamodel.html) and [PyLDAVis](https://github.com/bmabey/pyLDAvis)" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": null, 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [] 20 | } 21 | ], 22 | "metadata": { 23 | "kernelspec": { 24 | "display_name": "Python 3", 25 | "language": "python", 26 | "name": "python3" 27 | }, 28 | "language_info": { 29 | "codemirror_mode": { 30 | "name": "ipython", 31 | "version": 3 32 | }, 33 | "file_extension": ".py", 34 | "mimetype": "text/x-python", 35 | "name": "python", 36 | "nbconvert_exporter": "python", 37 | "pygments_lexer": "ipython3", 38 | "version": "3.8.5" 39 | } 40 | }, 41 | "nbformat": 4, 42 | "nbformat_minor": 4 43 | } 44 | -------------------------------------------------------------------------------- /2021/day5/03-Supervised-Machine-Learning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Supervised Machine Learning\n", 8 | "\n", 9 | "An agenda for our discussion:\n", 10 | "* SML vs. Statistical testing - same thing with different names?\n", 11 | "* From SPSS to Python via something R-like: statsmodels\n", 12 | "* Choosing and building a model\n", 13 | "* Using a model for predictions\n", 14 | "* A step back: how do we evaluate a model?\n", 15 | "* Understanding what the model is doing: Explainable AI" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": null, 21 | "metadata": {}, 22 | "outputs": [], 23 | "source": [] 24 | } 25 | ], 26 | "metadata": { 27 | "kernelspec": { 28 | "display_name": "Python 3", 29 | "language": "python", 30 | "name": "python3" 31 | }, 32 | "language_info": { 33 | "codemirror_mode": { 34 | "name": "ipython", 35 | "version": 3 36 | }, 37 | "file_extension": ".py", 38 | "mimetype": "text/x-python", 39 | "name": "python", 40 | "nbconvert_exporter": "python", 41 | "pygments_lexer": "ipython3", 42 | "version": "3.8.5" 43 | } 44 | }, 45 | "nbformat": 4, 46 | "nbformat_minor": 4 47 | } 48 | -------------------------------------------------------------------------------- /2021/day5/README.md: -------------------------------------------------------------------------------- 1 | # Day 5: Building and testing models 2 | 3 | ## Key topics: 4 | 5 | * What is machine learning? 6 | * Statistical testing vs. (?) Machine learning 7 | * Unsupervised machine learning examples: with numbers, and with text 8 | * Going deeper into supervised machine learning 9 | * Building models 10 | * Evaluating models 11 | * Explaining models 12 | 13 | 14 | ## (Preliminary) Agenda: 15 | 16 | | Time | Topic | 17 | |---------------|-------------------------------------------------------------------------------| 18 | | 09:30 - 11:00 | Basics of (supervised) machine learning & comparison with statistical testing | 19 | | 11:00 - 12:00 | Practice on (your own) data | 20 | | 13:00 - 14:00 | Model comparison & explainability | 21 | | 14:00 - 15:30 | Practice on (your own) data | 22 | |15:30 - 16:30 | Teaching presentations & general Q&A | 23 | |16:30 - 17:00 | Closure / next steps | -------------------------------------------------------------------------------- /2021/installation.md: -------------------------------------------------------------------------------- 1 | # Getting started 2 | 3 | ## Installing Python 4 | 5 | Python is a language and not a program and thus there are many different 6 | ways you can run code in Python. For the course, it is important that 7 | you have Python 3 installed and running on your machine and be able to 8 | run Jupyter Notebooks as well as install packages. There are different 9 | ways to install Python on your computer. We will provide instructions on 10 | two widely used solutions: 11 | 12 | Using Anaconda or using Python natively (we will discuss pros and cons 13 | of them in our course as well). Whatever solution you go for, make sure 14 | it works on your system. 15 | 16 | ### Option 1: installing Python via Anaconda 17 | 18 | #### Windows 19 | 20 | - Open [https://www.anaconda.com/download/ ](https://www.anaconda.com/download/)with 21 | your web browser. 22 | 23 | - Download the Python 3.7 (or later) installer for Windows. 24 | 25 | - Install Python 3.7 (or later) using all of the defaults for 26 | installation but **make sure to check Make Anaconda the default 27 | Python**. 28 | 29 | #### Mac OS X 30 | 31 | - Open [https://www.anaconda.com/download/](https://www.anaconda.com/download/) with 32 | your web browser. 33 | 34 | - Download the Python 3.7 (or later) installer for OS X. 35 | 36 | - Install Python 3.7 (or later) using all of the defaults for 37 | installation  38 | 39 | - 40 | #### Linux 41 | 42 | - Open [https://www.anaconda.com/download/](https://www.anaconda.com/download/) with 43 | your web browser. 44 | 45 | - Download the Python 3.7 (or later) installer for Linux. 46 | 47 | - Install Python 3.7 (or later) using all of the defaults for 48 | installation. (Installation requires using the shell. If you aren't 49 | comfortable doing this, come to one of the consultation hours and we 50 | will help you) 51 | 52 | - Open a terminal window. 53 | 54 | - Type bash Anaconda- and then press tab. The name of the file you 55 | just downloaded should appear. 56 | 57 | - Press enter. You will follow the text-only prompts. When there is a 58 | colon at the bottom of the screen press the down arrow to move down 59 | through the text. Type yes and press enter to approve the license. 60 | Press enter to approve the default location for the files. Type yes 61 | and press enter to prepend Anaconda to your PATH (this makes the 62 | Anaconda distribution the default Python). 63 | 64 | #### How do I know if the installation worked? 65 | 66 | Open the **Terminal** (in a Mac or Linux computer) or **Anaconda 67 | Prompt** (in Windows), and type python. 68 | 69 | Python should start, and should say "3.7" (perhaps 3.8, 3.9... etc.) and 70 | "Continuum Analytics" or "Anaconda" somewhere in the header. 71 | 72 | To quit Python, just type exit() and press enter 73 | 74 | See example below:  75 | 76 | ![AnacondaPythonExample](./media/pythoninterpreter.png) 77 | 78 | *Not sure how to open Terminal/Anaconda Prompt?* 79 | 80 | - [Mac OSX instructions on YouTube - online 81 | tutorial](https://www.youtube.com/watch?v=zw7Nd67_aFw) 82 | 83 | - For Windows, please use Anaconda Prompt (search for it in your 84 | computer) 85 | 86 | Still unsure? Check the section below. 87 | 88 | #### Check out some tutorials online 89 | 90 | There are online tutorials offering specific advice 91 | for [Windows](https://www.youtube.com/watch?v=xxQ0mzZ8UvA) and [Mac 92 | OSX](https://www.youtube.com/watch?v=TcSAln46u9U). 93 | 94 | Please note that you need Python 3.9, and the Anaconda website may look 95 | a bit different from what you see in the video. 96 | 97 | Here is another [video 98 | tutorial](https://www.youtube.com/watch?v=YJC6ldI3hWk) with 99 | information on how to install and use Anaconda. It also covers a lot of 100 | additional information that you will not need in the course. For now, as 101 | long as you managed to get Anaconda installed for now - you're more than 102 | OK\! 103 | 104 | ### Option 2: using Python natively 105 | *based on Chapter 1 of Atteveldt, Trilling & Arcila Calderón (2021)* 106 | 107 | Oftentimes, you will have Python already installed on your computer. 108 | There are different ways to check if you already have it. For example, 109 | if you are using a Mac, you can open your system terminal[^1] and type 110 | python -V or python –version and you will get a message with the version 111 | that is installed by default on your computer. 112 | 113 | If you do not have already Python on your computer, the first thing will 114 | be to download it and install it from its official 115 | [webpage](https://www.python.org/downloads/), selecting the right 116 | software according to your operating system (Windows, Linux/UNIX, Mac OS 117 | X).  118 | 119 | During the installation, additional features will be installed. They 120 | include *pip*, a basics package that you will need to install more 121 | packages for Python. In addition, you might be asked if you want to add 122 | Python to your path, which means that you set the path variable in order 123 | to call the executable software from your system terminal just by typing 124 | the word python. We recommend selecting this option. 125 | 126 | 127 | #### Installing Jupyter 128 | 129 | In the course, we will run our Python code using Jupyter Notebooks. They 130 | run as a web application that allows you to create documents that 131 | contain code and text (and also equations and visualizations). We will 132 | discuss other options to run Python in the course. 133 | 134 | Jupyter is already installed if you went for option 1 and installed 135 | Python via Anaconda. If you are using the native installation, you will 136 | need to install it by running pip install notebook on your systems’ 137 | terminal. 138 | 139 | You can start Jupter Notebook by typing jupyter notebook in your 140 | system’s terminal (or in Anaconda prompt if you installed Python via 141 | Anaconda on a Windows computer). 142 | 143 | There is a more fancy and moden environment called JupyterLab -- using JupyterLab instead of plain JupyterNotebooks is fine as well. 144 | 145 | [^1] Not sure what is and how to open Terminal? Have a look at this short 146 | [video](https://www.youtube.com/watch?v=zw7Nd67_aFw). 147 | -------------------------------------------------------------------------------- /2021/media/boumanstrilling2016.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2021/media/boumanstrilling2016.pdf -------------------------------------------------------------------------------- /2021/media/mannetje.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2021/media/mannetje.png -------------------------------------------------------------------------------- /2021/media/pythoninterpreter.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2021/media/pythoninterpreter.png -------------------------------------------------------------------------------- /2021/media/sparse_dense.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2021/media/sparse_dense.png -------------------------------------------------------------------------------- /2021/references.bib: -------------------------------------------------------------------------------- 1 | @article{Boumans2016, 2 | author = {Boumans, Jelle W. and Trilling, Damian}, 3 | doi = {10.1080/21670811.2015.1096598}, 4 | file = {:Users/damian/Dropbox/uva/literatuur-mendeley/Boumans, Trilling{\_}2016.pdf:pdf}, 5 | issn = {2167-0811}, 6 | journal = {Digital Journalism}, 7 | number = {1}, 8 | pages = {8--23}, 9 | title = {Taking stock of the toolkit: An overview of relevant autmated content analysis approaches and techniques for digital journalism scholars}, 10 | volume = {4}, 11 | year = {2016} 12 | } 13 | -------------------------------------------------------------------------------- /2021/teachingtips.md: -------------------------------------------------------------------------------- 1 | # Teachingtips 2 | 3 | This document contains some tips from experience to help you avoiding the most common pitfalls when getting started to teach with Python. 4 | 5 | 6 | ## Avoiding technical problems 7 | - Make everyone install (and test!) the environment *before* class starts. Avoid having to deal with "I can't open the notebook!"- or "It says 'module not found'!"-questions during class. 8 | - **The first session is crucial.** Expect that during the first session, there will be students with technical problems. It is best to teach this session with two teachers, such that one can deal with individual problems and the other can continue with the rest of the group. 9 | - Be aware that even though Python is largely platform-independent (yeah!!!), there can be subtle differences between using it on typical unix-based systems (MacOS or Linux) versus Windows. Think of `/home/damian/pythoncourse` vs `C:\\Users\\damian\\pythoncourse` (note the double (!) backslash!), but also about some modules that may have different external requirements (recent experience: try `pip install geopandas` on different systems!) 10 | - You can consider pointing students to [Google Colab](https://colab.research.google.com/) as a fallback option if they cannot get things to work. 11 | 12 | 13 | ## Grading and rules for assignments 14 | - Make clear from the beginning that it is fine, even encouraged, to get ideas from sites like https://stackoverflow.com . Emphasize, though, that copy-pasting code and presenting it as own work is considered plagiarism, just like in written assignments. A simple comment line like `# the following cell is [copied from/adapted from/inspired by] URL` is enough to prevent this. Document this rule, for instance in the course manual. 15 | 16 | 17 | -------------------------------------------------------------------------------- /2023/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2023/.DS_Store -------------------------------------------------------------------------------- /2023/Installationinstruction.md: -------------------------------------------------------------------------------- 1 | # Installing Jupyter Lab 2 | 3 | To ensure a smooth start for everyone, we ask you to download and install the Jupyter-Lab Desktop program: https://github.com/jupyterlab/jupyterlab-desktop/tree/master 4 | 5 | Please scroll down and download the respective installer for your system (Windows/Mac/Linux). When opening JupyterLab for the first time, you will see a small message at the bottom (see screenshot below). Please then click on “install using the bundled installer” to start the installation process, restart JupyterLab, and we should be good to go! If you encounter errors, do not worry, we will have enough time on Monday to make sure everyone — regardless of their computer or operating system — will be operational! -------------------------------------------------------------------------------- /2023/Teachingtips.md: -------------------------------------------------------------------------------- 1 | # Teachingtips 2 | This document contains some tips from experience to help you avoiding the most common pitfalls when getting started to teach with Python. 3 | 4 | ## Avoiding technical problems 5 | 6 | Make everyone install (and test!) the environment before class starts. Avoid having to deal with "I can't open the notebook!"- or "It says 'module not found'!"-questions during class. 7 | 8 | The first session is crucial. Expect that during the first session, there will be students with technical problems. It is best to teach this session with two teachers, such that one can deal with individual problems and the other can continue with the rest of the group. 9 | 10 | Be aware that even though Python is largely platform-independent (yeah!!!), there can be subtle differences between using it on typical unix-based systems (MacOS or Linux) versus Windows. Think of /home/damian/pythoncourse vs C:\\Users\\damian\\pythoncourse (note the double (!) backslash!), but also about some modules that may have different external requirements (recent experience: try pip install geopandas on different systems!) 11 | 12 | You can consider pointing students to Google Colab as a fallback option if they cannot get things to work. 13 | 14 | ## Grading and rules for assignments 15 | 16 | Make clear from the beginning that it is fine, even encouraged, to get ideas from sites like https://stackoverflow.com . Emphasize, though, that copy-pasting code and presenting it as own work is considered plagiarism, just like in written assignments. A simple comment line like # the following cell is [copied from/adapted from/inspired by] URL is enough to prevent this. Document this rule, for instance in the course manual. -------------------------------------------------------------------------------- /2023/day2/Day 2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2023/day2/Day 2.pdf -------------------------------------------------------------------------------- /2023/day2/Notebooks/ExcercisesPandas.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "guilty-steam", 6 | "metadata": {}, 7 | "source": [ 8 | "## Excercises pandas\n", 9 | "\n", 10 | "Let's practice data exploration and wrangling in Pandas. \n", 11 | "\n", 12 | "We will work with data collected through the Twitter API (RIP, more on API's and how quickly we're loosing them next week ;)). Few months ago (last days when the API was alive) we collected different tweets (for teaching purposes). We have also already run a sentiment analysis on these tweets and have saved it in a separate file.\n", 13 | "\n", 14 | "We have two datasets per account/topic:\n", 15 | "* One datasets with tweets (public tweets by account or with a hashtag)\n", 16 | "* One dataset with sentiment of those tweets (with three sentiment scores using veeery basic snetiment tool - more on how bad it is in two weeks ;))\n", 17 | "\n", 18 | "We want to see how sentiment changes over time, compare number of positive and negative tweets and analyze the relation between sentiment and engagement with the tweets. You can selest an account/topic that seems interesting to you.\n", 19 | "\n", 20 | "We want to prepare the dataset for analysis:\n", 21 | "\n", 22 | "**Morning**\n", 23 | "\n", 24 | "* Data exploration\n", 25 | " * Check columns, data types, missing values, descriptives for numeric variables measuring engagement and sentiment, value_counts for relevant categorical variables\n", 26 | "* Handling missing values and data types\n", 27 | " * Handle missing values in variables of interest: number of likes and retweets - what can nan's mean?\n", 28 | " * Make sure created_at has the right format (to use it for aggregation later)\n", 29 | "* Creating necessary variables (sentiment)\n", 30 | " * Overall measure of sentiment - create it from positive and negative\n", 31 | " * Binary variable (positive or negative tweet) - Tip: Write a function that \"recodes\" the sentiment column\n", 32 | " \n", 33 | "\n", 34 | "**Afternoon**\n", 35 | "\n", 36 | "* Merging the files (tweets with sentiment)\n", 37 | " * Make sure the columns you merge match and check how to merge\n", 38 | "* Agrregating the files per month\n", 39 | " * Tip: Create a column for month by transforming the date column. Remember that the date column needs the right format first!\n", 40 | " `df['month'] = df['date_dt_column'].dt.strftime('%Y-%m')` \n", 41 | "* Visualisations:\n", 42 | " * Visualise different columns of the tweet dataset (change of sentiment over time, sentiment, engagement, relation between sentiment and engagement)\n", 43 | " * But *more fun*: use your own data to play with visualisations" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 4, 49 | "id": "accredited-adolescent", 50 | "metadata": {}, 51 | "outputs": [], 52 | "source": [ 53 | "import pandas as pd" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": null, 59 | "metadata": {}, 60 | "outputs": [], 61 | "source": [ 62 | "df_jsonl = pd.read_json('filename', lines=True) #put your filename as filename\n" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": null, 68 | "metadata": {}, 69 | "outputs": [], 70 | "source": [ 71 | "#run this cell \n", 72 | "\n", 73 | "def get_public_metrics(row):\n", 74 | " if 'public_metrics' in row.keys():\n", 75 | " if type(row['public_metrics']) == dict:\n", 76 | " for key, value in row['public_metrics'].items():\n", 77 | " row['metric_' + str(key)] = value\n", 78 | " return row\n", 79 | "\n", 80 | "def get_tweets(df):\n", 81 | " if 'data' not in df.columns:\n", 82 | " return None\n", 83 | " results = pd.DataFrame()\n", 84 | " for item in df['data'].values.tolist():\n", 85 | " results = pd.concat([results, pd.DataFrame(item)])\n", 86 | " \n", 87 | " results = results.apply(get_public_metrics, axis=1)\n", 88 | " \n", 89 | " results = results.reset_index()\n", 90 | " del results['index']\n", 91 | " \n", 92 | " return results" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": null, 98 | "metadata": {}, 99 | "outputs": [], 100 | "source": [ 101 | "#unpack the tweets - this gives you dataframe with tweets\n", 102 | "tweets = get_tweets(df_jsonl)" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": null, 108 | "metadata": {}, 109 | "outputs": [], 110 | "source": [] 111 | } 112 | ], 113 | "metadata": { 114 | "kernelspec": { 115 | "display_name": "Python 3", 116 | "language": "python", 117 | "name": "python3" 118 | }, 119 | "language_info": { 120 | "codemirror_mode": { 121 | "name": "ipython", 122 | "version": 3 123 | }, 124 | "file_extension": ".py", 125 | "mimetype": "text/x-python", 126 | "name": "python", 127 | "nbconvert_exporter": "python", 128 | "pygments_lexer": "ipython3", 129 | "version": "3.9.1" 130 | } 131 | }, 132 | "nbformat": 4, 133 | "nbformat_minor": 5 134 | } 135 | -------------------------------------------------------------------------------- /2023/day2/Notebooks/PandasIntroduction.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Introduction to Pandas\n", 8 | "\n", 9 | "*Based on DA and CCS1 materials*" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "Before using it, however, we need to import it." 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": null, 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "import pandas as pd" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": null, 31 | "metadata": {}, 32 | "outputs": [], 33 | "source": [ 34 | "pd.__version__" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": null, 40 | "metadata": {}, 41 | "outputs": [], 42 | "source": [] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "## Reading data into Pandas\n", 49 | "\n" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": null, 55 | "metadata": {}, 56 | "outputs": [], 57 | "source": [ 58 | "\n", 59 | "websites = [\n", 60 | " {'site': 'Twitter', 'type': 'Social Media', 'views': 10000, 'active_users': 200000},\n", 61 | " {'site': 'Facebook', 'type': 'Social Media', 'views': 35000, 'active_users': 500000},\n", 62 | " {'site': 'NYT', 'type': 'News media', 'views': 78000, 'active_users': 156000}, \n", 63 | " {'site': 'YouTube', 'type': 'Video platform', 'views': 18000, 'active_users': 289000},\n", 64 | " {'site': 'Vimeo', 'type': 'Video platform', 'views': 300, 'active_users': 1580},\n", 65 | " {'site': 'USA Today', 'type': 'News media', 'views': 4800, 'active_users': 5608},\n", 66 | "]" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": {}, 73 | "outputs": [], 74 | "source": [ 75 | "websites" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": null, 81 | "metadata": {}, 82 | "outputs": [], 83 | "source": [ 84 | "websites=pd.DataFrame(websites)" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": null, 90 | "metadata": {}, 91 | "outputs": [], 92 | "source": [ 93 | "websites" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": null, 99 | "metadata": {}, 100 | "outputs": [], 101 | "source": [ 102 | "websites.to_csv('websites.csv', index=False)" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": null, 108 | "metadata": {}, 109 | "outputs": [], 110 | "source": [ 111 | "#Read from a csv\n", 112 | "df_websites = pd.read_csv('websites.csv')" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": null, 118 | "metadata": {}, 119 | "outputs": [], 120 | "source": [ 121 | "df_websites" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "## Exploring this dataset" 129 | ] 130 | }, 131 | { 132 | "cell_type": "markdown", 133 | "metadata": {}, 134 | "source": [ 135 | "Which columns are available?" 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": null, 141 | "metadata": {}, 142 | "outputs": [], 143 | "source": [ 144 | "df_websites.columns" 145 | ] 146 | }, 147 | { 148 | "cell_type": "markdown", 149 | "metadata": {}, 150 | "source": [ 151 | "Are there missing values?" 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": null, 157 | "metadata": {}, 158 | "outputs": [], 159 | "source": [ 160 | "df_websites.dtypes" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": null, 166 | "metadata": {}, 167 | "outputs": [], 168 | "source": [ 169 | "df_websites.isna().sum()" 170 | ] 171 | }, 172 | { 173 | "cell_type": "markdown", 174 | "metadata": {}, 175 | "source": [ 176 | "Let's see the first few values" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": null, 182 | "metadata": {}, 183 | "outputs": [], 184 | "source": [ 185 | "df_websites.head()" 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": {}, 191 | "source": [ 192 | "And now the last few values" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": null, 198 | "metadata": {}, 199 | "outputs": [], 200 | "source": [ 201 | "df_websites.tail()" 202 | ] 203 | }, 204 | { 205 | "cell_type": "markdown", 206 | "metadata": {}, 207 | "source": [ 208 | "Let's look at some descriptive statistics..." 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": null, 214 | "metadata": {}, 215 | "outputs": [], 216 | "source": [ 217 | "df_websites.describe()" 218 | ] 219 | }, 220 | { 221 | "cell_type": "markdown", 222 | "metadata": {}, 223 | "source": [ 224 | "Only numerical variables appear above... let's see the frequencies for the non-numerical variables" 225 | ] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "execution_count": null, 230 | "metadata": {}, 231 | "outputs": [], 232 | "source": [ 233 | "df_websites['type'].describe()" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": {}, 239 | "source": [ 240 | "This is not very informative... let's try to get the counts per value of the column" 241 | ] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": null, 246 | "metadata": {}, 247 | "outputs": [], 248 | "source": [ 249 | "df_websites['type'].value_counts()" 250 | ] 251 | }, 252 | { 253 | "cell_type": "markdown", 254 | "metadata": {}, 255 | "source": [ 256 | "Now let's get descriptive statistics per group:" 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": null, 262 | "metadata": {}, 263 | "outputs": [], 264 | "source": [ 265 | "df_websites.groupby('type').describe()" 266 | ] 267 | }, 268 | { 269 | "cell_type": "markdown", 270 | "metadata": {}, 271 | "source": [ 272 | "This doesn't look so easy to read. Let's transpose this output\n", 273 | "\n", 274 | "By transposing a dataframe we move the rows data to columns and the columns data to the rows. " 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "execution_count": null, 280 | "metadata": {}, 281 | "outputs": [], 282 | "source": [ 283 | "df_websites.groupby('type').describe().transpose()" 284 | ] 285 | }, 286 | { 287 | "cell_type": "markdown", 288 | "metadata": {}, 289 | "source": [ 290 | "## Subsetting and slicing\n", 291 | "\n", 292 | "* Let's say I just want some of the **columns** that there are in the dataset\n", 293 | "* Or that I just want some of the **rows** that are in the dataset" 294 | ] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "metadata": {}, 299 | "source": [ 300 | "### Slicing by column" 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "execution_count": null, 306 | "metadata": {}, 307 | "outputs": [], 308 | "source": [ 309 | "df_websites.columns" 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "execution_count": null, 315 | "metadata": {}, 316 | "outputs": [], 317 | "source": [ 318 | "df_websites[['type', \"views\"]]" 319 | ] 320 | }, 321 | { 322 | "cell_type": "code", 323 | "execution_count": null, 324 | "metadata": {}, 325 | "outputs": [], 326 | "source": [ 327 | "df_websites" 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": null, 333 | "metadata": {}, 334 | "outputs": [], 335 | "source": [ 336 | "type_views = df_websites[['site','views']]" 337 | ] 338 | }, 339 | { 340 | "cell_type": "code", 341 | "execution_count": null, 342 | "metadata": {}, 343 | "outputs": [], 344 | "source": [ 345 | "type_views" 346 | ] 347 | }, 348 | { 349 | "cell_type": "code", 350 | "execution_count": null, 351 | "metadata": {}, 352 | "outputs": [], 353 | "source": [ 354 | "df_websites" 355 | ] 356 | }, 357 | { 358 | "cell_type": "markdown", 359 | "metadata": {}, 360 | "source": [ 361 | "### Slicing by row (value)\n", 362 | "\n", 363 | "Filtering dataset based on values in columns" 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": null, 369 | "metadata": {}, 370 | "outputs": [], 371 | "source": [ 372 | "df_websites[df_websites['type']=='Social Media']" 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": null, 378 | "metadata": {}, 379 | "outputs": [], 380 | "source": [ 381 | "df_websites[df_websites['type']!='News media']" 382 | ] 383 | }, 384 | { 385 | "cell_type": "markdown", 386 | "metadata": {}, 387 | "source": [ 388 | "I want to have data that is not about News Media **and** with more than 12,000 views" 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": null, 394 | "metadata": {}, 395 | "outputs": [], 396 | "source": [ 397 | "df_websites[(df_websites['type']!='News media') & (df_websites['views'] > 12000)]" 398 | ] 399 | }, 400 | { 401 | "cell_type": "markdown", 402 | "metadata": {}, 403 | "source": [ 404 | "I want to have data that is **either** not about News Media **or** with more than 12,000 views" 405 | ] 406 | }, 407 | { 408 | "cell_type": "code", 409 | "execution_count": null, 410 | "metadata": {}, 411 | "outputs": [], 412 | "source": [ 413 | "df_websites[(df_websites['type']!='News media') | (df_websites['views'] > 12000)]" 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": null, 419 | "metadata": {}, 420 | "outputs": [], 421 | "source": [ 422 | "social_media = df_websites[df_websites['type']=='Social Media']" 423 | ] 424 | }, 425 | { 426 | "cell_type": "code", 427 | "execution_count": null, 428 | "metadata": {}, 429 | "outputs": [], 430 | "source": [ 431 | "social_media" 432 | ] 433 | }, 434 | { 435 | "cell_type": "code", 436 | "execution_count": null, 437 | "metadata": {}, 438 | "outputs": [], 439 | "source": [ 440 | "social_media.describe()" 441 | ] 442 | }, 443 | { 444 | "cell_type": "code", 445 | "execution_count": null, 446 | "metadata": {}, 447 | "outputs": [], 448 | "source": [ 449 | "df_websites[df_websites['type']=='Social Media'].describe()" 450 | ] 451 | }, 452 | { 453 | "cell_type": "code", 454 | "execution_count": null, 455 | "metadata": {}, 456 | "outputs": [], 457 | "source": [ 458 | "socialmediaviews = df_websites[df_websites['type']=='Social Media'][['type', 'views']]" 459 | ] 460 | }, 461 | { 462 | "cell_type": "code", 463 | "execution_count": null, 464 | "metadata": {}, 465 | "outputs": [], 466 | "source": [ 467 | "socialmediaviews" 468 | ] 469 | }, 470 | { 471 | "cell_type": "markdown", 472 | "metadata": {}, 473 | "source": [ 474 | "## Saving the dataframe\n", 475 | "\n", 476 | "Formats you can use : see https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html" 477 | ] 478 | }, 479 | { 480 | "cell_type": "markdown", 481 | "metadata": {}, 482 | "source": [ 483 | "CSV:" 484 | ] 485 | }, 486 | { 487 | "cell_type": "code", 488 | "execution_count": null, 489 | "metadata": {}, 490 | "outputs": [], 491 | "source": [ 492 | "df_websites.to_csv('websites.csv')" 493 | ] 494 | }, 495 | { 496 | "cell_type": "markdown", 497 | "metadata": {}, 498 | "source": [ 499 | "Pickle:" 500 | ] 501 | }, 502 | { 503 | "cell_type": "code", 504 | "execution_count": null, 505 | "metadata": {}, 506 | "outputs": [], 507 | "source": [ 508 | "df_websites.to_pickle('websites.pkl')" 509 | ] 510 | } 511 | ], 512 | "metadata": { 513 | "anaconda-cloud": {}, 514 | "kernelspec": { 515 | "display_name": "Python 3", 516 | "language": "python", 517 | "name": "python3" 518 | }, 519 | "language_info": { 520 | "codemirror_mode": { 521 | "name": "ipython", 522 | "version": 3 523 | }, 524 | "file_extension": ".py", 525 | "mimetype": "text/x-python", 526 | "name": "python", 527 | "nbconvert_exporter": "python", 528 | "pygments_lexer": "ipython3", 529 | "version": "3.9.1" 530 | } 531 | }, 532 | "nbformat": 4, 533 | "nbformat_minor": 4 534 | } 535 | -------------------------------------------------------------------------------- /2023/day3/API.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "e9014de5", 6 | "metadata": { 7 | "slideshow": { 8 | "slide_type": "slide" 9 | } 10 | }, 11 | "source": [ 12 | "# API\n", 13 | "\n", 14 | "\n", 15 | "Author: Justin Chun-ting Ho\n", 16 | "\n", 17 | "Date: 27 Nov 2023\n", 18 | "\n", 19 | "Credit: Some sections are adopted from the slides prepared by Damian Trilling" 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "id": "510e823a", 25 | "metadata": { 26 | "slideshow": { 27 | "slide_type": "slide" 28 | } 29 | }, 30 | "source": [ 31 | "### Beyond files\n", 32 | "\n", 33 | "- we can write anything to files\n", 34 | "- as long as we know the structure and encoding, we can unpack it into data\n", 35 | "- we don't even need files!\n", 36 | "- how about sending it directly through the internet?" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "id": "11b982aa", 42 | "metadata": { 43 | "slideshow": { 44 | "slide_type": "slide" 45 | } 46 | }, 47 | "source": [ 48 | "### How does API work?\n", 49 | "\n", 50 | "![API](https://voyager.postman.com/illustration/diagram-what-is-an-api-postman-illustration.svg)" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "id": "2766306b", 56 | "metadata": { 57 | "slideshow": { 58 | "slide_type": "slide" 59 | } 60 | }, 61 | "source": [ 62 | "### Example: Google Books API\n", 63 | "\n", 64 | "You could try this in any browser: [https://www.googleapis.com/books/v1/volumes?q=isbn:9780261102217](https://www.googleapis.com/books/v1/volumes?q=isbn:9780261102217)\n", 65 | "\n", 66 | "But how do we know how to use it? Read the [documentation](https://developers.google.com/books/docs/v1/using#PerformingSearch)!" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "id": "b2a5bfec", 73 | "metadata": { 74 | "slideshow": { 75 | "slide_type": "slide" 76 | } 77 | }, 78 | "outputs": [], 79 | "source": [ 80 | "# A better way to do this\n", 81 | "\n", 82 | "import json\n", 83 | "from urllib.request import urlopen\n", 84 | "\n", 85 | "api = \"https://www.googleapis.com/books/v1/volumes?q=\"\n", 86 | "query = \"isbn:9780261102217\"\n", 87 | "\n", 88 | "# send a request and get a JSON response\n", 89 | "resp = urlopen(api + query)\n", 90 | "# parse JSON into Python as a dictionary\n", 91 | "book_data = json.load(resp)" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "id": "152d9883", 98 | "metadata": { 99 | "slideshow": { 100 | "slide_type": "slide" 101 | } 102 | }, 103 | "outputs": [], 104 | "source": [ 105 | "volume_info = book_data[\"items\"][0][\"volumeInfo\"]\n", 106 | "\n", 107 | "print('Title: ' + volume_info['title'])\n", 108 | "print('Author: ' + str(volume_info['authors']))\n", 109 | "print('Publication Date: ' + volume_info['publishedDate'])" 110 | ] 111 | }, 112 | { 113 | "cell_type": "markdown", 114 | "id": "f7ce48d3", 115 | "metadata": { 116 | "slideshow": { 117 | "slide_type": "slide" 118 | } 119 | }, 120 | "source": [ 121 | "### Example: Youtube API\n", 122 | "\n", 123 | "#### Getting an API key\n", 124 | "\n", 125 | "- Go to [Google Cloud Platform](https://console.cloud.google.com/)\n", 126 | "\n", 127 | "- Create a project in the Google Developers Console\n", 128 | "\n", 129 | "- Enable YouTube Data API \n", 130 | "\n", 131 | "- Obtain your API key\n", 132 | "\n", 133 | "#### Step by Step Guide\n", 134 | "\n", 135 | "![](https://miro.medium.com/v2/resize:fit:2000/format:webp/1*bCsi9C7yC8U-dVdWW4Zqhg.png)\n", 136 | "\n", 137 | "![](https://miro.medium.com/v2/resize:fit:2000/format:webp/1*k86eiqdHf9HhhxWKKnO7Sg.png)\n", 138 | "\n", 139 | "![](https://miro.medium.com/v2/resize:fit:2000/format:webp/1*DgLkgzXA9YkzMJC7Dh7JZg.png)\n", 140 | "\n", 141 | "![](https://miro.medium.com/v2/resize:fit:2000/format:webp/1*KzdLen4agoUi33_H0MutcA.png)\n", 142 | "\n", 143 | "![](https://miro.medium.com/v2/resize:fit:2000/format:webp/1*3HjyBix-P1gop_CPLYNpiQ.png)\n", 144 | "\n", 145 | "![](https://miro.medium.com/v2/resize:fit:2000/format:webp/1*rzq6FpRfV0ujb_B6nUoGEA.png)\n", 146 | "\n", 147 | "![](https://miro.medium.com/v2/resize:fit:2000/format:webp/1*FOTj3rvn0hGmHxgNz0x1Gw.png)\n", 148 | "\n", 149 | "![](https://miro.medium.com/v2/resize:fit:2000/format:webp/1*cLWdO9siuQE-3v0kPz-SAA.png)\n", 150 | "\n", 151 | "Credit: [Pedro Hernández](https://medium.com/mcd-unison/youtube-data-api-v3-in-python-tutorial-with-examples-e829a25d2ebd)\n", 152 | "\n", 153 | "#### Install google api package\n", 154 | "\n", 155 | "- install the package with `conda install -c conda-forge google-api-python-client` or \n", 156 | "`pip install google-api-python-client`" 157 | ] 158 | }, 159 | { 160 | "cell_type": "markdown", 161 | "id": "f001ad51", 162 | "metadata": {}, 163 | "source": [ 164 | "### Simple video search" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": null, 170 | "id": "763cf34e", 171 | "metadata": {}, 172 | "outputs": [], 173 | "source": [ 174 | "# Setting Up\n", 175 | "import googleapiclient.discovery\n", 176 | "api_service_name = \"youtube\"\n", 177 | "api_version = \"v3\"\n", 178 | "DEVELOPER_KEY = \"#################\"\n", 179 | "youtube = googleapiclient.discovery.build(\n", 180 | " api_service_name, api_version, developerKey = DEVELOPER_KEY)" 181 | ] 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "id": "810614af", 186 | "metadata": {}, 187 | "source": [ 188 | "### Getting a list of videos" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": null, 194 | "id": "27c002da", 195 | "metadata": {}, 196 | "outputs": [], 197 | "source": [ 198 | "# The codes to send the request\n", 199 | "request = youtube.search().list(\n", 200 | " part=\"id,snippet\",\n", 201 | " type='video',\n", 202 | " q=\"Lord of the rings\",\n", 203 | " maxResults=1\n", 204 | ")\n", 205 | "# Request execution\n", 206 | "response = request.execute()\n", 207 | "print(response)" 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": null, 213 | "id": "c7d29894", 214 | "metadata": {}, 215 | "outputs": [], 216 | "source": [ 217 | "lotr_videos_ids = youtube.search().list(\n", 218 | " part=\"id\",\n", 219 | " type='video',\n", 220 | " order=\"viewCount\", # This can also be \"date\", \"rating\", \"relevance\" etc.\n", 221 | " q=\"Lord of the rings\", # The search query\n", 222 | " maxResults=50,\n", 223 | " fields=\"items(id(videoId))\"\n", 224 | ").execute()" 225 | ] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "execution_count": null, 230 | "id": "ca3a3c6e", 231 | "metadata": {}, 232 | "outputs": [], 233 | "source": [ 234 | "lotr_videos_ids" 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": null, 240 | "id": "66a1711e", 241 | "metadata": {}, 242 | "outputs": [], 243 | "source": [ 244 | "info = {\n", 245 | " 'id':[],\n", 246 | " 'title':[],\n", 247 | " 'views':[]\n", 248 | "}\n", 249 | "\n", 250 | "for item in lotr_videos_ids['items']:\n", 251 | " vidId = item['id']['videoId']\n", 252 | " r = youtube.videos().list(\n", 253 | " part=\"statistics,snippet\",\n", 254 | " id=vidId,\n", 255 | " fields=\"items(statistics),snippet(title)\"\n", 256 | " ).execute()\n", 257 | "\n", 258 | " views = r['items'][0]['statistics']['viewCount']\n", 259 | " title = r['items'][0]['snippet']['title']\n", 260 | " info['id'].append(vidId)\n", 261 | " info['title'].append(title)\n", 262 | " info['views'].append(views)\n", 263 | "\n", 264 | "df = pd.DataFrame(data=info)" 265 | ] 266 | }, 267 | { 268 | "cell_type": "markdown", 269 | "id": "f85288f6", 270 | "metadata": {}, 271 | "source": [ 272 | "### How to search by channel id?\n", 273 | "\n", 274 | "First, you need to find the channel id, there are many tools for that, eg [this one](https://commentpicker.com/youtube-channel-id.php). While it is possible to search by username, sometimes it work, sometimes it doesn't.\n", 275 | "\n", 276 | "Example: Last Week Tonight by John Oliva (https://www.youtube.com/@LastWeekTonight)" 277 | ] 278 | }, 279 | { 280 | "cell_type": "code", 281 | "execution_count": null, 282 | "id": "8524c0f2", 283 | "metadata": {}, 284 | "outputs": [], 285 | "source": [ 286 | "# Some Channel Statistics\n", 287 | "response = youtube.channels().list( \n", 288 | " part='statistics', \n", 289 | " id='UC3XTzVzaHQEd30rQbuvCtTQ').execute()" 290 | ] 291 | }, 292 | { 293 | "cell_type": "code", 294 | "execution_count": null, 295 | "id": "b53f73e8", 296 | "metadata": {}, 297 | "outputs": [], 298 | "source": [ 299 | "# You can search by channel id, but it will not return everything\n", 300 | "videos_ids = youtube.search().list(\n", 301 | " part=\"id\",\n", 302 | " type='video',\n", 303 | " order=\"viewCount\", # This can also be \"date\", \"rating\", \"relevance\" etc.\n", 304 | " channelId=\"UC3XTzVzaHQEd30rQbuvCtTQ\", # The search query\n", 305 | " maxResults=500,\n", 306 | " fields=\"items(id(videoId))\"\n", 307 | ").execute()" 308 | ] 309 | }, 310 | { 311 | "cell_type": "code", 312 | "execution_count": null, 313 | "id": "603f6d16", 314 | "metadata": {}, 315 | "outputs": [], 316 | "source": [ 317 | "# A more robust way is to search by playlists. First, you need to get the playlists ids.\n", 318 | "response = youtube.playlists().list( \n", 319 | " part='contentDetails,snippet', \n", 320 | " channelId='UC3XTzVzaHQEd30rQbuvCtTQ', \n", 321 | " maxResults=50\n", 322 | " ).execute() \n", 323 | "\n", 324 | "playlists = []\n", 325 | "for i in response['items']:\n", 326 | " playlists.append(i['id'])" 327 | ] 328 | }, 329 | { 330 | "cell_type": "code", 331 | "execution_count": null, 332 | "id": "177c377c", 333 | "metadata": {}, 334 | "outputs": [], 335 | "source": [ 336 | "# Next, write a loop to search through all the pages\n", 337 | "\n", 338 | "nextPageToken = None\n", 339 | "\n", 340 | "while True: \n", 341 | "\n", 342 | " response = youtube.playlistItems().list( \n", 343 | " part='snippet', \n", 344 | " playlistId=playlists[0], \n", 345 | " maxResults=100, \n", 346 | " pageToken=nextPageToken \n", 347 | " ).execute() \n", 348 | "\n", 349 | " # Iterate through all response and get video description \n", 350 | " for item in response['items']: \n", 351 | " description = item['snippet']['title']\n", 352 | " print(description) \n", 353 | " print(\"\\n\")\n", 354 | " nextPageToken = response.get('nextPageToken') \n", 355 | " \n", 356 | " if not nextPageToken: \n", 357 | " break" 358 | ] 359 | }, 360 | { 361 | "cell_type": "markdown", 362 | "id": "375a3095", 363 | "metadata": {}, 364 | "source": [ 365 | "### Exercise\n", 366 | "\n", 367 | "Find the top 10 most viewed dogs and cats video on YouTube" 368 | ] 369 | } 370 | ], 371 | "metadata": { 372 | "celltoolbar": "Slideshow", 373 | "kernelspec": { 374 | "display_name": "Python 3 (ipykernel)", 375 | "language": "python", 376 | "name": "python3" 377 | }, 378 | "language_info": { 379 | "codemirror_mode": { 380 | "name": "ipython", 381 | "version": 3 382 | }, 383 | "file_extension": ".py", 384 | "mimetype": "text/x-python", 385 | "name": "python", 386 | "nbconvert_exporter": "python", 387 | "pygments_lexer": "ipython3", 388 | "version": "3.9.18" 389 | } 390 | }, 391 | "nbformat": 4, 392 | "nbformat_minor": 5 393 | } 394 | -------------------------------------------------------------------------------- /2023/day3/Teaching Exercises.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "868c5ff0", 6 | "metadata": { 7 | "slideshow": { 8 | "slide_type": "slide" 9 | } 10 | }, 11 | "source": [ 12 | "### Question: \n", 13 | "### A student comes to you a vague topic in mind, what data collection methods would you recommend?" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "id": "4d03208f", 19 | "metadata": { 20 | "slideshow": { 21 | "slide_type": "slide" 22 | } 23 | }, 24 | "source": [ 25 | "Topics:\n", 26 | "\n", 27 | "- Analyzing Social Media Discourse: A Comparative Study of Political Communication Strategies during Elections.\n", 28 | "- The Impact of Online Influencers on Consumer Behavior: A Analysis of Product Endorsements.\n", 29 | "- Exploring Online Activism: Case Studies on the Effectiveness of Digital Campaigns in Social Change Movements.\n", 30 | "- Fake News and Public Perception: An Examination of the Role of Online Information in Shaping Public Opinion.\n", 31 | "- Digital Diplomacy: A Cross-Cultural Analysis of Nation Branding in International Relations.\n", 32 | "- User-Generated Content and Brand Loyalty: A Study of Customer Engagement on E-commerce Platforms.\n", 33 | "- Crisis Communication in the Age of Social Media: A Comparative Study of Organizational Responses to Online Controversies.\n", 34 | "- Analyzing Media Bias and Framing in Online News Articles.\n", 35 | "\n", 36 | "\n", 37 | "\n", 38 | "\n", 39 | "\n", 40 | "\n", 41 | "\n" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "id": "70e0b8cc", 47 | "metadata": { 48 | "slideshow": { 49 | "slide_type": "slide" 50 | } 51 | }, 52 | "source": [ 53 | "Considerations:\n", 54 | "- What kind of data?\n", 55 | "- Where can you get the data (state the actual URLs and/or API endpoint)?\n", 56 | "- How can you get the data?\n", 57 | "- How would you store the data?\n", 58 | "- Any limitations?" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "id": "1dff6acf", 64 | "metadata": {}, 65 | "source": [ 66 | "#### Make a short presentation and share with the class" 67 | ] 68 | } 69 | ], 70 | "metadata": { 71 | "celltoolbar": "Slideshow", 72 | "kernelspec": { 73 | "display_name": "Python 3 (ipykernel)", 74 | "language": "python", 75 | "name": "python3" 76 | }, 77 | "language_info": { 78 | "codemirror_mode": { 79 | "name": "ipython", 80 | "version": 3 81 | }, 82 | "file_extension": ".py", 83 | "mimetype": "text/x-python", 84 | "name": "python", 85 | "nbconvert_exporter": "python", 86 | "pygments_lexer": "ipython3", 87 | "version": "3.9.18" 88 | } 89 | }, 90 | "nbformat": 4, 91 | "nbformat_minor": 5 92 | } 93 | -------------------------------------------------------------------------------- /2023/day3/Webscraping.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "e9014de5", 6 | "metadata": { 7 | "slideshow": { 8 | "slide_type": "slide" 9 | } 10 | }, 11 | "source": [ 12 | "# Webscraping\n", 13 | "\n", 14 | "Author: Justin Chun-ting Ho\n", 15 | "\n", 16 | "Date: 27 Nov 2023\n", 17 | "\n", 18 | "Credit: Some sections are adopted from the slides prepared by Damian Trilling" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "id": "cb8d7db1", 24 | "metadata": {}, 25 | "source": [ 26 | "### What is an website?\n", 27 | "\n", 28 | "Let's take a look at [this](https://ascor.uva.nl/staff/ascor-faculty/ascor-staff---faculty.html)" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "id": "7c0d5028", 34 | "metadata": {}, 35 | "source": [ 36 | "### Typical Workflow\n", 37 | "\n", 38 | "- Download the source code (HTML)\n", 39 | "- Identify the pattern to isolate what we want\n", 40 | "- Write a script to extract" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "id": "76d0ecb3", 46 | "metadata": {}, 47 | "source": [ 48 | "## Approach 1: Regular Expression" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "id": "e34b3916", 54 | "metadata": {}, 55 | "source": [ 56 | "You probably need [this](https://images.datacamp.com/image/upload/v1665049611/Marketing/Blog/Regular_Expressions_Cheat_Sheet.pdf)." 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": null, 62 | "id": "50a3825b", 63 | "metadata": {}, 64 | "outputs": [], 65 | "source": [ 66 | "import requests\n", 67 | "import re" 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": null, 73 | "id": "67affb3c", 74 | "metadata": {}, 75 | "outputs": [], 76 | "source": [ 77 | "response = requests.get('https://ascor.uva.nl/staff/ascor-faculty/ascor-staff---faculty.html')" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "id": "58a38586", 84 | "metadata": {}, 85 | "outputs": [], 86 | "source": [ 87 | "text = response.text" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": null, 93 | "id": "4d4d6d58", 94 | "metadata": {}, 95 | "outputs": [], 96 | "source": [ 97 | "emails = re.findall(r'mailto:(.*?)\\\"',text)" 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "id": "36b6b243", 103 | "metadata": {}, 104 | "source": [ 105 | "## Approach 2: Modern Packages" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "id": "4d588c46", 111 | "metadata": {}, 112 | "source": [ 113 | "### Tools\n", 114 | "- Beautiful Soup: `pip install beautifulsoup4` or `conda install -c anaconda beautifulsoup4`\n", 115 | "- SelectorGadget: https://selectorgadget.com/" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": null, 121 | "id": "ec502b65", 122 | "metadata": {}, 123 | "outputs": [], 124 | "source": [ 125 | "from bs4 import BeautifulSoup \n", 126 | "import csv\n", 127 | "import pandas as pd\n", 128 | "import numpy as np" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "id": "c91536be", 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [ 138 | "URL = 'https://ascor.uva.nl/staff/faculty.html'\n", 139 | "r = requests.get(URL) \n", 140 | "soup = BeautifulSoup(r.content) " 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": null, 146 | "id": "6750db9b", 147 | "metadata": {}, 148 | "outputs": [], 149 | "source": [ 150 | "r.content" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": null, 156 | "id": "541ee3fb", 157 | "metadata": {}, 158 | "outputs": [], 159 | "source": [ 160 | "emails = soup.find_all(class_=\"mail\")" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": null, 166 | "id": "cfb229ac", 167 | "metadata": {}, 168 | "outputs": [], 169 | "source": [ 170 | "emails[0:6]" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": null, 176 | "id": "ecbb15ca", 177 | "metadata": {}, 178 | "outputs": [], 179 | "source": [ 180 | "for email in emails[0:6]:\n", 181 | " print(email['href'])" 182 | ] 183 | }, 184 | { 185 | "cell_type": "markdown", 186 | "id": "75db440f", 187 | "metadata": {}, 188 | "source": [ 189 | "### Another Way" 190 | ] 191 | }, 192 | { 193 | "cell_type": "code", 194 | "execution_count": null, 195 | "id": "5021bfe5", 196 | "metadata": {}, 197 | "outputs": [], 198 | "source": [ 199 | "soup = BeautifulSoup(r.content) \n", 200 | "items = soup.find_all(class_=\"c-item__link\")" 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": null, 206 | "id": "ab8091b9", 207 | "metadata": {}, 208 | "outputs": [], 209 | "source": [ 210 | "items[0]" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": null, 216 | "id": "01795f75", 217 | "metadata": {}, 218 | "outputs": [], 219 | "source": [ 220 | "links = []\n", 221 | "for i in items:\n", 222 | " link = i['href']\n", 223 | " links.append(link) " 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": null, 229 | "id": "ddad105f", 230 | "metadata": {}, 231 | "outputs": [], 232 | "source": [ 233 | "links[0:10]" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": null, 239 | "id": "244dca96", 240 | "metadata": {}, 241 | "outputs": [], 242 | "source": [ 243 | "link = '/profile/h/o/j.c.ho/j.c.ho.html?origin=%2BkELbJiCRnm%2F56cOYZSXzA'\n", 244 | "url = 'https://ascor.uva.nl/' + link\n", 245 | "r = requests.get(url)\n", 246 | "soup = BeautifulSoup(r.content)" 247 | ] 248 | }, 249 | { 250 | "cell_type": "code", 251 | "execution_count": null, 252 | "id": "627daaa9", 253 | "metadata": {}, 254 | "outputs": [], 255 | "source": [ 256 | "name = soup.find(class_=\"c-profile__name\").get_text()\n", 257 | "name" 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": null, 263 | "id": "13319474", 264 | "metadata": {}, 265 | "outputs": [], 266 | "source": [ 267 | "summary = soup.find(class_=\"c-profile__summary\").get_text()\n", 268 | "summary" 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": null, 274 | "id": "1be51446", 275 | "metadata": {}, 276 | "outputs": [], 277 | "source": [ 278 | "profile = soup.find(id=\"Profile\").get_text()\n", 279 | "profile" 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "execution_count": null, 285 | "id": "ceabdc89", 286 | "metadata": {}, 287 | "outputs": [], 288 | "source": [ 289 | "divs = soup.find_all('div', class_=\"c-profile__list\")\n", 290 | "divs" 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": null, 296 | "id": "059433f8", 297 | "metadata": {}, 298 | "outputs": [], 299 | "source": [ 300 | "divs[1].find_all('li')" 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "execution_count": null, 306 | "id": "e1dea200", 307 | "metadata": {}, 308 | "outputs": [], 309 | "source": [ 310 | "divs[1].find_all('li')[0].get_text()" 311 | ] 312 | }, 313 | { 314 | "cell_type": "code", 315 | "execution_count": null, 316 | "id": "15e9caf6", 317 | "metadata": {}, 318 | "outputs": [], 319 | "source": [ 320 | "profiles = []\n", 321 | "\n", 322 | "for link in links:\n", 323 | " print(link)\n", 324 | " url = 'https://ascor.uva.nl' + link\n", 325 | " r = requests.get(url)\n", 326 | " soup = BeautifulSoup(r.content) \n", 327 | " name = soup.find(class_=\"c-profile__name\").get_text()\n", 328 | " summary = soup.find(class_=\"c-profile__summary\").get_text()\n", 329 | "# profile = soup.find(id=\"Profile\").get_text()\n", 330 | " divs = soup.find_all('div', class_=\"c-profile__list\")\n", 331 | " email = divs[1].find_all('li')[0].get_text()\n", 332 | " \n", 333 | " profile = {} \n", 334 | " profile['name'] = name\n", 335 | " profile['summary'] = summary\n", 336 | "# profile['profile'] = profile\n", 337 | " profile['email'] = email\n", 338 | " profiles.append(profile) " 339 | ] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "execution_count": null, 344 | "id": "122eae1c", 345 | "metadata": {}, 346 | "outputs": [], 347 | "source": [ 348 | "profiles = []\n", 349 | "\n", 350 | "for link in links:\n", 351 | " print(link)\n", 352 | " url = 'https://ascor.uva.nl' + link\n", 353 | " r = requests.get(url)\n", 354 | " soup = BeautifulSoup(r.content) \n", 355 | " name = soup.find(class_=\"c-profile__name\").get_text()\n", 356 | " summary = soup.find(class_=\"c-profile__summary\").get_text()\n", 357 | " try:\n", 358 | " profile_text = soup.find(id=\"Profile\").get_text()\n", 359 | " except:\n", 360 | " profile_text = np.nan\n", 361 | " divs = soup.find_all('div', class_=\"c-profile__list\")\n", 362 | " try:\n", 363 | " email = divs[1].find_all('li')[0].get_text()\n", 364 | " except:\n", 365 | " email = np.nan\n", 366 | " \n", 367 | " profile = {} \n", 368 | " profile['name'] = name\n", 369 | " profile['summary'] = summary\n", 370 | " profile['profile_text'] = profile_text\n", 371 | " profile['email'] = email\n", 372 | " profiles.append(profile) " 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": null, 378 | "id": "db0e8c1d", 379 | "metadata": {}, 380 | "outputs": [], 381 | "source": [ 382 | "df = pd.DataFrame(profiles)" 383 | ] 384 | }, 385 | { 386 | "cell_type": "code", 387 | "execution_count": null, 388 | "id": "4febe195", 389 | "metadata": {}, 390 | "outputs": [], 391 | "source": [ 392 | "df" 393 | ] 394 | }, 395 | { 396 | "cell_type": "code", 397 | "execution_count": null, 398 | "id": "a7ff0b4c", 399 | "metadata": {}, 400 | "outputs": [], 401 | "source": [ 402 | "df.to_csv('profiles.csv')" 403 | ] 404 | }, 405 | { 406 | "cell_type": "code", 407 | "execution_count": null, 408 | "id": "328cb381", 409 | "metadata": {}, 410 | "outputs": [], 411 | "source": [ 412 | "df.to_json('profiles.json', orient='records', lines=True)" 413 | ] 414 | }, 415 | { 416 | "cell_type": "markdown", 417 | "id": "431264f5", 418 | "metadata": {}, 419 | "source": [ 420 | "### Exercise\n", 421 | "\n", 422 | "Get the full text of all the news item here: https://ascor.uva.nl/news/newslist.html" 423 | ] 424 | }, 425 | { 426 | "cell_type": "code", 427 | "execution_count": null, 428 | "id": "4f11a935", 429 | "metadata": {}, 430 | "outputs": [], 431 | "source": [] 432 | } 433 | ], 434 | "metadata": { 435 | "kernelspec": { 436 | "display_name": "Python 3 (ipykernel)", 437 | "language": "python", 438 | "name": "python3" 439 | }, 440 | "language_info": { 441 | "codemirror_mode": { 442 | "name": "ipython", 443 | "version": 3 444 | }, 445 | "file_extension": ".py", 446 | "mimetype": "text/x-python", 447 | "name": "python", 448 | "nbconvert_exporter": "python", 449 | "pygments_lexer": "ipython3", 450 | "version": "3.9.18" 451 | } 452 | }, 453 | "nbformat": 4, 454 | "nbformat_minor": 5 455 | } 456 | -------------------------------------------------------------------------------- /2023/day3/get_mails: -------------------------------------------------------------------------------- 1 | all_mail = [] 2 | 3 | for email in emails[0:6]: 4 | new_mail = email['href'].replace('mailto:','') 5 | all_mail.append(new_mail) 6 | -------------------------------------------------------------------------------- /2023/day3/updated cell: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | 3 | info = { 4 | 'id':[], 5 | 'views':[] 6 | } 7 | 8 | for item in lotr_videos_ids['items']: 9 | vidId = item['id']['videoId'] 10 | r = youtube.videos().list( 11 | part="statistics,snippet", 12 | id=vidId, 13 | fields="items(statistics)" 14 | ).execute() 15 | 16 | views = r['items'][0]['statistics']['viewCount'] 17 | info['id'].append(vidId) 18 | info['views'].append(views) 19 | 20 | df = pd.DataFrame(data=info) 21 | 22 | 23 | 24 | 25 | all_mail = [] 26 | 27 | for email in emails[0:6]: 28 | new_mail = email['href'].replace('mailto:','') 29 | all_mail.append(new_mail) 30 | -------------------------------------------------------------------------------- /2023/day4/README.md: -------------------------------------------------------------------------------- 1 | # Day 4: Natural Language Processing 2 | 3 | | Time slot | Content | 4 | |---------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 5 | | 09:30-11:00 | [Introduction to NLP and text as data](slides-04-1.pdf): In a gentle introduction to NLP techniques, we will discuss the basics of bag-of-word (BAG) approaches, such as tokenization, stopword removal, stemming and lemmatization | 6 | | 11:00 - 12:30 | Exercise time! [practice with some NLP](/exercises-morning/) or experiment with [vectorizers](/exercises-vectorizers/) 7 | | 12:30-13:30 | Lunch (tip: lunch lecture by Toni van der Meer!) | 8 | | 13:30-14:30 | [Advanced NLP and regular expressions](slides-04-2.pdf): In the second lecture of the day, we will delve a bit deeper in NLP approaches. We discuss the possibilities NER in spacy and introduce regular expressions | 9 | | 14:45-15:30 | Exercise time! [Play around with regular expressions](/exercises-afternoon/) or explore [spacy](spacy-examples.ipynb) | 10 | |15:30-16:00 | wrap up/ final questions | 11 | -------------------------------------------------------------------------------- /2023/day4/example-ngrams.md: -------------------------------------------------------------------------------- 1 | ## N-grams 2 | 3 | ```python 4 | import nltk 5 | from gensim import corpora 6 | from gensim import models 7 | 8 | documents = ["In the train from Connecticut to New York", 9 | "He is a spokesman for New York City's Health Department", 10 | "New York has been one of the states hit hardest by Coronavirus"] 11 | 12 | documents_bigrams = [["_".join(tup) for tup in nltk.ngrams(doc.split(),2)] for doc in documents] 13 | 14 | len(documents) == len(documents_bigrams) 15 | # maybe we want both unigrams and bigrams in the feature set? 16 | documents_uniandbigrams = [] 17 | for a,b in zip([doc.split() for doc in documents],documents_bigrams): 18 | documents_uniandbigrams.append(a + b) 19 | 20 | print(documents_uniandbigrams) 21 | ``` 22 | 23 | if you want to use this as input for a `sklearn` classifier, you can do the following: 24 | 25 | ```python 26 | myvectorizer = CountVectorizer(analyzer=lambda x:x) 27 | ``` 28 | 29 | And if you want to see what's happening, convert to a dense format (please only do this with a small toy sample, never on a large dataset): 30 | 31 | ```python 32 | X = myvectorizer.fit_transform(documents_uniandbigrams) 33 | df = pd.DataFrame(X.toarray().transpose(), index = myvectorizer.get_feature_names()) 34 | df 35 | documents_uniandbigrams 36 | ​ 37 | myvectorizer = CountVectorizer(analyzer=lambda x:x) 38 | X = myvectorizer.fit_transform(documents_uniandbigrams) 39 | df = pd.DataFrame(X.toarray().transpose(), index = myvectorizer.get_feature_names()) 40 | ``` 41 | 42 | ### Collocations with `NLTK` 43 | 44 | ```python 45 | import nltk 46 | documents = ["He travelled by train from New York to Connecticut and back to New York", 47 | "He is a spokesman for New York City's Health Department", 48 | "New York has been one of the states hit hardest by Coronavirus"] 49 | 50 | text = [nltk.Text(tkn for tkn in doc.split()) for doc in documents ] # this inspects frequencies WITHIN documents 51 | text[0].collocations(num=10) 52 | ``` 53 | 54 | ### Collocations with `Gensim` 55 | 56 | ```python 57 | from nltk.tokenize import TreebankWordTokenizer 58 | import pandas as pd 59 | import regex 60 | from sklearn.feature_extraction.text import CountVectorizer 61 | from gensim.models import KeyedVectors, Phrases 62 | from gensim.models.phrases import Phraser 63 | from glob import glob 64 | 65 | infowarsfiles = glob('articles/*/Infowars/*') 66 | documents = [] 67 | for filename in infowarsfiles: 68 | with open(filename) as f: 69 | documents.append(f.read()) 70 | 71 | mytokenizer = TreebankWordTokenizer() 72 | tokenized_texts = [mytokenizer.tokenize(t) for t in documents] 73 | 74 | phrases_model = Phrases(tokenized_texts, min_count=10, scoring="npmi", threshold=.5) 75 | score_dict = phrases_model.export_phrases() 76 | scores = pd.DataFrame(score_dict.items(), 77 | columns=["phrase", "score"]) 78 | scores.sort_values("score",ascending=False).head() 79 | ``` 80 | 81 | Using `Gensim`'s collocations in `sklearn`'s vectorizer 82 | 83 | ```python 84 | from gensim.models.phrases import Phraser 85 | import numpy as np 86 | 87 | phraser = Phraser(phrases_model) 88 | tokens_phrases = [phraser[doc] for doc in tokens] 89 | cv = CountVectorizer(tokenizer=lambda x: x, lowercase=False) # initiate a count or tfidf vectorizer 90 | ``` 91 | 92 | Inspecting the resulting dtm 93 | 94 | ```python 95 | from gensim.models.phrases import Phraser 96 | import numpy as np 97 | 98 | phraser = Phraser(phrases_model) 99 | tokens_phrases = [phraser[doc] for doc in tokens] 100 | cv = CountVectorizer(tokenizer=lambda x: x, lowercase=False) # initiate a count or tfidf vectorizer 101 | 102 | 103 | 104 | def termstats(dfm, vectorizer): 105 | """Helper function to calculate term and document frequency per term""" 106 | # Frequencies are the column sums of the DFM 107 | frequencies = dfm.sum(axis=0).tolist()[0] 108 | # Document frequencies are the binned count 109 | # of the column indices of DFM entries 110 | docfreqs = np.bincount(dfm.indices) 111 | freq_df=pd.DataFrame(dict(frequency=frequencies,docfreq=docfreqs), index=vectorizer.get_feature_names()) 112 | return freq_df.sort_values("frequency", ascending=False) 113 | 114 | dtm = cv.fit_transform(tokens_phrases) 115 | termstats(dtm, cv).filter(like="hussein", axis=0) 116 | ``` 117 | -------------------------------------------------------------------------------- /2023/day4/exercises-afternoon/01tuesday-regex-exercise.md: -------------------------------------------------------------------------------- 1 | 2 | # Exercise with regular expressions 3 | 4 | Let’s take some time to write some regular expressions. Write a 5 | script that 6 | 7 | • extracts URLS form a list of strings 8 | • removes everything that is not a letter or number from a list of 9 | strings 10 | 11 | 12 | ```python 13 | list_w_urls = ["some text with a url http://www.youtube.com... ", 14 | "and another one!! https://www.facebook.com", 15 | "more urls www.baidu.com??", 16 | "And even more?!! %$##($^) https://www.yahoo.com and this one http://www.amazon.com and this one www.wikipedia.org" ] 17 | ``` 18 | -------------------------------------------------------------------------------- /2023/day4/exercises-afternoon/01tuesday-regex-solution.md: -------------------------------------------------------------------------------- 1 | 2 | # Possible solution to `regex` [exercise](01tuesday-regex-exercise.md) 3 | *Please note that alternative solutions may work just as well or even better* 4 | 5 | ## extracts URLS form a list of strings 6 | 7 | ```python 8 | import re 9 | 10 | for l in list_w_urls: 11 | m = re.findall('(?:(?:https?|ftp):\/\/)?[\w.]+\.[\w]+.', l) 12 | print(m) 13 | 14 | ``` 15 | 16 | `?` = matches either once or zero times 17 | `?:` = matches the group but does not captured it / save it. 18 | `\w` = matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. 19 | 20 | 21 | ## remove everything that is not a letter or number from a list of strings 22 | 23 | ```python 24 | for e in list_w_urls: 25 | print(re.sub(r'[\W_]+', ' ', e)) 26 | ``` 27 | 28 | `[\W_]` = matches any non-word character (or underscore, which is weirdly enough considered a 'word character' - therefore, if we simply do `\W`, we miss the underscores) 29 | `+` = 1 or more times 30 | -------------------------------------------------------------------------------- /2023/day4/exercises-afternoon/02tuesday-exercise_nexis.md: -------------------------------------------------------------------------------- 1 | # A Practical Introduction to Machine Learning in Python 2 | Anne Kroon and Damian Trilling 3 | 4 | ## Day 2 (Tuesday Afternoon) 5 | 6 | ## Exercise: Parsing unstructured text files 7 | 8 | When working with text data, often we have to deal with unstructured files. Before we can start with our analysis, we have to transform such files to more structured forms of data. 9 | 10 | An example of such forms of unstructured data is the output of Nexis Uni, a large news database often used by social scientists. 11 | We will practise with some files downloaded from Nexis Uni. 12 | 13 | Download and unpack a set of .RTF files [here](corona_news.tar.gz). 14 | Windows users may need an additional program to unpack it, such as 7zip. 15 | 16 | Specific tasks 17 | 18 | 1. Write some code to read the data in. 19 | 3. Try to extract the newspaper title using regular expressions. 20 | 4. Do the same for the publication dates. 21 | 5. Finally, extract the full body of the text. 22 | 6. Think about a way to store the data 23 | 24 | 25 | Hints: 26 | 27 | In order to read .RTF files with python, we need to convert rtf files to strings, before we can start parsing and processing. 28 | This library can help: https://pypi.org/project/striprtf/ 29 | 30 | ```bash 31 | pip install striprtf 32 | ``` 33 | 34 | Afterwards, we can start converting our files: 35 | 36 | ```python 37 | from striprtf.striprtf import rtf_to_text 38 | 39 | rtf_string = open("exercises-afternoon/corona_news/news_corona_1.RTF").read() 40 | text = rtf_to_text(rtf_string) 41 | 42 | ``` 43 | 44 | This will return a string object. In order to split up the string by article, we can look at the structure of the data. 45 | As you might notice, all news articles went with 'End of Document '. We can use this information to split the string. 46 | 47 | ```python 48 | splitted_text = text.replace("\n", " ").split("End of Document ") 49 | ``` 50 | -------------------------------------------------------------------------------- /2023/day4/exercises-afternoon/02tuesday-exercise_nexis_solution.md: -------------------------------------------------------------------------------- 1 | # A Practical Introduction to Machine Learning in Python 2 | Anne Kroon and Damian Trilling 3 | 4 | This is just one solution. Maybe you came up with an even better one yourself! 5 | 6 | ### Reading the files in: 7 | 8 | ```python 9 | from striprtf.striprtf import rtf_to_text 10 | 11 | # read the files in 12 | filenames = ["news_corona_" + str(i) + ".RTF" for i in range(1, 4) ] 13 | rtf_string = [ open("exercises-afternoon/corona_news/" + f).read() for f in filenames ] 14 | 15 | # convert the files from rtf to string format 16 | text = [ rtf_to_text(i) for i in rtf_string ] 17 | 18 | # replace line breaks and split articles 19 | 20 | splitted_text = [ i.replace("\n", " ").split("End of Document ") for i in text ] 21 | 22 | ``` 23 | 24 | ### A function that parses the documents. 25 | 26 | ```python 27 | import re 28 | 29 | def parse_nexis_uni(news_string): 30 | ''' parses strings (nexis news articles), so that the title, date and full text are extracted. ''' 31 | 32 | parsed_results = [] 33 | for line in news_string: 34 | 35 | # newspaper title 36 | matchObj1=re.match(" +([a-zA-Z\s]+?) \d+",line) 37 | if matchObj1: 38 | newspaper = matchObj1.group(1) 39 | else: 40 | newspaper = "NaN" 41 | 42 | # date 43 | matchObj2 = re.match(r".*(\d{1,2}) ([jJ]anuari|[fF]ebruari|[mM]aart|[aA]pril|[mM]ei|[jJ]uni|[jJ]uli|[aA]ugustus|[sS]eptember|[Oo]ktober|[nN]ovember|[dD]ecember) (\d{4}).*", line) 44 | if matchObj2: 45 | day = matchObj2.group(1) 46 | month = matchObj2.group(2) 47 | year = matchObj2.group(3) 48 | date = (day, month, year ) 49 | else: 50 | date = "NaN" 51 | 52 | # full text 53 | matchObj3=re.match(".*Body(.*) Classification",line) 54 | if matchObj3: 55 | text = matchObj3.group(1).strip() 56 | else: 57 | text = "NaN" 58 | 59 | parsed_results.append( {'newspaper': newspaper, 60 | 'date' : date, 61 | 'text': text } ) 62 | 63 | return parsed_results 64 | 65 | ``` 66 | 67 | #### calling the function 68 | 69 | ```python 70 | results = [] 71 | for document in splitted_text: 72 | results.extend(parse_nexis_uni(document)) 73 | ``` 74 | -------------------------------------------------------------------------------- /2023/day4/exercises-afternoon/corona_news.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2023/day4/exercises-afternoon/corona_news.tar.gz -------------------------------------------------------------------------------- /2023/day4/exercises-morning/exercise-feature-engineering.md: -------------------------------------------------------------------------------- 1 | # Exercise 1: Working with textual data 2 | 3 | ### 0. Get the data. 4 | 5 | - Download `articles.tar.gz` from 6 | https://dx.doi.org/10.7910/DVN/ULHLCB 7 | 8 | If you experience difficulties downloading this (rather large) dataset, you can also download just a part of the data [here](https://surfdrive.surf.nl/files/index.php/s/bfNFkuUVoVtiyuk) 9 | 10 | - Unpack it. On Linux and MacOS, you can do this with `tar -xzf mydata.tar.gz` on the command line. On Windows, you may need an additional tool such as `7zip` for that (note that technically speaking, there is a `tar` archive within a `gz` archive, so unpacking may take *two* steps depending on your tool). 11 | 12 | 13 | ### 1. Inspect the structure of the dataset. 14 | What information do the following elements give you? 15 | 16 | - folder (directory) names 17 | - folder structure/hierarchy 18 | - file names 19 | - file contents 20 | 21 | ### 2. Discuss strategies for working with this dataset! 22 | 23 | - Which questions could you answer? 24 | - How could you deal with it, given the size and the structure? 25 | - How much memory1 (RAM) does your computer have? How large is the complete dataset? What does that mean? 26 | - Make a sketch (e.g., with pen&paper), how you could handle your workflow and your data to answer your question. 27 | 28 | 1 *memory* (RAM), not *storage* (harddisk)! 29 | 30 | ### 3. Read some (or all?) data 31 | 32 | Here is some example code that you can modify. Assuming that he folder `articles` is in the same folder as the notebook you are currently working on, you could, for instance, do the following to read a *part* of your dataset. 33 | 34 | ```python 35 | from glob import glob 36 | infowarsfiles = glob('articles/*/Infowars/*') 37 | infowarsarticles = [] 38 | for filename in infowarsfiles: 39 | with open(filename) as f: 40 | infowarsarticles.append(f.read()) 41 | 42 | ``` 43 | 44 | - Can you explain what the `glob` function does? 45 | - What does `infowarsfiles` contain, and what does `infowarsarticles` contain? First make an educated guess based on the code snippet, then check it! Do *not* print the whole thing, but use `len`, `type` en slicing `[:10]` to get the info ou need. 46 | 47 | - Tip: take a random sample of the articles for practice purposes (if your code works, you can scale up!) 48 | 49 | ``` 50 | # taking a random sample of the articles for practice purposes 51 | articles =random.sample(infowarsarticles, 10) 52 | ``` 53 | 54 | ### 2. first analyses and pre-processing steps 55 | 56 | - Perform some first analyses on the data using string methods and regular expressions. 57 | Techniques you can try out include: 58 | 59 | a. lowercasing 60 | b. tokenization 61 | c. stopword removal 62 | d. stemming and/or lemmatizing) 63 | e. cleaning: removing punctuation, line breaks, double spaces 64 | 65 | 66 | ### 3. N-grams 67 | 68 | - Think about what type of n-grams you want to add to your feature set. Extract and inspect n-grams and/or collocations, and add them to your feature set if you think this is relevant. 69 | 70 | ### 4. Extract entities and other meaningful information 71 | 72 | Try to extract meaningful information from your texts. Depending on your interests and the nature of the data, you could: 73 | 74 | - use regular expressions to distinguish relevant from irrelevant texts, or to extract substrings 75 | - use NLP techniques such as Named Entity Recognition to extract entities that occur. 76 | 77 | ### 5. Train a supervised classifier 78 | 79 | Go back to your code belonging to yesterday's assignment. Perform the same classification task, but this time carefully consider which feature set you want to use. Reflect on the options listed above, and extract features that you think are relevant to include. Carefully consider **pre-processing steps**: what type of features will you feed your algorithm? Do you, for example, want to manually remove stopwords, or include ngrams? Use these features as input for your classifier, and investigate the effects hereof on performance of the classifier. Not that the purpose is not to build the perfect classifier, but to inspect the effects of different feature engineering decisions on the outcomes of your classification algorithm. 80 | 81 | 82 | ## BONUS 83 | 84 | - Compare that bottom-up approach with a top-down (keyword or regular-expression based) approach. 85 | -------------------------------------------------------------------------------- /2023/day4/exercises-morning/possible-solution-exercise-day2-vectorizers.md: -------------------------------------------------------------------------------- 1 | ### using manually crafted features as input for supervised machine learning with `sklearn` 2 | 3 | 4 | ```python 5 | import nltk 6 | from sklearn.model_selection import train_test_split 7 | 8 | from glob import glob 9 | import random 10 | 11 | 12 | def read_data(listofoutlets): 13 | texts = [] 14 | labels = [] 15 | for label in listofoutlets: 16 | for file in glob(f'../articles-small/*/{label}/*'): 17 | with open(file) as f: 18 | texts.append(f.read()) 19 | labels.append(label) 20 | return texts, labels 21 | 22 | documents, labels = read_data(['Infowars', 'BBC']) 23 | ``` 24 | 25 | Create bigrams and combine with unigrams 26 | 27 | ```python 28 | documents_bigrams = [["_".join(tup) for tup in nltk.ngrams(doc.split(),2)] for doc in documents] # creates bigrams 29 | documents_bigrams[7][:5] # inspect the results... 30 | 31 | # maybe we want both unigrams and bigrams in the feature set? 32 | assert len(documents)==len(documents_bigrams) 33 | 34 | documents_uniandbigrams = [] 35 | for a,b in zip([doc.split() for doc in documents],documents_bigrams): 36 | documents_uniandbigrams.append(a + b) 37 | 38 | #and let's inspect the outcomes again. 39 | documents_uniandbigrams[7] 40 | ``` 41 | 42 | some sanity checks: 43 | 44 | ```python 45 | len(documents_uniandbigrams[7]),len(documents_bigrams[7]),len(documents[7].split()) 46 | assert len(documents_uniandbigrams) == len(labels) 47 | ``` 48 | 49 | Now lets fit a `sklearn` vectorizer on the manually crafted feature set: 50 | 51 | ```python 52 | from sklearn.feature_extraction.text import CountVectorizer 53 | X_train,X_test,y_train,y_test=train_test_split(documents_uniandbigrams, labels, test_size=0.3) 54 | # We do *not* want scikit-learn to tokenize a string into a list of tokens, 55 | # after all, we already *have* a list of tokens. lambda x:x is just a fancy way of saying: 56 | # do nothing! 57 | myvectorizer= CountVectorizer(analyzer=lambda x:x) 58 | ``` 59 | 60 | let's fit and transform 61 | 62 | ```python 63 | #Fit the vectorizer, and transform. 64 | X_features_train = myvectorizer.fit_transform(X_train) 65 | X_features_test = myvectorizer.transform(X_test) 66 | ``` 67 | 68 | Inspect the vocabulary and their id mappings 69 | 70 | ```python 71 | # inspect 72 | myvectorizer.vocabulary_ 73 | ``` 74 | 75 | Finally, run the model again 76 | 77 | ```python 78 | from sklearn.naive_bayes import MultinomialNB 79 | from sklearn.metrics import accuracy_score 80 | from sklearn.metrics import classification_report 81 | 82 | model = MultinomialNB() 83 | model.fit(X_features_train, y_train) 84 | y_pred = model.predict(X_features_test) 85 | 86 | print(f"Accuracy : {accuracy_score(y_test, y_pred)}") 87 | print(classification_report(y_test, y_pred)) 88 | ``` 89 | 90 | 91 | ### Final remark on ngrams in scikit learn 92 | 93 | Of course, you do not *have* to do all of this if you just want to use ngrams. Alternatively, you can simply use 94 | ``` 95 | myvectorizer = CountVectorizer(ngram_range=(1,2)) 96 | X_features_train = myvectorizer.fit_transform(X_train) 97 | ``` 98 | *if X_train are the **untokenized** texts.* 99 | 100 | What this little example illustrates, though, is that you can use *any* manually crafted feature set as input for scikit-learn. 101 | -------------------------------------------------------------------------------- /2023/day4/exercises-morning/possible-solution-exercise-day2.md: -------------------------------------------------------------------------------- 1 | 2 | ## Exercise 2: NLP and feature engineering 3 | 4 | ### 1. Read in the data 5 | 6 | Load the data... 7 | 8 | ```python 9 | from glob import glob 10 | import random 11 | import nltk 12 | from nltk.stem.snowball import SnowballStemmer 13 | import spacy 14 | 15 | 16 | infowarsfiles = glob('articles/*/Infowars/*') 17 | infowarsarticles = [] 18 | for filename in infowarsfiles: 19 | with open(filename) as f: 20 | infowarsarticles.append(f.read()) 21 | 22 | 23 | # taking a random sample of the articles for practice purposes 24 | articles =random.sample(infowarsarticles, 10) 25 | ``` 26 | 27 | Let's inspect the data, and start some pre-processing/ cleaning steps... 28 | 29 | ### 2. first analyses and pre-processing steps 30 | 31 | ##### a. lowercasing articles 32 | 33 | ```python 34 | articles_lower_cased = [art.lower() for art in articles] 35 | ``` 36 | ##### b. tokenization 37 | 38 | Basic solution, using the `.str` method `.split()`. Not very sophisticated, though. 39 | 40 | ```python 41 | articles_split = [art.split() for art in articles] 42 | ``` 43 | 44 | A more sophisticated solution: 45 | 46 | ```python 47 | from nltk.tokenize import TreebankWordTokenizer 48 | articles_tokenized = [TreebankWordTokenizer().tokenize(art) for art in articles ] 49 | ``` 50 | 51 | Even more sophisticated; create your own tokenizer that first split into sentences. In this way,`TreebankWordTokenizer` works better. 52 | 53 | ```python 54 | import regex 55 | 56 | nltk.download("punkt") 57 | class MyTokenizer: 58 | def tokenize(self, text): 59 | tokenizer = TreebankWordTokenizer() 60 | result = [] 61 | word = r"\p{letter}" 62 | for sent in nltk.sent_tokenize(text): 63 | tokens = tokenizer.tokenize(sent) 64 | tokens = [t for t in tokens 65 | if regex.search(word, t)] 66 | result += tokens 67 | return result 68 | 69 | mytokenizer = MyTokenizer() 70 | print(mytokenizer.tokenize(articles[0])) 71 | ``` 72 | 73 | ##### c. removing stopwords 74 | 75 | Define your stopwordlist: 76 | 77 | ```python 78 | from nltk.corpus import stopwords 79 | mystopwords = stopwords.words("english") 80 | mystopwords.extend(["add", "more", "words"]) # manually add more stopwords to your list if needed 81 | print(mystopwords) #let's see what's inside 82 | ``` 83 | 84 | Now, remove stopwords from the corpus: 85 | 86 | ```python 87 | articles_without_stopwords = [] 88 | for article in articles: 89 | articles_no_stop = "" 90 | for word in article.lower().split(): 91 | if word not in mystopwords: 92 | articles_no_stop = articles_no_stop + " " + word 93 | articles_without_stopwords.append(articles_no_stop) 94 | ``` 95 | 96 | Same solution, but with list comprehension: 97 | 98 | ```python 99 | articles_without_stopwords = [" ".join([w for w in article.lower().split() if w not in mystopwords]) for article in articles] 100 | ``` 101 | 102 | Different--probably more sophisticated--solution, by writing a function and calling it in a list comprehension: 103 | 104 | ```python 105 | def remove_stopwords(article, stopwordlist): 106 | cleantokens = [] 107 | for word in article: 108 | if word.lower() not in mystopwords: 109 | cleantokens.append(word) 110 | return cleantokens 111 | 112 | articles_without_stopwords = [remove_stopwords(art, mystopwords) for art in articles_tokenized] 113 | ``` 114 | 115 | It's good practice to frequently inspect the results of your code, to make sure you are not making mistakes, and the results make sense. For example, compare your results to some random articles from the original sample: 116 | 117 | ```python 118 | print(articles[8][:100]) 119 | print("-----------------") 120 | print(" ".join(articles_without_stopwords[8])[:100]) 121 | ``` 122 | 123 | ##### d. stemming and lemmatization 124 | 125 | ```python 126 | stemmer = SnowballStemmer("english") 127 | 128 | stemmed_text = [] 129 | for article in articles: 130 | stemmed_words = "" 131 | for word in article.lower().split(): 132 | stemmed_words = stemmed_words + " " + stemmer.stem(word) 133 | stemmed_text.append(stemmed_words.strip()) 134 | ``` 135 | 136 | Same solution, but with list comprehension: 137 | 138 | ```python 139 | stemmed_text = [" ".join([stemmer.stem(w) for w in article.lower().split()]) for article in articles] 140 | ``` 141 | 142 | Compare tokeninzation and lemmatization using `Spacy`: 143 | 144 | ```python 145 | import spacy 146 | nlp = spacy.load("en_core_web_sm") 147 | lemmatized_articles = [[token.lemma_ for token in nlp(art)] for art in articles] 148 | ``` 149 | 150 | Again, frequently inspect your code, and for example compare the results to the original articles: 151 | 152 | ```python 153 | print(articles[6][:100]) 154 | print("-----------------") 155 | print(stemmed_text[6][:100]) 156 | print("-----------------") 157 | print(" ".join(lemmatized_articles[6])[:100]) 158 | ``` 159 | 160 | 161 | #### e. cleaning: removing punctuation, line breaks, double spaces 162 | 163 | ```python 164 | articles[17] # print a random article to inspect. 165 | ## Typical cleaning up steps: 166 | from string import punctuation 167 | articles = [art.replace('\n\n', '') for art in articles] # remove line breaks 168 | articles = ["".join([w for w in art if w not in punctuation]) for doc in articles] # remove punctuation 169 | articles = [" ".join(art.split()) for art in articles] # remove double spaces by splitting the strings into words and joining these words again 170 | 171 | articles[17] # print the same article to see whether the changes are in line with what you want 172 | ``` 173 | 174 | ### 3. N-grams 175 | 176 | ```python 177 | articles_bigrams = [["_".join(tup) for tup in nltk.ngrams(art.split(),2)] for art in articles] # creates bigrams 178 | articles_bigrams[7][:5] # inspect the results... 179 | 180 | # maybe we want both unigrams and bigrams in the feature set? 181 | 182 | assert len(articles)==len(articles_bigrams) 183 | 184 | articles_uniandbigrams = [] 185 | for a,b in zip([art.split() for art in articles],articles_bigrams): 186 | articles_uniandbigrams.append(a + b) 187 | 188 | #and let's inspect the outcomes again. 189 | articles_uniandbigrams[7] 190 | len(articles_uniandbigrams[7]),len(articles_bigrams[7]),len(articles[7].split()) 191 | ``` 192 | 193 | Or, if you want to inspect collocations: 194 | 195 | ```python 196 | text = [nltk.Text(tkn for tkn in art.split()) for art in articles ] 197 | text[7].collocations(num=10) 198 | ``` 199 | 200 | ---------- 201 | 202 | ### 4. Extract entities and other meaningful information 203 | 204 | ```Python 205 | import nltk 206 | 207 | tokens = [nltk.word_tokenize(sentence) for sentence in articles] 208 | tagged = [nltk.pos_tag(sentence) for sentence in tokens] 209 | print(tagged[0]) 210 | ``` 211 | 212 | playing around with Spacy: 213 | 214 | ```python 215 | nlp = spacy.load('en') 216 | 217 | doc = [nlp(sentence) for sentence in articles] 218 | for i in doc: 219 | for ent in i.ents: 220 | if ent.label_ == 'PERSON': 221 | print(ent.text, ent.label_ ) 222 | 223 | ``` 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | Removing stopwords: 233 | 234 | ```python 235 | mystopwords = set(stopwords.words('english')) # use default NLTK stopword list; alternatively: 236 | # mystopwords = set(open('mystopwordfile.txt').readlines()) #read stopword list from a textfile with one stopword per line 237 | documents = [" ".join([w for w in doc.split() if w not in mystopwords]) for doc in documents] 238 | documents[7] 239 | ``` 240 | 241 | Using N-grams as features: 242 | 243 | ```python 244 | documents_bigrams = [["_".join(tup) for tup in nltk.ngrams(doc.split(),2)] for doc in documents] # creates bigrams 245 | documents_bigrams[7][:5] # inspect the results... 246 | 247 | # maybe we want both unigrams and bigrams in the feature set? 248 | 249 | assert len(documents)==len(documents_bigrams) 250 | 251 | documents_uniandbigrams = [] 252 | for a,b in zip([doc.split() for doc in documents],documents_bigrams): 253 | documents_uniandbigrams.append(a + b) 254 | 255 | #and let's inspect the outcomes again. 256 | documents_uniandbigrams[7] 257 | len(documents_uniandbigrams[7]),len(documents_bigrams[7]),len(documents[7].split()) 258 | ``` 259 | 260 | Or, if you want to inspect collocations: 261 | 262 | ```python 263 | text = [nltk.Text(tkn for tkn in doc.split()) for doc in documents ] 264 | text[7].collocations(num=10) 265 | ``` 266 | 267 | ---- 268 | 269 | 270 | *hint: if you want to include n-grams as feature input, add the following argument to your vectorizer:* 271 | 272 | ```python 273 | myvectorizer= CountVectorizer(analyzer=lambda x:x) 274 | ``` 275 | -------------------------------------------------------------------------------- /2023/day4/exercises-vectorizers/Understanding_vectorizers.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "3903bc69", 6 | "metadata": {}, 7 | "source": [ 8 | "# Understanding vectorizers\n", 9 | "\n", 10 | "In the following code examples, we will experiment with vectorizers to understand a bit better how they work. Feel free to adjust the code, and try things out yourself.\n", 11 | "\n", 12 | "For now, we will practice with `sklearn`'s vectorizers. however, packages such as `gensim` offer their own build in functionality to vectorize the data. " 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "metadata": {}, 18 | "source": [ 19 | "Please keep in mind that we differentiate between `sparse` and `dense` matrixes. The following visualization may help you understand the difference. " 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 2, 25 | "metadata": {}, 26 | "outputs": [ 27 | { 28 | "data": { 29 | "text/html": [ 30 | "" 31 | ], 32 | "text/plain": [ 33 | "" 34 | ] 35 | }, 36 | "metadata": {}, 37 | "output_type": "display_data" 38 | } 39 | ], 40 | "source": [ 41 | "from IPython.display import display, Image\n", 42 | "url = \"https://miro.medium.com/v2/resize:fit:4800/format:webp/1*1LLMA9VGH6x8mRKqT-Mhtw.gif\"\n", 43 | "# Display the GIF\n", 44 | "display(Image(url=url))" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 1, 50 | "id": "d6288fa8", 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [ 54 | "import pandas as pd\n", 55 | "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer" 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "id": "9efbfbd6", 61 | "metadata": {}, 62 | "source": [ 63 | "## Example 1: Inspect the output of a vectorizer in a dense format\n", 64 | "\n", 65 | "The following code cell will fit and transform three documents using a `Count`-based vectorizer. Next, the output is transformed to a *dense* matrix, and printed. \n", 66 | "\n", 67 | "1. Do you understand the output?\n", 68 | "2. Is it smart to transform output to a dense format? What will happen if you work with millions of documents, rather than 3 short sentences?\n", 69 | "3. what happens if you replace `CountVectorizer()` for `TfidfVectorizer()`?" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 2, 75 | "id": "49495cfd", 76 | "metadata": {}, 77 | "outputs": [ 78 | { 79 | "name": "stdout", 80 | "output_type": "stream", 81 | "text": [ 82 | " are everybody hello how students today what you\n", 83 | "0 0 0 1 0 1 0 0 0\n", 84 | "1 1 0 0 1 0 1 0 1\n", 85 | "2 0 0 0 0 0 0 1 0\n", 86 | "3 0 1 2 0 0 0 0 0\n" 87 | ] 88 | } 89 | ], 90 | "source": [ 91 | "texts = [\"hello students!\", \"how are you today?\", \"what?\", \"hello hello everybody\"]\n", 92 | "vect = CountVectorizer()# initialize the vectorizer\n", 93 | "\n", 94 | "X = vect.fit_transform(texts) #fit the vectorizer and transform the documents in one go\n", 95 | "print(pd.DataFrame(X.A, columns=vect.get_feature_names_out()).to_string())\n", 96 | "df = pd.DataFrame(X.toarray().transpose(), index = vect.get_feature_names_out())" 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "id": "72b8d55e", 102 | "metadata": {}, 103 | "source": [ 104 | "## Example 2: Inspect the output of a vectorizer in a sparse format\n", 105 | "\n", 106 | "Internally, `sklearn` represents the data in a *sparse* format, as this is computationally more efficient, and less memory is required.\n" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": 3, 112 | "id": "88bfaeba", 113 | "metadata": {}, 114 | "outputs": [], 115 | "source": [ 116 | "texts = [\"hello students!\", \"how are you today?\", \"what?\", \"hello hello everybody\"]\n", 117 | "count_vec = CountVectorizer() #initilize the vectorizer\n", 118 | "count_vec_fit = count_vec.fit_transform(texts) #fit the vectorizer and transform the documents in one go" 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "id": "e95a380b", 124 | "metadata": {}, 125 | "source": [ 126 | " 1.Inspect the shape of transformed texts. We can see that we have a 4x8 sparse matrix, meaning that we have 4 \n", 127 | " rows (=documents) and 8 unique tokens (=words, numbers)\n" 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": 4, 133 | "id": "d9363fb0", 134 | "metadata": {}, 135 | "outputs": [ 136 | { 137 | "data": { 138 | "text/plain": [ 139 | "<4x8 sparse matrix of type ''\n", 140 | "\twith 9 stored elements in Compressed Sparse Row format>" 141 | ] 142 | }, 143 | "execution_count": 4, 144 | "metadata": {}, 145 | "output_type": "execute_result" 146 | } 147 | ], 148 | "source": [ 149 | "count_vec_fit" 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "id": "64e134c2", 155 | "metadata": {}, 156 | "source": [ 157 | " 2.Get the feature names. This will return the tokens that are in the vocabulary of the vectorizer" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": 5, 163 | "id": "b9c92ac2", 164 | "metadata": {}, 165 | "outputs": [ 166 | { 167 | "data": { 168 | "text/plain": [ 169 | "array(['are', 'everybody', 'hello', 'how', 'students', 'today', 'what',\n", 170 | " 'you'], dtype=object)" 171 | ] 172 | }, 173 | "execution_count": 5, 174 | "metadata": {}, 175 | "output_type": "execute_result" 176 | } 177 | ], 178 | "source": [ 179 | "count_vec.get_feature_names_out()" 180 | ] 181 | }, 182 | { 183 | "cell_type": "markdown", 184 | "id": "14c6b9a0", 185 | "metadata": {}, 186 | "source": [ 187 | " 3. Represent the token's mapping to it's id values. The numbers do *not* represent the count of the words but the position of the words in the matrix" 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "execution_count": 6, 193 | "id": "0cf16fdc", 194 | "metadata": {}, 195 | "outputs": [ 196 | { 197 | "data": { 198 | "text/plain": [ 199 | "{'hello': 2,\n", 200 | " 'students': 4,\n", 201 | " 'how': 3,\n", 202 | " 'are': 0,\n", 203 | " 'you': 7,\n", 204 | " 'today': 5,\n", 205 | " 'what': 6,\n", 206 | " 'everybody': 1}" 207 | ] 208 | }, 209 | "execution_count": 6, 210 | "metadata": {}, 211 | "output_type": "execute_result" 212 | } 213 | ], 214 | "source": [ 215 | "count_vec.vocabulary_ " 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "id": "d4f3fb63", 221 | "metadata": {}, 222 | "source": [ 223 | " 4. Get sparse representation on document level" 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": 7, 229 | "id": "1a70295b", 230 | "metadata": {}, 231 | "outputs": [ 232 | { 233 | "name": "stdout", 234 | "output_type": "stream", 235 | "text": [ 236 | "hello students!\n", 237 | " (0, 2)\t1\n", 238 | " (0, 4)\t1\n", 239 | "\n", 240 | "how are you today?\n", 241 | " (0, 3)\t1\n", 242 | " (0, 0)\t1\n", 243 | " (0, 7)\t1\n", 244 | " (0, 5)\t1\n", 245 | "\n", 246 | "what?\n", 247 | " (0, 6)\t1\n", 248 | "\n", 249 | "hello hello everybody\n", 250 | " (0, 2)\t2\n", 251 | " (0, 1)\t1\n", 252 | "\n" 253 | ] 254 | } 255 | ], 256 | "source": [ 257 | "for i, document in zip(count_vec_fit, texts):\n", 258 | " print(document)\n", 259 | " print(i)\n", 260 | " print()" 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "id": "52645b99", 266 | "metadata": {}, 267 | "source": [ 268 | "a. Do you understand the output printed above? \n", 269 | "b. What happens if you change the `count` to a `tfidf` vectorizer? " 270 | ] 271 | } 272 | ], 273 | "metadata": { 274 | "kernelspec": { 275 | "display_name": "Python 3 (ipykernel)", 276 | "language": "python", 277 | "name": "python3" 278 | }, 279 | "language_info": { 280 | "codemirror_mode": { 281 | "name": "ipython", 282 | "version": 3 283 | }, 284 | "file_extension": ".py", 285 | "mimetype": "text/x-python", 286 | "name": "python", 287 | "nbconvert_exporter": "python", 288 | "pygments_lexer": "ipython3", 289 | "version": "3.9.6" 290 | } 291 | }, 292 | "nbformat": 4, 293 | "nbformat_minor": 5 294 | } 295 | -------------------------------------------------------------------------------- /2023/day4/exercises-vectorizers/exercise-text-to-features.md: -------------------------------------------------------------------------------- 1 | # Exercise 1: Working with textual data 2 | 3 | ### 0. Get the data. 4 | 5 | - Download `articles.tar.gz` from 6 | https://dx.doi.org/10.7910/DVN/ULHLCB 7 | 8 | If you experience difficulties downloading this (rather large) dataset, you can also download just a part of the data [here](https://surfdrive.surf.nl/files/index.php/s/bfNFkuUVoVtiyuk) 9 | 10 | - Unpack it. On Linux and MacOS, you can do this with `tar -xzf mydata.tar.gz` on the command line. On Windows, you may need an additional tool such as `7zip` for that (note that technically speaking, there is a `tar` archive within a `gz` archive, so unpacking may take *two* steps depending on your tool). 11 | 12 | 13 | ### 1. Inspect the structure of the dataset. 14 | What information do the following elements give you? 15 | 16 | - folder (directory) names 17 | - folder structure/hierarchy 18 | - file names 19 | - file contents 20 | 21 | ### 2. Discuss strategies for working with this dataset! 22 | 23 | - Which questions could you answer? 24 | - How could you deal with it, given the size and the structure? 25 | - How much memory1 (RAM) does your computer have? How large is the complete dataset? What does that mean? 26 | - Make a sketch (e.g., with pen&paper), how you could handle your workflow and your data to answer your question. 27 | 28 | 1 *memory* (RAM), not *storage* (harddisk)! 29 | 30 | ### 3. Read some (or all?) data 31 | 32 | Here is some example code that you can modify. Assuming that he folder `articles` is in the same folder as the notebook you are currently working on, you could, for instance, do the following to read a *part* of your dataset. 33 | 34 | ```python 35 | from glob import glob 36 | infowarsfiles = glob('articles/*/Infowars/*') 37 | infowarsarticles = [] 38 | for filename in infowarsfiles: 39 | with open(filename) as f: 40 | infowarsarticles.append(f.read()) 41 | 42 | ``` 43 | 44 | - Can you explain what the `glob` function does? 45 | - What does `infowarsfiles` contain, and what does `infowarsarticles` contain? First make an educated guess based on the code snippet, then check it! Do *not* print the whole thing, but use `len`, `type` en slicing `[:10]` to get the info ou need. 46 | 47 | - Tip: take a random sample of the articles for practice purposes (if your code works, you can scale up!) 48 | 49 | ``` 50 | # taking a random sample of the articles for practice purposes 51 | articles =random.sample(infowarsarticles, 10) 52 | ``` 53 | 54 | ### 4. Vectorize the data 55 | 56 | Imagine you want to train a classifier that will predict whether articles come from a fake news source (e.g., `Infowars`) or a quality news outlet (e.g., `bbc`). In other words, you want to predict `source` based on linguistic variations in the articles. 57 | 58 | To arrive at a model that will do just that, you have to transform 'text' to 'features'. 59 | 60 | - Can you vectorize the data? Try defining different vectorizers. Consider the following options: 61 | - `count` vs. `tfidf` vectorizers 62 | - with/ without pruning 63 | - with/ without stopword removal 64 | 65 | ### 5. Fit a classifier 66 | 67 | - Try out a simple supervised model. Find some inspiration [here](possible-solution-exercise-day1.md). Can you predict the `source` using linguistic variations in the articles? 68 | 69 | - Which combination of pre-processing steps + vectorizer gives the best results? 70 | 71 | ### BONUS: Inceasing efficiency + reusability 72 | The approach under (3) gets you very far. 73 | But for those of you who want to go the extra mile, here are some suggestions for further improvements in handling such a large dataset, consisting of thousands of files, and for deeper thinking about data handling: 74 | 75 | - Consider writing a function to read the data. Let your function take three parameters as input, `basepath` (where is the folder with articles located?), `month` and `outlet`, and return the articles that match this criterion. 76 | - Even better, make it a *generator* that yields the articles instead of returning a whole list. 77 | - Consider yielding a dict (with date, outlet, and the article itself) instead of yielding only the article text. 78 | - Think of the most memory-efficient way to get an overview of how often a given regular expression R is mentioned per outlet! 79 | - Under which circumstances would you consider having your function for reading the data return a pandas dataframe? 80 | -------------------------------------------------------------------------------- /2023/day4/exercises-vectorizers/possible-solution-exercise-day1.md: -------------------------------------------------------------------------------- 1 | ## Exercise 1 Working with textual data - possible solutions 2 | 3 | ---------- 4 | 5 | ### Vectorize the data 6 | 7 | ```python 8 | from glob import glob 9 | import random 10 | 11 | def read_data(listofoutlets): 12 | texts = [] 13 | labels = [] 14 | for label in listofoutlets: 15 | for file in glob(f'articles/*/{label}/*'): 16 | with open(file) as f: 17 | texts.append(f.read()) 18 | labels.append(label) 19 | return texts, labels 20 | 21 | X, y = read_data(['Infowars', 'BBC']) #choose your own newsoutlets 22 | 23 | ``` 24 | 25 | 26 | ```python 27 | #split the dataset in a train and test sample 28 | from sklearn.model_selection import train_test_split 29 | X_train,X_test,y_train,y_test=train_test_split(X, y, test_size=0.2) 30 | ``` 31 | 32 | Define some vectorizers. 33 | You can try out different variations: 34 | - `count` versus `tfidf` 35 | - with/ without a stopword list 36 | - with / without pruning 37 | 38 | 39 | ```python 40 | from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer 41 | 42 | myvectorizer= CountVectorizer(stop_words=mystopwords) # you can further modify this yourself. 43 | 44 | #Fit the vectorizer, and transform. 45 | X_features_train = myvectorizer.fit_transform(X_train) 46 | X_features_test = myvectorizer.transform(X_test) 47 | 48 | ``` 49 | ### Build a simple classifier 50 | 51 | Now, lets build a simple classifier and predict outlet based on textual features: 52 | 53 | ```python 54 | from sklearn.naive_bayes import MultinomialNB 55 | from sklearn.metrics import accuracy_score 56 | from sklearn.metrics import classification_report 57 | 58 | model = MultinomialNB() 59 | model.fit(X_features_train, y_train) 60 | y_pred = model.predict(X_features_test) 61 | 62 | print(f"Accuracy : {accuracy_score(y_test, y_pred)}") 63 | print(classification_report(y_test, y_pred)) 64 | 65 | ``` 66 | 67 | Can you improve this classifier when using different vectorizers? 68 | -------------------------------------------------------------------------------- /2023/day4/regex_examples.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "a4bf8020", 6 | "metadata": {}, 7 | "source": [ 8 | "`.` matches any character\n", 9 | "\n", 10 | "`*` the expression before occurs 0 or more times\n", 11 | "\n", 12 | "`+` the expression before occurs 1 or more times\n", 13 | " " 14 | ] 15 | }, 16 | { 17 | "cell_type": "code", 18 | "execution_count": 4, 19 | "id": "5830422c", 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "import re" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": 63, 29 | "id": "ac3fad21", 30 | "metadata": {}, 31 | "outputs": [ 32 | { 33 | "data": { 34 | "text/plain": [ 35 | "'** ** you hello'" 36 | ] 37 | }, 38 | "execution_count": 63, 39 | "metadata": {}, 40 | "output_type": "execute_result" 41 | } 42 | ], 43 | "source": [ 44 | "pattern = r\"[A-Z]+\"\n", 45 | "example_string = \"HOW ARE you hello\"\n", 46 | "subs = \"**\"\n", 47 | "\n", 48 | "re.sub(pattern, subs, example_string)" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": 72, 54 | "id": "444084cc", 55 | "metadata": {}, 56 | "outputs": [ 57 | { 58 | "data": { 59 | "text/plain": [ 60 | "[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '!!']" 61 | ] 62 | }, 63 | "execution_count": 72, 64 | "metadata": {}, 65 | "output_type": "execute_result" 66 | } 67 | ], 68 | "source": [ 69 | "#pattern= r\"[^a-zA-Z]+\" #other empty strings represent the sequences of characters that are not alphabetic characters but are separated by spaces\n", 70 | "#pattern = r\"\\d+\"\n", 71 | "pattern = r\"\\W+\"\n", 72 | "example_string = \"a sentence with stuff in 052953 and so forth!!\"\n", 73 | "\n", 74 | "re.findall(pattern, example_string)" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 13, 80 | "id": "b59362d5", 81 | "metadata": {}, 82 | "outputs": [], 83 | "source": [ 84 | "## r indicates that this is a raw string: backslashes are treated as literal characters and not escape characters" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": 78, 90 | "id": "dd55b547", 91 | "metadata": {}, 92 | "outputs": [ 93 | { 94 | "data": { 95 | "text/plain": [ 96 | "['RT @TimSenders']" 97 | ] 98 | }, 99 | "execution_count": 78, 100 | "metadata": {}, 101 | "output_type": "execute_result" 102 | } 103 | ], 104 | "source": [ 105 | "pattern = 'RT ?:? @[a-zA-Z]*'\n", 106 | "example_string = 'iewjogejiwojg RT @TimSenders395 iegwjo'\n", 107 | "\n", 108 | "re.findall(pattern, example_string)" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": 88, 114 | "id": "0ad26e5f", 115 | "metadata": {}, 116 | "outputs": [ 117 | { 118 | "data": { 119 | "text/plain": [ 120 | "['ABN', 'amro', 'ABNAMRO', 'abn amro']" 121 | ] 122 | }, 123 | "execution_count": 88, 124 | "metadata": {}, 125 | "output_type": "execute_result" 126 | } 127 | ], 128 | "source": [ 129 | "test_string = 'ABN and also amro. ABNAMRO and abn amro'\n", 130 | "pattern = r'\\b(ABN\\s+AMRO|ABNAMRO|ABN|AMRO)\\b'\n", 131 | "\n", 132 | "re.findall(pattern, test_string, re.IGNORECASE)" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": 80, 138 | "id": "d22dcd64", 139 | "metadata": {}, 140 | "outputs": [ 141 | { 142 | "data": { 143 | "text/plain": [ 144 | "'A DUTCH BANK and also A DUTCH BANK. A DUTCH BANK and A DUTCH BANK'" 145 | ] 146 | }, 147 | "execution_count": 80, 148 | "metadata": {}, 149 | "output_type": "execute_result" 150 | } 151 | ], 152 | "source": [ 153 | "replacement_string = 'A DUTCH BANK'\n", 154 | "re.sub(pattern, replacement_string, test_string, flags=re.IGNORECASE)" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "id": "dde41d36", 160 | "metadata": {}, 161 | "source": [ 162 | "`\\b:` Word boundary anchor.\n", 163 | "\n", 164 | "ABN`\\s+`AMRO: Matches \"ABN AMRO\" with one or more spaces in between." 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": null, 170 | "id": "15dd18de", 171 | "metadata": {}, 172 | "outputs": [], 173 | "source": [] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "id": "efbd0a77", 178 | "metadata": {}, 179 | "source": [ 180 | "## Making a custom regex application: printing matches in context" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": 17, 186 | "id": "5e04500f", 187 | "metadata": {}, 188 | "outputs": [], 189 | "source": [ 190 | "# make some example data\n", 191 | "texts = ['The top Republican on the House Intelligence Committee says he is prepared to impeach the head of the FBI and Deputy Attorney General if he doesnt get a two-page document he says prompted the Russia investigation.\\n\\nJust the fact that theyre not giving this to us tells me theres something wrong here, California Republican Rep. Devin Nunes told Fox News host Laura Ingraham on the The Ingraham Angle Tuesday night.\\n\\nI can tell you that were not just going to hold in contempt, we will have a plan to ',\n", 192 | " 'The Gulf Coast is preparing as Tropical Storm Michael developed in the Caribbean Sea and is expected to strengthen into a hurricane before making landfall around the middle of this week.\\n\\nFlorida Gov. Rick Scott ordered activation of the State Emergency Operations Center in Tallahassee to enhance coordination between federal, state and local agencies.\\n\\nOur state understands how serious tropical weather is and how devastating any hurricane or tropical storm can be, Scott said. As we continue to m',\n", 193 | " 'YouTube star Candace Owens says there is a card more valuable than VISA or AMERICAN EXPRESS called the black card.',\n", 194 | " 'Donald Trump can claim another victory after Mexican authorities agreed to disband the illegal alien caravans working their way through Mexico towards America.\\n\\nMexican immigration authorities said they plan on disbanding the Central American caravan by Wednesday in Oaxaca. The most vulnerable will get humanitarian visas, tweeted BuzzFeed reporter Adolfo Flores.\\n\\nEveryone else in the caravan, which has traveled through Mexico for days from Chiapas, will have to petition the Mexican government fo']" 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": 19, 200 | "id": "d40df1dd", 201 | "metadata": {}, 202 | "outputs": [ 203 | { 204 | "name": "stdout", 205 | "output_type": "stream", 206 | "text": [ 207 | "\n" 208 | ] 209 | } 210 | ], 211 | "source": [ 212 | "# first show the principle:\n", 213 | "for r in re.finditer(r\"[A-Z][A-Z]+\", texts[0]): # words with two or more capital letters\n", 214 | " print(r)" 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": 21, 220 | "id": "f6b58a3e", 221 | "metadata": {}, 222 | "outputs": [ 223 | { 224 | "name": "stdout", 225 | "output_type": "stream", 226 | "text": [ 227 | "Now processing text number 0...\n", 228 | "ach the head of the FBI and Deputy Attorney\n", 229 | "\n", 230 | "**********************************\n", 231 | "\n", 232 | "Now processing text number 1...\n", 233 | "\n", 234 | "**********************************\n", 235 | "\n", 236 | "Now processing text number 2...\n", 237 | " more valuable than VISA or AMERICAN EXPRESS\n", 238 | "luable than VISA or AMERICAN EXPRESS called the \n", 239 | "an VISA or AMERICAN EXPRESS called the black ca\n", 240 | "\n", 241 | "**********************************\n", 242 | "\n", 243 | "Now processing text number 3...\n", 244 | "\n", 245 | "**********************************\n", 246 | "\n" 247 | ] 248 | } 249 | ], 250 | "source": [ 251 | "# let's exploit the fact that span() gives us first and last index (=position) within\n", 252 | "# the string\n", 253 | "# so we print the matched string +/- 20 characters\n", 254 | "for number, text in enumerate(texts):\n", 255 | " print(f\"Now processing text number {number}...\")\n", 256 | " for r in re.finditer(r\"[A-Z][A-Z]+\", text):\n", 257 | " print(text[r.span()[0]-20:r.span()[1]+20])\n", 258 | " print('\\n**********************************\\n')" 259 | ] 260 | } 261 | ], 262 | "metadata": { 263 | "kernelspec": { 264 | "display_name": "Python 3 (ipykernel)", 265 | "language": "python", 266 | "name": "python3" 267 | }, 268 | "language_info": { 269 | "codemirror_mode": { 270 | "name": "ipython", 271 | "version": 3 272 | }, 273 | "file_extension": ".py", 274 | "mimetype": "text/x-python", 275 | "name": "python", 276 | "nbconvert_exporter": "python", 277 | "pygments_lexer": "ipython3", 278 | "version": "3.8.10" 279 | } 280 | }, 281 | "nbformat": 4, 282 | "nbformat_minor": 5 283 | } 284 | -------------------------------------------------------------------------------- /2023/day4/slides-04-1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2023/day4/slides-04-1.pdf -------------------------------------------------------------------------------- /2023/day4/slides-04-1.tex: -------------------------------------------------------------------------------- 1 | % !TeX document-id = {f19fb972-db1f-447e-9d78-531139c30778} 2 | % !BIB program = biber 3 | 4 | %\documentclass[handout]{beamer} 5 | \documentclass[compress]{beamer} 6 | \usepackage[T1]{fontenc} 7 | \usetheme[block=fill,subsectionpage=progressbar,sectionpage=progressbar]{metropolis} 8 | \usepackage{graphicx} 9 | 10 | \usepackage{wasysym} 11 | \usepackage{etoolbox} 12 | \usepackage[utf8]{inputenc} 13 | 14 | \usepackage{threeparttable} 15 | \usepackage{subcaption} 16 | 17 | \usepackage{tikz-qtree} 18 | \setbeamercovered{still covered={\opaqueness<1->{5}},again covered={\opaqueness<1->{100}}} 19 | 20 | 21 | % color-coded listings; replace those above 22 | \usepackage{xcolor} 23 | \usepackage{minted} 24 | \definecolor{listingbg}{rgb}{0.87,0.93,1} 25 | \setminted[python]{ 26 | frame=none, 27 | framesep=1mm, 28 | baselinestretch=1, 29 | bgcolor=listingbg, 30 | fontsize=\scriptsize, 31 | linenos, 32 | breaklines 33 | } 34 | 35 | 36 | 37 | \usepackage{listings} 38 | 39 | \lstset{ 40 | basicstyle=\scriptsize\ttfamily, 41 | columns=flexible, 42 | breaklines=true, 43 | numbers=left, 44 | %stepsize=1, 45 | numberstyle=\tiny, 46 | backgroundcolor=\color[rgb]{0.85,0.90,1} 47 | } 48 | 49 | 50 | \lstnewenvironment{lstlistingoutput}{\lstset{basicstyle=\footnotesize\ttfamily, 51 | columns=flexible, 52 | breaklines=true, 53 | numbers=left, 54 | %stepsize=1, 55 | numberstyle=\tiny, 56 | backgroundcolor=\color[rgb]{.7,.7,.7}}}{} 57 | 58 | 59 | \lstnewenvironment{lstlistingoutputtiny}{\lstset{basicstyle=\tiny\ttfamily, 60 | columns=flexible, 61 | breaklines=true, 62 | numbers=left, 63 | %stepsize=1, 64 | numberstyle=\tiny, 65 | backgroundcolor=\color[rgb]{.7,.7,.7}}}{} 66 | 67 | 68 | 69 | \usepackage[american]{babel} 70 | \usepackage{csquotes} 71 | \usepackage[style=apa, backend = biber]{biblatex} 72 | \DeclareLanguageMapping{american}{american-UoN} 73 | \addbibresource{../../literature.bib} 74 | \renewcommand*{\bibfont}{\tiny} 75 | 76 | \usepackage{tikz} 77 | \usetikzlibrary{shapes,arrows,matrix} 78 | \usepackage{multicol} 79 | 80 | \usepackage{subcaption} 81 | 82 | \usepackage{booktabs} 83 | \usepackage{graphicx} 84 | 85 | 86 | 87 | \makeatletter 88 | \setbeamertemplate{headline}{% 89 | \begin{beamercolorbox}[colsep=1.5pt]{upper separation line head} 90 | \end{beamercolorbox} 91 | \begin{beamercolorbox}{section in head/foot} 92 | \vskip2pt\insertnavigation{\paperwidth}\vskip2pt 93 | \end{beamercolorbox}% 94 | \begin{beamercolorbox}[colsep=1.5pt]{lower separation line head} 95 | \end{beamercolorbox} 96 | } 97 | \makeatother 98 | 99 | 100 | 101 | \setbeamercolor{section in head/foot}{fg=normal text.bg, bg=structure.fg} 102 | 103 | 104 | 105 | \newcommand{\question}[1]{ 106 | \begin{frame}[plain] 107 | \begin{columns} 108 | \column{.3\textwidth} 109 | \makebox[\columnwidth]{ 110 | \includegraphics[width=\columnwidth,height=\paperheight,keepaspectratio]{../../pictures/mannetje.png}} 111 | \column{.7\textwidth} 112 | \large 113 | \textcolor{orange}{\textbf{\emph{#1}}} 114 | \end{columns} 115 | \end{frame}} 116 | 117 | 118 | 119 | \title[Teach-the-teacher: Python]{\textbf{Teach-the-teacher: Python} 120 | \\Day 4: »Processing textual data // NLP« } 121 | \author[Anne Kroon]{Anne Kroon\\ \footnotesize{a.c.kroon@uva.nl}} 122 | \date{December 4, 2023} 123 | \institute[UvA CW]{UvA RM Communication Science} 124 | 125 | 126 | \begin{document} 127 | 128 | \begin{frame}{} 129 | \titlepage{\tiny } 130 | \end{frame} 131 | 132 | \begin{frame}{Today} 133 | \tableofcontents 134 | \end{frame} 135 | 136 | 137 | \section{Bottom-up vs. top-down} 138 | 139 | \begin{frame}[standout] 140 | Automated content analysis can be either \textcolor{red}{bottom-up} (inductive, explorative, pattern recognition, \ldots) or \textcolor{red}{top-down} (deductive, based on a-priori developed rules, \ldots). Or in between. 141 | \end{frame} 142 | 143 | 144 | \begin{frame}{The ACA toolbox} 145 | \makebox[\columnwidth]{ 146 | \includegraphics[width=\columnwidth,height=\paperheight,keepaspectratio]{../../media/boumanstrilling2016}} 147 | \\ 148 | \cite{Boumans2016} 149 | \end{frame} 150 | 151 | 152 | \begin{frame}{Bottom-up vs. top-down} 153 | \begin{block}{Bottom-up} 154 | \begin{itemize} 155 | \item Count most frequently occurring words 156 | \item Maybe better: Count combinations of words $\Rightarrow$ Which words co-occur together? 157 | \end{itemize} 158 | We \emph{don't} specify what to look for in advance 159 | \end{block} 160 | 161 | \onslide<2>{ 162 | \begin{block}{Top-down} 163 | \begin{itemize} 164 | \item Count frequencies of pre-defined words 165 | \item Maybe better: patterns instead of words 166 | \end{itemize} 167 | We \emph{do} specify what to look for in advance 168 | \end{block} 169 | } 170 | \end{frame} 171 | 172 | 173 | \begin{frame}[fragile]{A simple bottom-up approach} 174 | \begin{lstlisting} 175 | from collections import Counter 176 | 177 | texts = ["I really really really love him, I do", "I hate him"] 178 | 179 | for t in texts: 180 | print(Counter(t.split()).most_common(3)) 181 | \end{lstlisting} 182 | \begin{lstlistingoutput} 183 | [('really', 3), ('I', 2), ('love', 1)] 184 | [('I', 1), ('hate', 1), ('him', 1)] 185 | \end{lstlistingoutput} 186 | \end{frame} 187 | 188 | 189 | \begin{frame}[fragile]{A simple top-down approach} 190 | \begin{lstlisting} 191 | texts = ["I really really really love him, I do", "I hate him"] 192 | features = ['really', 'love', 'hate'] 193 | 194 | for t in texts: 195 | print(f"\nAnalyzing '{t}':") 196 | for f in features: 197 | print(f"{f} occurs {t.count(f)} times") 198 | \end{lstlisting} 199 | \begin{lstlistingoutput} 200 | Analyzing 'I really really really love him, I do': 201 | really occurs 3 times 202 | love occurs 1 times 203 | hate occurs 0 times 204 | 205 | Analyzing 'I hate him': 206 | really occurs 0 times 207 | love occurs 0 times 208 | hate occurs 1 times 209 | 210 | \end{lstlistingoutput} 211 | \end{frame} 212 | 213 | \question{When would you use which approach?} 214 | 215 | 216 | \begin{frame}{Some considerations} 217 | \begin{itemize}[<+->] 218 | \item Both can have a place in your workflow (e.g., bottom-up as first exploratory step) 219 | \item You have a clear theoretical expectation? Bottom-up makes little sense. 220 | \item But in any case: you need to transform your text into something ``countable''. 221 | \end{itemize} 222 | \end{frame} 223 | 224 | 225 | \input{../../modules/working-with-text/basic-string-operations.tex} 226 | \input{../../modules/working-with-text/bow.tex} 227 | 228 | \begin{frame}[fragile]{General approach} 229 | \Large 230 | 231 | \textcolor{red}{Test on a single string, then make a for loop or list comprehension!} 232 | 233 | \pause 234 | 235 | \normalsize 236 | 237 | \begin{alertblock}{Own functions} 238 | If it gets more complex, you can write your ow= function and then use it in the list comprehension: 239 | \begin{lstlisting} 240 | def mycleanup(t): 241 | # do sth with string t here, create new string t2 242 | return t2 243 | 244 | results = [mycleanup(t) for t in allmytexts] 245 | \end{lstlisting} 246 | \end{alertblock} 247 | \end{frame} 248 | 249 | 250 | \begin{frame}[fragile]{Pandas string methods as alternative} 251 | If you select column with strings from a pandas dataframe, pandas offers a collection of string methods (via \texttt{.str.}) that largely mirror standard Python string methods: 252 | 253 | \begin{lstlisting} 254 | df['newcoloumnwithresults'] = df['columnwithtext'].str.count("bla") 255 | \end{lstlisting} 256 | 257 | 258 | \pause 259 | 260 | \begin{alertblock}{To pandas or not to pandas for text?} 261 | Partly a matter of taste. 262 | 263 | Not-too-large dataset with a lot of extra columns? Advanced statistical analysis planned? Sounds like pandas. 264 | 265 | It's mainly a lot of text? Wanna do some machine learning later on anyway? It's large and (potentially) messy? Doesn't sound like pandas is a good idea. 266 | \end{alertblock} 267 | 268 | \end{frame} 269 | 270 | 271 | 272 | %\begin{frame}[plain] 273 | % \printbibliography 274 | %\end{frame} 275 | 276 | 277 | 278 | \end{document} 279 | 280 | 281 | 282 | -------------------------------------------------------------------------------- /2023/day4/slides-04-2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2023/day4/slides-04-2.pdf -------------------------------------------------------------------------------- /2023/day4/slides-04-2.tex: -------------------------------------------------------------------------------- 1 | % !TeX document-id = {f19fb972-db1f-447e-9d78-531139c30778} 2 | % !BIB program = biber 3 | 4 | %\documentclass[handout]{beamer} 5 | \documentclass[compress]{beamer} 6 | \usepackage[T1]{fontenc} 7 | \usetheme[block=fill,subsectionpage=progressbar,sectionpage=progressbar]{metropolis} 8 | \usepackage{graphicx} 9 | 10 | \usepackage{wasysym} 11 | \usepackage{etoolbox} 12 | \usepackage[utf8]{inputenc} 13 | 14 | \usepackage{threeparttable} 15 | \usepackage{subcaption} 16 | 17 | \usepackage{tikz-qtree} 18 | \setbeamercovered{still covered={\opaqueness<1->{5}},again covered={\opaqueness<1->{100}}} 19 | 20 | 21 | \usepackage{listings} 22 | 23 | \lstset{ 24 | basicstyle=\scriptsize\ttfamily, 25 | columns=flexible, 26 | breaklines=true, 27 | numbers=left, 28 | %stepsize=1, 29 | numberstyle=\tiny, 30 | backgroundcolor=\color[rgb]{0.85,0.90,1} 31 | } 32 | 33 | 34 | 35 | \lstnewenvironment{lstlistingoutput}{\lstset{basicstyle=\footnotesize\ttfamily, 36 | columns=flexible, 37 | breaklines=true, 38 | numbers=left, 39 | %stepsize=1, 40 | numberstyle=\tiny, 41 | backgroundcolor=\color[rgb]{.7,.7,.7}}}{} 42 | 43 | 44 | \lstnewenvironment{lstlistingoutputtiny}{\lstset{basicstyle=\tiny\ttfamily, 45 | columns=flexible, 46 | breaklines=true, 47 | numbers=left, 48 | %stepsize=1, 49 | numberstyle=\tiny, 50 | backgroundcolor=\color[rgb]{.7,.7,.7}}}{} 51 | 52 | 53 | 54 | \usepackage[american]{babel} 55 | \usepackage{csquotes} 56 | \usepackage[style=apa, backend = biber]{biblatex} 57 | \DeclareLanguageMapping{american}{american-UoN} 58 | \addbibresource{../references.bib} 59 | \renewcommand*{\bibfont}{\tiny} 60 | 61 | \usepackage{tikz} 62 | \usetikzlibrary{shapes,arrows,matrix} 63 | \usepackage{multicol} 64 | 65 | \usepackage{subcaption} 66 | 67 | \usepackage{booktabs} 68 | \usepackage{graphicx} 69 | 70 | 71 | 72 | \makeatletter 73 | \setbeamertemplate{headline}{% 74 | \begin{beamercolorbox}[colsep=1.5pt]{upper separation line head} 75 | \end{beamercolorbox} 76 | \begin{beamercolorbox}{section in head/foot} 77 | \vskip2pt\insertnavigation{\paperwidth}\vskip2pt 78 | \end{beamercolorbox}% 79 | \begin{beamercolorbox}[colsep=1.5pt]{lower separation line head} 80 | \end{beamercolorbox} 81 | } 82 | \makeatother 83 | 84 | 85 | 86 | \setbeamercolor{section in head/foot}{fg=normal text.bg, bg=structure.fg} 87 | 88 | 89 | 90 | \newcommand{\question}[1]{ 91 | \begin{frame}[plain] 92 | \begin{columns} 93 | \column{.3\textwidth} 94 | \makebox[\columnwidth]{ 95 | \includegraphics[width=\columnwidth,height=\paperheight,keepaspectratio]{../pictures/mannetje.png}} 96 | \column{.7\textwidth} 97 | \large 98 | \textcolor{orange}{\textbf{\emph{#1}}} 99 | \end{columns} 100 | \end{frame}} 101 | 102 | 103 | \title[Teach-the-teacher: Python]{\textbf{Teach-the-teacher: Python} 104 | \\Day 4: » Advanced NLP \& Regular Expressions « } 105 | \author[Anne Kroon]{Anne Kroon\\ \footnotesize{a.c.kroon@uva.nl}} 106 | \date{December 4, 2023} 107 | \institute[UvA CW]{UvA RM Communication Science} 108 | 109 | 110 | \begin{document} 111 | 112 | \begin{frame}{} 113 | \titlepage 114 | \end{frame} 115 | 116 | \begin{frame}{Today} 117 | \tableofcontents 118 | \end{frame} 119 | 120 | 121 | \section{Advanced NLP} 122 | 123 | \subsection{Parsing sentences} 124 | \begin{frame}{NLP: What and why?} 125 | \begin{block}{Why parse sentences?} 126 | \begin{itemize} 127 | \item To find out what grammatical function words have 128 | \item and to get closer to the meaning. 129 | \end{itemize} 130 | \end{block} 131 | \end{frame} 132 | 133 | \begin{frame}[fragile]{Parsing a sentence using NLTK} 134 | Tokenize a sentence, and ``tag'' the tokenized sentence: 135 | \begin{lstlisting} 136 | tokens = nltk.word_tokenize(sentence) 137 | tagged = nltk.pos_tag(tokens) 138 | print (tagged[0:6]) 139 | \end{lstlisting} 140 | gives you the following: 141 | \begin{lstlisting} 142 | [('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'), 143 | ('Thursday', 'NNP'), ('morning', 'NN')] 144 | \end{lstlisting} 145 | 146 | \onslide<2->{ 147 | And you could get the word type of "morning" with \texttt{tagged[5][1]}! 148 | } 149 | 150 | \end{frame} 151 | 152 | 153 | \begin{frame}[fragile]{Named Entity Recognition with spacy} 154 | Terminal: 155 | 156 | \begin{lstlisting} 157 | sudo pip3 install spacy 158 | sudo python3 -m spacy download nl # or en, de, fr .... 159 | \end{lstlisting} 160 | 161 | Python: 162 | 163 | \begin{lstlisting} 164 | import spacy 165 | nlp = spacy.load('nl') 166 | doc = nlp('Een 38-jarige vrouw uit Zeist en twee mannen moeten 24 maanden de cel in voor de gecoordineerde oplichting van Rabobank-klanten.') 167 | for ent in doc.ents: 168 | print(ent.text,ent.label_) 169 | \end{lstlisting} 170 | 171 | returns: 172 | 173 | \begin{lstlisting} 174 | Zeist LOC 175 | Rabobank ORG 176 | \end{lstlisting} 177 | 178 | \end{frame} 179 | 180 | 181 | 182 | \begin{frame}{More NLP} 183 | \url{http://nlp.stanford.edu} 184 | \url{http://spacy.io} 185 | \url{http://nltk.org} 186 | \url{https://www.clips.uantwerpen.be/pattern} 187 | \end{frame} 188 | 189 | 190 | 191 | \begin{frame}{Main takeaway} 192 | 193 | \begin{itemize} 194 | % \item It matters how you transform your text into numbers (``vectorization''). 195 | \item Preprocessing matters, be able to make informed choices. 196 | \item Keep this in mind when moving to Machine Learning. 197 | \end{itemize} 198 | \end{frame} 199 | 200 | 201 | \section[Regular expressions]{ACA using regular expressions} 202 | 203 | \begin{frame} 204 | Automated content analysis using regular expressions 205 | \end{frame} 206 | 207 | 208 | \subsection{What is a regexp?} 209 | \begin{frame}{Regular Expressions: What and why?} 210 | \begin{block}{What is a regexp?} 211 | \begin{itemize} 212 | \item<1-> a \emph{very} widespread way to describe patterns in strings 213 | \item<2-> Think of wildcards like {\tt{*}} or operators like {\tt{OR}}, {\tt{AND}} or {\tt{NOT}} in search strings: a regexp does the same, but is \emph{much} more powerful 214 | \item<3-> You can use them in many editors (!), in the Terminal, in STATA \ldots and in Python 215 | \end{itemize} 216 | \end{block} 217 | \end{frame} 218 | 219 | \begin{frame}{An example} 220 | \begin{block}{Regex example} 221 | \begin{itemize} 222 | \item Let's say we wanted to remove everything but words from a tweet 223 | \item We could do so by calling the \texttt{.replace()} method 224 | \item We could do this with a regular expression as well: \\ 225 | {\tt{ \lbrack \^{}a-zA-Z\rbrack}} would match anything that is not a letter 226 | \end{itemize} 227 | \end{block} 228 | \end{frame} 229 | 230 | \begin{frame}{Basic regexp elements} 231 | \begin{block}{Alternatives} 232 | \begin{description} 233 | \item[{\tt{\lbrack TtFf\rbrack}}] matches either T or t or F or f 234 | \item[{\tt{Twitter|Facebook}}] matches either Twitter or Facebook 235 | \item[{\tt{.}}] matches any character 236 | \end{description} 237 | \end{block} 238 | \begin{block}{Repetition}<2-> 239 | \begin{description} 240 | \item[{\tt{*}}] the expression before occurs 0 or more times 241 | \item[{\tt{+}}] the expression before occurs 1 or more times 242 | \end{description} 243 | \end{block} 244 | \end{frame} 245 | 246 | \begin{frame}{regexp quizz} 247 | \begin{block}{Which words would be matched?} 248 | \tt 249 | \begin{enumerate} 250 | \item<1-> \lbrack Pp\rbrack ython 251 | \item<2-> \lbrack A-Z\rbrack + 252 | \item<3-> RT ?:? @\lbrack a-zA-Z0-9\rbrack * 253 | \end{enumerate} 254 | \end{block} 255 | \end{frame} 256 | 257 | \begin{frame}{What else is possible?} 258 | See the table in the book! 259 | \end{frame} 260 | 261 | \subsection{Using a regexp in Python} 262 | \begin{frame}{How to use regular expressions in Python} 263 | \begin{block}{The module \texttt{re}*} 264 | \begin{description} 265 | \item<1->[{\tt{re.findall("\lbrack Tt\rbrack witter|\lbrack Ff\rbrack acebook",testo)}}] returns a list with all occurances of Twitter or Facebook in the string called {\tt{testo}} 266 | \item<1->[{\tt{re.findall("\lbrack 0-9\rbrack +\lbrack a-zA-Z\rbrack +",testo)}}] returns a list with all words that start with one or more numbers followed by one or more letters in the string called {\tt{testo}} 267 | \item<2->[{\tt{re.sub("\lbrack Tt\rbrack witter|\lbrack Ff\rbrack acebook","a social medium",testo)}}] returns a string in which all all occurances of Twitter or Facebook are replaced by "a social medium" 268 | \end{description} 269 | \end{block} 270 | 271 | \tiny{Use the less-known but more powerful module \texttt{regex} instead to support all dialects used in the book} 272 | \end{frame} 273 | 274 | 275 | \begin{frame}[fragile]{How to use regular expressions in Python} 276 | \begin{block}{The module re} 277 | \begin{description} 278 | \item<1->[{\tt{re.match(" +(\lbrack 0-9\rbrack +) of (\lbrack 0-9\rbrack +) points",line)}}] returns \texttt{None} unless it \emph{exactly} matches the string \texttt{line}. If it does, you can access the part between () with the \texttt{.group()} method. 279 | \end{description} 280 | \end{block} 281 | 282 | Example: 283 | \begin{lstlisting} 284 | line=" 2 of 25 points" 285 | result=re.match(" +([0-9]+) of ([0-9]+) points",line) 286 | if result: 287 | print (f"Your points: {}result.group(1)}, Maximum points: {result.group(2)}) 288 | \end{lstlisting} 289 | Your points: 2 Maximum points: 25 290 | \end{frame} 291 | 292 | 293 | 294 | \begin{frame}{Possible applications} 295 | \begin{block}{Data preprocessing} 296 | \begin{itemize} 297 | \item Remove unwanted characters, words, \ldots 298 | \item Identify \emph{meaningful} bits of text: usernames, headlines, where an article starts, \ldots 299 | \item filter (distinguish relevant from irrelevant cases) 300 | \end{itemize} 301 | \end{block} 302 | \end{frame} 303 | 304 | 305 | \begin{frame}{Possible applications} 306 | \begin{block}{Data analysis: Automated coding} 307 | \begin{itemize} 308 | \item Actors 309 | \item Brands 310 | \item links or other markers that follow a regular pattern 311 | \item Numbers (!) 312 | \end{itemize} 313 | \end{block} 314 | \end{frame} 315 | 316 | \begin{frame}[fragile,plain]{Example 1: Counting actors} 317 | \begin{lstlisting} 318 | import re, csv 319 | from glob import glob 320 | count1_list=[] 321 | count2_list=[] 322 | filename_list = glob("/home/damian/articles/*.txt") 323 | 324 | for fn in filename_list: 325 | with open(fn) as fi: 326 | artikel = fi.read() 327 | artikel = artikel.replace('\n',' ') 328 | 329 | count1 = len(re.findall('Israel.*(minister|politician.*|[Aa]uthorit)',artikel)) 330 | count2 = len(re.findall('[Pp]alest',artikel)) 331 | 332 | count1_list.append(count1) 333 | count2_list.append(count2) 334 | 335 | output=zip(filename_list,count1_list, count2_list) 336 | with open("results.csv", mode='w',encoding="utf-8") as fo: 337 | writer = csv.writer(fo) 338 | writer.writerows(output) 339 | \end{lstlisting} 340 | \end{frame} 341 | 342 | 343 | 344 | 345 | \begin{frame}[fragile]{Example 2: Which number has this Lexis Nexis article?} 346 | \begin{lstlisting} 347 | All Rights Reserved 348 | 349 | 2 of 200 DOCUMENTS 350 | 351 | De Telegraaf 352 | 353 | 21 maart 2014 vrijdag 354 | 355 | Brussel bereikt akkoord aanpak probleembanken; 356 | ECB krijgt meer in melk te brokkelen 357 | 358 | SECTION: Finance; Blz. 24 359 | LENGTH: 660 woorden 360 | 361 | BRUSSEL Europa heeft gisteren op de valreep een akkoord bereikt 362 | over een saneringsfonds voor banken. Daarmee staat de laatste 363 | \end{lstlisting} 364 | 365 | \end{frame} 366 | 367 | \begin{frame}[fragile]{Example 2: Check the number of a lexis nexis article} 368 | \begin{lstlisting} 369 | All Rights Reserved 370 | 371 | 2 of 200 DOCUMENTS 372 | 373 | De Telegraaf 374 | 375 | 21 maart 2014 vrijdag 376 | 377 | Brussel bereikt akkoord aanpak probleembanken; 378 | ECB krijgt meer in melk te brokkelen 379 | 380 | SECTION: Finance; Blz. 24 381 | LENGTH: 660 woorden 382 | 383 | BRUSSEL Europa heeft gisteren op de valreep een akkoord bereikt 384 | over een saneringsfonds voor banken. Daarmee staat de laatste 385 | \end{lstlisting} 386 | 387 | \begin{lstlisting} 388 | for line in tekst: 389 | matchObj=re.match(r" +([0-9]+) of ([0-9]+) DOCUMENTS",line) 390 | if matchObj: 391 | numberofarticle= int(matchObj.group(1)) 392 | totalnumberofarticles= int(matchObj.group(2)) 393 | \end{lstlisting} 394 | \end{frame} 395 | 396 | 397 | \begin{frame}{Practice yourself!} 398 | Let's take some time to write some regular expressions. 399 | Write a script that 400 | \begin{itemize} 401 | \item extracts URLS form a list of strings 402 | \item removes everything that is not a letter or number from a list of strings 403 | \end{itemize} 404 | (first develop it for a single string, then scale up) 405 | 406 | More tips: 407 | \huge{\url{http://www.pyregex.com/}} 408 | \end{frame} 409 | 410 | 411 | 412 | %\begin{frame}[plain] 413 | % \printbibliography 414 | %\end{frame} 415 | 416 | 417 | 418 | \end{document} 419 | 420 | 421 | 422 | -------------------------------------------------------------------------------- /2023/day5/Day 5 - Machine Learning - Afternoon.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2023/day5/Day 5 - Machine Learning - Afternoon.pdf -------------------------------------------------------------------------------- /2023/day5/Day 5 - Machine Learning - Morning.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uvacw/teachteacher-python/38f12ca955e3a42ea7fb10800eccc60898da64f2/2023/day5/Day 5 - Machine Learning - Morning.pdf -------------------------------------------------------------------------------- /2023/day5/Day 5 Take-aways.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "80deb237-7fb5-4ecd-8186-05b1dfde3014", 6 | "metadata": {}, 7 | "source": [ 8 | "# Supervised Machine Learning \n", 9 | "\n", 10 | "## Main take aways:\n", 11 | "* Congrats on obtaining your driver's license! Now, please get out on the road and learn how to drive :) \n", 12 | "* Don't be impressed - you can certainly do it.\n", 13 | "* Just because you can do it does not mean you should do it.\n", 14 | "* All decisions regarding the SML process are arbritrary. The right choice is the one you can argue for best.\n", 15 | "* Don't reinvent the wheel\n", 16 | "* Google the error messages\n", 17 | " \n", 18 | "\n", 19 | "## More resources\n", 20 | "\n", 21 | "#### Overfitting\n", 22 | "https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/\n", 23 | "\n", 24 | "#### Hyperparameter tuning \n", 25 | "https://machinelearningmastery.com/hyperparameter-optimization-with-random-search-and-grid-search/\n", 26 | "\n", 27 | "#### Datasets and challenges\n", 28 | "https://www.kaggle.com/\n", 29 | "\n", 30 | "\n", 31 | "## Recommended readings\n", 32 | "\n", 33 | "#### Van Atteveldt et al. (book for Python learners)\n", 34 | "Van Atteveldt, W., Trilling, D., & Calderón, C. A. (2022). Computational Analysis of Communication. Wiley Blackwell.\n", 35 | "https://cssbook.net/\n", 36 | "\n", 37 | "#### Zhang et al. (Paper about shooting victims and thoughts and prayers in Tweets)\n", 38 | "Zhang, Y., Shah, D., Foley, J., Abhishek, A., Lukito, J., Suk, J., ... & Garlough, C. (2019). Whose lives matter? Mass shootings and social media discourses of sympathy and policy, 2012–2014. Journal of Computer-Mediated Communication, 24(4), 182-202.\n", 39 | "https://doi.org/10.1093/jcmc/zmz009\n", 40 | "\n", 41 | "#### Meppelink et al. (Paper about online health info and reliability)\n", 42 | "Meppelink, C. S., Hendriks, H., Trilling, D., van Weert, J. C., Shao, A., & Smit, E. S. (2021). Reliable or not? An automated classification of webpages about early childhood vaccination using supervised machine learning. Patient Education and Counseling, 104(6), 1460-1466.\n", 43 | "https://doi.org/10.1016/j.pec.2020.11.013\n", 44 | "\n" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": null, 50 | "id": "1dfd8fe2-d0eb-470f-966e-fb70393edf9d", 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [] 54 | } 55 | ], 56 | "metadata": { 57 | "kernelspec": { 58 | "display_name": "Python 3 (ipykernel)", 59 | "language": "python", 60 | "name": "python3" 61 | }, 62 | "language_info": { 63 | "codemirror_mode": { 64 | "name": "ipython", 65 | "version": 3 66 | }, 67 | "file_extension": ".py", 68 | "mimetype": "text/x-python", 69 | "name": "python", 70 | "nbconvert_exporter": "python", 71 | "pygments_lexer": "ipython3", 72 | "version": "3.9.6" 73 | } 74 | }, 75 | "nbformat": 4, 76 | "nbformat_minor": 5 77 | } 78 | -------------------------------------------------------------------------------- /2023/day5/Exercise 3/exercise3.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "54c12790-b072-4582-9d1f-f10af87a2fcb", 6 | "metadata": {}, 7 | "source": [ 8 | "### Exercise 3\n", 9 | "\n", 10 | "In this exercise, you will practice with both applying SML and also evaluating it. When doing the latter, use the materials and slides discussed in today's workshop. Work together with your neighbour on this exercise!" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "id": "d03327ea-5806-4011-9523-31bebfd30227", 16 | "metadata": {}, 17 | "source": [ 18 | "### Q1: Describing Supervised Machine Learning (SML)\n", 19 | "\n", 20 | "a. SeeFlex is a video streaming platform that initially focused solely on television shows aimed at children. However, the CEO recently decided to expand the content available on SeeFlex to content aimed at adults as well. New availble genres on SeeFlex are, for example, horror shows or dating shows. To help customers select content and mostly, to help parents to keep selecting only content that is suitable for their children, the CEO wants to employ Supervised Machine Learning (SML) to automatically indicate the genre that a specific piece of content belongs to based on its description. She can do this because she got her hands on a large dataset with pre-labeled content descriptions which she can use to train and validate machines. \n", 21 | "\n", 22 | "With your neighbour, discuss the suitability of SML. Provide one argument in favor of and one argument against using SML in this case.\n" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "id": "be973d47-0572-40e3-a1fc-990c49a7befd", 28 | "metadata": {}, 29 | "source": [ 30 | "### Q2: Executing Supervised Machine Learning (SML)\n", 31 | " \n", 32 | "a. Read in the dataset you need for this assignment ('SeeFlex_data.csv') and conduct some explorative analyses on it. Your analysis needs to result in an overview of how many pieces of content there are per genre. \n", 33 | "\n", 34 | "Hint: Are you getting a 'list index out of range' error when reading in the data? Check what delimiter you are using for this *comma*-seperated values file!\n" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "id": "9c374389-382a-40bf-abf1-5c4545ed94e5", 40 | "metadata": {}, 41 | "source": [ 42 | "b. Write a script for the CEO of SeeFlex using SML to categorize the content descriptions into genres. While doing so, keep in mind the following:\n", 43 | "* The CEO's goal is to automatically label content for all viewers. But because SeeFlex is mainly used by parents and their children, correctly identifying kids' content takes priority over correctly detecting other genres. \n", 44 | "* In your code, compare at least two different models (e.g., Logistic Regression, Decision Tree).\n", 45 | "* Your code needs to produce at least one metric that evaluates the classifiers - think about what metric is most important in the current situation.\n" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "id": "42ba478c-9301-4800-a502-9140797740bf", 51 | "metadata": {}, 52 | "source": [ 53 | "### Q3: Reflecting on your script and findings\n", 54 | "\n", 55 | "a. Discuss the results of Q2a. Your answer needs to:\n", 56 | "* Discuss how the content descriptions are distributed across genres\n", 57 | "* Discuss about why it is (not) relevant to inspect the distribution of content descriptions across genres\n", 58 | "* Discuss what the above means for the classifier you developed for the CEO of SeeFlex\n" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "id": "7e4f448a-188b-4a38-9c48-7c5e086ee843", 64 | "metadata": {}, 65 | "source": [ 66 | "b. How do your classifiers work: do they distinguish between the four different genres that content belongs to, or did you decide to merge some genres into one or more categories? Why did you decide to do it in this way? In your answer, reflect on the advantages and on the disadvantages of your approach for the CEO of SeeFlex.\n" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "id": "2b97e603-ae45-424c-96ac-1f535a6007f4", 72 | "metadata": {}, 73 | "source": [ 74 | "c. Based on the results of your validation metrics, what classifier would you recommend to the CEO of SeeFlex? Why? \n" 75 | ] 76 | } 77 | ], 78 | "metadata": { 79 | "kernelspec": { 80 | "display_name": "Python 3 (ipykernel)", 81 | "language": "python", 82 | "name": "python3" 83 | }, 84 | "language_info": { 85 | "codemirror_mode": { 86 | "name": "ipython", 87 | "version": 3 88 | }, 89 | "file_extension": ".py", 90 | "mimetype": "text/x-python", 91 | "name": "python", 92 | "nbconvert_exporter": "python", 93 | "pygments_lexer": "ipython3", 94 | "version": "3.9.6" 95 | } 96 | }, 97 | "nbformat": 4, 98 | "nbformat_minor": 5 99 | } 100 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # teachteacher-python 2 | "Teaching the Teacher" resources for colleagues that want to get started using computational methods in their teaching (using Python) 3 | 4 | ## Purpose 5 | 6 | This repository contains materials for colleagues who are new to teaching computational methods but want to do so in the future. In particular, this holds true for the following courses, but the tips and resources also apply to future yet-to-be-developed courses: 7 | 8 | 9 | | Course name | Resource link 1 | 10 | |-----------------------------------------------------|------------------------------------------------------| 11 | | Gesis Course Introduction to Machine Learning | https://github.com/annekroon/gesis-machine-learning | 12 | | Big Data and Automated Content Analysis | https://github.com/uvacw/teaching-bdaca | 13 | | Computational Communication Science I | https://github.com/uva-cw-ccs1/2223s2/ | 14 | | Computational Communication Science II | https://github.com/uva-cw-ccs2/2223s2/ | 15 | | Data Journalism | https://github.com/uvacw/datajournalism | 16 | | Digital Analytics* | https://github.com/uva-cw-digitalanalytics/2021s2 | 17 | 18 | *Ask Joanna or Theo for access 19 | 20 | ## Requirements 21 | 22 | You need to have a working Python environment and you need to be able to install Python packages on your system. There are several ways of achieving this, and it is important to note that not all of your students may have the same type of environment. In particular, one can either opt for the so-called Anaconda distribution or a native Python installation. There are pro's and con's for both approaches. Currently, students in Data Journalism as well as in Digital Analytics are advised to install Anaconda; students in Big Data and Automated Content Analysis are explicitly given the choise. Please read the [our Installation Guide](installation.md) for detailed instructions. 23 | 24 | 25 | ## Structure of the ``Teaching the teacher`` course 26 | 27 | As a pilot, we are holding a five-day course in which we combine 28 | - teaching the necessary Python skills 29 | - teaching how to teach these skills 30 | - exercising and reflecting on best teaching practices. 31 | 32 | 33 | ## Additional resources 34 | 35 | A list of additional resources that could be of interest: 36 | 37 | - A 5-day workshop by Anne and Damian on Machine Learning in Python (for social scientists with no or minimal previous Python knowledge): https://github.com/annekroon/gesis-ml-learning/ 38 | 39 | - "The new book" (forthcoming open-access on https://cssbook.net and in print with Wiley): Van Atteveldt, W., Trilling, D., Arcila, C. (in press): Computational Analysis of Communication: A practical introduction to the analysis of texts, networks, and images with code examples in Python and R 40 | 41 | - "The old book" (the book used between 2015 and 2020 in the Big Data courses). Less focus on Pandas than in more modern approaches, slightly outdated coding style in some examples, and less depth than the "new" book. The Twitter API chapter is outdated and sentiment anaysis as described in Chapter 6 should not be tought like this any more. Apart from that, it still can be a good resource to get started and/or to look things up. Trilling, D. (2020): Doing Computational Social Science with Python: An Introduction. Version 1.3.2. https://github.com/damian0604/bdaca/blob/master/book/bd-aca_book.pdf 42 | --------------------------------------------------------------------------------