├── .DS_Store ├── .gitignore ├── Assignments ├── Week01 │ ├── PySDS_EX_Week01_Day01.ipynb │ ├── PySDS_EX_Week01_Day01_ModelAnswers.ipynb │ ├── PySDS_EX_Week01_Day02.ipynb │ ├── PySDS_EX_Week01_Day02_ModelAnswers.ipynb │ ├── PySDS_EX_Week01_Day03.ipynb │ ├── PySDS_EX_Week01_Day03_ModelAnswers.ipynb │ └── PySDS_EX_Week01_Day04.ipynb ├── Week02 │ ├── PySDS_Ex_Week02_Day01.ipynb │ ├── PySDS_Ex_Week02_Day01_ModelAnswers.ipynb │ ├── PySDS_Ex_Week02_Day02.ipynb │ ├── PySDS_Ex_Week02_Day02_ModelAnswers.ipynb │ ├── PySDS_Ex_Week02_Day03.ipynb │ ├── PySDS_Ex_Week02_Day03_ModelAnswers.ipynb │ ├── PySDS_Ex_Week02_Day04.ipynb │ └── PySDS_Ex_Week02_Day04_exampleAnswer.ipynb ├── Week03 │ ├── PySDS_Ex_Week03_Day01.ipynb │ ├── PySDS_Ex_Week03_Day01_ModelAnswers.ipynb │ ├── PySDS_Ex_Week03_Day02.ipynb │ ├── PySDS_Ex_Week03_Day02_ModelAnswers.ipynb │ ├── PySDS_Ex_Week03_Day03.ipynb │ ├── PySDS_Ex_Week03_Day03_ModelAnswers.ipynb │ └── PySDS_Ex_Week03_Day04.ipynb └── Week04 │ ├── PySDS_ex_w4-1.ipynb │ ├── PySDS_ex_w4-1_codeCalc.ipynb │ └── PySDS_votingData.csv ├── Course_Material ├── Week_0 │ └── PySDS_week0_lecture1_JupyterLab_Basics.ipynb ├── Week_1 │ ├── PySDS_w1-1_Strings-Numbers-Lists-Dictionaries.ipynb │ ├── PySDS_w1-2a_Intro_to_Loops.ipynb │ ├── PySDS_w1-2b_Advanced_loops.ipynb │ ├── PySDS_w1-3_The_File_System.ipynb │ └── PySDS_w1-4_Functions_and_abstraction.ipynb ├── Week_2 │ ├── PySDS_Supplemental_FREE_coding.ipynb │ ├── PySDS_w2-1_PANDAS_and_DataFrames.ipynb │ ├── PySDS_w2-2_Text_and_File_Types.ipynb │ ├── PySDS_w2-3_More_Text_Dates_SQL.ipynb │ └── PySDS_w2-4_Merging_Grouping_Data.ipynb ├── Week_3 │ ├── PySDS_demo_parseReplies_election.py │ ├── PySDS_w3-1_Getting_Web_Data_from_Reddit.ipynb │ ├── PySDS_w3-2_Exploring_and_cleaning_Data.ipynb │ ├── PySDS_w3-3_Pseudocode_and_Web_Spiders.ipynb │ └── PySDS_w3-4_Working_with_APIs.ipynb └── Week_4 │ ├── PySDS_w4-1_Visualizing_Data_with_Seaborn_and_Bokeh.ipynb │ ├── PySDS_w4-2_Working_on_a_Server_Using_Screen.ipynb │ ├── PySDS_w4-2_exampleCode_twitterStreamer.py │ ├── PySDS_w4-3a_ResearchEthics.pdf │ ├── PySDS_w4-3b_ResearchQuestions.pdf │ ├── PySDS_w4-4_Basic_Github.ipynb │ └── PySDS_w4-4b_Latex.pdf ├── README.md └── Supplementary_Data_(public) ├── Canada.xml ├── MuppetsTable.xlsx ├── PySDS_PolCandidates.csv ├── example_lines.txt └── muppetEpisodes.json /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oxfordinternetinstitute/sds-python/8c2469c9ff52d53e1176a6692a27cfe6dda55bc1/.DS_Store -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *-checkpoint.ipynb 2 | .ipynb_checkpoints 3 | Supplementary_Data_(private)/* 4 | Supplementary_Data_(private) 5 | Examination_Work 6 | Examination_Work/* 7 | *.zip 8 | -------------------------------------------------------------------------------- /Assignments/Week01/PySDS_EX_Week01_Day01.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "**PySDS Week 01 Day 01 - Coding Practice**" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "# Assignment 1. A few quick exercises and a sustained data cleaning example\n", 15 | " \n", 16 | "For this assignment you will be expected to debug the following statements and answer a few questions. These exercises are grouped by topic. You will then be expected to submit the assignment by \n", 17 | "\n", 18 | "~~~\n", 19 | "10am October 10th, 2018\n", 20 | "~~~\n", 21 | "\n", 22 | "That is to say, before tomorrow's class. The assignments will then be randomly shuffled to different members of the class. At 2pm, at the beginning of the tutorial, you will be able to view and comment on someone else's assignment. \n", 23 | "\n", 24 | "Peer grading is new at the OII, but it has been shown to be an effective approach to both learning about other people's code as well as keeping people engaged. Because it is peer graded, we will not expect you to give a specific mark. You will instead be asked: \n", 25 | "\n", 26 | "1. Does the code run as expected: Y / N \n", 27 | "2. If No, please note the lines. \n", 28 | "3. Do you think the code makes sense\n", 29 | "4. Does anything else stand out\n", 30 | "\n", 31 | "We will be doing this through Canvas. This is uncharted territory, but I think after a couple days it will be smooth sailing. \n", 32 | "\n", 33 | "Please be sure to name the file appropriately. " 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "## Part 1. Debugging Strings" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": null, 46 | "metadata": {}, 47 | "outputs": [], 48 | "source": [ 49 | "# 1.1.1. Printing a string 1.: \n", 50 | "\n", 51 | "print(\"The following statements should be identical. Do not edit the one in triple quotes.\")\n", 52 | "\n", 53 | "print('''Greetings Fellow \"Humans\"!!!\n", 54 | "Isn't it time for 'data science'?''')\n", 55 | "\n", 56 | "# Answer\n", 57 | "print(\"Greetings Fellow \"Humans\"!!!\",'Isn't it time for 'data science'?')\n", 58 | "\n", 59 | " \n", 60 | " \n", 61 | "# Reviewer comments below here\n", 62 | "\n" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": null, 68 | "metadata": {}, 69 | "outputs": [], 70 | "source": [ 71 | "# 1.1.2 Using whitespace characters.\n", 72 | "\n", 73 | "print(\"The following statements should be identical. Do not edit the one in triple quotes.\")\n", 74 | "print(\"It is okay to use spaces or tabs. It is not okay to use triple quotes.\")\n", 75 | "print()\n", 76 | "print('''A program without \n", 77 | " a poem \n", 78 | " is like a fish \n", 79 | " without a bicycle.''')\n", 80 | "\n", 81 | "# Answer:\n", 82 | "print(\"A program without a poem is like a fish without a bicycle.\")\n", 83 | "\n", 84 | "\n", 85 | "# Reviewer comments below here\n", 86 | "\n", 87 | "\n" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": null, 93 | "metadata": {}, 94 | "outputs": [], 95 | "source": [ 96 | "# 1.1.3. Joining strings\n", 97 | "\n", 98 | "# Print the happy birthday song using your name and the variables below: \n", 99 | "\n", 100 | "v1 = \"Happy Birthday\"\n", 101 | "v2 = \"to you\"\n", 102 | "v3 = \"dear\"\n", 103 | "name = ...\n", 104 | "\n", 105 | "# Answer\n", 106 | "\n", 107 | "\n", 108 | "# Reviewer comments below here\n", 109 | "\n", 110 | "\n" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "## A Muppet-themed series of questions. \n", 118 | "\n", 119 | "The following questions are based around The Muppets (of course). We will be shaping some data and asking some rudimentary questions of that data. " 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": null, 125 | "metadata": {}, 126 | "outputs": [], 127 | "source": [ 128 | "# 1.1.4 Replacing values \n", 129 | "\n", 130 | "# These data are a mess! They were supposed to be comma separated \n", 131 | "# Clean up the text so there are commas instead of tabs\n", 132 | " \n", 133 | "MuppetInput = '''\n", 134 | "Name\tGender\tSpecies\tFirst Appearance\n", 135 | "Fozzie\tMale\tBear\t1976\n", 136 | "Kermit\tMale\tFrog\t1955\n", 137 | "Piggy\tFemale\tPig\t1974\n", 138 | "Gonzo\tMale\tUnknown\t1970\n", 139 | "Rowlf\tMale\tDog\t1962\n", 140 | "Beaker\tMale\tMuppet\t1977\n", 141 | "Janice\tFemale\tMuppet\t1975\n", 142 | "Hilda\tFemale\tMuppet\t1976\n", 143 | "'''\n", 144 | "\n", 145 | "# Answer\n", 146 | "\n", 147 | "MuppetOutput = ...\n", 148 | "\n", 149 | "\n", 150 | "# Reviewer comments\n" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": null, 156 | "metadata": {}, 157 | "outputs": [], 158 | "source": [ 159 | "# 1.1.5 Splitting strings into lists\n", 160 | "\n", 161 | "# Take the text in the previous exercise and split \n", 162 | "# it so that each line of text is now one element in a list.\n", 163 | "# Print the number of lines. Is it what you expected? \n", 164 | "# If not, look for ways to strip the whitespace before you split it. \n", 165 | "\n", 166 | "# Answer\n", 167 | "\n", 168 | "MuppetList = ...\n", 169 | "\n", 170 | "\n", 171 | "# Reviewer comments \n", 172 | "\n", 173 | "\n" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": null, 179 | "metadata": {}, 180 | "outputs": [], 181 | "source": [ 182 | "# 1.1.6 Separating the header from the rest. \n", 183 | "\n", 184 | "# Take the list and then split it so that you have two lists\n", 185 | "# List one is only of length one, it's the header.\n", 186 | "# The other list is all the rest of the list. \n", 187 | "# Print the length of the data. It should be 8. \n", 188 | "\n", 189 | "# Answer \n", 190 | "MuppetHeader = ...\n", 191 | "MuppetData = ...\n", 192 | "\n", 193 | "# Reviewer comments \n", 194 | "\n" 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": null, 200 | "metadata": {}, 201 | "outputs": [], 202 | "source": [ 203 | "# 1.1.7 Creating a dictionary with Muppet Names as keys: \n", 204 | "\n", 205 | "# Part 0. Create an empty dictionary called MuppetDict\n", 206 | "\n", 207 | "\n", 208 | "# Part 1. \n", 209 | "# Take the first item of MuppetData, split it to that it is now a list with four items\n", 210 | "# Add this to MuppetDict with the name as the key and the three remaining items as the value.\n", 211 | "\n", 212 | "# Answer. \n", 213 | "MuppetDict = ...\n", 214 | "\n", 215 | "# Part 2. \n", 216 | "# Do part one for the remaining 7 items. \n", 217 | "\n", 218 | "\n", 219 | "\n", 220 | "# Reviewer Comments \n", 221 | "\n" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": null, 227 | "metadata": {}, 228 | "outputs": [], 229 | "source": [ 230 | "# 1.1.8 Sorting\n", 231 | "\n", 232 | "# Using MuppetDict and the print statement below and answer the following questions\n", 233 | "# Which muppet appears first alphabetically by name?\n", 234 | "# When was that muppet's first appearance? \n", 235 | "# Hint: remember you can return the keys as a list\n", 236 | "\n", 237 | "# Answer \n", 238 | "\n", 239 | "print(\"The muppet that appears first alphabetically is %s and their first appearance was %s\" % (...,...))\n", 240 | "\n", 241 | "\n", 242 | "# Reviewer comments " 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": null, 248 | "metadata": {}, 249 | "outputs": [], 250 | "source": [ 251 | "# Any other reviewer comments: \n", 252 | "\n" 253 | ] 254 | } 255 | ], 256 | "metadata": { 257 | "kernelspec": { 258 | "display_name": "Python 3", 259 | "language": "python", 260 | "name": "python3" 261 | }, 262 | "language_info": { 263 | "codemirror_mode": { 264 | "name": "ipython", 265 | "version": 3 266 | }, 267 | "file_extension": ".py", 268 | "mimetype": "text/x-python", 269 | "name": "python", 270 | "nbconvert_exporter": "python", 271 | "pygments_lexer": "ipython3", 272 | "version": "3.7.0" 273 | } 274 | }, 275 | "nbformat": 4, 276 | "nbformat_minor": 2 277 | } 278 | -------------------------------------------------------------------------------- /Assignments/Week01/PySDS_EX_Week01_Day01_ModelAnswers.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "**PySDS Week 01 Day 01 - Coding Practice**" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "# Assignment 1. A few quick exercises and a sustained data cleaning example\n", 15 | " \n", 16 | "For this assignment you will be expected to debug the following statements and answer a few questions. These exercises are grouped by topic. You will then be expected to submit the assignment by \n", 17 | "\n", 18 | "~~~\n", 19 | "10am October 10th, 2018\n", 20 | "~~~\n", 21 | "\n", 22 | "That is to say, before tomorrow's class. The assignments will then be randomly shuffled to different members of the class. At 2pm, at the beginning of the tutorial, you will be able to view and comment on someone else's assignment. \n", 23 | "\n", 24 | "Peer grading is new at the OII, but it has been shown to be an effective approach to both learning about other people's code as well as keeping people engaged. Because it is peer graded, we will not expect you to give a specific mark. You will instead be asked: \n", 25 | "\n", 26 | "1. Does the code run as expected: Y / N \n", 27 | "2. If No, please note the lines. \n", 28 | "3. Do you think the code makes sense\n", 29 | "4. Does anything else stand out\n", 30 | "\n", 31 | "We will be doing this through Canvas. This is uncharted territory, but I think after a couple days it will be smooth sailing. \n", 32 | "\n", 33 | "Please be sure to name the file appropriately. " 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "## Part 1. Debugging Strings" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": null, 46 | "metadata": {}, 47 | "outputs": [], 48 | "source": [ 49 | "# 1.1.1. Printing a string 1.: \n", 50 | "\n", 51 | "print(\"The following statements should be identical. Do not edit the one in triple quotes.\")\n", 52 | "\n", 53 | "print('''Greetings Fellow \"Humans\"!!!\n", 54 | "Isn't it time for 'data science'?''')\n", 55 | "\n", 56 | "# Answer\n", 57 | "print(\"Greetings Fellow \\\"Humans\\\"!!!\\nIsn\\'t it time for \\'data science\\'?\")\n", 58 | "\n", 59 | " \n", 60 | " \n", 61 | "# Reviewer comments below here\n", 62 | "\n" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": null, 68 | "metadata": {}, 69 | "outputs": [], 70 | "source": [ 71 | "# 1.1.2 Using whitespace characters.\n", 72 | "\n", 73 | "print(\"The following statements should be identical. Do not edit the one in triple quotes.\")\n", 74 | "print(\"It is okay to use spaces or tabs. It is not okay to use triple quotes.\")\n", 75 | "print()\n", 76 | "print('''A program without \n", 77 | " a poem \n", 78 | " is like a fish \n", 79 | " without a bicycle.''')\n", 80 | "\n", 81 | "# Answer:\n", 82 | "print(\"A program without a poem is like a fish without a bicycle.\")\n", 83 | "\n", 84 | "\n", 85 | "# Reviewer comments below here\n", 86 | "\n", 87 | "\n" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": null, 93 | "metadata": {}, 94 | "outputs": [], 95 | "source": [ 96 | "Answer: print(\"A program without \\n\\ta poem \\n\\t\\tis like a fish \\n\\t\\t\\t\\twithout a bicycle.\")" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": null, 102 | "metadata": {}, 103 | "outputs": [], 104 | "source": [ 105 | "# 1.1.3. Joining strings\n", 106 | "\n", 107 | "# Print the happy birthday song using your name and the variables below: \n", 108 | "\n", 109 | "v1 = \"Happy Birthday\"\n", 110 | "v2 = \"to you\"\n", 111 | "v3 = \"dear\"\n", 112 | "name = ...\n", 113 | "\n", 114 | "# Answer\n", 115 | "\n", 116 | "\n", 117 | "# Reviewer comments below here\n", 118 | "\n", 119 | "\n" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": null, 125 | "metadata": {}, 126 | "outputs": [], 127 | "source": [ 128 | "name = \"Bernie\"\n", 129 | "# Solution 1.\n", 130 | "print(v1,v2,\"\\n\",v1,v2,\"\\n\",v1,v3,\"\\n\",name,v1,v2,\"\\n\") # Worst - using commas in the print command adds some ugly spaces\n", 131 | "\n", 132 | "# Solution 2. \n", 133 | "strout = v1 + \" \" + v2 + \"\\n\" + v1 + \" \" + v2 + \"\\n\" + v1 + \" \" + v3 + \" \" + name + \"\\n\" + v1 + \" \" + v2 + \"\\n\" \n", 134 | "print(strout)\n", 135 | "\n", 136 | "# Solution 3\n", 137 | "print ( \"%(v1)s %(v2)s\\n%(v1)s %(v2)s\\n%(v1)s %(v3)s %(name)s\\n%(v1)s %(v2)s\\n\" % {\"v1\":v1,\"v2\":v2,\"v3\":v3,\"name\":name}) # this uses string substitution\n" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": {}, 143 | "source": [ 144 | "## A Muppet-themed series of questions. \n", 145 | "\n", 146 | "The following questions are based around The Muppets. We will be shaping some data and asking some rudimentary questions of that data. " 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": null, 152 | "metadata": {}, 153 | "outputs": [], 154 | "source": [ 155 | "# 1.1.4 Replacing values \n", 156 | "\n", 157 | "# These data are a mess! They were supposed to be comma separated \n", 158 | "# Clean up the text so there are commas instead of tabs\n", 159 | " \n", 160 | "MuppetInput = '''\n", 161 | "Name\tGender\tSpecies\tFirst Appearance\n", 162 | "Fozzie\tMale\tBear\t1976\n", 163 | "Kermit\tMale\tFrog\t1955\n", 164 | "Piggy\tFemale\tPig\t1974\n", 165 | "Gonzo\tMale\tUnknown\t1970\n", 166 | "Rowlf\tMale\tDog\t1962\n", 167 | "Beaker\tMale\tMuppet\t1977\n", 168 | "Janice\tFemale\tMuppet\t1975\n", 169 | "Hilda\tFemale\tMuppet\t1976\n", 170 | "'''\n", 171 | "\n", 172 | "MuppetOutput = ..." 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": null, 178 | "metadata": {}, 179 | "outputs": [], 180 | "source": [ 181 | "#Answer \n", 182 | "\n", 183 | "MuppetOutput = MuppetInput.replace(\"\\t\",\",\") # '\t' is more commonly represented by '\\t' - meaning 'tab'\n", 184 | "print(MuppetOutput)" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": null, 190 | "metadata": {}, 191 | "outputs": [], 192 | "source": [ 193 | "# 1.1.5 Splitting strings into lists\n", 194 | "\n", 195 | "# Take the text in the previous exercise and split \n", 196 | "# it so that each line of text is now one element in a list.\n", 197 | "# Print the number of lines. Is it what you expected? \n", 198 | "# If not, look for ways to strip the whitespace before you split it. \n", 199 | "\n", 200 | "MuppetList = ..." 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": null, 206 | "metadata": {}, 207 | "outputs": [], 208 | "source": [ 209 | "# Answer \n", 210 | "\n", 211 | "MuppetList = MuppetOutput.strip().split(\"\\n\") # you can apply multiple methods in one line\n", 212 | "print(len(MuppetList))" 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": null, 218 | "metadata": {}, 219 | "outputs": [], 220 | "source": [ 221 | "# 1.1.6 Separating the header from the rest. \n", 222 | "\n", 223 | "# Take the list and then split it so that you have two lists\n", 224 | "# List one is only of length one, it's the header.\n", 225 | "# The other list is all the rest of the list. \n", 226 | "# Print the length of the data. It should be 8. \n", 227 | "MuppetHeader = ...\n", 228 | "MuppetData = ..." 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": null, 234 | "metadata": {}, 235 | "outputs": [], 236 | "source": [ 237 | "# Answer\n", 238 | "MuppetHeader = [MuppetList[0]] # MuppetList[0] is just an element of a list. Put square brackets around it to put it in a list\n", 239 | "MuppetData = MuppetList[1:] # list slice\n", 240 | "print(MuppetHeader)\n", 241 | "print(MuppetData)\n", 242 | "print(len((MuppetData)))" 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": null, 248 | "metadata": {}, 249 | "outputs": [], 250 | "source": [ 251 | "# 1.1.7 Creating a dictionary with Muppet Names as keys: \n", 252 | "\n", 253 | "# Part 0. Create an empty dictionary called MuppetDict\n", 254 | "\n", 255 | "MuppetDict = {}\n", 256 | "\n", 257 | "# Part 1. \n", 258 | "# Take the first item of MuppetData, split it to that it is now a list with four items\n", 259 | "# Add this to MuppetDict with the name as the key and the three remaining items as the value.\n", 260 | "\n", 261 | "\n", 262 | "# Part 2. \n", 263 | "# Do part one for the remaining 7 items. \n", 264 | "\n", 265 | "\n" 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": null, 271 | "metadata": {}, 272 | "outputs": [], 273 | "source": [ 274 | "MuppetDict = {}\n", 275 | "for i in MuppetData: # for each line\n", 276 | " j = i.split(\",\") # split on the commas -> list\n", 277 | " MuppetDict[j[0]] = j[1:] # dict key is first element of new list, value is the remaining elements\n", 278 | "\n", 279 | "# or\n", 280 | "MuppetDict = {}\n", 281 | "ll = MuppetData[0].split(\",\")\n", 282 | "MuppetDict[ll[0]] = ll[1:]\n", 283 | "ll = MuppetData[1].split(\",\")\n", 284 | "MuppetDict[ll[0]] = ll[1:]\n", 285 | "ll = MuppetData[2].split(\",\")\n", 286 | "MuppetDict[ll[0]] = ll[1:]\n", 287 | "ll = MuppetData[3].split(\",\")\n", 288 | "MuppetDict[ll[0]] = ll[1:]\n", 289 | "ll = MuppetData[4].split(\",\")\n", 290 | "MuppetDict[ll[0]] = ll[1:]\n", 291 | "ll = MuppetData[5].split(\",\")\n", 292 | "MuppetDict[ll[0]] = ll[1:]\n", 293 | "ll = MuppetData[6].split(\",\")\n", 294 | "MuppetDict[ll[0]] = ll[1:]\n", 295 | "ll = MuppetData[7].split(\",\")\n", 296 | "MuppetDict[ll[0]] = ll[1:]\n", 297 | "\n", 298 | "print(MuppetDict)" 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "execution_count": null, 304 | "metadata": {}, 305 | "outputs": [], 306 | "source": [ 307 | "# 1.1.8 Sorting\n", 308 | "\n", 309 | "# Using MuppetDict and the print statement below and answer the following questions\n", 310 | "# Which muppet appears first alphabetically by name?\n", 311 | "# When was that muppet's first appearance? \n", 312 | "# Hint: remember you can return the keys as a list\n", 313 | "\n", 314 | "#\n", 315 | "sortednames = list(MuppetDict.keys()) # get the list of keys\n", 316 | "sortednames.sort() # sort in place\n", 317 | "FirstMuppet = sortednames[0] # return the first item\n", 318 | "print(\"The muppet that appears first alphabetically is %s and their first appearance was %s\" % (FirstMuppet, MuppetDict[FirstMuppet][2])) # print output using string substitution\n", 319 | "\n", 320 | "# or use 'sorted()' with some indexing to write all in one line\n", 321 | "print(\"The muppet that appears first alphabetically is %s and their first appearance was %s\" % (sorted(MuppetDict.items())[0][0],sorted(MuppetDict.items())[0][1][2])) " 322 | ] 323 | } 324 | ], 325 | "metadata": { 326 | "kernelspec": { 327 | "display_name": "Python 3", 328 | "language": "python", 329 | "name": "python3" 330 | }, 331 | "language_info": { 332 | "codemirror_mode": { 333 | "name": "ipython", 334 | "version": 3 335 | }, 336 | "file_extension": ".py", 337 | "mimetype": "text/x-python", 338 | "name": "python", 339 | "nbconvert_exporter": "python", 340 | "pygments_lexer": "ipython3", 341 | "version": "3.7.0" 342 | } 343 | }, 344 | "nbformat": 4, 345 | "nbformat_minor": 2 346 | } 347 | -------------------------------------------------------------------------------- /Assignments/Week01/PySDS_EX_Week01_Day02.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "**PySDS Week 01 Day 02 - Control and flow statements**" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "# Assignment 1.2 A few quick exercises and a sustained data cleaning example\n", 15 | " \n", 16 | "For this assignment you will be expected to create a single long choose your own adventure story based on user input. The story is going to be pretty underwhelming but should show some examples of some of the concepts used in class. The story can be of your own choosing but must have at least three different paths and two endings. It will take user input to decide the next story. Below I have some 'story elements' that can be used. \n", 17 | "\n", 18 | "~~~\n", 19 | "10am October 10th, 2018\n", 20 | "~~~\n", 21 | "\n", 22 | "That is to say, before tomorrow's class. The assignments will then be randomly shuffled to different members of the class. At 2pm, at the beginning of the tutorial, you will be able to view and comment on someone else's assignment. \n", 23 | "\n", 24 | "Peer grading is new at the OII, but it has been shown to be an effective approach to both learning about other people's code as well as keeping people engaged. Because it is peer graded, we will not expect you to give a specific mark. You will instead be asked: \n", 25 | "\n", 26 | "1. Does the code run as expected: Y / N \n", 27 | "2. If No, please note the lines. \n", 28 | "3. Do you think the code makes sense\n", 29 | "4. Does anything else stand out" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": null, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "# Short Exercise 1. \n", 39 | "\n", 40 | "# Fozzie Bear! (based on a classic coding challenge)\n", 41 | "# Make a program that spits out numbers and the words Fozzie Bear. \n", 42 | "# If the line is a multiple of 3 print Fozzie\n", 43 | "# If the line is a multiple of 5 print Bear\n", 44 | "# If the line is a multiple of both, print \"Fozzie Bear\" (with the space in between). \n", 45 | "# Have the program run in the range 0 to 30. \n", 46 | "\n", 47 | "# Answer below here: \n", 48 | "\n" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": null, 54 | "metadata": {}, 55 | "outputs": [], 56 | "source": [ 57 | "# Short exercise 2: Versever \n", 58 | "\n", 59 | "# print 10 lines of 20 characters each in the following pattern:\n", 60 | "# verseverseverseverse\n", 61 | "# severseverseversever\n", 62 | "# Catch, you can only use one string with the word 'verse'\n", 63 | "# everything else should be done through slicing and loops.\n", 64 | "\n", 65 | "# Answer below here\n", 66 | "\n", 67 | "\n", 68 | "# Reviewer comments below here: \n", 69 | "'''\n", 70 | "\n", 71 | "'''" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": null, 77 | "metadata": {}, 78 | "outputs": [], 79 | "source": [ 80 | "# Short exercise 3. Versever II \n", 81 | "\n", 82 | "# Print 10 lines of the 5 characters in the following pattern:\n", 83 | "#\n", 84 | "# verse\n", 85 | "# ersev\n", 86 | "# rseve\n", 87 | "# sever\n", 88 | "# evers\n", 89 | "#\n", 90 | "# Again, the catch is you can only use one string with the word 'verse'\n", 91 | "# everything else should be done through slicing and loops. " 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": 1, 97 | "metadata": {}, 98 | "outputs": [], 99 | "source": [ 100 | "# Long exercise 1. A very short choose your own adventure\n", 101 | "# Please create a story that could, in theory, include every string below and \n", 102 | "# at least one element from each of the two lists. \n", 103 | "#\n", 104 | "# Please complete your code in the cell below this one. Also in the cells below that\n", 105 | "# are some design patterns that you can use to help you out. \n", 106 | "\n", 107 | "setting1 = '''It was a sunny day outside when Kermit went to the old auditorium. \n", 108 | "Along the way he was greeted by his friend %s. Kermit couldn't believe what %s told him.'''\n", 109 | "\n", 110 | "actor1 = [\"Piggy\",\"Fozzie\",\"Gonzo\",\"Hilda\"]\n", 111 | "\n", 112 | "quote1 = '''The auditorium has burned down! What shall we do?'''\n", 113 | "\n", 114 | "quote2 = '''The owner said if we can't come up with rent by Thursday, we will be evicted!'''\n", 115 | "\n", 116 | "quote3 = '''The flight for this weeks guest star was cancelled. We will have to find a way to pay for a new flight!'''\n", 117 | "\n", 118 | "reply1 = '''I know, kermit said. We should hold a %s'''\n", 119 | "\n", 120 | "event1 = [\"BBQ\",\"Telethon\",\"Bake sale\"]\n", 121 | "\n", 122 | "setting2 = '''At the %s, all the muppets were there, inclding %s for a special performance. They needed to raise $1000 but actually raised %s'''\n", 123 | "\n", 124 | "ending1 = '''Kermit, sullen, with his guitar in hand, knew it wouldn't be enough. \n", 125 | "Sitting on a log he thought, it's easier being green than making it. Better luck next time'''\n", 126 | "\n", 127 | "ending2 = '''With his muppet hands flailing in the air, Kermit knew they could do it. \n", 128 | "The show will go on after all.'''\n", 129 | "\n", 130 | "################################################\n", 131 | "# Additional user-defined statements \n", 132 | "################################################\n", 133 | "\n" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": 2, 139 | "metadata": {}, 140 | "outputs": [], 141 | "source": [ 142 | "# The choose your own adventure should include all of the above statements,\n", 143 | "# but can include more if you are feeling creative. Place all additional statements \n", 144 | "# in the cell above with appropriate variable names (you can choose these at will)\n", 145 | "\n", 146 | "# Answer below here\n", 147 | "\n", 148 | "\n", 149 | "\n", 150 | "# Reviewer comments below here\n", 151 | "\n", 152 | "\n" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": 8, 158 | "metadata": {}, 159 | "outputs": [ 160 | { 161 | "name": "stdout", 162 | "output_type": "stream", 163 | "text": [ 164 | "f\n" 165 | ] 166 | } 167 | ], 168 | "source": [ 169 | "######################################\n", 170 | "# Appendix 1. Here are some code patterns to help you out.\n", 171 | "# None of them are necessary, per se, but they should serve as inspiration.\n", 172 | "#\n", 173 | "######################################\n", 174 | "# Pattern 1. Getting a number\n", 175 | "\n", 176 | "a = input(\"Please print a number: \")\n", 177 | "\n", 178 | "if int(a) > 4: \n", 179 | " b = input(\"Please print a second number: \")\n", 180 | "else:\n", 181 | " b = input(\"Please print a letter: \")\n", 182 | "\n", 183 | "print(b)" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": 14, 189 | "metadata": {}, 190 | "outputs": [], 191 | "source": [ 192 | "from IPython import display" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": null, 198 | "metadata": {}, 199 | "outputs": [], 200 | "source": [ 201 | "#####################################\n", 202 | "# Pattern 2. Staying in a loop until you are satisfied\n", 203 | "\n", 204 | "while True: \n", 205 | " print(\"This is a statement about A and B\")\n", 206 | " a = input(\"Please make a selection: (A) or (B)\")\n", 207 | " \n", 208 | " if a == \"A\":\n", 209 | " display(\"you chose well.\")\n", 210 | " break\n", 211 | " elif a == \"B\":\n", 212 | " display(\"you chose well.\")\n", 213 | " break\n", 214 | " else:\n", 215 | " display(\"I'm sorry, that was not a valid selection\")\n", 216 | "\n", 217 | "while True: \n", 218 | " print(\"This is a statement about C or D\")\n", 219 | " a = input(\"Please make a selection: (C) or (D)\")\n", 220 | " \n", 221 | " if a == \"C\":\n", 222 | " display(\"you chose well.\")\n", 223 | " break\n", 224 | " elif a == \"D\":\n", 225 | " display(\"you chose well.\")\n", 226 | " break\n", 227 | " else:\n", 228 | " display(\"I'm sorry, that was not a valid selection\")\n" 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": 7, 234 | "metadata": {}, 235 | "outputs": [ 236 | { 237 | "name": "stdout", 238 | "output_type": "stream", 239 | "text": [ 240 | "That's not greater than 4. Try again.\n", 241 | "you selected 6 which is greater than 4\n" 242 | ] 243 | } 244 | ], 245 | "source": [ 246 | "#####################################\n", 247 | "# Pattern 3. Catching bad input with 'try / except'\n", 248 | "\n", 249 | "while True:\n", 250 | " a = input(\"Please print a number greater than 4: \")\n", 251 | " try: \n", 252 | " if float(a) > 4: \n", 253 | " print(\"you selected %s which is greater than 4\" % a)\n", 254 | " break\n", 255 | " else:\n", 256 | " print(\"That's not greater than 4. Try again.\")\n", 257 | " except: \n", 258 | " print(\"I'm sorry, that was not valid input. Please enter an number\")\n" 259 | ] 260 | } 261 | ], 262 | "metadata": { 263 | "kernelspec": { 264 | "display_name": "Python 3", 265 | "language": "python", 266 | "name": "python3" 267 | }, 268 | "language_info": { 269 | "codemirror_mode": { 270 | "name": "ipython", 271 | "version": 3 272 | }, 273 | "file_extension": ".py", 274 | "mimetype": "text/x-python", 275 | "name": "python", 276 | "nbconvert_exporter": "python", 277 | "pygments_lexer": "ipython3", 278 | "version": "3.6.5" 279 | } 280 | }, 281 | "nbformat": 4, 282 | "nbformat_minor": 2 283 | } 284 | -------------------------------------------------------------------------------- /Assignments/Week01/PySDS_EX_Week01_Day02_ModelAnswers.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "**PySDS Week 01 Day 02 - Control and flow statements**" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "# Assignment 1.2 A few quick exercises and a sustained data cleaning example\n", 15 | " \n", 16 | "For this assignment you will be expected to create a single long choose your own adventure story based on user input. The story is going to be pretty underwhelming but should show some examples of some of the concepts used in class. The story can be of your own choosing but must have at least three different paths and two endings. It will take user input to decide the next story. Below I have some 'story elements' that can be used. \n", 17 | "\n", 18 | "~~~\n", 19 | "10am October 10th, 2018\n", 20 | "~~~\n", 21 | "\n", 22 | "That is to say, before tomorrow's class. The assignments will then be randomly shuffled to different members of the class. At 2pm, at the beginning of the tutorial, you will be able to view and comment on someone else's assignment. \n", 23 | "\n", 24 | "Peer grading is new at the OII, but it has been shown to be an effective approach to both learning about other people's code as well as keeping people engaged. Because it is peer graded, we will not expect you to give a specific mark. You will instead be asked: \n", 25 | "\n", 26 | "1. Does the code run as expected: Y / N \n", 27 | "2. If No, please note the lines. \n", 28 | "3. Do you think the code makes sense\n", 29 | "4. Does anything else stand out" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": null, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "# Short Exercise 1. \n", 39 | "\n", 40 | "# Fozzie Bear! (based on a classic coding challenge)\n", 41 | "# Make a program that spits out numbers and the words Fozzie Bear. \n", 42 | "# If the line is a multiple of 3 print Fozzie\n", 43 | "# If the line is a multiple of 5 print Bear\n", 44 | "# If the line is a multiple of both, print \"Fozzie Bear\" (with the space in between). \n", 45 | "# Have the program run in the range 0 to 30. \n", 46 | "\n", 47 | "# Answer below here: \n", 48 | "for i in range(30): # (yes, it was unclear whether to go from 0 to 30 inclusive or not)\n", 49 | " if i%3==0 and i%5==0: # check for 3 and 5\n", 50 | " print('Fozzie Bear')\n", 51 | " elif i%3==0: # check for 3 \n", 52 | " print('Fozzie')\n", 53 | " elif i%5==0: # check for 5\n", 54 | " print('Bear')\n", 55 | " else: # if no conditions satisfied, print the number\n", 56 | " print(i)\n", 57 | "\n", 58 | "\n", 59 | "print()\n", 60 | "\n", 61 | "# Alternative more general solution, which is easily scalable to any number of mod conditions.\n", 62 | "fb = {3:'Fozzie', 5:'Bear'} \n", 63 | "for i in range(30): # \n", 64 | " toprint=[]\n", 65 | " for k, v in fb.items(): # iterates through mod conditions dictionary for each number\n", 66 | " if i%k==0:\n", 67 | " toprint.append(v) # appends if condition satisfied\n", 68 | " print(str(i)*(not bool(len(toprint))) + ' '.join(toprint)) # first part is a boolean statement which prints the number if 'toprint' is empty, second part joins the strings in toprint with spaces\n", 69 | "\n", 70 | " " 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": null, 76 | "metadata": {}, 77 | "outputs": [], 78 | "source": [ 79 | "# Short exercise 2: Versever \n", 80 | "\n", 81 | "# print 10 lines of 20 characters each in the following pattern:\n", 82 | "# verseverseverseverse\n", 83 | "# severseverseversever\n", 84 | "# Catch, you can only use one string with the word 'verse'\n", 85 | "# everything else should be done through slicing and loops.\n", 86 | "\n", 87 | "# Answer below here\n", 88 | "s = 'verse'\n", 89 | "\n", 90 | "# with for loops\n", 91 | "t = s[3:]+s[:3] # use a string slice to construct 'sever'\n", 92 | "\n", 93 | "for i in [s,t]: # fr each string\n", 94 | " for j in range(10): # print 10 lines\n", 95 | " print(i*4) # 4 on each line\n", 96 | "\n", 97 | " \n", 98 | " \n", 99 | "print()\n", 100 | "# alternative\n", 101 | "print(10*(4*s+'\\n')) # construct 10 line long string\n", 102 | "print(10*(4*(s[3:]+s[:3])+'\\n')) # use a string slice to construct 'sever'\n", 103 | "\n", 104 | "\n", 105 | "# Reviewer comments below here: \n", 106 | "'''\n", 107 | "\n", 108 | "'''" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": null, 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [ 117 | "# Short exercise 3. Versever II \n", 118 | "\n", 119 | " \n", 120 | "# Print 10 lines of the 5 characters in the following pattern:\n", 121 | "#\n", 122 | "# verse\n", 123 | "# ersev\n", 124 | "# rseve\n", 125 | "# sever\n", 126 | "# evers\n", 127 | "#\n", 128 | "# Again, the catch is you can only use one string with the word 'verse'\n", 129 | "# everything else should be done through slicing and loops. \n", 130 | "\n", 131 | "s = 'verse'\n", 132 | "for n in range(2): # does the inner loop twice to ensure 10 lines\n", 133 | " for i in range(0,5): # each line is a combination of string slices at different positions\n", 134 | " print(s[i:]+s[:i])" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": null, 140 | "metadata": {}, 141 | "outputs": [], 142 | "source": [ 143 | "# Long exercise 1. A very short choose your own adventure\n", 144 | "# Please create a story that could, in theory, include every string below and \n", 145 | "# at least one element from each of the two lists. \n", 146 | "#\n", 147 | "# Please complete your code in the cell below this one. Also in the cells below that\n", 148 | "# are some design patterns that you can use to help you out. \n", 149 | "\n", 150 | "setting1a = '''It was a sunny day outside when Kermit went to the old auditorium.'''\n", 151 | "setting1b = '''Along the way he was greeted by his friend %s. Kermit couldn't believe what %s told him.'''\n", 152 | "\n", 153 | "actor1 = [\"Piggy\",\"Fozzie\",\"Gonzo\",\"Hilda\"]\n", 154 | "\n", 155 | "quote1 = '''The auditorium has burned down! What shall we do?'''\n", 156 | "\n", 157 | "quote2 = '''The owner said if we can't come up with rent by Thursday, we will be evicted!'''\n", 158 | "\n", 159 | "quote3 = '''The flight for this weeks guest star was cancelled. We will have to find a way to pay for a new flight!'''\n", 160 | "\n", 161 | "reply1 = '''\"I know\", kermit said. \"We should hold a %s\"'''\n", 162 | "\n", 163 | "event1 = [\"BBQ\",\"Telethon\",\"Bake sale\"]\n", 164 | "\n", 165 | "setting2 = '''At the %s, all the muppets were there, inclding %s for a special performance. They needed to raise $1000 but actually raised $%s'''\n", 166 | "\n", 167 | "ending1 = '''Kermit, sullen, with his guitar in hand, knew it wouldn't be enough. \n", 168 | "Sitting on a log he thought, it's easier being green than making it. Better luck next time'''\n", 169 | "\n", 170 | "ending2 = '''With his muppet hands flailing in the air, Kermit knew they could do it. \n", 171 | "The show will go on after all.'''\n", 172 | "\n", 173 | "################################################\n", 174 | "# Additional user-defined statements \n", 175 | "################################################\n", 176 | "\n" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": null, 182 | "metadata": {}, 183 | "outputs": [], 184 | "source": [ 185 | "# The choose your own adventure should include all of the above statements,\n", 186 | "# but can include more if you are feeling creative. Place all additional statements \n", 187 | "# in the cell above with appropriate variable names (you can choose these at will)\n", 188 | "\n", 189 | "# Answer below here\n", 190 | "quotelist = [quote1, quote2, quote3] #group the quotes\n", 191 | "\n", 192 | "\n", 193 | "print(setting1a) # print the start\n", 194 | "\n", 195 | "while True: # use while loops to ensure correct input\n", 196 | " q0 = input('\\tWho was also headed that way?\\n\\t\\t1: %s\\n\\t\\t2: %s\\n\\t\\t3: %s\\n\\t\\t4: %s\\n' %(actor1[0], actor1[1], actor1[2], actor1[3])) # take input\n", 197 | " try:\n", 198 | " if int(q0)>0:\n", 199 | " friend = actor1[int(q0)-1] # select the character\n", 200 | " del actor1[int(q0)-1] # remove from list for later\n", 201 | " print(setting1b %(friend, friend)) # print next sentence\n", 202 | " break #escape while loop if input correct\n", 203 | " print('Incorrect input, please choose a number from 1 to 4')\n", 204 | " except:\n", 205 | " print('Incorrect input, please choose a number from 1 to 4')\n", 206 | "\n", 207 | "while True: # take more input, same stuff for later loops too\n", 208 | " q1 = input('\\tWhat happened?\\n\\t\\t1: There was a fire at the auditorium.\\n\\t\\t2: They have been threatened with eviction.\\n\\t\\t3: The guest\\'s flight was cancelled.\\n')\n", 209 | " try:\n", 210 | " if int(q1)>0:\n", 211 | " print('\"%s\"'%quotelist[int(q1)-1])\n", 212 | " break\n", 213 | " print('Incorrect input, please choose a number from 1 to 3')\n", 214 | " except:\n", 215 | " print('Incorrect input, please choose a number from 1 to 3') \n", 216 | "\n", 217 | "while True:\n", 218 | " q2 = input('\\tWhat event did Kermit decide to hold?\\n\\t\\t1: %s\\n\\t\\t2: %s\\n\\t\\t3: %s\\n' %(event1[0], event1[1], event1[2]))\n", 219 | " try:\n", 220 | " if int(q2)>0:\n", 221 | " print(reply1 %event1[int(q2)-1])\n", 222 | " break\n", 223 | " print('Incorrect input, please choose a number from 1 to 3')\n", 224 | " except:\n", 225 | " print('Incorrect input, please choose a number from 1 to 3') \n", 226 | "\n", 227 | "while True:\n", 228 | " q3 = input('\\tWho was the guest star?\\n\\t\\t1: %s\\n\\t\\t2: %s\\n\\t\\t3: %s\\n' %(actor1[0], actor1[1], actor1[2]))\n", 229 | " try:\n", 230 | " if int(q3)>0 and int(q3)<4:\n", 231 | " break\n", 232 | " print('Incorrect input, please choose a number from 1 to 3')\n", 233 | " except:\n", 234 | " print('Incorrect input, please choose a number from 1 to 3') \n", 235 | " \n", 236 | "while True:\n", 237 | " q4 = input('\\tHow much money was raised? $')\n", 238 | " try:\n", 239 | " if float(q4)>=0:\n", 240 | " break\n", 241 | " print('Incorrect input, please choose a number greater than or equal to zero')\n", 242 | " except:\n", 243 | " print('Incorrect input, please choose a number greater than or equal to zero') \n", 244 | " \n", 245 | "\n", 246 | "print(setting2 %(event1[int(q2)-1], actor1[int(q3)-1],q4)) # next line\n", 247 | "\n", 248 | "if float(q4)<1000: # check amount to see which ending\n", 249 | " print(ending1)\n", 250 | "else:\n", 251 | " print(ending2)\n", 252 | " \n", 253 | "\n", 254 | "\n", 255 | "\n" 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": null, 261 | "metadata": {}, 262 | "outputs": [], 263 | "source": [ 264 | "######################################\n", 265 | "# Appendix 1. Here are some code patterns to help you out.\n", 266 | "# None of them are necessary, per se, but they should serve as inspiration.\n", 267 | "#\n", 268 | "######################################\n", 269 | "# Pattern 1. Getting a number\n", 270 | "\n", 271 | "a = input(\"Please print a number: \")\n", 272 | "\n", 273 | "if int(a) > 4: \n", 274 | " b = input(\"Please print a second number: \")\n", 275 | "else:\n", 276 | " b = input(\"Please print a letter: \")\n", 277 | "\n", 278 | "print(b)" 279 | ] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "execution_count": null, 284 | "metadata": {}, 285 | "outputs": [], 286 | "source": [ 287 | "from IPython import display" 288 | ] 289 | }, 290 | { 291 | "cell_type": "code", 292 | "execution_count": null, 293 | "metadata": {}, 294 | "outputs": [], 295 | "source": [ 296 | "#####################################\n", 297 | "# Pattern 2. Staying in a loop until you are satisfied\n", 298 | "\n", 299 | "while True: \n", 300 | " print(\"This is a statement about A and B\")\n", 301 | " a = input(\"Please make a selection: (A) or (B)\")\n", 302 | " \n", 303 | " if a == \"A\":\n", 304 | " display(\"you chose well.\")\n", 305 | " break\n", 306 | " elif a == \"B\":\n", 307 | " display(\"you chose well.\")\n", 308 | " break\n", 309 | " else:\n", 310 | " display(\"I'm sorry, that was not a valid selection\")\n", 311 | "\n", 312 | "while True: \n", 313 | " print(\"This is a statement about C or D\")\n", 314 | " a = input(\"Please make a selection: (C) or (D)\")\n", 315 | " \n", 316 | " if a == \"C\":\n", 317 | " display(\"you chose well.\")\n", 318 | " break\n", 319 | " elif a == \"D\":\n", 320 | " display(\"you chose well.\")\n", 321 | " break\n", 322 | " else:\n", 323 | " display(\"I'm sorry, that was not a valid selection\")\n" 324 | ] 325 | }, 326 | { 327 | "cell_type": "code", 328 | "execution_count": null, 329 | "metadata": {}, 330 | "outputs": [], 331 | "source": [ 332 | "#####################################\n", 333 | "# Pattern 3. Catching bad input with 'try / except'\n", 334 | "\n", 335 | "while True:\n", 336 | " a = input(\"Please print a number greater than 4: \")\n", 337 | " try: \n", 338 | " if float(a) > 4: \n", 339 | " print(\"you selected %s which is greater than 4\" % a)\n", 340 | " break\n", 341 | " else:\n", 342 | " print(\"That's not greater than 4. Try again.\")\n", 343 | " except: \n", 344 | " print(\"I'm sorry, that was not valid input. Please enter an number\")\n" 345 | ] 346 | } 347 | ], 348 | "metadata": { 349 | "kernelspec": { 350 | "display_name": "Python [default]", 351 | "language": "python", 352 | "name": "python3" 353 | }, 354 | "language_info": { 355 | "codemirror_mode": { 356 | "name": "ipython", 357 | "version": 3 358 | }, 359 | "file_extension": ".py", 360 | "mimetype": "text/x-python", 361 | "name": "python", 362 | "nbconvert_exporter": "python", 363 | "pygments_lexer": "ipython3", 364 | "version": "3.6.3" 365 | } 366 | }, 367 | "nbformat": 4, 368 | "nbformat_minor": 2 369 | } 370 | -------------------------------------------------------------------------------- /Assignments/Week01/PySDS_EX_Week01_Day03.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "**PySDS Week 01 Day 02 - Control and flow statements**" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "# Assignment 1.3 Printing to a file and more loop practice. \n", 15 | "\n", 16 | "For this assignment you will be expected to write and read some text from a file. The sustained example is about reading a table of data and then writing that table back to a different file under some pre-determined conditions. \n", 17 | "\n", 18 | "~~~\n", 19 | "10am October 12th, 2018\n", 20 | "~~~\n", 21 | "\n", 22 | "That is to say, before tomorrow's class. The assignments will then be randomly shuffled to different members of the class. At 2pm, at the beginning of the tutorial, you will be able to view and comment on someone else's assignment. \n", 23 | "\n", 24 | "Peer grading is new at the OII, but it has been shown to be an effective approach to both learning about other people's code as well as keeping people engaged. Because it is peer graded, we will not expect you to give a specific mark. You will instead be asked: \n", 25 | "\n", 26 | "1. Does the code run as expected: Y / N ?\n", 27 | " 1. If No, please note the lines. (view lines above using view > Show line numbers). \n", 28 | "3. Do you think the code makes sense?\n", 29 | "4. Does anything else stand out?" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": null, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "# Group the files in this directory by type.\n", 39 | "# Get all the files, \n", 40 | "# create a dictionary with the file extension as the key \n", 41 | "# and the list of files in that directory as the value\n", 42 | "# Print the keys and the length of the list of files so it would \n", 43 | "# look like the following: (the numbers will be different, of course)\n", 44 | "#\n", 45 | "# File type counts: \n", 46 | "# .py: 3\n", 47 | "# .ipynb: 4\n", 48 | "# .txt: 10\n", 49 | "#\n", 50 | "# Have two file paths. First do this with your current working directory.\n", 51 | "# Second do this with your downloads folder. \n", 52 | "\n", 53 | "# Answer\n", 54 | "import os\n", 55 | "CWD = os.getcwd()\n", 56 | "DOWNLOADS = ...\n", 57 | "\n", 58 | "\n", 59 | "\n", 60 | "# Reviewer comments below here \n", 61 | "'''\n", 62 | "\n", 63 | "\n", 64 | "\n", 65 | "'''" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": null, 71 | "metadata": {}, 72 | "outputs": [], 73 | "source": [ 74 | "# Write an ascii art pattern to a file, then read the file and print its contents.\n", 75 | "# It should be no more than 80 characters wide and not more than 80 lines long. \n", 76 | "# The pattern should repeat somehow, but it can be creative. A simple one might be: \n", 77 | "# XOXO\n", 78 | "# OXOX\n", 79 | "# XOXO\n", 80 | "# OXOX\n", 81 | "# The pattern needs to come from code.\n", 82 | "\n", 83 | "\n", 84 | "# Answer\n", 85 | "\n", 86 | "\n", 87 | "\n", 88 | "# Reviewer comments below here \n", 89 | "'''\n", 90 | "\n", 91 | "\n", 92 | "\n", 93 | "'''" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": null, 99 | "metadata": {}, 100 | "outputs": [], 101 | "source": [ 102 | "# write a script here that prints out a file with the name \"PySDS_week1lab3question2.py\"\n", 103 | "# Just like we printed out the test.py file in class. \n", 104 | "#\n", 105 | "# When running PySDS_...py, it should do the following: \n", 106 | "# prints out the number of arguments, \n", 107 | "# prints out each argument on its own line\n", 108 | "# prints a message asking the user to re-run the program \n", 109 | "# with more arguments. \n", 110 | "# This message should be polite if you have one or more arguments\n", 111 | "# and grouchy if you do not append any arguments. \n", 112 | "\n", 113 | "\n", 114 | "# Answer\n", 115 | "\n", 116 | "...\n", 117 | "\n", 118 | "fileout = open(\"PySDS_week1lab3question2.py\", 'w')\n", 119 | "\n", 120 | "...\n", 121 | "\n", 122 | "# Reviewer comments below here \n", 123 | "'''\n", 124 | "\n", 125 | "\n", 126 | "\n", 127 | "'''" 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": null, 133 | "metadata": {}, 134 | "outputs": [], 135 | "source": [ 136 | "# Stitching together 4 files. \n", 137 | "# The zip file: muppet_episodes_by_season.zip contains four text files\n", 138 | "# muppet_show_season_xx.txt\n", 139 | "# You have to merge these files together to create a single csv file.\n", 140 | "# You have to unzip the files to a directory, read each one in\n", 141 | "# delete the first line and append the others to a single file. \n", 142 | "\n", 143 | "\n", 144 | "# Answer\n", 145 | "\n", 146 | "\n", 147 | "\n", 148 | "# Reviewer comments below here \n", 149 | "'''\n", 150 | "\n", 151 | "\n", 152 | "'''" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": null, 158 | "metadata": {}, 159 | "outputs": [], 160 | "source": [ 161 | "# Other reviewer comments down here: \n", 162 | "'''\n", 163 | "\n", 164 | "\n", 165 | "'''" 166 | ] 167 | } 168 | ], 169 | "metadata": { 170 | "kernelspec": { 171 | "display_name": "Python 3", 172 | "language": "python", 173 | "name": "python3" 174 | }, 175 | "language_info": { 176 | "codemirror_mode": { 177 | "name": "ipython", 178 | "version": 3 179 | }, 180 | "file_extension": ".py", 181 | "mimetype": "text/x-python", 182 | "name": "python", 183 | "nbconvert_exporter": "python", 184 | "pygments_lexer": "ipython3", 185 | "version": "3.6.5" 186 | } 187 | }, 188 | "nbformat": 4, 189 | "nbformat_minor": 2 190 | } 191 | -------------------------------------------------------------------------------- /Assignments/Week01/PySDS_EX_Week01_Day03_ModelAnswers.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "**PySDS Week 01 Day 02 - Control and flow statements**" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "# Assignment 1.3 Printing to a file and more loop practice. \n", 15 | "\n", 16 | "For this assignment you will be expected to write and read some text from a file. The sustained example is about reading a table of data and then writing that table back to a different file under some pre-determined conditions. \n", 17 | "\n", 18 | "~~~\n", 19 | "10am October 12th, 2018\n", 20 | "~~~\n", 21 | "\n", 22 | "That is to say, before tomorrow's class. The assignments will then be randomly shuffled to different members of the class. At 2pm, at the beginning of the tutorial, you will be able to view and comment on someone else's assignment. \n", 23 | "\n", 24 | "Peer grading is new at the OII, but it has been shown to be an effective approach to both learning about other people's code as well as keeping people engaged. Because it is peer graded, we will not expect you to give a specific mark. You will instead be asked: \n", 25 | "\n", 26 | "1. Does the code run as expected: Y / N ?\n", 27 | " 1. If No, please note the lines. (view lines above using view > Show line numbers). \n", 28 | "3. Do you think the code makes sense?\n", 29 | "4. Does anything else stand out?" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": null, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "# Group the files in this directory by type.\n", 39 | "# Get all the files, \n", 40 | "# create a dictionary with the file extension as the key \n", 41 | "# and the list of files in that directory as the value\n", 42 | "# Print the keys and the length of the list of files so it would \n", 43 | "# look like the following: (the numbers will be different, of course)\n", 44 | "#\n", 45 | "# File type counts: \n", 46 | "# .py: 3\n", 47 | "# .ipynb: 4\n", 48 | "# .txt: 10\n", 49 | "#\n", 50 | "# Have two file paths. First do this with your current working directory.\n", 51 | "# Second do this with your downloads folder. \n", 52 | "\n", 53 | "# Answer\n", 54 | "import os, glob\n", 55 | "CWD = os.getcwd()\n", 56 | "DOWNLOADS = '~/Downloads' #### This is obviously dependent on your filesystem\n", 57 | "\n", 58 | "allfiles= glob.glob(CWD+'/*') # get all files in directory\n", 59 | "filetypes = set() # use a set to store the unique file types\n", 60 | "for i in allfiles:\n", 61 | " if '.' in i: # identify files with an extension and add it to the filetypes set\n", 62 | " filetypes.add(i.split('.')[-1])\n", 63 | "\n", 64 | "filedict = {}\n", 65 | "for i in filetypes: # for each of the file types\n", 66 | " filedict[i] = glob.glob(\"%s%s*%s\" % (CWD, os.sep, i)) # use glob to get all files with the extension\n", 67 | "\n", 68 | "for k, v in filedict.items(): # print the summary\n", 69 | " print('.%s: %d' %(k, len(v)))\n", 70 | " \n", 71 | " \n", 72 | "# and now the same for downloads - could have just put this in another for loop\n", 73 | "allfiles= glob.glob(DOWNLOADS+'/*')\n", 74 | "filetypes = set()\n", 75 | "for i in allfiles:\n", 76 | " if '.' in i:\n", 77 | " filetypes.add(i.split('.')[-1])\n", 78 | "\n", 79 | "filedict = {}\n", 80 | "for i in filetypes: # for each of the file types\n", 81 | " filedict[i] = glob.glob(\"%s%s*%s\" % (DOWNLOADS, os.sep, i)) # use glob to get all files with the extension\n", 82 | "\n", 83 | "for k, v in filedict.items(): # print the summary\n", 84 | " print('.%s: %d' %(k, len(v)))\n", 85 | " \n", 86 | "# Reviewer comments below here \n", 87 | "'''\n", 88 | "\n", 89 | "\n", 90 | "\n", 91 | "'''" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "metadata": {}, 98 | "outputs": [], 99 | "source": [ 100 | "# Write an ascii art pattern to a file, then read the file and print its contents.\n", 101 | "# It should be no more than 80 characters wide and not more than 80 lines long. \n", 102 | "# The pattern should repeat somehow, but it can be creative. A simple one might be: \n", 103 | "# XOXO\n", 104 | "# OXOX\n", 105 | "# XOXO\n", 106 | "# OXOX\n", 107 | "# The pattern needs to come from code.\n", 108 | "\n", 109 | "\n", 110 | "# Answer\n", 111 | "\n", 112 | "\n", 113 | "# creating a simple pattern\n", 114 | "pattern = ''\n", 115 | "for i in range(50):\n", 116 | " pattern += i*'XO' + '\\n'\n", 117 | "\n", 118 | "# writing to file\n", 119 | "filepath = ''\n", 120 | "\n", 121 | "fileout = open(filepath+\"artattack.txt\",'w')\n", 122 | "fileout.write(pattern)\n", 123 | "fileout.close()\n", 124 | "\n", 125 | "# reading from file and printing\n", 126 | "fileout = open(filepath+\"artattack.txt\",'r')\n", 127 | "print(fileout.read())\n", 128 | "\n", 129 | "# Reviewer comments below here \n", 130 | "'''\n", 131 | "\n", 132 | "\n", 133 | "\n", 134 | "'''" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": null, 140 | "metadata": {}, 141 | "outputs": [], 142 | "source": [ 143 | "# write a script here that prints out a file with the name \"PySDS_week1lab3question2.py\"\n", 144 | "# Just like we printed out the test.py file in class. \n", 145 | "#\n", 146 | "# When running PySDS_...py, it should do the following: \n", 147 | "# prints out the number of arguments, \n", 148 | "# prints out each argument on its own line\n", 149 | "# prints a message asking the user to re-run the program \n", 150 | "# with more arguments. \n", 151 | "# This message should be polite if you have one or more arguments\n", 152 | "# and grouchy if you do not append any arguments. \n", 153 | "\n", 154 | "\n", 155 | "# Answer\n", 156 | "\n", 157 | "# It's defintely easiest to write the program in a normal cell first, then copy/paste into the quotes, then write to a file\n", 158 | "\n", 159 | "\n", 160 | "filetext = '''\n", 161 | "import sys\n", 162 | "\n", 163 | "args = sys.argv[1:] # get arguments from command line input\n", 164 | "\n", 165 | "running=True # use this to offer to rerun the program\n", 166 | "while running == True:\n", 167 | " n_args = len(args) # number or args\n", 168 | " if args: # checks if there are items in args list\n", 169 | " print('Great! Thanks for appending %d additional argument(s)!\\nYour arguments were:\\n%s' %(n_args, '\\n'.join(args))) # return summary\n", 170 | "\n", 171 | " else:\n", 172 | " print('Grrr! You didn\\'t append any arguments!') # return summary\n", 173 | "\n", 174 | " while True:\n", 175 | " rerun = input('Would you like to rerun the program (y/n)?\\n').lower() # ask to run again\n", 176 | " if rerun == 'y':\n", 177 | " print('Ok, starting again.\\n')\n", 178 | " args = input('Please type any arguments, separated by spaces:\\n').split() # ask for new arguments and puts them in the args variable\n", 179 | " break # \n", 180 | " elif rerun == 'n':\n", 181 | " print('Ok, quitting')\n", 182 | " running = False\n", 183 | " break # quit the entire program\n", 184 | " else:\n", 185 | " print(\"I'm sorry, that was not valid input. Please enter 'y' or 'n'\") #handles incorrect input\n", 186 | "\n", 187 | "\n", 188 | "\n", 189 | "'''\n", 190 | "\n", 191 | "\n", 192 | "fileout = open(filepath+\"PySDS_week1lab3question2.py\", 'w') # opens python file to write to\n", 193 | "fileout.write(filetext) \n", 194 | "fileout.close()\n", 195 | "\n", 196 | "# There is admittedly some ambiguity in identifying whether we act on all the arguments including the .py file name, or whether it's just the extra appended arguments\n", 197 | "\n", 198 | "\n", 199 | "# Reviewer comments below here \n", 200 | "'''\n", 201 | "\n", 202 | "\n", 203 | "\n", 204 | "'''" 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": null, 210 | "metadata": {}, 211 | "outputs": [], 212 | "source": [ 213 | "import sys\n", 214 | "\n", 215 | "args = sys.argv[1:] # get arguments from command line input\n", 216 | "\n", 217 | "running=True # use this to offer to rerun the program\n", 218 | "while running == True:\n", 219 | " n_args = len(args) # number or args\n", 220 | " if args: # checks if there are items in args list\n", 221 | " print('Great! Thanks for appending %d additional argument(s)!\\nYour arguments were:\\n%s' %(n_args, '\\n'.join(args))) # return summary\n", 222 | "\n", 223 | " else:\n", 224 | " print('Grrr! You didn\\'t append any arguments!') # return summary\n", 225 | "\n", 226 | " while True:\n", 227 | " rerun = input('Would you like to rerun the program (y/n)?\\n').lower() # ask to run again\n", 228 | " if rerun == 'y':\n", 229 | " print('Ok, starting again.\\n')\n", 230 | " args = input('Please type any arguments, separated by spaces:\\n').split() # ask for new arguments and puts them in the args variable\n", 231 | " break # \n", 232 | " elif rerun == 'n':\n", 233 | " print('Ok, quitting')\n", 234 | " running = False\n", 235 | " break # quit the entire program\n", 236 | " else:\n", 237 | " print(\"I'm sorry, that was not valid input. Please enter 'y' or 'n'\") #handles incorrect input\n", 238 | "\n" 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": null, 244 | "metadata": {}, 245 | "outputs": [], 246 | "source": [ 247 | "# Stitching together 4 files. \n", 248 | "# The zip file: muppet_episodes_by_season.zip contains four text files\n", 249 | "# muppet_show_season_xx.txt\n", 250 | "# You have to merge these files together to create a single csv file.\n", 251 | "# You have to unzip the files to a directory, read each one in\n", 252 | "# delete the first line and append the others to a single file. \n", 253 | "\n", 254 | "\n", 255 | "# Answer\n", 256 | "filepath='muppet_episodes_by_season.zip/' # edit filepath\n", 257 | "\n", 258 | "tot_lines=[]\n", 259 | "for i in range(1,5):\n", 260 | " filein = open(filepath+'muppet_show_season_%d.txt' %i, 'r') # read each file\n", 261 | " filelines = filein.readlines()[1:] # get the file lines, excluding line 0\n", 262 | " filein.close()\n", 263 | " tot_lines.extend(filelines) # store the file lines in larger list\n", 264 | " \n", 265 | " \n", 266 | "fileout = open(filepath+'merged_muppets.txt', 'w') # write to file \n", 267 | "fileout.writelines(tot_lines)\n", 268 | "fileout.close()\n", 269 | " \n", 270 | "fileout = open(filepath+'merged_muppets.txt', 'r') # read the file\n", 271 | "print(fileout.read())\n", 272 | "fileout.close()\n", 273 | "\n", 274 | "# Reviewer comments below here \n", 275 | "'''\n", 276 | "\n", 277 | "\n", 278 | "'''" 279 | ] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "execution_count": null, 284 | "metadata": {}, 285 | "outputs": [], 286 | "source": [ 287 | "# Other reviewer comments down here: \n", 288 | "'''\n", 289 | "\n", 290 | "'''" 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": null, 296 | "metadata": {}, 297 | "outputs": [], 298 | "source": [] 299 | } 300 | ], 301 | "metadata": { 302 | "kernelspec": { 303 | "display_name": "Python [default]", 304 | "language": "python", 305 | "name": "python3" 306 | }, 307 | "language_info": { 308 | "codemirror_mode": { 309 | "name": "ipython", 310 | "version": 3 311 | }, 312 | "file_extension": ".py", 313 | "mimetype": "text/x-python", 314 | "name": "python", 315 | "nbconvert_exporter": "python", 316 | "pygments_lexer": "ipython3", 317 | "version": "3.6.3" 318 | } 319 | }, 320 | "nbformat": 4, 321 | "nbformat_minor": 2 322 | } 323 | -------------------------------------------------------------------------------- /Assignments/Week01/PySDS_EX_Week01_Day04.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "**PySDS Week 01 Day 04 v.1.1 - Exercise - Friday Formative**" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "# Week 1. Friday Formative\n", 15 | "\n", 16 | "The Friday formative is a series of exercises designed to build up to a cohesive project. This week we are making a shopping cart. \n", 17 | "\n", 18 | "You have to make a program that can be run as a standalone python program. You can build it in JupyterLab but submit it as a standalone executable. We recommend porting it over to Spyder and working there. When it is run from the terminal / console it should do the following:\n", 19 | "\n", 20 | "1. Provide a welcome message. \n", 21 | "2. Present a list of goods for the user to choose from. You must present at least three different goods. They don't have to be muppets or food, but they should make sense as a type of object, such as music albums. They should all be priced differently. \n", 22 | "3. The user can choose which to buy, or select some value for exit.\n", 23 | "4. When one is chosen, the user is asked how many or what quantity of that item. \n", 24 | "5. Then the main menu appears again, but there is an additional option to check out. \n", 25 | "6. When the user checks out, it prints the total price and the basket of goods. \n", 26 | "7. If you are satisfied with this workflow and have time to spare, consider the following extensions: \n", 27 | " - check the basket \n", 28 | " - delete a item from the basket\n", 29 | " - have the user enter their name and address. \n", 30 | " - another feature you come up with.\n", 31 | " Please note your extension in the code with a comment such as:\n", 32 | " ~~~ >\n", 33 | " # Extension \n", 34 | " ~~~\n", 35 | " \n", 36 | "## Purpose (is this really data science?)\n", 37 | "Why this particular assignment? This assignment will require you to weild loops, dictionaries, maybe larger data structures in a way that emphasises your existing learning. That could be a dictionary, a list or a more generic object as a collection. I would recommend the object-oriented approach but strictly speaking you can get away without it. \n", 38 | "\n", 39 | "It also allows a little creativity, both in the item select / wording and in the implementation of the features. It's nice to mark diverse implementations more than very rote exercises.\n", 40 | "\n", 41 | "Some things you'll want to consider about your data structure: \n", 42 | "- Are you storing the items as a list or perhaps as a dictionary with the number of each item as the value.\n", 43 | "- Do you put the price of the goods in this basket class? \n", 44 | " - Or do you make a dictionary with goods and price? \n", 45 | "- Do you have an addItem() method in your basket object or do you interact with basket.items directly as a list? \n", 46 | "- Once you have a basket, you will want to come up with the items and a way to print the items, their price and ask the user to select an item. \n", 47 | "- Will your printing of output be attractive and easy to read?\n", 48 | "\n", 49 | "## Rubric\n", 50 | "The mark will be out of 30. For this assignment: \n", 51 | "1. **Functionality [10pts] ** - does it do what we specified above?\n", 52 | "2. **Usability [5pts] ** - Is it clear? Is both the code and the program well formatted? \n", 53 | "3. **Robustness [5pts] ** - What happens with bad input? How many changes can be made and with they all work as expected? 5pts \n", 54 | "4. **Parsimonious [5pts] ** - Subjectively speaking, is the code well organized, using coherent data structures? \n", 55 | "5. **Creativity [5pts]** - How are you extending the specified but very basic check out? \n", 56 | "\n", 57 | "Assignments will be returned with comments in \n", 58 | "'''COMMENTS + MARK''' at the top of the file. " 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": null, 64 | "metadata": {}, 65 | "outputs": [], 66 | "source": [] 67 | } 68 | ], 69 | "metadata": { 70 | "kernelspec": { 71 | "display_name": "Python 3", 72 | "language": "python", 73 | "name": "python3" 74 | }, 75 | "language_info": { 76 | "codemirror_mode": { 77 | "name": "ipython", 78 | "version": 3 79 | }, 80 | "file_extension": ".py", 81 | "mimetype": "text/x-python", 82 | "name": "python", 83 | "nbconvert_exporter": "python", 84 | "pygments_lexer": "ipython3", 85 | "version": "3.6.5" 86 | } 87 | }, 88 | "nbformat": 4, 89 | "nbformat_minor": 2 90 | } 91 | -------------------------------------------------------------------------------- /Assignments/Week02/PySDS_Ex_Week02_Day01.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "**PySDS Week 02 Day 01 v.1 - Exercise - Manging DataFrames**" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "# Recall the small table from week 1. Here it is again. \n", 17 | "\n", 18 | "MuppetInput = '''\n", 19 | "Name\tGender\tSpecies\tFirst Appearance\n", 20 | "Fozzie\tMale\tBear\t1976\n", 21 | "Kermit\tMale\tFrog\t1955\n", 22 | "Piggy\tFemale\tPig\t1974\n", 23 | "Gonzo\tMale\tUnknown\t1970\n", 24 | "Rowlf\tMale\tDog\t1962\n", 25 | "Beaker\tMale\tMuppet\t1977\n", 26 | "Janice\tFemale\tMuppet\t1975\n", 27 | "Hilda\tFemale\tMuppet\t1976\n", 28 | "'''\n", 29 | "\n", 30 | "# Step 1. \n", 31 | "# This time please convert it into a DataFrame \n", 32 | "x = pd.read_csv(object)\n", 33 | "# Step 2. Please answer the following questions using the DataFrame:\n", 34 | "\n", 35 | "# A. What are the details for Fozzie Bear\n", 36 | "# - Return these printed in a sentence of the form\n", 37 | "# - is a who first appeared in \n", 38 | "\n", 39 | "# B. Who appeared before 1976 (i.e. the year the Muppet Show started).\n", 40 | "# Return this as a DataFrame. \n", 41 | "\n", 42 | "# Step 3 . Adding a row of data\n", 43 | "# Please add a row for \n", 44 | "# \"Rizzo\" a male rat who first appeared in 1980. \n", 45 | "\n", 46 | "# Answer \n", 47 | "\n", 48 | "\n", 49 | "\n", 50 | "# Reviewer's comments\n", 51 | "\n" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "# Creating a DataFrame of Muppet show episodes by season.\n", 61 | "# * Note, there is a fifth season of the muppets which was not included.\n", 62 | "# This will be addressed next week.\n", 63 | "#\n", 64 | "# Step 1. \n", 65 | "#\n", 66 | "# Using the text from the previous week contained within \n", 67 | "# muppet_episodes_by_season.zip\n", 68 | "# unzip the files (that can be done manually), read each one of them\n", 69 | "# in as a DataFrame, then merge the dataframes to have a single \n", 70 | "# dataframe. \n", 71 | "\n", 72 | "\n", 73 | "# Step 2. \n", 74 | "#\n", 75 | "# Did any of the guest stars appear more than once? \n", 76 | "# Did every season have the same number of episodes? \n", 77 | "\n", 78 | "\n", 79 | "# Answer \n", 80 | "\n", 81 | "\n", 82 | "\n", 83 | "# Reviewer's comments\n", 84 | "\n" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": null, 90 | "metadata": {}, 91 | "outputs": [], 92 | "source": [ 93 | "# Working with a column of data \n", 94 | "\n", 95 | "# Using the table of data above split the guest stars name \n", 96 | "# so that the first name is in its own column. \n", 97 | "# Find and print the name of the guest star with the longest first name. \n", 98 | "# Print it in a sentence that also includes the guest star's episode number.\n", 99 | "\n", 100 | "# Bonus! Now recall that we were looking for first name, not a group name\n", 101 | "# So in your answer, split the text. \n", 102 | "# If the entry has more or less than two entries, discard it.\n", 103 | "# Does this make a difference? \n", 104 | "\n", 105 | "# Answer\n", 106 | "\n", 107 | "\n", 108 | "# Reviewer's comments\n", 109 | "\n" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": null, 115 | "metadata": {}, 116 | "outputs": [], 117 | "source": [ 118 | "# Plotting the data. \n", 119 | "# Create a histogram of the distribution of first names \n", 120 | "# (from the previous question). \n", 121 | "# Hint: series.plot(kind=\"hist\")\n", 122 | "# What is the average name length? \n", 123 | "# Is it heavy tailed (answer this to the best of your knowledge)? \n", 124 | "\n", 125 | "\n", 126 | "# Answer \n", 127 | "\n", 128 | "\n", 129 | "\n", 130 | "# Reviewer's comments\n", 131 | "\n" 132 | ] 133 | } 134 | ], 135 | "metadata": { 136 | "kernelspec": { 137 | "display_name": "Python 3", 138 | "language": "python", 139 | "name": "python3" 140 | }, 141 | "language_info": { 142 | "codemirror_mode": { 143 | "name": "ipython", 144 | "version": 3 145 | }, 146 | "file_extension": ".py", 147 | "mimetype": "text/x-python", 148 | "name": "python", 149 | "nbconvert_exporter": "python", 150 | "pygments_lexer": "ipython3", 151 | "version": "3.6.5" 152 | } 153 | }, 154 | "nbformat": 4, 155 | "nbformat_minor": 2 156 | } 157 | -------------------------------------------------------------------------------- /Assignments/Week02/PySDS_Ex_Week02_Day02.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "**PySDS Week 02 Day 02 v.1 - Exercise - File Types and Text Processing I**" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Today we will be doing some example regular expressions (yay), and some dataframe manipulation. Recall that we used the Canada wikipedia page as an example. Below is some code that you can use to pull in a Wikipedia page as data. Today, you will be asked to read in several pages, compare them on a number of features in a dataframe and report on what you found. Below is the code that you can use to download a Wikipedia page. " 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": null, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "import urllib, urllib.request\n", 24 | "import bs4 \n", 25 | "\n", 26 | "# You can set this Wikipage to be any string that has a wikipedia page.\n", 27 | "\n", 28 | "def getWikiPage(page=\"United Kingdom\"): \n", 29 | " '''Returns the XML found using special export of a Wikipedia page.'''\n", 30 | " \n", 31 | " # Here we use urllib.parse.quote to turn spaces and special characters into\n", 32 | " # the characters needed for an html string. So for example spaces become %20\n", 33 | "\n", 34 | " URL = \"http://en.wikipedia.org/wiki/Special:Export/%s\" % urllib.parse.quote(WIKIPAGE)\n", 35 | "\n", 36 | " print(URL,\"\\n\")\n", 37 | "\n", 38 | " req = urllib.request.Request( URL, headers={'User-Agent': 'OII SDS class 2018.1/Hogan'})\n", 39 | " infile = urllib.request.urlopen(req)\n", 40 | "\n", 41 | " return infile.read()\n", 42 | "\n", 43 | "# Testing\n", 44 | "data = getWikiPage()\n", 45 | "soup = bs4.BeautifulSoup(wikitext.decode('utf8'), \"lxml\")\n", 46 | "print(soup.mediawiki.page.revision.id)\n" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "metadata": {}, 53 | "outputs": [], 54 | "source": [ 55 | "# Now, select 10 countries and place them in a list. \n", 56 | "# These will be rows in a dataframe. \n", 57 | "# For each of the ten countries, \n", 58 | "# find the following features from parsing their wikipedia page: \n", 59 | "# 1. The number of internal wikilinks. \n", 60 | "# 2. The number of external wikilinks. \n", 61 | "# 3. The length of the page (in characters)\n", 62 | "# 4. The population of the country. \n", 63 | "# - This last one will be very tricky. It's okay if you cannot get the \n", 64 | "# regex working, or if you have to build multiple regexes. \n", 65 | "# Please simply document this. \n", 66 | "\n", 67 | "# Print the following: \n", 68 | "# The rank order of each of the columns. \n", 69 | "# For example, for wikilinks you might print \n", 70 | "# (note numbrs below are not accurate)\n", 71 | "\n", 72 | "# Table 1. Number of \n", 73 | "# Canada 46\n", 74 | "# Germany 45\n", 75 | "# France 24\n", 76 | "# Netherlands 12\n", 77 | "# ...\n", 78 | "\n", 79 | "# answer below here\n", 80 | "\n", 81 | "\n", 82 | "\n", 83 | "\n", 84 | "# Reviewer's comments\n", 85 | "\n", 86 | "\n", 87 | "\n" 88 | ] 89 | } 90 | ], 91 | "metadata": { 92 | "kernelspec": { 93 | "display_name": "Python 3", 94 | "language": "python", 95 | "name": "python3" 96 | }, 97 | "language_info": { 98 | "codemirror_mode": { 99 | "name": "ipython", 100 | "version": 3 101 | }, 102 | "file_extension": ".py", 103 | "mimetype": "text/x-python", 104 | "name": "python", 105 | "nbconvert_exporter": "python", 106 | "pygments_lexer": "ipython3", 107 | "version": "3.6.5" 108 | } 109 | }, 110 | "nbformat": 4, 111 | "nbformat_minor": 2 112 | } 113 | -------------------------------------------------------------------------------- /Assignments/Week02/PySDS_Ex_Week02_Day03.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "**PySDS Week 02 Day 03 v.1 - Exercise - Dates and more DataFrames**" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Today we will continue to use the PySDS_PolCandidates.csv table and answer some more involved questions as DataFrame practice. \n", 15 | "\n", 16 | "First, I would like you to begin with a few practice exercises on parsing date times first. Then, using only filters, grouping and other features of DataFrames you should be able to accomplish the questions below. " 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 9, 22 | "metadata": {}, 23 | "outputs": [ 24 | { 25 | "name": "stdout", 26 | "output_type": "stream", 27 | "text": [ 28 | "1986-12-10 03:04:50\n" 29 | ] 30 | } 31 | ], 32 | "source": [ 33 | "# Date Parsing exercises: \n", 34 | "from datetime import datetime \n", 35 | "\n", 36 | "time_now = datetime.now()\n", 37 | "time_1 = \"June 20, 1985 12:35pm\"\n", 38 | "time_2 = \"10/10/10 10:10:10 +1000\" # hint, the +1000 means UTC +10 hours \n", 39 | "time_3 = \"534567890\" #UTC time; hint: datetime.utcfromtimestamp(xx)\n", 40 | "\n", 41 | "# Question 1. Using now(), which I realise will be a slightly \n", 42 | "# different time for everyone. report the time elapsed between \n", 43 | "# times 1,2,3 and now()\n", 44 | "\n", 45 | "# Question 2. For each of the times above, what day of the week was it? \n", 46 | "\n", 47 | "\n", 48 | "# Answer\n", 49 | "\n", 50 | "\n", 51 | "\n", 52 | "# Reviewer comments \n", 53 | "\n", 54 | "\n" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": 36, 60 | "metadata": {}, 61 | "outputs": [ 62 | { 63 | "data": { 64 | "text/html": [ 65 | "
\n", 66 | "\n", 79 | "\n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | "
LabourConservativeAll Parties
NoneNaNNaNNaN
Only TwitterNaNNaNNaN
Only FacebookNaNNaNNaN
Only WebpageNaNNaNNaN
Twitter and FacebookNaNNaNNaN
Facebook and WebpageNaNNaNNaN
Twitter and WebpageNaNNaNNaN
Twitter, Facebook and WebpageNaNNaNNaN
\n", 139 | "
" 140 | ], 141 | "text/plain": [ 142 | " Labour Conservative All Parties\n", 143 | "None NaN NaN NaN\n", 144 | "Only Twitter NaN NaN NaN\n", 145 | "Only Facebook NaN NaN NaN\n", 146 | "Only Webpage NaN NaN NaN\n", 147 | "Twitter and Facebook NaN NaN NaN\n", 148 | "Facebook and Webpage NaN NaN NaN\n", 149 | "Twitter and Webpage NaN NaN NaN\n", 150 | "Twitter, Facebook and Webpage NaN NaN NaN" 151 | ] 152 | }, 153 | "metadata": {}, 154 | "output_type": "display_data" 155 | } 156 | ], 157 | "source": [ 158 | "# Extended exercise part 1. \n", 159 | "\n", 160 | "# Using the data \"PySDS_PolCandidates.csv\" fill in the DataFrame below \n", 161 | "# with data. Also, try to ensure that it is formatted nicely. \n", 162 | "import pandas as pd \n", 163 | "\n", 164 | "media_combo_df = pd.DataFrame(columns=[\"Labour\",\"Conservative\",\"All Parties\"],\n", 165 | " index=[\"None\",\n", 166 | " \"Only Twitter\",\n", 167 | " \"Only Facebook\",\n", 168 | " \"Only Webpage\",\n", 169 | " \"Twitter and Facebook\",\n", 170 | " \"Facebook and Webpage\",\n", 171 | " \"Twitter and Webpage\",\n", 172 | " \"Twitter, Facebook and Webpage\"\n", 173 | " ])\n", 174 | "display(media_combo_df)\n", 175 | "\n", 176 | "# Each cell should be the count of users of the total. So, if it is the \n", 177 | "# [None, Labour] cell it would be the number of Labour candidates\n", 178 | "# who did not have either Twitter, Web or Facebook. \n", 179 | "\n", 180 | "# Here are some hints: If you ensure that the empty columns in the \n", 181 | "# PolCandidates.csv file are null, you can then use boolean logic to \n", 182 | "# select your variables. For example, \n", 183 | "# x = df['have_twitter'].notnull()\n", 184 | "# y = df['have_facebook'].notnull() \n", 185 | "# then \n", 186 | "# have_both = df[x & y] \n", 187 | "# will get you the rows of the people who have both and \n", 188 | "# have_both['party'].value_counts() \n", 189 | "# will get you the count, by party, of the people \n", 190 | "# who have both twitter and facebook. " 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": 32, 196 | "metadata": {}, 197 | "outputs": [ 198 | { 199 | "data": { 200 | "text/html": [ 201 | "
\n", 202 | "\n", 215 | "\n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | "
var1var2
00.00.0%
10.110.0%
20.220.0%
30.330.0%
40.440.0%
50.550.0%
60.660.0%
70.770.0%
80.880.0%
90.990.0%
\n", 276 | "
" 277 | ], 278 | "text/plain": [ 279 | " var1 var2\n", 280 | "0 0.0 0.0%\n", 281 | "1 0.1 10.0%\n", 282 | "2 0.2 20.0%\n", 283 | "3 0.3 30.0%\n", 284 | "4 0.4 40.0%\n", 285 | "5 0.5 50.0%\n", 286 | "6 0.6 60.0%\n", 287 | "7 0.7 70.0%\n", 288 | "8 0.8 80.0%\n", 289 | "9 0.9 90.0%" 290 | ] 291 | }, 292 | "execution_count": 32, 293 | "metadata": {}, 294 | "output_type": "execute_result" 295 | } 296 | ], 297 | "source": [ 298 | "# Extended exercise part 2. \n", 299 | "\n", 300 | "# The raw counts in the table are useful, \n", 301 | "# but showing the relative percentage would be even more useful. \n", 302 | "# Create a new table that is formatted like the above, however, in \n", 303 | "# this table show the percent of the column total. \n", 304 | "# So for Labour that would be the percentage of Labour candidates\n", 305 | "# who had 'only webpage', not the percentage of all candidates who\n", 306 | "# are Labour and only have a webpage. \n", 307 | "\n", 308 | "# Hint to display a DataFrame as a percentage, try this: \n", 309 | "df = pd.DataFrame(pd.Series(range(10))/10,columns=[\"var1\"])\n", 310 | "df['var2'] = df['var1'].map(lambda n: '{:,.1%}'.format(n))\n", 311 | "df\n", 312 | "\n", 313 | "# Answer below here\n", 314 | "\n", 315 | "\n", 316 | "\n", 317 | "# Reviwers comments below here\n", 318 | "\n", 319 | "\n" 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": 33, 325 | "metadata": {}, 326 | "outputs": [ 327 | { 328 | "name": "stdout", 329 | "output_type": "stream", 330 | "text": [ 331 | "4.5\n" 332 | ] 333 | } 334 | ], 335 | "source": [ 336 | "# Extended exercise part 3. \n", 337 | "\n", 338 | "# Sum each of the columns in the previous exercise. \n", 339 | "# Do each of the columns sum to 100%? They should. \n", 340 | "# Use this exercise as a check that \n", 341 | "# each column sums to the expected total. \n", 342 | "\n", 343 | "# hint. \n", 344 | "print(df[\"var1\"].sum())\n", 345 | "\n", 346 | "# Answer below here\n", 347 | "\n", 348 | "\n", 349 | "\n", 350 | "\n", 351 | "# Reviewers comments below here \n", 352 | "\n", 353 | "\n" 354 | ] 355 | } 356 | ], 357 | "metadata": { 358 | "kernelspec": { 359 | "display_name": "Python 3", 360 | "language": "python", 361 | "name": "python3" 362 | }, 363 | "language_info": { 364 | "codemirror_mode": { 365 | "name": "ipython", 366 | "version": 3 367 | }, 368 | "file_extension": ".py", 369 | "mimetype": "text/x-python", 370 | "name": "python", 371 | "nbconvert_exporter": "python", 372 | "pygments_lexer": "ipython3", 373 | "version": "3.7.0" 374 | } 375 | }, 376 | "nbformat": 4, 377 | "nbformat_minor": 2 378 | } 379 | -------------------------------------------------------------------------------- /Assignments/Week02/PySDS_Ex_Week02_Day03_ModelAnswers.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "**PySDS Week 02 Day 03 v.1 - Exercise - Dates and more DataFrames**" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Today we will continue to use the PySDS_PolCandidates.csv table and answer some more involved questions as DataFrame practice. \n", 15 | "\n", 16 | "This is not directly related to much of the material from today. As a consequence, I would like you to begin with a few practice exercises on parsing date times first. Then, using only filters, grouping and other features of DataFrames you should be able to accomplish the questions below. " 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 171, 22 | "metadata": {}, 23 | "outputs": [ 24 | { 25 | "name": "stdout", 26 | "output_type": "stream", 27 | "text": [ 28 | "12173 days, 9:19:39.011223\n", 29 | "2930 days, 20:44:29.011223\n", 30 | "11635 days, 17:49:49.011223\n", 31 | "time1 Thursday\n", 32 | "time2 Sunday\n", 33 | "time3 Wednesday\n" 34 | ] 35 | } 36 | ], 37 | "source": [ 38 | "# Date Parsing exercises: \n", 39 | "from datetime import datetime\n", 40 | "from datetime import timezone\n", 41 | "\n", 42 | "time_now = datetime.now(timezone.utc) # specify the timezone\n", 43 | "time_1 = \"June 20, 1985 12:35pm\"\n", 44 | "time_2 = \"10/10/10 10:10:10 +1000\" # hint, the +1000 means UTC +10 hours \n", 45 | "time_3 = \"534567890\" #UTC time; hint: datetime.utcfromtimestamp(xx)\n", 46 | "\n", 47 | "# Question 1. Using now(), which I realise will be a slightly \n", 48 | "# different time for everyone. report the time elapsed between \n", 49 | "# times 1,2,3\n", 50 | "\n", 51 | "# Question 2. For each of the times above, what day of the week was it? \n", 52 | "\n", 53 | "\n", 54 | "# Answer\n", 55 | "time_1 = datetime.strptime(time_1, '%B %d, %Y %H:%M%p').astimezone() # forces timezone aware\n", 56 | "time_2 = datetime.strptime(time_2, '%d/%m/%y %H:%M:%S %z')\n", 57 | "time_3 = datetime.utcfromtimestamp(int(time_3)).astimezone()\n", 58 | "\n", 59 | "\n", 60 | "print(time_now-time_1)\n", 61 | "print(time_now-time_2)\n", 62 | "print(time_now-time_3)\n", 63 | "\n", 64 | "weekdaymap = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday' ,'Sunday'] # 0-7 mapped to day names\n", 65 | "\n", 66 | "print('time1', weekdaymap[time_1.weekday()]) # returns weekday, uses as index in weekdaymap\n", 67 | "print('time2', weekdaymap[time_2.weekday()])\n", 68 | "print('time3', weekdaymap[time_3.weekday()])\n", 69 | "\n", 70 | "# Reviewer comments \n", 71 | "\n", 72 | "\n" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 173, 78 | "metadata": {}, 79 | "outputs": [ 80 | { 81 | "data": { 82 | "text/html": [ 83 | "
\n", 84 | "\n", 97 | "\n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | "
Labour PartyConservative PartyTotal
None29.00.0549
Only Twitter211.02.0672
Only Facebook2.00.065
Only Webpage9.073.0395
Twitter and Facebook66.01.0313
Facebook and Webpage4.028.0113
Twitter and Webpage180.0318.01064
Twitter, Facebook and Webpage88.0209.0800
\n", 157 | "
" 158 | ], 159 | "text/plain": [ 160 | " Labour Party Conservative Party Total\n", 161 | "None 29.0 0.0 549\n", 162 | "Only Twitter 211.0 2.0 672\n", 163 | "Only Facebook 2.0 0.0 65\n", 164 | "Only Webpage 9.0 73.0 395\n", 165 | "Twitter and Facebook 66.0 1.0 313\n", 166 | "Facebook and Webpage 4.0 28.0 113\n", 167 | "Twitter and Webpage 180.0 318.0 1064\n", 168 | "Twitter, Facebook and Webpage 88.0 209.0 800" 169 | ] 170 | }, 171 | "metadata": {}, 172 | "output_type": "display_data" 173 | } 174 | ], 175 | "source": [ 176 | "# Extended exercise part 1. \n", 177 | "\n", 178 | "# Using the data \"PySDS_PolCandidates.csv\" fill in the DataFrame below \n", 179 | "# with data. Also, try to ensure that it is formatted nicely. \n", 180 | "import pandas as pd \n", 181 | "\n", 182 | "media_combo_df = pd.DataFrame(columns=[\"Labour Party\",\"Conservative Party\",\"Total\"],\n", 183 | " index=[\"None\",\n", 184 | " \"Only Twitter\",\n", 185 | " \"Only Facebook\",\n", 186 | " \"Only Webpage\",\n", 187 | " \"Twitter and Facebook\",\n", 188 | " \"Facebook and Webpage\",\n", 189 | " \"Twitter and Webpage\",\n", 190 | " \"Twitter, Facebook and Webpage\"\n", 191 | " ])\n", 192 | "\n", 193 | "# Each cell should be the count of users of the total. So, if it is the \n", 194 | "# [None, Labour] cell it would be the number of Labour candidates\n", 195 | "# who did not have either Twitter, Web or Facebook. \n", 196 | "\n", 197 | "# Here are some hints: If you ensure that the empty columns in the \n", 198 | "# PolCandidates.csv file are null, you can then use boolean logic to \n", 199 | "# select your variables. For example, \n", 200 | "# x = df['have_twitter'].notnull()\n", 201 | "# y = df['have_facebook'].notnull() \n", 202 | "# then \n", 203 | "# have_both = df[x & y] \n", 204 | "# will get you the rows of the people who have both and \n", 205 | "# have_both['party'].value_counts() \n", 206 | "# will get you the count, by party, of the people \n", 207 | "# who have both twitter and facebook. \n", 208 | "\n", 209 | "filepath = ''\n", 210 | "\n", 211 | "df = pd.read_csv(filepath+'PySDS_PolCandidates.csv')\n", 212 | "\n", 213 | "# set up expressions for users with each service\n", 214 | "x = df['twitter_username'].notnull()\n", 215 | "y = df['facebook_page_url'].notnull()\n", 216 | "z = df['party_ppc_page_url'].notnull()\n", 217 | "lc = df['party'].isin(['Labour Party', 'Conservative Party']) # only those in lab/con parties\n", 218 | "\n", 219 | "# go through different boolean configurations of having the services\n", 220 | "bools = {'None':~x&~y&~z, 'Only Twitter':x&~y&~z, 'Only Facebook':~x&y&~z, 'Only Webpage':~x&~y&z,'Twitter and Facebook':x&y&~z,\n", 221 | " 'Facebook and Webpage':~x&y&z, 'Twitter and Webpage':x&~y&z, 'Twitter, Facebook and Webpage':x&y&z}\n", 222 | "\n", 223 | "# iterate through index names and different configurations\n", 224 | "for k, v in bools.items():\n", 225 | " media_combo_df.loc[k] = df[lc&v]['party'].value_counts() # value counts for those in lab & con and that satisfy the combination of tw,fb,web\n", 226 | " media_combo_df.loc[k, 'Total'] = df[v]['party'].value_counts().sum() # sum of the total value counts for all parties\n", 227 | "\n", 228 | "media_combo_df = media_combo_df.fillna(0) # fill the remaining nas with 0\n", 229 | "display(media_combo_df)\n" 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": 174, 235 | "metadata": {}, 236 | "outputs": [ 237 | { 238 | "data": { 239 | "text/html": [ 240 | "
\n", 241 | "\n", 254 | "\n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | "
Labour PartyConservative PartyTotal
None4.9%0.0%13.8%
Only Twitter35.8%0.3%16.9%
Only Facebook0.3%0.0%1.6%
Only Webpage1.5%11.6%9.9%
Twitter and Facebook11.2%0.2%7.9%
Facebook and Webpage0.7%4.4%2.8%
Twitter and Webpage30.6%50.4%26.8%
Twitter, Facebook and Webpage14.9%33.1%20.1%
\n", 314 | "
" 315 | ], 316 | "text/plain": [ 317 | " Labour Party Conservative Party Total\n", 318 | "None 4.9% 0.0% 13.8%\n", 319 | "Only Twitter 35.8% 0.3% 16.9%\n", 320 | "Only Facebook 0.3% 0.0% 1.6%\n", 321 | "Only Webpage 1.5% 11.6% 9.9%\n", 322 | "Twitter and Facebook 11.2% 0.2% 7.9%\n", 323 | "Facebook and Webpage 0.7% 4.4% 2.8%\n", 324 | "Twitter and Webpage 30.6% 50.4% 26.8%\n", 325 | "Twitter, Facebook and Webpage 14.9% 33.1% 20.1%" 326 | ] 327 | }, 328 | "metadata": {}, 329 | "output_type": "display_data" 330 | } 331 | ], 332 | "source": [ 333 | "# Extended exercise part 2. \n", 334 | "\n", 335 | "# The raw counts in the table are useful, \n", 336 | "# but showing the relative percentage would be even more useful. \n", 337 | "# Create a new table that is formatted like the above, however, in \n", 338 | "# this table show the percent of the column total. \n", 339 | "# So for Labour that would be the percentage of Labour candidates\n", 340 | "# who had 'only webpage', not the percentage of all candidates who\n", 341 | "# are Labour and only have a webpage. \n", 342 | "\n", 343 | "# Hint to display a DataFrame as a percentage, try this: \n", 344 | "# df = pd.DataFrame(pd.Series(range(10))/10,columns=[\"var1\"])\n", 345 | "# df['var2'] = df['var1'].map(lambda n: '{:,.1%}'.format(n))\n", 346 | "\n", 347 | "# display(df)\n", 348 | "# Answer below here\n", 349 | "\n", 350 | "# convert to decimal\n", 351 | "perc_df = media_combo_df/media_combo_df.sum() \n", 352 | "\n", 353 | "# convert to percentage\n", 354 | "for i in perc_df:\n", 355 | " perc_df[i] = perc_df[i].map(lambda n: '{:,.1%}'.format(n))\n", 356 | "\n", 357 | "display(perc_df)\n", 358 | "# Reviwers comments below here\n", 359 | "\n", 360 | "\n" 361 | ] 362 | }, 363 | { 364 | "cell_type": "code", 365 | "execution_count": 176, 366 | "metadata": {}, 367 | "outputs": [ 368 | { 369 | "name": "stdout", 370 | "output_type": "stream", 371 | "text": [ 372 | "Labour Party 99.89999999999999\n", 373 | "Conservative Party 100.0\n", 374 | "Total 99.80000000000001\n" 375 | ] 376 | } 377 | ], 378 | "source": [ 379 | "# Extended exercise part 3. \n", 380 | "\n", 381 | "# Sum each of the columns in the previous exercise. \n", 382 | "# Do each of the columns sum to 100%? They should. \n", 383 | "# Use this exercise as a check that \n", 384 | "# each column sums to the expected total. \n", 385 | "\n", 386 | "# hint. \n", 387 | "# print(df[\"var1\"].sum())\n", 388 | "\n", 389 | "# Answer below here\n", 390 | "\n", 391 | "for i in perc_df.columns:\n", 392 | " print(i, perc_df[i].apply(lambda x: float(x[:-1])).sum()) # remove '%' and convert string to float, then sum column\n", 393 | "\n", 394 | "# discrepancy from 100% comes from rounding to 1 decimal place in previous cell\n", 395 | "\n", 396 | "# Reviewers comments below here \n", 397 | "\n", 398 | "\n" 399 | ] 400 | }, 401 | { 402 | "cell_type": "code", 403 | "execution_count": null, 404 | "metadata": {}, 405 | "outputs": [], 406 | "source": [] 407 | } 408 | ], 409 | "metadata": { 410 | "kernelspec": { 411 | "display_name": "Python 3", 412 | "language": "python", 413 | "name": "python3" 414 | }, 415 | "language_info": { 416 | "codemirror_mode": { 417 | "name": "ipython", 418 | "version": 3 419 | }, 420 | "file_extension": ".py", 421 | "mimetype": "text/x-python", 422 | "name": "python", 423 | "nbconvert_exporter": "python", 424 | "pygments_lexer": "ipython3", 425 | "version": "3.7.0" 426 | } 427 | }, 428 | "nbformat": 4, 429 | "nbformat_minor": 2 430 | } 431 | -------------------------------------------------------------------------------- /Assignments/Week02/PySDS_Ex_Week02_Day04.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "**PySDS Week 02 Day 04 v.1 - Friday Formative - Merging DataFrames**" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "# Exercise 1. Merging and reporting on data\n", 15 | "\n", 16 | "Recall that we have a table called PySDS_PolCandidates.csv. This table has a list of candidates with Twitter accounts. We also now have a database of tweets captured on the 5th and 6th of May, 2015 by British Politicians. The expanded dataset includes the set of tweets as replies to these politicians, but that is not being used here." 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 28, 22 | "metadata": {}, 23 | "outputs": [ 24 | { 25 | "name": "stdout", 26 | "output_type": "stream", 27 | "text": [ 28 | "Before filtering there were Ellipsis Tweets and Ellipsis accounts. \n", 29 | "After filtering there were Ellipsis Tweets and Ellipsis accounts.\n" 30 | ] 31 | } 32 | ], 33 | "source": [ 34 | "# Question 1.1: There are accounts in the roottweets database that are \n", 35 | "# not in the PolCandidates list and vice versa. \n", 36 | "# Filter the roottweets table / dataframe down to only the candidates \n", 37 | "# in the PolCandidates table. Then enter the values in the sentence below. \n", 38 | "\n", 39 | "\n", 40 | "######################################\n", 41 | "# Answer Below Here \n", 42 | "\n", 43 | "before_tweets = ...\n", 44 | "before_accounts = ... \n", 45 | "after_tweets = ...\n", 46 | "after_accounts = ...\n", 47 | "\n", 48 | "print( \"Before filtering there were %s Tweets and %s accounts.\" % (before_tweets, before_accounts),\n", 49 | " \"\\nAfter filtering there were %s Tweets and %s accounts.\" % (after_tweets,after_accounts)\n", 50 | " )" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "#####################################\n", 60 | "# Question 1.1\n", 61 | "# TA comments below here \n", 62 | "\n", 63 | "# ___ / 5. \n", 64 | "# Comments:\n", 65 | "'''\n", 66 | "\n", 67 | "'''\n" 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": 29, 73 | "metadata": {}, 74 | "outputs": [ 75 | { 76 | "name": "stdout", 77 | "output_type": "stream", 78 | "text": [ 79 | "The Ellipsis candidates from the Conservative party sent Ellipsis root tweets. The top tweeter was Ellipsis with Ellipsis tweets\n", 80 | "The Ellipsis candidates from the Labour party sent Ellipsis root tweets. The top tweeter was Ellipsis with Ellipsis tweets\n" 81 | ] 82 | } 83 | ], 84 | "source": [ 85 | "# Question 1.2: Using the newly filtered table, merge in the candidates' political \n", 86 | "# party from PolCandidates. Use this to enter values in the sentence below. \n", 87 | "\n", 88 | "######################################\n", 89 | "# Answer Below Here \n", 90 | "\n", 91 | "conservative_candidates_count = ...\n", 92 | "conservative_tweets_count = ...\n", 93 | "top_con_tweeter = ...\n", 94 | "top_con_tweet_count = ... \n", 95 | "\n", 96 | "labour_candidates_count = ...\n", 97 | "labour_tweets_count = ...\n", 98 | "top_labour_tweeter = ...\n", 99 | "top_labour_tweet_count = ...\n", 100 | "\n", 101 | "print(\"The %s candidates from the Conservative party sent %s root tweets. The top tweeter was %s with %s tweets\" \\\n", 102 | " % (conservative_candidates_count, conservative_tweets_count, top_con_tweeter, top_con_tweet_count))\n", 103 | "\n", 104 | "print(\"The %s candidates from the Labour party sent %s root tweets. The top tweeter was %s with %s tweets\" \\\n", 105 | " % (labour_candidates_count, labour_tweets_count, top_labour_tweeter, top_labour_tweet_count))\n" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": null, 111 | "metadata": {}, 112 | "outputs": [], 113 | "source": [ 114 | "#####################################\n", 115 | "# Question 1.2\n", 116 | "# TA comments below here \n", 117 | "\n", 118 | "# ___ / 5. \n", 119 | "# Comments:\n", 120 | "'''\n", 121 | "\n", 122 | "'''" 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "metadata": {}, 128 | "source": [ 129 | "# Exercise 2. An acrostic of tweets. " 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": 1, 135 | "metadata": {}, 136 | "outputs": [ 137 | { 138 | "data": { 139 | "text/plain": [ 140 | "'\\nUsing tweets that I made an acrostic: \\n\\nTweets \\nRarely \\nAccommodate\\nPoliticians\\n\\nUsing the same set of tweets, now you try to make one: \\n[ ]\\n'" 141 | ] 142 | }, 143 | "execution_count": 1, 144 | "metadata": {}, 145 | "output_type": "execute_result" 146 | } 147 | ], 148 | "source": [ 149 | "#################################################################\n", 150 | "#\n", 151 | "# Perhaps\n", 152 | "# You'd\n", 153 | "# Take\n", 154 | "# Hacking\n", 155 | "# Over\n", 156 | "# Nothing?\n", 157 | "#\n", 158 | "# See https://en.wikipedia.org/wiki/Acrostic\n", 159 | "#\n", 160 | "# Fun Fact! Lewis Carroll's Through the Looking Glass contained a \n", 161 | "# poem with an acrostic of the full name of the real-life Alice. \n", 162 | "# \n", 163 | "#################################################################\n", 164 | "\n", 165 | "# This exercise consists of two parts. In the first, you have to\n", 166 | "# print out an acrostic. You select a codephrase, and then the words that \n", 167 | "# are printed on each line should come from the tweets database. They do not \n", 168 | "# have to come from the filtered table unless you want the party affiliation.\n", 169 | "# \n", 170 | "# The horizontal words for the acrostic should be the first word of the \n", 171 | "# tweet. They should also be filtered somehow, such as 'tweets from the \n", 172 | "# Liberal Democrat party', 'tweets with a url', or 'tweets that have an \n", 173 | "# @mention' in them.\n", 174 | "#\n", 175 | "# The second part is that you have to then provide a user input prompt\n", 176 | "# so that a user can see if they can make an acrostic with the same \n", 177 | "# set of tweets. If they can (i.e. the codephrase's letters are all contained\n", 178 | "# within the set of tweets), print out the acrostic. Otherwise, let the user \n", 179 | "# know that the program cannot find an acrostic with that phrase. Ask them to \n", 180 | "# please try another phrase, or type \"exit()\" to exit. \n", 181 | "#\n", 182 | "'''\n", 183 | "Using tweets that I made an acrostic: \n", 184 | "\n", 185 | "Tweets \n", 186 | "Rarely \n", 187 | "Accommodate\n", 188 | "Politicians\n", 189 | "\n", 190 | "Using the same set of tweets, now you try to make one: \n", 191 | "[ ]\n", 192 | "'''\n", 193 | "\n", 194 | "\n", 195 | "# Notes: \n", 196 | "# - Each line in the acrostic should be a unique word, even if the codephrase \n", 197 | "# has two of the same letter. \n", 198 | "# - Your acrostic codephrase has to be longer than 5 characters. \n", 199 | "# - Dont worry about representing lower/uppper case, spaces, or punctuation in \n", 200 | "# your acrostic, but assume that users will try to type that in \n", 201 | "# the input box.\n", 202 | "# - If the user's attempted acrostic codephrase doesn't work\n", 203 | "# then it should let the user try again. \n", 204 | "# - The codephrase should make sense, but I fully expect the word list\n", 205 | "# from tweets not to make a lot of sense. \n", 206 | "# - If you find that the first word doesn't cut it, you can take the first \n", 207 | "# 'non-tweet' as in the first non-[\"rt\", \"@mention\", \"#hashtag\"]\n", 208 | "#\n", 209 | "# hint: df['first_word'] = df[\"text\"].map(lambda x: cleanWord(x))\n", 210 | "\n", 211 | "#\n", 212 | "#\n", 213 | "# Rubric\n", 214 | "# 5 pts. Functionality: Does your code work as directed (to test: \n", 215 | "# we would enter your codephrase as input)\n", 216 | "# 5 pts. Robustness: Will user input break the code? How does it handle junk characters?\n", 217 | " \n", 218 | "# 5 pts. Code factoring: e.g., how well did you use functions/data strutures \n", 219 | "# to help manage your queries?\n", 220 | "# 5 pts. Complexity of the filter on the tweets: A relative / subjective \n", 221 | "# assessment based on how you decided to filter and select tweets)\n", 222 | "\n", 223 | "######################################\n", 224 | "# Answer Below Here " 225 | ] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "execution_count": null, 230 | "metadata": {}, 231 | "outputs": [], 232 | "source": [ 233 | "#####################################\n", 234 | "# TA comments below here \n", 235 | "\n", 236 | "# Functionality: \n", 237 | "# ___ / 5. \n", 238 | "# Comments \n", 239 | "'''\n", 240 | "\n", 241 | "'''\n", 242 | "\n", 243 | "# Robustness: \n", 244 | "# ___ / 5. \n", 245 | "# Comments \n", 246 | "'''\n", 247 | "\n", 248 | "'''\n", 249 | "\n", 250 | "# Code Factoring: \n", 251 | "# ___ / 5. \n", 252 | "# Comments \n", 253 | "'''\n", 254 | "\n", 255 | "'''\n", 256 | "\n", 257 | "# Filter Complexity: \n", 258 | "# ___ / 5. \n", 259 | "# Comments \n", 260 | "'''\n", 261 | "\n", 262 | "'''" 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": null, 268 | "metadata": {}, 269 | "outputs": [], 270 | "source": [] 271 | } 272 | ], 273 | "metadata": { 274 | "kernelspec": { 275 | "display_name": "Python 3", 276 | "language": "python", 277 | "name": "python3" 278 | }, 279 | "language_info": { 280 | "codemirror_mode": { 281 | "name": "ipython", 282 | "version": 3 283 | }, 284 | "file_extension": ".py", 285 | "mimetype": "text/x-python", 286 | "name": "python", 287 | "nbconvert_exporter": "python", 288 | "pygments_lexer": "ipython3", 289 | "version": "3.7.0" 290 | } 291 | }, 292 | "nbformat": 4, 293 | "nbformat_minor": 2 294 | } 295 | -------------------------------------------------------------------------------- /Assignments/Week02/PySDS_Ex_Week02_Day04_exampleAnswer.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Bernie's acrostic code\n", 8 | "See this as an example of the assignment, question 3." 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": null, 14 | "metadata": {}, 15 | "outputs": [], 16 | "source": [ 17 | "import sqlite3\n", 18 | "import pandas as pd \n", 19 | "\n", 20 | "def cleanTweet(df):\n", 21 | " df['first_word'] = df[\"text\"].map(lambda x: cleanWord(x,-1))\n", 22 | " df['first_letter'] = df[\"first_word\"].map(lambda x: charZero(x))\n", 23 | " return df\n", 24 | "\n", 25 | "def cleanWord (text,whichword=0):\n", 26 | " l = text.split()\n", 27 | " new_word_list = []\n", 28 | " for i in l:\n", 29 | " if i[0] == \"@\" or i ==\"RT\" or i[0] == \"#\" or i[0:4] == \"http\":\n", 30 | " pass\n", 31 | " elif i[0].isalpha() and i[-1].isalpha():\n", 32 | " new_word_list.append(i)\n", 33 | "\n", 34 | " if len(new_word_list) >= 1:\n", 35 | " \n", 36 | " return new_word_list[whichword]\n", 37 | " else:\n", 38 | " return \"\"\n", 39 | "\n", 40 | "def charZero(word):\n", 41 | " if len(word) >= 1:\n", 42 | " return word[0].lower()\n", 43 | " else:\n", 44 | " return None\n", 45 | " \n", 46 | "def cleanOutstr( inword):\n", 47 | " l = []\n", 48 | " for i in inword:\n", 49 | " if i.isalpha():\n", 50 | " l.append(i.lower())\n", 51 | " return l\n", 52 | "\n", 53 | "def getAcrostic(df,codephrase=\"gross, the best words\"):\n", 54 | " outstr = \"\"\n", 55 | " wordSet = set([])\n", 56 | " for i in cleanOutstr(codephrase): \n", 57 | " x = df[df['first_letter']==i].index\n", 58 | "# print(x)\n", 59 | " \n", 60 | " found = False\n", 61 | " for j in x: \n", 62 | " word = df.iloc[j][\"first_word\"]\n", 63 | " if word not in wordSet:\n", 64 | " wordSet.add(word)\n", 65 | " outstr += word + \"\\n\"\n", 66 | " found = True\n", 67 | " break\n", 68 | " if not found:\n", 69 | " return False\n", 70 | " return outstr\n", 71 | "\n", 72 | "\n", 73 | "\n", 74 | "df = pd.read_sql(\"select * from roottweets\",sqlite3.connect(\"PySDS_ElectionData_2015_may5-6.db\"))\n", 75 | "df1 = cleanTweet(df)\n", 76 | "\n", 77 | "DEMO = \"letter\"\n", 78 | "output = \"Using tweets that are not filtered I made an acrostic\\n\\n\"\n", 79 | "output += getAcrostic(df1, DEMO)\n", 80 | "output += \"\\n\\nUsing the same set of tweets, now you try to make one:\"" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": null, 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "print(output)\n", 90 | "print (getAcrostic(df1,input()))" 91 | ] 92 | } 93 | ], 94 | "metadata": { 95 | "kernelspec": { 96 | "display_name": "Python 3", 97 | "language": "python", 98 | "name": "python3" 99 | }, 100 | "language_info": { 101 | "codemirror_mode": { 102 | "name": "ipython", 103 | "version": 3 104 | }, 105 | "file_extension": ".py", 106 | "mimetype": "text/x-python", 107 | "name": "python", 108 | "nbconvert_exporter": "python", 109 | "pygments_lexer": "ipython3", 110 | "version": "3.7.0" 111 | } 112 | }, 113 | "nbformat": 4, 114 | "nbformat_minor": 2 115 | } 116 | -------------------------------------------------------------------------------- /Assignments/Week03/PySDS_Ex_Week03_Day02.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "**PySDS Week 03 Day 02 v.1 - Exercise - Merging and reporting on data**" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "# Exercise 1. Create a new codebook\n", 15 | "\n", 16 | "For this exercise, please go through all the steps in class with respect to cleaning the data on roottweets, except do this for the replytweets table. Put this in a function that you can call. It is okay if the function is very specific to replytweets, but the more generic the better. If it could also be used to clean up roottweets from a raw SQL call, this would be ideal. " 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": null, 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "# Exercise 1. \n", 26 | "###############################################\n", 27 | "# Answer below here \n", 28 | "\n", 29 | "\n", 30 | "\n", 31 | "\n", 32 | "\n", 33 | "##############################################\n", 34 | "# Reviewer comments below here \n", 35 | "\n", 36 | "\n", 37 | "\n", 38 | "\n" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "# Exercise 2. Finding the happy tweets. \n", 46 | "\n", 47 | "Using the indepedent samples ttest function, split the data into at least two groups (e.g., by length of tweets / has emoji / has @mention, etc...). Compare two of theese groups using an independent samples t-test. (See example below). Try to find a split that will lead to a significant difference between the two splits. After trying three different splits, if there is no significant difference, simply move on. Report all three splits. If you get a significant difference on the first split, great! This can be done with either the roottweets table or the replytweets table." 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 10, 53 | "metadata": {}, 54 | "outputs": [ 55 | { 56 | "name": "stdout", 57 | "output_type": "stream", 58 | "text": [ 59 | "Populating the interactive namespace from numpy and matplotlib\n", 60 | "Ttest_indResult(statistic=0.7662141663581955, pvalue=0.4583568003180871)\n", 61 | "0.0493953176745734\n" 62 | ] 63 | } 64 | ], 65 | "source": [ 66 | "###############################\n", 67 | "# Example ttest code \n", 68 | "from scipy import stats\n", 69 | "%pylab inline\n", 70 | "\n", 71 | "r1 = [1,3,5,7,9,11]\n", 72 | "r2 = [1,3,4,6,8,3,6,7]\n", 73 | "r3 = [80,10,20,31,4,45]\n", 74 | " for paired samples it's ttest_rel(x,y)\n", 75 | "print(stats.ttest_ind(r1,r3).pvalue)\n", 76 | "\n", 77 | "################################\n", 78 | "# Answer below here \n", 79 | "\n", 80 | "\n", 81 | "\n", 82 | "################################\n", 83 | "# Peer review comments below here \n", 84 | "\n", 85 | "\n", 86 | "\n" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "# Exercise 3. Finding the tweetstorm tweet. \n", 94 | "\n", 95 | "We want to find out what tweet inspired the most negative replies. First, create a 'grouped_reply_tweets' table/DataFrame. It should have the roottweet_id, the count of replies, and the average sentiment score for pos, neg, neu. \n", 96 | "\n", 97 | "Filter this table to those roottweets that have > 1 replies. Look for the tweet(s) with the maximum average negative sentiment. If there are more than one with the same max negative sentiment, take the roottweet(s) with the most replies. Use these tweet IDs to look up the tweet(s) in the roottweets table. What tweet was it that prompted such negativity? Report your output as follows: \n", 98 | "\n", 99 | "```\n", 100 | "The maximum negative sentiment score was %s. The replies that got this score were:\n", 101 | "\n", 102 | "Tweet 1.\n", 103 | " \n", 104 | "\n", 105 | "Tweet 2. \n", 106 | "\n", 107 | "\n", 108 | "etc...\n", 109 | "\n", 110 | "The root tweet that inspired such negativity was written by @. It was: \n", 111 | "\n", 112 | "\n", 113 | "```" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": null, 119 | "metadata": {}, 120 | "outputs": [], 121 | "source": [ 122 | "################################\n", 123 | "# Answer below here \n", 124 | "\n", 125 | "\n", 126 | "\n", 127 | "################################\n", 128 | "# Peer review comments below here \n", 129 | "\n", 130 | "\n" 131 | ] 132 | } 133 | ], 134 | "metadata": { 135 | "kernelspec": { 136 | "display_name": "Python 3", 137 | "language": "python", 138 | "name": "python3" 139 | }, 140 | "language_info": { 141 | "codemirror_mode": { 142 | "name": "ipython", 143 | "version": 3 144 | }, 145 | "file_extension": ".py", 146 | "mimetype": "text/x-python", 147 | "name": "python", 148 | "nbconvert_exporter": "python", 149 | "pygments_lexer": "ipython3", 150 | "version": "3.7.0" 151 | } 152 | }, 153 | "nbformat": 4, 154 | "nbformat_minor": 2 155 | } 156 | -------------------------------------------------------------------------------- /Assignments/Week03/PySDS_Ex_Week03_Day03.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "**PySDS Week 03 Day 03 v.1 - Exercise - Webcrawlers**" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Modifying a scraper to suit needs\n", 15 | "\n" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": null, 21 | "metadata": {}, 22 | "outputs": [], 23 | "source": [ 24 | "# Exercise 1. \n", 25 | "# How pervaisive is the notion of a social network at the OII? \n", 26 | "# Modify the crawler that we showed in class. You can either use a new scraper \n", 27 | "# from scrapy, beautifulSoup, mechanicalSoup or modify the code we have. \n", 28 | "#\n", 29 | "# Part 1. Write pseudocode from the following instructions:\n", 30 | "#\n", 31 | "# Use the department's homepage (http://www.oii.ox.ac.uk) as your seed. \n", 32 | "# Navigate to all links that you find that also have www.oii.ox.ac.uk in them.\n", 33 | "# If a page includes the word \"network\" or \"networks\", then mark it in the \n", 34 | "# \"has network\" pile. Otherwise mark it in the \"not mentioned\" pile. \n", 35 | "# When you run out of links, return the number in each pile. \n", 36 | "#\n", 37 | "# NOTE> Please exempt any page with http://www.oii.ox.ac.uk/study See updated code snippet. \n", 38 | "\n", 39 | "################################\n", 40 | "# Answer below here \n", 41 | "\n", 42 | "\n", 43 | "\n", 44 | "################################\n", 45 | "# Peer review comments below here \n", 46 | "\n", 47 | "\n" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": null, 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [ 56 | "# Part 2. Creating a scraper that looks for oii links. \n", 57 | "\n", 58 | "# If you are using a subclass of HTTPparser, consider reviewing the documentation \n", 59 | "# https://docs.python.org/3/library/html.parser.html\n", 60 | "#\n", 61 | "# Write a subclass of HTTPparser (or another means) \n", 62 | "# that returns links if they have 'www.oii.ox.ac.uk'\n", 63 | "# in the full path. You should think about how you are going to test this.\n", 64 | "\n", 65 | "################################\n", 66 | "# Answer below here \n", 67 | "\n", 68 | "\n", 69 | "\n", 70 | "################################\n", 71 | "# Peer review comments below here \n", 72 | "\n", 73 | "\n" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": null, 79 | "metadata": {}, 80 | "outputs": [], 81 | "source": [ 82 | "# Part 3. Translate your pseudocode spider to working code \n", 83 | "\n", 84 | "# Here we will want to run a crawler.\n", 85 | "# Call your getLinks() method from within your working code. \n", 86 | "# I assume it will be an extension of my code in the lecture.\n", 87 | "#\n", 88 | "# Please note, you should use a set or other form of counter to ensure\n", 89 | "# that you do not visit the same link twice. I have warned IT about today. \n", 90 | "# But still...let's try not to DDOS the deparmtnet webpage. \n", 91 | "# Note 1. Please exempt any page with /study See updated code snippet. \n", 92 | "# Note 2. Max links = 100\n", 93 | "# Note 3. time.sleep(0.2)\n", 94 | "\n", 95 | "################################\n", 96 | "# Answer below here \n", 97 | "\n", 98 | "\n", 99 | "\n", 100 | "################################\n", 101 | "# Peer review comments below here \n", 102 | "\n", 103 | "\n" 104 | ] 105 | } 106 | ], 107 | "metadata": { 108 | "kernelspec": { 109 | "display_name": "Python 3", 110 | "language": "python", 111 | "name": "python3" 112 | }, 113 | "language_info": { 114 | "codemirror_mode": { 115 | "name": "ipython", 116 | "version": 3 117 | }, 118 | "file_extension": ".py", 119 | "mimetype": "text/x-python", 120 | "name": "python", 121 | "nbconvert_exporter": "python", 122 | "pygments_lexer": "ipython3", 123 | "version": "3.7.0" 124 | } 125 | }, 126 | "nbformat": 4, 127 | "nbformat_minor": 2 128 | } 129 | -------------------------------------------------------------------------------- /Assignments/Week03/PySDS_Ex_Week03_Day04.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "**PySDS Week 03 Day 04 v.1** - Friday Formative" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "# Exercise: DIYAPI\n", 15 | "\n", 16 | "For today's formative, you will have to authenticate to an API and download a sufficient amount of data to make a staitistical relationship. The relationship does not have to be statistically significant but you should test for it. Below are some basic methods of testing statistical relatinships. This list is definitely non-exhaustive. You are not asked whether the data is representative or what is the statistical power of the relationship. \n", 17 | "\n", 18 | "Constraints. \n", 19 | "- It **cannot be an API** used in the course already. It can't be Twitter, Wikipedia, TheTVDb or Reddit, for example. \n", 20 | "- There must be some sort of **authentication** involved. This can happen through a specialised python package or through a DIY approach as shown with TheTVDb. \n", 21 | "\n", 22 | "Rubric for today:\n", 23 | "\n", 24 | "**functional** - does your code run* [8pts]: This is a little more complex than in past formatives. This is because running is most likely to involve authentication that is not available to the researcher. If you cannot create a dummy account whose credentials you can share with the researcher, try to store the token as printed here. We cannot guarantee we can run the code, but we will try. \n", 25 | "\n", 26 | "**Code well organized** [4pts]: Do you have spaghetti code like the top example in the Friday lecture or more cohesive code like what was reported at the end of the lecture? Did you use comments to denote careful decisions? \n", 27 | "\n", 28 | "**Data wrangled well** [4pts]: The data that is returned - how has it been treated? Were any checks done for missing data or looking for outliers? \n", 29 | "\n", 30 | "**Test/output makes sense** [4pts]: When reporting on your statistical relationship, are you making sense of it correctly? Are you framing a significant or non-significant relationship appropriately." 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "Below are some code snippets to help you plan your analysis" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 4, 43 | "metadata": {}, 44 | "outputs": [ 45 | { 46 | "data": { 47 | "text/plain": [ 48 | "(0.17857142857142855, 0.6726038174415168, 1, array([[1.2, 1.8],\n", 49 | " [2.8, 4.2]]))" 50 | ] 51 | }, 52 | "execution_count": 4, 53 | "metadata": {}, 54 | "output_type": "execute_result" 55 | } 56 | ], 57 | "source": [ 58 | "x = [1,2,3,4,5, 6, 7, 8, 9]\n", 59 | "y = [2,4,6,8,10,12,17,16,18]\n", 60 | "\n", 61 | "from scipy import stats\n", 62 | "\n", 63 | "# A non-exhaustive list of some statistical tests:\n", 64 | "#\n", 65 | "# independent samples t-test\n", 66 | "# https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html\n", 67 | "\n", 68 | "# Two different distributions (are the means different?)\n", 69 | "stats.ttest_ind(x,y)\n", 70 | "\n", 71 | "# paird samples t-test (two measure measurements for same cases - did the mean change?)\n", 72 | "stats.ttest_rel(x,y)\n", 73 | "\n", 74 | "# Mann-whitney U (two non parametric vars: Does the mean rank-order change?) \n", 75 | "stats.mannwhitneyu(x,y)\n", 76 | "\n", 77 | "# Pearson's r. (The correlation between two variables)\n", 78 | "stats.pearsonr(x,y)\n", 79 | "\n", 80 | "table = [[1,2],[3,4]]\n", 81 | "# One way anova (significant difference on a continuous variable for n categories)\n", 82 | "stats.f_oneway(table)\n", 83 | "\n", 84 | "# Chi square.. (Observed versus expected categories in a categorical table)\n", 85 | "stats.chi2_contingency(table)\n" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 6, 91 | "metadata": {}, 92 | "outputs": [], 93 | "source": [ 94 | "####################################\n", 95 | "# Answer code below here \n", 96 | "\n", 97 | "\n", 98 | "\n" 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": {}, 104 | "source": [ 105 | "####################################\n", 106 | "## Text explanation of relationship below here:\n", 107 | "\n" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": 5, 113 | "metadata": {}, 114 | "outputs": [ 115 | { 116 | "data": { 117 | "text/plain": [ 118 | "'\\n\\n'" 119 | ] 120 | }, 121 | "execution_count": 5, 122 | "metadata": {}, 123 | "output_type": "execute_result" 124 | } 125 | ], 126 | "source": [ 127 | "#####################################\n", 128 | "# TA comments below here \n", 129 | "\n", 130 | "# Functionality: \n", 131 | "# ___ / 8. \n", 132 | "# Comments \n", 133 | "'''\n", 134 | "\n", 135 | "'''\n", 136 | "\n", 137 | "# Organization: \n", 138 | "# ___ / 4. \n", 139 | "# Comments \n", 140 | "'''\n", 141 | "\n", 142 | "'''\n", 143 | "\n", 144 | "# Code wrangling: \n", 145 | "# ___ / 4. \n", 146 | "# Comments \n", 147 | "'''\n", 148 | "\n", 149 | "'''\n", 150 | "\n", 151 | "# Stats interpretation: \n", 152 | "# ___ / 4. \n", 153 | "# Comments \n", 154 | "'''\n", 155 | "\n", 156 | "'''" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": null, 162 | "metadata": {}, 163 | "outputs": [], 164 | "source": [] 165 | } 166 | ], 167 | "metadata": { 168 | "kernelspec": { 169 | "display_name": "Python 3", 170 | "language": "python", 171 | "name": "python3" 172 | }, 173 | "language_info": { 174 | "codemirror_mode": { 175 | "name": "ipython", 176 | "version": 3 177 | }, 178 | "file_extension": ".py", 179 | "mimetype": "text/x-python", 180 | "name": "python", 181 | "nbconvert_exporter": "python", 182 | "pygments_lexer": "ipython3", 183 | "version": "3.7.0" 184 | } 185 | }, 186 | "nbformat": 4, 187 | "nbformat_minor": 2 188 | } 189 | -------------------------------------------------------------------------------- /Assignments/Week04/PySDS_ex_w4-1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Week 4. Assignment. \"Come code with me\" \n", 8 | "\n", 9 | "Hat tip to Patrick for the name. This week, you will have only one formative and it is a group-work exercise. You will be downloading and working with the data from IMDB. It is available from: https://www.imdb.com/interfaces/ \n", 10 | "\n", 11 | "Your goal is to present some findings under the following conditions:\n", 12 | "1. It has to, in some way, **relate to muppets**. It can be as loose as looking at some of the Muppet guest stars or doing some analysis on the show over time, or others in the franchise. \n", 13 | "2. It has to use the **user ratings table** somehow. Whether it is ratings of the show, other shows, doesn't matter as long as it's the user ratings table and linked in some way to another table. \n", 14 | "\n", 15 | "The findings should be found in a python notebook that is submitted jointly (i.e., one notebook per team). Your fellow classmates will get to see this notebook while you're presenting. Your choice of presentation format (i.e., the notebook, some html page, powerpoint) is up to you as long as it is used to communicate the findings that are present in the notebook and to some extent the journey to get to those findings. \n", 16 | "\n", 17 | "If you want to bring in other data such as that from TVdb that's ok, as long as the above two conditions are met and you only submit one file. So adding a .csv file is not possible, getting data from a web query is. Adding data that is pasted into the notebook is technically fine, but kinda hacky. Sometimes you might want to use a string, number or index that is easier to get from the web and copy it in directly, such as a titleID for a show. Certainly! If it is easier that way and scales to your needs here, do it. \n", 18 | "\n", 19 | "Your code to produce the findings should be cleanly packaged and submitted by one person. The code will then be available for the other groups to read. The code will be due Wednesday at 2pm as a single notebook file. Then at 2:30 each team will have 12 minutes to discuss what they found with the other groups. This will take the form of a presentation to the class. \n", 20 | "\n", 21 | "When all the presentations have been completed, we will have a ranked vote, one vote per person (on a qualtrics survey I'll mail out). Only the winners will be revealed. The ranked vote means that the group in first gets 1 point, the group in second 2 points, etc... The ranks are the summed up for each of the groups and the one with the lowest rank score wins. " 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "## Some strategies for dealing with this data: \n", 29 | "- This is a lot of data. Choose what you want to look at first. Limiting the scope of the data analysis is important. \n", 30 | "- Try to read in the data, slice it down (for example, removing television episodes will cut the data in more than half, slicing further by year can do even more). To get it more manageable, destroy the larger object. It will be faster loading in the smaller one. \n", 31 | "- Regular expressions will be slow but possible.\n", 32 | "- If you merge very large data sets it will take a long time unless you configure the database properly. \n", 33 | "- Each zip file from imdb will decompress quite considerably on your computer. But you can make some efficiency gains by loading in the .gzip file directly as seen in the example below. " 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": 14, 39 | "metadata": {}, 40 | "outputs": [ 41 | { 42 | "name": "stderr", 43 | "output_type": "stream", 44 | "text": [ 45 | "C:\\Users\\bernie\\Anaconda3\\lib\\site-packages\\IPython\\core\\interactiveshell.py:2785: DtypeWarning: Columns (5) have mixed types. Specify dtype option on import or set low_memory=False.\n", 46 | " interactivity=interactivity, compiler=compiler, result=result)\n" 47 | ] 48 | }, 49 | { 50 | "name": "stdout", 51 | "output_type": "stream", 52 | "text": [ 53 | "5353317\n" 54 | ] 55 | } 56 | ], 57 | "source": [ 58 | "import pandas as pd \n", 59 | "\n", 60 | "path_to_data = \"C:\\\\Users\\\\bernie\\\\Documents\\\\GitHub\\\\sds-python\\\\Data_outside_github\\\\\"\n", 61 | "\n", 62 | "filein = pd.read_csv(\"%s\\\\title.basics.tsv.gz\" % path_to_data,sep=\"\\t\")\n", 63 | "print(len(filein))" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "### Consider dividing tasks\n", 71 | "You might want to subdivide the team and two-three people work in parallel, or have one person typing commands while others come up with what to ask of the data in a round table. With respect to tasks some people might be good at putting a presentation together aesthetically while others might be more comfortable talking. If one person is doing all the coding, you're doing it wrong. Think about using google chat, slack or telegram for copying and pasting code snippets to each other.\n", 72 | "\n", 73 | "Find a spot in St. Cross or OII somewhere where there might be less cross talk between teams. \n", 74 | "\n", 75 | "Finally, this task is useful but don't go overboard as a team. If you're workig after hours keep it reasonable. This is not _eXtreme_ hackathoning, it's an exercise in collaborative curiosity. It's the day before Halloween. It should be what you could reasonably expect given the circumstances. " 76 | ] 77 | } 78 | ], 79 | "metadata": { 80 | "kernelspec": { 81 | "display_name": "Python 3", 82 | "language": "python", 83 | "name": "python3" 84 | }, 85 | "language_info": { 86 | "codemirror_mode": { 87 | "name": "ipython", 88 | "version": 3 89 | }, 90 | "file_extension": ".py", 91 | "mimetype": "text/x-python", 92 | "name": "python", 93 | "nbconvert_exporter": "python", 94 | "pygments_lexer": "ipython3", 95 | "version": "3.7.0" 96 | } 97 | }, 98 | "nbformat": 4, 99 | "nbformat_minor": 2 100 | } 101 | -------------------------------------------------------------------------------- /Assignments/Week04/PySDS_ex_w4-1_codeCalc.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stdout", 10 | "output_type": "stream", 11 | "text": [ 12 | "Ole's career ruining presentation 30.0\n", 13 | "dtype: float64\n" 14 | ] 15 | } 16 | ], 17 | "source": [ 18 | "import pandas as pd \n", 19 | "import scipy.stats.mstats as mstats\n", 20 | "import numpy as np\n", 21 | "\n", 22 | "vote_df = pd.read_csv(\"PySDS_votingData.csv\")\n", 23 | "\n", 24 | "df = pd.DataFrame()\n", 25 | "\n", 26 | "for c,i in enumerate(vote_df.iterrows()):\n", 27 | " l = list(i[1]) # The list of values from the df\n", 28 | " l[l[0]] = np.nan # replacing own group with nan\n", 29 | " l = np.array(l[1:]) # getting the ranks for each of the choices\n", 30 | " nl = np.ma.masked_invalid(l) # excluding the np.nan\n", 31 | " nl = mstats.rankdata(nl) # rank everything, np.nan is always 0\n", 32 | " nl[nl == 0] = np.nan # put nan back in. \n", 33 | " df[c] = nl # add to dataframe\n", 34 | " \n", 35 | "df = df.T\n", 36 | "df.columns = [\n", 37 | " \"Ker-bit the frog and networks of muppets\",\n", 38 | " \"Alicia's group \\\"The Ideal Muppet Show\\\"\",\n", 39 | " \"Victor's \\\"64 % Keynote\\\"\",\n", 40 | " \"Ole's career ruining presentation\",\n", 41 | " \"Lee's \\\"A star is born\\\" \"]\n", 42 | "\n", 43 | "ranks = df.sum()\n", 44 | "print( ranks.sort_values()[0:1] ) # if we just say [0] it returns value, not index + value" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": null, 50 | "metadata": {}, 51 | "outputs": [], 52 | "source": [] 53 | } 54 | ], 55 | "metadata": { 56 | "kernelspec": { 57 | "display_name": "Python 3", 58 | "language": "python", 59 | "name": "python3" 60 | }, 61 | "language_info": { 62 | "codemirror_mode": { 63 | "name": "ipython", 64 | "version": 3 65 | }, 66 | "file_extension": ".py", 67 | "mimetype": "text/x-python", 68 | "name": "python", 69 | "nbconvert_exporter": "python", 70 | "pygments_lexer": "ipython3", 71 | "version": "3.7.0" 72 | } 73 | }, 74 | "nbformat": 4, 75 | "nbformat_minor": 2 76 | } 77 | -------------------------------------------------------------------------------- /Assignments/Week04/PySDS_votingData.csv: -------------------------------------------------------------------------------- 1 | Q1,Q2_1,Q2_2,Q2_3,Q2_4,Q2_5 2 | 2,4,3,2,1,5 3 | 3,5,4,3,1,2 4 | 3,4,5,2,1,3 5 | 5,3,1,4,2,5 6 | 4,4,3,5,1,2 7 | 2,2,5,3,1,4 8 | 1,5,2,4,1,3 9 | 5,5,3,4,2,1 10 | 1,1,3,5,2,4 11 | 2,3,1,4,2,5 12 | 3,2,4,5,1,3 13 | 4,4,2,5,1,3 14 | 5,3,4,5,2,1 15 | 4,3,4,5,1,2 16 | 3,2,3,1,5,4 17 | 1,1,4,3,2,5 18 | 5,4,2,5,3,1 19 | 4,5,4,2,1,3 20 | 1,1,4,3,2,5 21 | 2,4,1,5,2,3 22 | 3,5,4,1,3,2 23 | 4,3,4,5,1,2 24 | 5,3,2,5,1,4 25 | 1,5,3,1,2,4 26 | 2,2,3,4,5,1 27 | -------------------------------------------------------------------------------- /Course_Material/Week_0/PySDS_week0_lecture1_JupyterLab_Basics.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "**PySDS Week 0. Lecture 1. V.2** Author: Bernie Hogan " 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "# Welcome to python and jupyter!" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "This is a Jupyter notebook. This is not like either a computer program or a word document. It has elements of both! This is both a strength and sometimes a weakness of Jupyter Notebooks. \n", 22 | "\n", 23 | "In the coming weeks you'll be using a lot of these. So it is important to get acquainted with their features and limits. " 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "# How to navigate a jupyter notebook in JupyterLab\n", 31 | "\n", 32 | "## How to create a text cell\n", 33 | "Everything is in a cell. The two main types of cells are 'code' and 'markdown'. \n", 34 | "\n", 35 | "A markdown cell is one that has text in it. Markdown is a simple way to add features to text, like _italics_, headers, ~~strikethrough~~, and **bold**. If you click on this cell you can see the code that produced this text. \n", 36 | "\n", 37 | "1. Use two tildas (the ~ character) for ~~strikethrough~~\n", 38 | "2. Use two asterisks (the * character) for **bold**\n", 39 | "3. Use underscores (the _ character) for _italics_. \n", 40 | "4. Lists are auto generated. \n", 41 | "5. You can also embed code in a nice format in markdown using three tildas and the language. See here: \n", 42 | "\n", 43 | " ~~~ python\n", 44 | " print(\"Hello World\")\n", 45 | " \n", 46 | " ~~~\n", 47 | "6. Use hash (the # symbol) at the beginning of a line to make it a header. \n", 48 | "7. Use the dollar symbol for formulae (and I use it for $commands$ )." 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "## How to create a new cell / navigate with the keyboard\n", 56 | "\n", 57 | "On the left hand side is a blue bar noting which cell is in focus. The background of the cell will also give you a hint whether the cell is in focus and editable. When it is in focus and editable you can type in it, when it is not in focus it looks pretty. \n", 58 | "\n", 59 | "To change the focus from one cell to another, you can click on a new cell or 'run' the current cell. \n", 60 | "\n", 61 | "To run the current cell, press $shift-enter$. \n", 62 | "\n", 63 | "If the cell is not in focus you can tell because there is no cursor and pressing up or down will move the blue bar on the left handd side up and down. If it is in focus you can tell because pressing up and down will move the cursor within the cell. \n", 64 | "\n", 65 | "To create a new cell, you can do this by pressing $a$ for above and $b$ for below the current cell. \n", 66 | "\n", 67 | "To delete a cell, you can press $d d$, that's d twice. In both cases, this happens when you are not in focus, otherwise you would just be typing the letter a or the letters dd. \n", 68 | "\n", 69 | "To get out of focus you can either: \n", 70 | "- Run the current cell ($shift-enter$), \n", 71 | "- Escape the cell ($escape key$).\n", 72 | "\n", 73 | "To get in focus you can either:\n", 74 | "- Press $enter$,\n", 75 | "- $Double-click$ with the mouse.\n", 76 | "\n", 77 | "To change a cell from 'code' to 'Markdown' you can either:\n", 78 | "- Use the menu at the top,\n", 79 | "- When the cell is not in focus you can press $c$ for code and $m$ for Markdown. " 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "metadata": {}, 85 | "source": [ 86 | "# Running code in a jupyter notebook#\n", 87 | "\n", 88 | "Jupyter notebooks each come with their own 'kernel'. This means they store variables across cells. Each cell can be run and then the variables are used later. \n", 89 | "\n", 90 | "This is somewhat different than other ways of coding. Later in the course, we will briefly introduce _vi_ and _Spyder_ for running python programs. \n", 91 | "\n", 92 | "To run a cell, again it is $shift - enter$. Run the cell below to print 'hello world' as well as print a simple calculation." 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 3, 98 | "metadata": {}, 99 | "outputs": [ 100 | { 101 | "name": "stdout", 102 | "output_type": "stream", 103 | "text": [ 104 | "Hello world!\n" 105 | ] 106 | } 107 | ], 108 | "source": [ 109 | "print(\"Hello world!\")\n", 110 | "x = 3\n", 111 | "y = 6" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": 7, 117 | "metadata": {}, 118 | "outputs": [ 119 | { 120 | "name": "stdout", 121 | "output_type": "stream", 122 | "text": [ 123 | "16\n" 124 | ] 125 | } 126 | ], 127 | "source": [ 128 | "# You will want to run this cell again after reading below.\n", 129 | "print(x + y)" 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": 5, 135 | "metadata": {}, 136 | "outputs": [ 137 | { 138 | "name": "stdout", 139 | "output_type": "stream", 140 | "text": [ 141 | "demo\n" 142 | ] 143 | } 144 | ], 145 | "source": [ 146 | "print(\"demo\")" 147 | ] 148 | }, 149 | { 150 | "cell_type": "markdown", 151 | "metadata": {}, 152 | "source": [ 153 | "Notice that once we ran the first cell, we can access those variables in another cell. We did not assign a new value to x or y, so when we said \n", 154 | "~~~ python\n", 155 | "print(x + y) \n", 156 | "~~~\n", 157 | "\n", 158 | "then it printed $9$ which was the value of what was in the parentheses (3 + 6). Now, here's a tricky part. Below we will assign a new value to x. Then go back and run that cell above." 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": 6, 164 | "metadata": {}, 165 | "outputs": [], 166 | "source": [ 167 | "x = 10" 168 | ] 169 | }, 170 | { 171 | "cell_type": "markdown", 172 | "metadata": {}, 173 | "source": [ 174 | "Notice that the value of x is now $16$ rather than $9$. You can also see that on the left there is a number in square brackets that shows the order in which the cells ran. If you never ran the cell, it won't have a number and the variables that are created in that cell won't be accessible to other cells in the file. \n", 175 | "\n", 176 | "Be very careful with the order in which you run your code. If you change a variable's value and run your code out of order, you might run into trouble. There is an experimental version of Jupyter called DataFlow Jupyter that can preserve order, but it's a bit too experimental for us at the moment (check back later!). \n", 177 | "\n", 178 | "Because it's tricky to notice cell order we tend to use some conventions: \n", 179 | "- use single letters for disposable variables that are cell-specific and created on demand. \n", 180 | "- use clear notation for variables that persist (typically with underscores or camelCase)\n", 181 | " - These variables should only be initialised once. Otherwise, we should consider using an 'object', which we will discuss later.\n", 182 | "- use ALL CAPS for variables that are static and never change." 183 | ] 184 | }, 185 | { 186 | "cell_type": "markdown", 187 | "metadata": {}, 188 | "source": [ 189 | "# Last points\n", 190 | "\n", 191 | "This course will introduce a number of concepts. The goals are not just to get you programming, but to get you to think about the ways in which programming is integrated into scientific claims. This involves learning a little computer science, a little coding, a little hardware knowledge and some software tools. We will throw in some examples from politics, sociology, philosophy and terms from economics where possible. It is indeed very broad, but partly this is to demonstrate the similarities between disciplines so that we can understand how the management of data plays a key role in claimmaking in all these disciplines. \n", 192 | "\n", 193 | "Next week we will begin with the basics of programming. We will learn about variables, loops, if statements and functions. We will also learn about file management and github. See you soon! " 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": null, 199 | "metadata": {}, 200 | "outputs": [], 201 | "source": [] 202 | } 203 | ], 204 | "metadata": { 205 | "kernelspec": { 206 | "display_name": "Python 3", 207 | "language": "python", 208 | "name": "python3" 209 | }, 210 | "language_info": { 211 | "codemirror_mode": { 212 | "name": "ipython", 213 | "version": 3 214 | }, 215 | "file_extension": ".py", 216 | "mimetype": "text/x-python", 217 | "name": "python", 218 | "nbconvert_exporter": "python", 219 | "pygments_lexer": "ipython3", 220 | "version": "3.7.0" 221 | } 222 | }, 223 | "nbformat": 4, 224 | "nbformat_minor": 2 225 | } 226 | -------------------------------------------------------------------------------- /Course_Material/Week_2/PySDS_Supplemental_FREE_coding.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "**PySDS Week 2. Supplemental Lecture 1. V.1** Author: Bernie Hogan" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "# Code should be FREE\n", 15 | "\n", 16 | "Below is an explanation of the idea that code must be FREE. It's a bit of a pun in the coding world, there is a major movement in most technical arenas to make code Free and Open Source. While it is important to familiarise yourself with the notion of free and open source coding, this is a different matter. Here FREE is mnemonic to help you understand how to focus your coding efforts. \n", 17 | "\n", 18 | "- **F**unctional \n", 19 | "- **R**obust\n", 20 | "- **E**conomical\n", 21 | "- **E**fficient\n" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "## Functional code\n", 29 | "\n", 30 | "Functional code is code that gives the expected result. If you are building a way to calculate a series of numbers or run a slice from a DataFrame, in all cases you want the answer to be correct, or as expected. \n", 31 | "\n", 32 | "In the case of functions and methods, this does not mean that you are meeting all eventualities. It means that you are abiding by the pre-condition / post-condition contract. This contract is important for building modular code. Each module (whether it is a script, a class file or a program in its own right) has a sense of what is the correct input. This is the 'pre-condition'. If this pre-condition is satisfied then the post-condition will be correcct. \n", 33 | "\n", 34 | "For example, if you give me an integer and I say I will square it, then for every integer I should be able to do this. \n", 35 | "\n", 36 | "``` python\n", 37 | "def square(number):\n", 38 | " squarednumber = number * number \n", 39 | " return squarednumber\n", 40 | "```\n", 41 | "\n", 42 | "Treated as a 'black box', we can say that if you give this function a number then it will return the correct, squared value.\n", 43 | "\n", 44 | "Finally, a note on language. This is functional code in the sense that it functions as expected. There is also a notion of 'functional programming', which is a style of programming. That's not what we mean here. We mean code that gets the correct result. " 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "## Robust code\n", 52 | "\n", 53 | "Code that is funcctional with the precondition might not be functional in other contexts. What if the user sends in a string (which we know cannot be squared)? What if a user sends in a really long integer, longer than you would normally expect, but still an integer? This is where we need to think about how to ensure that our code is not simply functioning, but robust. \n", 54 | "\n", 55 | "One common way to ensure robust code is to check for data types. Another is to use try / catch statements. The most thorough way is to use _unit tests_. In our example, we can make the code more robust by checking that the input is a number. If it is, we square it, and if it is not, we return False. " 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 15, 61 | "metadata": {}, 62 | "outputs": [ 63 | { 64 | "data": { 65 | "text/plain": [ 66 | "False" 67 | ] 68 | }, 69 | "execution_count": 15, 70 | "metadata": {}, 71 | "output_type": "execute_result" 72 | } 73 | ], 74 | "source": [ 75 | "import numbers \n", 76 | "\n", 77 | "def square(number):\n", 78 | " if isinstance(number, numbers.Number):\n", 79 | " squarednumber = number * number \n", 80 | " return squarednumber\n", 81 | " else:\n", 82 | " return False\n", 83 | "\n", 84 | "square(\"b\")" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "## Economical \n", 92 | "\n", 93 | "Code that is economical is code that doesn't waste space or add extra layers of complexity. Here we can borrow from economics the ideas of fixed costs and marginal costs. A fixed cost is the initial outlay that is required to produce the first good. Economists often call them 'widgets' as a generic term. So, the cost of building the widget plant, designing the widget and hiring the widget makers are all examples of fixed costs. The marginal costs are the costs of producing the second, third and n'th widget. \n", 94 | "\n", 95 | "So if we have to do some data processing and we do type out the same task repeatedly our code is not very economical. Often times, it is said that reusable code is good code. This does not necessarily mean reusable by someone else. It can also mean reused within the same program. In the example above, our code is somewhat economical in that it uses a function to perform some action and that function can be reused. But you will also notice that it creates a new variable \"squarednumber\". We can simply get rid of that and return the square directly. \n", 96 | "``` python\n", 97 | "def square(number):\n", 98 | " if isinstance(number, numbers.Number):\n", 99 | " return number * number\n", 100 | " else:\n", 101 | " return False\n", 102 | "```\n", 103 | "\n", 104 | "We can further think about ways to simplfy our code. For example, we could make the function \"powersOf\" rather than square and make 2 the default:\n", 105 | "\n", 106 | "``` python\n", 107 | "def powersOf(number,power = 2):\n", 108 | " return number ** power\n", 109 | "```\n", 110 | "In this second case, the code might not be the most efficient (because it creates a second variable, \"power\". But it is economical in terms of functions. One could make a completely general purpose function with hundreds or thousands of arguments. That wouldn't be economical either becuase of all the possible contingencies. There is an art to thinking about what level of abstraction works best. It is where we shift our skills from being strictly programmatic to thinking of coding as writing." 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "## Efficient \n", 118 | "\n", 119 | "All else equal, we want our code to be as efficient as possible. However, that's asssuming the first three things are taken care of. In python, there are a variety of means to make code more efficient. For example, there is the [time and timeit modules](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/01.07-Timing-and-Profiling.ipynb) which can be used to get a sense of how long a task takes. \n", 120 | "\n", 121 | "Overall, efficient code tends to make use of low level python features. For example, \"broadcasting\" is more efficent than going through a for loop one at a time. But these sorts of tricks tend to beccome known as they are needed. In most cases, the greatest speed bump in code is reading bad code, not in running slow code. That's obviously not the case for the biggest data, but even then code has to be functional and robust first and foremost and economical for the developers.\n", 122 | "\n", 123 | "Strategies for developing efficient code are covered in later courses including _profiling_ and _timeit_. These are covered in later lectures in Data Analytics at Scale." 124 | ] 125 | } 126 | ], 127 | "metadata": { 128 | "kernelspec": { 129 | "display_name": "Python 3", 130 | "language": "python", 131 | "name": "python3" 132 | }, 133 | "language_info": { 134 | "codemirror_mode": { 135 | "name": "ipython", 136 | "version": 3 137 | }, 138 | "file_extension": ".py", 139 | "mimetype": "text/x-python", 140 | "name": "python", 141 | "nbconvert_exporter": "python", 142 | "pygments_lexer": "ipython3", 143 | "version": "3.7.0" 144 | } 145 | }, 146 | "nbformat": 4, 147 | "nbformat_minor": 2 148 | } 149 | -------------------------------------------------------------------------------- /Course_Material/Week_3/PySDS_demo_parseReplies_election.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*-# 2 | 3 | import re,os,time,urllib,sys,urllib2 4 | 5 | from bs4 import BeautifulSoup 6 | 7 | from sqlalchemy import * 8 | 9 | 10 | dbname = "may5-6" 11 | dbexists = False 12 | if os.path.exists(os.getcwd() + os.sep + "data" + os.sep + dbname + ".db"): 13 | dbexists = True 14 | db = create_engine('sqlite:///data/%s.db' % dbname) 15 | 16 | con = db.connect() 17 | 18 | metadata = MetaData(db) 19 | 20 | tweet_reply_table = Table('replytweets', 21 | metadata, 22 | Column('tweet_id', String(20), unique = True, primary_key=True), 23 | Column('user_id', String(20)), 24 | Column('username', String(20)), 25 | Column('text', String(140)), 26 | Column('root_tweet_id',String(20)), 27 | Column('root_tweet_username',String(32)), 28 | keep_existing=True) 29 | 30 | roottweets = Table('roottweets', metadata, autoload=True) 31 | 32 | if not tweet_reply_table.exists(): 33 | tweet_reply_table.create() 34 | 35 | 36 | def insertData(conn,table,tweet,root_name,root_tweet): 37 | 38 | try: 39 | ins = table.insert(prefixes=['OR IGNORE']).values( 40 | tweet_id = tweet[2], 41 | username = tweet[0], 42 | text = tweet[1], 43 | root_tweet_id = root_tweet, 44 | root_tweet_username = root_name 45 | ) 46 | print conn.execute(ins) 47 | 48 | except Exception as e: 49 | print(str(e)) 50 | return False 51 | 52 | return True 53 | 54 | 55 | def getReplies(username,tweet_id,opener): 56 | url = "https://mobile.twitter.com/%s/status/%s" % ( username, tweet_id) 57 | print url 58 | try: 59 | filein = opener.open(url) 60 | return filein.read() 61 | except Exception as e: 62 | print e 63 | return False 64 | 65 | def parseReplies(urltext,user): 66 | urltext = urltext.replace("timeline replies", "timeline_replies") 67 | soup = BeautifulSoup(urltext.replace("timeline replies", "timeline_replies")) 68 | 69 | usernames = [] 70 | # for i in tcs: 71 | x = soup.find_all(class_ ="username") 72 | for j in x: usernames.append(j.text.strip()[1:]) 73 | 74 | 75 | tweets = [] 76 | # for i in tcs: 77 | 78 | x = soup.find_all("div", class_ ="dir-ltr") 79 | for j in x: 80 | tweets.append(j.text.strip()) 81 | 82 | tweetids = [] 83 | x = soup.find_all(class_="tweet-text") 84 | for j in x: tweetids.append(j['data-id'].strip()) 85 | 86 | if len(usernames) == len(tweets) == len(tweetids): 87 | replies = [] 88 | tweetlist = zip(usernames,tweets,tweetids) 89 | for i in tweetlist: 90 | if i[0] == user: 91 | pass 92 | elif user in i[1]: 93 | replies.append(i) 94 | return replies 95 | else: 96 | print "bad" 97 | return [] 98 | 99 | 100 | 101 | #****************************** 102 | # BUILD THE OPENER 103 | opener = urllib2.build_opener() 104 | opener.addheaders = [('User-agent', 'OII-DSR-2014-BG/1.0')] 105 | 106 | #****************************** 107 | # Get the root tweets to iterate through 108 | # First get completed tweet IDs 109 | result = con.execute(select([tweet_reply_table.c.root_tweet_id])) 110 | tweetdonelist = set([]) 111 | 112 | for row in result: 113 | tweetdonelist.add(row["root_tweet_id"]) 114 | 115 | print len(tweetdonelist) 116 | 117 | 118 | # Second get the cursor for the table 119 | topresults = select([roottweets.c.username,roottweets.c.tweet_id]) 120 | result = con.execute(topresults) 121 | 122 | topcount = 0 123 | topcount2 = 0 124 | 125 | try: 126 | for row in result: 127 | topcount += 1 128 | 129 | 130 | if row["tweet_id"] in tweetdonelist or topcount < 299: 131 | continue 132 | else: 133 | tweetdonelist.add(row["tweet_id"]) 134 | 135 | 136 | if topcount > 350: 137 | break 138 | print row["username"],row["tweet_id"] 139 | 140 | replylist = getReplies(row["username"],row["tweet_id"],opener) 141 | if replylist: 142 | pass 143 | else: 144 | print "uh oh" 145 | continue 146 | 147 | for i in parseReplies(replylist,row["username"]): 148 | insertData(con,tweet_reply_table,i,row["username"],row["tweet_id"]) 149 | 150 | time.sleep(.5) 151 | 152 | if topcount %100 == 0: 153 | print "Working on tweet #%s, from %s" % (topcount, row["username"]) 154 | 155 | topcount2 += 1 156 | 157 | except Exception as e: 158 | print e 159 | 160 | 161 | print topcount2 162 | 163 | -------------------------------------------------------------------------------- /Course_Material/Week_3/PySDS_w3-4_Working_with_APIs.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "**PySDS Week 3 Lecture 1. V.1**\n", 8 | "Last author: B. Hogan" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "# Week 3. Day 4. - API Access practice \n", 16 | "\n", 17 | "Learning goals: \n", 18 | "- Get vs Post requests \n", 19 | "- Authenticating OAuth \n", 20 | "- Paging through a query\n" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "# Get vs. Post requests \n", 28 | "\n", 29 | "Recall that previously we used a get request in order to send a url string to a server. Everything after the domain name was used to find the right file and then present some important details to the server, such as those we found after the argument string. When we type in a URL in a browser we are similarly sending a GET request. \n", 30 | "\n", 31 | "A POST request similarly sends up a URL to a server. It similarly has a series of headers including a user-agent string. However, POST requests also contain a 'payload', which is a dictionary of key value pairs. The values are data for the server and the keys are what kind of data. \n", 32 | "\n", 33 | "POST requests are more secure than GET requests. For example, a POST request should happen every time you click submit after entering some credentials. By sending it through POST, the client can encrypt the data in the payload. Otherwise, you would be able to see the URL with your username and password as arguments in the URL string. Worse, if this is HTTP and not HTTPS, then the URL string is not encrypted in transit. This means that every server log, from the university's server logs to anyone who happened to be sniffing traffic on the wifi will be able to see your username and password. POST avoids that by putting these things in a payload. " 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "Example from TheTVDB. The site, The TVDB is an independent site of user-generated content \\[UGC]. It's donation-based and has an API for access. It is not associated with IMDB. The site's data is licensed under Creative Commons 3.0. The nice thing about the site is that it is pretty austere and it has a clear API (which is pretty new judging by the forums, and it shows with respect to its usability). \n", 41 | "\n", 42 | "Now when we log into the site, just like what I noted above, you have to fill out some details, namely your email and your password. Below is a snippet of the HTML code for that process. To see this yourself you can go to: https://www.thetvdb.com/login and then right-click -> \"show page source\". The page source is pretty long, but this snippet is in the middle. \n", 43 | "\n", 44 | "~~~ html\n", 45 | "
\n", 46 | "\n", 47 | "\t
\n", 48 | "\t\t\n", 49 | "\t\t\n", 50 | "\t
\n", 51 | "\n", 52 | "\t
\n", 53 | "\t\t\n", 54 | "\t\t\n", 55 | "\t
\n", 56 | "\n", 57 | "\t
\n", 58 | "\t\t\n", 61 | "\t
\n", 62 | "\n", 63 | "\t\n", 64 | "\t
\n", 65 | "\t\t\n", 66 | "\t\tForgot Password\n", 67 | "\t
\n", 68 | "\n", 69 | "\t\n", 70 | "\t\t\t
\n", 71 | "\t\t
\n", 72 | "\t\tNot a member? Register\n", 73 | "\t\n", 74 | "
\n", 75 | "~~~ \n", 76 | "\n", 77 | "The snippet shows that in order to log in, you have to click a button. Then it will send a post request to https://www.thetvdb.com/login/authenticate/concrete with the values of the forms. It also will use the value from the ccm_token in order to prevent cross site forgeries. So, you see, post happens all the time. \n", 78 | "\n", 79 | "We are going to have to create a post request if we want to get an API key from TheTVDB. " 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "metadata": {}, 86 | "outputs": [], 87 | "source": [ 88 | "print(len(\"\"))" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "When we go to TheTVDB's api page we are told that we need a token. They have a very handy on-site tester where you can fill in credentials and then submit. We will first create a token through this API. \n", 96 | "\n", 97 | "Notice that it produces the following request: \n", 98 | "\n", 99 | "~~~ bash\n", 100 | "curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' -d '{\n", 101 | " \"apikey\": \"\",\n", 102 | " \"userkey\": \"\",\n", 103 | " \"username\": \"bernie.hogan4a5\"\n", 104 | "}' 'https://api.thetvdb.com/login'\n", 105 | "~~~\n", 106 | "\n", 107 | "This is a 'curl' request. Curl is a common tool for downloading data from the web. It has a lot of arguments and parameters. If you ran this from a terminal window it would return the response right in the window with some tweaking. We however, are just going to use it to learn a few things, then create our own request using the 'requests' library in python. " 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "metadata": {}, 114 | "outputs": [], 115 | "source": [ 116 | "import requests\n", 117 | "\n", 118 | "payload = {\n", 119 | " \"apikey\": \"\",\n", 120 | " \"userkey\": \"\",\n", 121 | " \"username\": \"bernie.hogan4a5\"}\n", 122 | "\n", 123 | "headers = {\"Content-Type\":\"application/json\"}\n", 124 | "\n", 125 | "r = requests.post(\"https://api.thetvdb.com/login\", json=payload)\n", 126 | "print(r)" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": null, 132 | "metadata": {}, 133 | "outputs": [], 134 | "source": [ 135 | "r.json()" 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": null, 141 | "metadata": {}, 142 | "outputs": [], 143 | "source": [ 144 | "token = r.json()['token']" 145 | ] 146 | }, 147 | { 148 | "cell_type": "markdown", 149 | "metadata": {}, 150 | "source": [ 151 | "Now that we have our token we can use a series of get requests to collect data from the API. The token can be an argument in our argument string. Now notice that this time around (unlike with the Wikipedia example) we will not be creating the argument string by hand. We will be able to put that together more programmatically with ```requests```. But first we need to know what to ask for. \n", 152 | "\n", 153 | "No surprise, let's download data for The Muppet Show. Now this data should be familiar as it is the very first data that you worked with on day one. In fact, much of what we have done is meant to come back full circle now. In week one we used a database of the first four seasons of the Muppets, but notably there are five seasons. The data for the fifth one would not have come through the first API query. Instead we have to page through the results. Today we will page through those results and add the data to a ```DataFrame```. \n", 154 | "\n", 155 | "But first...how do we get this? Let's go over to the API tester and see what's available. \n", 156 | "\n", 157 | "We can see that the API says **'Series : Information about a specific series'**. Look's good; let's show that one. Underneath are a series of API end points, such as \n", 158 | "```get /series/{id}/episodes/summary```. These are URLs that, along with some arguments in the argument string will return some data to an authenticated client. Well, they are part of the URL. ACtually, they are the part that comes after ```http://api.thetvdb.com/```\n", 159 | "\n", 160 | "But how to:\n", 161 | "- Authorize ourselves on that page? (see demo - copy and paste token into browser)\n", 162 | "- Get the series ID? (see demo - using the search end point, once we are authorized will get us the series ID as a number in the json response). Hint: it is 72476\n" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": null, 168 | "metadata": {}, 169 | "outputs": [], 170 | "source": [ 171 | "import requests\n", 172 | "\n", 173 | "series_id = 72476\n", 174 | "headers = {\"Authorization\":\"Bearer %s\" % token}\n", 175 | "r = requests.get(\"http://api.thetvdb.com/series/%s/episodes\" % series_id, headers=headers)\n", 176 | "r" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": null, 182 | "metadata": {}, 183 | "outputs": [], 184 | "source": [ 185 | "if r.status_code == 200:\n", 186 | " response_data = r.json()" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": null, 192 | "metadata": {}, 193 | "outputs": [], 194 | "source": [ 195 | "from pandas.io.json import json_normalize \n", 196 | "\n", 197 | "muppetTable = json_normalize(response_data[\"data\"])\n", 198 | "display(muppetTable)" 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": null, 204 | "metadata": {}, 205 | "outputs": [], 206 | "source": [ 207 | "display(muppetTable.tail())" 208 | ] 209 | }, 210 | { 211 | "cell_type": "markdown", 212 | "metadata": {}, 213 | "source": [ 214 | "Notice that the table included five episodes from season 5. But these episodes were not included in your earlier data, and surely they aren't the only episodes from season 5? Nope, in fact, in the json we have a paging form up top. Observe:" 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": null, 220 | "metadata": {}, 221 | "outputs": [], 222 | "source": [ 223 | "response_data.keys()" 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": null, 229 | "metadata": {}, 230 | "outputs": [], 231 | "source": [ 232 | "response_data[\"links\"]" 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": null, 238 | "metadata": {}, 239 | "outputs": [], 240 | "source": [ 241 | "series_id = 72476\n", 242 | "headers = {\"Authorization\":\"Bearer %s\" % token}\n", 243 | "# \"page\":str(response_data[\"links\"][\"next\"])}\n", 244 | "print(headers)\n", 245 | "r = requests.get(\"http://api.thetvdb.com/series/%s/episodes?page=2\" % series_id, headers=headers)\n", 246 | "r" 247 | ] 248 | }, 249 | { 250 | "cell_type": "code", 251 | "execution_count": null, 252 | "metadata": {}, 253 | "outputs": [], 254 | "source": [ 255 | "if r.status_code == 200:\n", 256 | " response_data = r.json()\n", 257 | " print(\"Received data\")\n", 258 | "\n", 259 | " muppetTable2 = json_normalize(response_data[\"data\"])\n", 260 | " display(muppetTable2)" 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": null, 266 | "metadata": {}, 267 | "outputs": [], 268 | "source": [ 269 | "import pandas as pd \n", 270 | "\n", 271 | "total = pd.concat([muppetTable,muppetTable2])\n", 272 | "display(total)" 273 | ] 274 | }, 275 | { 276 | "cell_type": "code", 277 | "execution_count": null, 278 | "metadata": {}, 279 | "outputs": [], 280 | "source": [ 281 | "len(total[total[\"airedSeason\"] != 0])" 282 | ] 283 | }, 284 | { 285 | "cell_type": "markdown", 286 | "metadata": {}, 287 | "source": [ 288 | "Now we can put it all together in a single workflow. " 289 | ] 290 | }, 291 | { 292 | "cell_type": "code", 293 | "execution_count": null, 294 | "metadata": {}, 295 | "outputs": [], 296 | "source": [ 297 | "def getToken(apikey,userkey,username):\n", 298 | " \n", 299 | " import requests\n", 300 | "\n", 301 | " payload = {\n", 302 | " \"apikey\": apikey,\n", 303 | " \"userkey\": userkey,\n", 304 | " \"username\": username}\n", 305 | " \n", 306 | " headers = {\"Content-Type\":\"application/json\"}\n", 307 | "\n", 308 | " r = requests.post(\"https://api.thetvdb.com/login\", json=payload)\n", 309 | " if r.status_code == 200:\n", 310 | " return r.json()[\"token\"]\n", 311 | " else:\n", 312 | " print(\"Error: Status Code %s\" % r.status_code)\n", 313 | " return None\n", 314 | "\n", 315 | "def getEpisodeList(series_id,token):\n", 316 | " import pandas as pd\n", 317 | " \n", 318 | " episode_list = []\n", 319 | " headers = {\"Authorization\":\"Bearer %s\" % token}\n", 320 | " \n", 321 | " page = 1\n", 322 | " \n", 323 | " while True:\n", 324 | " url = \"http://api.thetvdb.com/series/%s/episodes?page=%s\" % (series_id,page)\n", 325 | " r = requests.get( url, headers=headers)\n", 326 | " if r.status_code == 200:\n", 327 | " response_data = r.json()\n", 328 | " episode_list.append(json_normalize(response_data[\"data\"]))\n", 329 | " if response_data['links'][\"next\"]:\n", 330 | " page = response_data['links'][\"next\"]\n", 331 | " else:\n", 332 | " break\n", 333 | " \n", 334 | " else:\n", 335 | " print(\"Error: Status Code %s\" % r.status_code)\n", 336 | " return None\n", 337 | "\n", 338 | " return pd.concat(episode_list)\n", 339 | " \n", 340 | " \n", 341 | "token = getToken(\"\",\n", 342 | " \"\",\n", 343 | " \"bernie.hogan4a5\")\n", 344 | "\n", 345 | "df = getEpisodeList(72476,token)\n", 346 | "print(len(df))\n", 347 | "df.tail()" 348 | ] 349 | }, 350 | { 351 | "cell_type": "code", 352 | "execution_count": null, 353 | "metadata": {}, 354 | "outputs": [], 355 | "source": [ 356 | "print(len(df))" 357 | ] 358 | } 359 | ], 360 | "metadata": { 361 | "kernelspec": { 362 | "display_name": "Python 3", 363 | "language": "python", 364 | "name": "python3" 365 | }, 366 | "language_info": { 367 | "codemirror_mode": { 368 | "name": "ipython", 369 | "version": 3 370 | }, 371 | "file_extension": ".py", 372 | "mimetype": "text/x-python", 373 | "name": "python", 374 | "nbconvert_exporter": "python", 375 | "pygments_lexer": "ipython3", 376 | "version": "3.7.0" 377 | } 378 | }, 379 | "nbformat": 4, 380 | "nbformat_minor": 2 381 | } 382 | -------------------------------------------------------------------------------- /Course_Material/Week_4/PySDS_w4-2_Working_on_a_Server_Using_Screen.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "**PySDS Week 4. Lecture 2. V.1**\n", 8 | "Last author: B. Hogan" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "# Week 4. Day 2. Working on a server \n", 16 | "\n", 17 | "Large code examples will not run on a computer. The IMDB code is really at the limit of what you might expect to be practically doing on a laptop. For many tasks there is a reason to run them remotely. For example:\n", 18 | "- You want to listen to a stream of data and you don't want to keep your laptop open and connected.\n", 19 | "- You have too much data for your computer to load.\n", 20 | "- You need processing power that's not available locally.\n", 21 | "\n", 22 | "For small tasks, a boost in ram might make a big difference, but for tasks on gigabytes of data or persistent connections, it won't. What makes a difference is using a dediated machine with a known history of continuous uptime.\n", 23 | "\n", 24 | "Linux is an operating system like Mac or Windows. It is most commonly seen in scientific work and in server administration. It does not always support the hardware of consumer devices (for example, there have yet to be reports of a Linux distribution that can drive the fingerprint reader on the Yoga 920, much to my disappointment). Linux, based on the Unix operating system, can be administered quite extensively from a command prompt. In fact, the prompt is in a shell that is its own language. Typically on a mac or linux you would be using bash or born-again shell [bash]. The good thing about this is that terminals are then easily to remote access. \n", 25 | "\n", 26 | "We can access another computer's shell remotely if we know the address of the server and it is configured for SSH. In which case we use the following syntax (on the Mac terminal, the Linux shell and Cygwin for windows). \n", 27 | "``` bash\n", 28 | "ssh USERNAME@domain.com\n", 29 | "```\n", 30 | "or \n", 31 | "``` bash\n", 32 | "ssh USERNAME:PORT@domain.com\n", 33 | "```\n", 34 | "In this case the domain is ```.oii.ox.ac.uk```. \n", 35 | "\n", 36 | "* Important note: I have had trouble giving my password via Windows Powershell, so I recommend downloading and installing Cygwin with the optional OpenSSH modules when you get to the install screen. This will be shown in class.\n", 37 | "\n", 38 | "When you first log in it will ask you to trust the key, select yes. Then either type your password or copy and paste it. This is fragile. Please be systematic and careful. It will lock you out after 5 attempts. I have been given instructions to reset the lock out, but I cannot guarantee I'll be able to use them properly. **Measure twice, cut once**. " 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "## Navigating the server\n", 46 | "\n", 47 | "The server can be navigated via the same commands as mac, for it is linux. This includes \n", 48 | "- ```cd``` change directory, recall ```~``` is home, ```.``` is here and ```..``` is up one \n", 49 | "- ```ls``` list directory, argument -a means all, i.e. \"ls -a\"\n", 50 | "- ```man``` the help page, so for help on other arguments for ls it would be ```man ls```\n", 51 | "- ```touch``` creates a new file." 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "## Copying files to a server \n", 59 | "\n", 60 | "To copy files to a server you can use scp (or secure copy) through both cygwin and \\*nix systems. To do this scp is run from the terminal (outside of ssh) with remote and local file paths as arguments. \n", 61 | "\n", 62 | "``` bash\n", 63 | "# download: remote -> local\n", 64 | "scp user@remote_host:remote_file local_file \n", 65 | "```\n", 66 | "\n", 67 | "The local file can also be a directory where the file would go. To upload, the local file is placed first:\n", 68 | "\n", 69 | "``` bash\n", 70 | "# upload: local -> remote\n", 71 | "scp local_file user@remote_host:remote_file\n", 72 | "```\n", 73 | "\n", 74 | "So if we have python file \"twitterServer.py\" on our computer at ~/Desktop/twitterServer.py then you would type:\n", 75 | "\n", 76 | "```\n", 77 | "cd ~Desktop/\n", 78 | "scp twitterServer.py inetXXXX@.oii.ox.ac.uk\n", 79 | "```\n", 80 | "And it should prompt you first for a password. If successful it will show a file copy dialog and then complete." 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "## Editing text on the server\n", 88 | "\n", 89 | "There are a few ways to edit text on a server. There are two basic text editors. ```Nano``` and ```vi```. Many hardcore programmers love vi because it employs a huge variety of keyboard shortcuts. It's for the same reason that most people find vi to be a huge pain. There are even games out there to help you improve your vi skills. But I personally think it will be futile without serious commitment. Regardless, I actually don't mind tweaking things in ```vi``` when I'm working on a server.\n", 90 | "\n", 91 | "vi started with the command ```vi```. You are then presented with a blank screen with \n", 92 | "\n", 93 | "```bash \n", 94 | "~ \n", 95 | "~ \n", 96 | "~ \n", 97 | "~``` \n", 98 | "going down the left hand side. This is the 'end of the document'. you cannot type right away in vi, but instead have to switch to one of its editing modes. Pressing ```i``` will do that, then you ccan type. Press escape and you are out of editing mode. Then in order to make a system command you have to press ```:```. To write, you would type ```w``` and press enter. To save and exit you would press ```wq```, to quit without saving it is ```q!```. I will demonstrate this, but then return to it here. It is confusing but it has a logic to it, just a foreign one to most students here. \n", 99 | "\n", 100 | "Follow along as we will first create a python file in ```vi```, then copy it to the server, log into the server, run it and then exit. \n", 101 | "\n", 102 | "The file is going to be called \"example.py\". It will be really simple: \n", 103 | "\n", 104 | "``` python\n", 105 | "import time,datetime\n", 106 | "\n", 107 | "while 1:\n", 108 | " print(\"The time is now: %s\" % datetime.datetime.now())\n", 109 | " time.sleep(3)\n", 110 | " \n", 111 | "```" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": {}, 117 | "source": [ 118 | "## Running a program on a server\n", 119 | "\n", 120 | "You will notice that when we run it on the server that it keeps going until we stop it, which we can do with a keyboard interrupt. But what happens if we want to leave the server, does it continue running? No. The shell that you create when logging into the terminal only lives while you are running it. It is destroyed when the connection is destroyed. \n", 121 | "\n", 122 | "In order to keep it running on the server, it has to be run from a shell that is not tied to ssh. To do this we use a **multiplexer**. That is a program that is going to create a second shell window for us that we can check in on and leave. If we have left it then we can get back to it. \n", 123 | "\n", 124 | "To do this we use ```screen```. This program is a multiplexer that will spawn a new instance of a terminal for you to use every time you type screen. It then displays that window. From this second window you can run commands, then exit the screen and the commands will still keep running. Let's first ```screen``` then run the python file. \n", 125 | "\n", 126 | "How do we escape this screen? It does not give a huge amount of feedback, but you would want to press $ctrl-a, d$. Control-a first lets screen know you are going to enter a command. Then $d$ is the command for **detaching**. This should bring you back to the main terminal window. To reattach you should type: ```screen -r```. If you happen to have more than one screen it will list these with random identifiers called ```pid```s. You can type ```screen -r ``` to get the correct working pid. As a tip, you can name a screen when you first create it by typing ```screen -S ``` and then reattach with that." 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": {}, 132 | "source": [ 133 | "# Section 2. Creating a Twitter Stream listener \n", 134 | "\n", 135 | "There are many reasons to create a Twitter stream listener if you want to collect your own live data from the site. First let's check that the module was instantiated correctly. " 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": null, 141 | "metadata": {}, 142 | "outputs": [], 143 | "source": [ 144 | "try: \n", 145 | " import tweepy\n", 146 | "except ModuleNotFoundError:\n", 147 | " import sys\n", 148 | " !{sys.executable} -m pip install git+https://github.com/tweepy/tweepy.git\n", 149 | " import tweepy" 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "metadata": {}, 155 | "source": [ 156 | "If we don't get an error then we should be all good. Now let's go over to Twitter to get some API keys. We start at https://developer.twitter.com/ and then go to \"apps\" under your name. We want to create a new app, get the keys, get the secret keys and then make use of them. We can do this in a similar way to what we did with API keys from reddit. (i.e. create the json, close it, delete them from the script. Bear in mind you will have to upload both the json and the script to the server later. " 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": null, 162 | "metadata": {}, 163 | "outputs": [], 164 | "source": [ 165 | "1/0\n", 166 | "import json \n", 167 | "\n", 168 | "keys = {\"CONSUMER_KEY\":\"\", \n", 169 | " \"CONSUMER_SECRET\":\"\", \n", 170 | " \"ACCESS_TOKEN\":\"\", \n", 171 | " \"ACCESS_TOKEN_SECRET\":\"\",\n", 172 | " \"gmail\":\"\"}\n", 173 | "\n", 174 | "with open(\"twitter_keys.json\",'w') as infile:\n", 175 | " infile.write( json.dumps(keys) )" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": null, 181 | "metadata": {}, 182 | "outputs": [], 183 | "source": [ 184 | "TWEETFILE = \"Tweet_Output.dat\"\n", 185 | "\n", 186 | "keys = json.loads(open(\"twitter_keys.json\").read())\n", 187 | "\n", 188 | "auth = tweepy.OAuthHandler(keys['CONSUMER_KEY'],keys['CONSUMER_SECRET'])\n", 189 | "auth.set_access_token(keys['ACCESS_TOKEN'], keys['ACCESS_TOKEN_SECRET'])\n", 190 | "\n", 191 | "api = tweepy.API(auth)\n", 192 | "\n", 193 | "if api:\n", 194 | " print(\"Successfully Authenticated\")\n", 195 | "else:\n", 196 | " print(\"Problems with authentication\")\n", 197 | "\n", 198 | "class CustomStreamListener(tweepy.StreamListener):\n", 199 | "\n", 200 | " def __init__ (self,limit=100,outfile=\"fileout.dat\",counter=10):\n", 201 | " self.count = 0\n", 202 | " self.limit = limit\n", 203 | " self.counter = counter\n", 204 | " self.fileout = open(outfile,'a')\n", 205 | " \n", 206 | " def on_error(self, status_code):\n", 207 | " print ('Encountered error with status code:', status_code)\n", 208 | " \n", 209 | " return True # Don't kill the stream\n", 210 | "\n", 211 | " def on_timeout(self):\n", 212 | " print('Timeout...')\n", 213 | " time.sleep(1)\n", 214 | " return True # Don't kill the stream\n", 215 | "\n", 216 | " def on_data(self, data):\n", 217 | " self.count += 1\n", 218 | " if self.count % self.counter == 0:\n", 219 | " print(\"Processing Tweet: %s\" % self.count)\n", 220 | " if self.count == self.limit:\n", 221 | " self.fileout.close()\n", 222 | " return False\n", 223 | " else:\n", 224 | " self.fileout.write(data.strip() + \"\\n\")\n", 225 | " \n", 226 | "# Notice that this instantiates the stream listener but it does not start it. \n", 227 | "streaming_api = tweepy.streaming.Stream(auth,CustomStreamListener(), timeout=60)\n", 228 | "\n", 229 | "# This is the filter we use; filters on twitter can be very complex. \n", 230 | "TWEET_FILTER = [\"Trump\"]\n", 231 | "\n", 232 | "# This starts the stream listener. \n", 233 | "streaming_api.filter(follow=None, track=TWEET_FILTER)" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": {}, 239 | "source": [ 240 | "# Section 3. Email warnings\n", 241 | "\n", 242 | "Building in an email warning is a useful way to alert you if something goes wrong on the server. We use gmail since Google enables us to have app passwords that are specific to the program and don't require two factor authentication. " 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": null, 248 | "metadata": {}, 249 | "outputs": [], 250 | "source": [ 251 | "import time\n", 252 | "import smtplib\n", 253 | "import datetime\n", 254 | "\n", 255 | "def send_email(test=True, text = \"\",pw=\"\"):\n", 256 | "\n", 257 | " if pw == \"\":\n", 258 | " print(\"Did not include a password\")\n", 259 | " return False\n", 260 | " else:\n", 261 | " gmail_pwd = pw # Use your own password! - see https://security.google.com/settings/security/apppasswords\n", 262 | "\n", 263 | " gmail_user = \"bernie.hogan@gmail.com\"\n", 264 | " FROM = \"bernie.hogan@gmail.com\"\n", 265 | " TO = [\"\"]\n", 266 | " SUBJECT = \"Help, the stream is broken!\"\n", 267 | " TEXT = \"The stream produced an error. Please return to the server and check it out. %s\" % text\n", 268 | " # Prepare actual message\n", 269 | " message = \"\"\"From: %s\\nTo: %s\\nSubject: %s\\n\\n%s\n", 270 | " \"\"\" % (FROM, \", \".join(TO), SUBJECT, TEXT)\n", 271 | "\n", 272 | " print(message)\n", 273 | " try:\n", 274 | " server = smtplib.SMTP(\"smtp.gmail.com\", 587)\n", 275 | " server.ehlo()\n", 276 | " server.starttls()\n", 277 | " server.login(gmail_user, gmail_pwd)\n", 278 | " server.sendmail(FROM, TO, message)\n", 279 | " server.close()\n", 280 | " print('successfully sent the mail')\n", 281 | " except:\n", 282 | " print(\"failed to send mail\")\n", 283 | "\n", 284 | "send_email(text = \"False alarm, just starting the program %s\" % datetime.datetime.now() ,pw=keys[\"gmail\"])" 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "metadata": {}, 290 | "source": [ 291 | "Now you can embed this method into your program, wrap the stream listener in a try / except statement and if it fails, on the exception it will email you to say that there was an issue. Like so: " 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": null, 297 | "metadata": {}, 298 | "outputs": [], 299 | "source": [ 300 | "import json \n", 301 | "keys = json.loads(open(\"twitter_keys.json\").read())\n" 302 | ] 303 | }, 304 | { 305 | "cell_type": "code", 306 | "execution_count": null, 307 | "metadata": {}, 308 | "outputs": [], 309 | "source": [ 310 | "try:\n", 311 | " # This starts the stream listener. \n", 312 | " streaming_api.filter(follow=None, track=TWEET_FILTER)\n", 313 | " 1/0\n", 314 | "except Exception as e:\n", 315 | " send_email(text = \"We received the following error that stopped the program: %s\" % e)" 316 | ] 317 | }, 318 | { 319 | "cell_type": "markdown", 320 | "metadata": {}, 321 | "source": [ 322 | "# Section 4. Checking the data. \n", 323 | "\n", 324 | "First we will want to get the data out of the server using ```scp```, then we will want to parse it. I placed data in a flat file with one tweet object per line. Now these days Twitter has an 'extended_tweet' objecct for long tweets. \n", 325 | "\n", 326 | "See the code snippet below" 327 | ] 328 | }, 329 | { 330 | "cell_type": "code", 331 | "execution_count": null, 332 | "metadata": {}, 333 | "outputs": [], 334 | "source": [ 335 | "TWEETFILE = \"Tweet_Output.dat\"\n", 336 | "\n", 337 | "with open(TWEETFILE) as filein:\n", 338 | " for i in filein.readlines(): \n", 339 | " if len(i) > 1:\n", 340 | " x = json.loads(i.strip())\n", 341 | " if x[\"truncated\"]:\n", 342 | " print(x[\"extended_tweet\"][\"full_text\"],\"\\n\")\n", 343 | " else:\n", 344 | " print(x[\"text\"],\"\\n\")" 345 | ] 346 | } 347 | ], 348 | "metadata": { 349 | "kernelspec": { 350 | "display_name": "Python 3", 351 | "language": "python", 352 | "name": "python3" 353 | }, 354 | "language_info": { 355 | "codemirror_mode": { 356 | "name": "ipython", 357 | "version": 3 358 | }, 359 | "file_extension": ".py", 360 | "mimetype": "text/x-python", 361 | "name": "python", 362 | "nbconvert_exporter": "python", 363 | "pygments_lexer": "ipython3", 364 | "version": "3.7.0" 365 | } 366 | }, 367 | "nbformat": 4, 368 | "nbformat_minor": 2 369 | } 370 | -------------------------------------------------------------------------------- /Course_Material/Week_4/PySDS_w4-2_exampleCode_twitterStreamer.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import time 3 | import smtplib 4 | import sys 5 | import tweepy 6 | import json 7 | 8 | class CustomStreamListener(tweepy.StreamListener): 9 | 10 | def __init__ (self,limit=100,outfile="fileout.dat",counter=10): 11 | self.count = 0 12 | self.limit = limit 13 | self.counter = counter 14 | self.fileout = open(outfile,'a') 15 | 16 | def on_error(self, status_code): 17 | print ('Encountered error with status code:', status_code) 18 | return True # Don't kill the stream 19 | 20 | def on_timeout(self): 21 | print('Timeout...') 22 | time.sleep(1) 23 | return True # Don't kill the stream 24 | 25 | def on_data(self, data): 26 | self.count += 1 27 | if self.count % self.counter == 0: 28 | print("Processing Tweet: %s" % self.count) 29 | if self.count == self.limit: 30 | self.fileout.close() 31 | return False 32 | else: 33 | self.fileout.write(data.strip() + "\n") 34 | 35 | def send_email(test=True, text = "",pw=""): 36 | 37 | if pw == "": 38 | print("Did not include a password") 39 | return False 40 | gmail_user = "bernie.hogan@gmail.com" 41 | gmail_pwd = pw # Use your own password! - see https://security.google.com/settings/security/apppasswords 42 | FROM = "bernie.hogan@gmail.com" 43 | if test: 44 | TO = ["bernie.hogan@oii.ox.ac.uk"] 45 | SUBJECT = "Help, the stream is broken!" 46 | TEXT = "The stream produced an error. Please return to the server and check it out. %s" % text 47 | # Prepare actual message 48 | message = """From: %s\nTo: %s\nSubject: %s\n\n%s 49 | """ % (FROM, ", ".join(TO), SUBJECT, TEXT) 50 | 51 | print(message) 52 | try: 53 | server = smtplib.SMTP("smtp.gmail.com", 587) 54 | server.ehlo() 55 | server.starttls() 56 | server.login(gmail_user, gmail_pwd) 57 | server.sendmail(FROM, TO, message) 58 | server.close() 59 | print('successfully sent the mail') 60 | except: 61 | print("failed to send mail") 62 | 63 | 64 | 65 | def main(argv): 66 | TWEETFILE = "Tweet_Output.dat" 67 | 68 | keys = json.loads(open("twitter_keys.json").read()) 69 | 70 | auth = tweepy.OAuthHandler(keys['CONSUMER_KEY'],keys['CONSUMER_SECRET']) 71 | auth.set_access_token(keys['ACCESS_TOKEN'], keys['ACCESS_TOKEN_SECRET']) 72 | 73 | api = tweepy.API(auth) 74 | 75 | if api: 76 | print("Successfully Authenticated") 77 | else: 78 | print("Problems with authentication") 79 | 80 | # Notice that this instantiates the stream listener but it does not start it. 81 | streaming_api = tweepy.streaming.Stream(auth, 82 | CustomStreamListener(limit=10,outfile=TWEETFILE,counter=2), 83 | timeout=60) 84 | 85 | # This is the filter we use; filters on twitter can be very complex. 86 | TWEET_FILTER = ["Trump"] 87 | 88 | # This starts the stream listener. 89 | try: 90 | streaming_api.filter(follow=None, track=TWEET_FILTER) 91 | except Exception as e: 92 | send_email(text = "We received the following error that stopped the program: %s" % e) 93 | 94 | if __name__ == "__main__": 95 | main(sys.argv) 96 | # keys = json.loads(open("twitter_keys.json").read()) 97 | 98 | 99 | -------------------------------------------------------------------------------- /Course_Material/Week_4/PySDS_w4-3a_ResearchEthics.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oxfordinternetinstitute/sds-python/8c2469c9ff52d53e1176a6692a27cfe6dda55bc1/Course_Material/Week_4/PySDS_w4-3a_ResearchEthics.pdf -------------------------------------------------------------------------------- /Course_Material/Week_4/PySDS_w4-3b_ResearchQuestions.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oxfordinternetinstitute/sds-python/8c2469c9ff52d53e1176a6692a27cfe6dda55bc1/Course_Material/Week_4/PySDS_w4-3b_ResearchQuestions.pdf -------------------------------------------------------------------------------- /Course_Material/Week_4/PySDS_w4-4_Basic_Github.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "**PySDS Week 4. Lecture 4. v1** Author: Bernie Hogan" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "# GitHub Basics \n", 15 | "\n", 16 | "At this point you have encountered Github in your work. We seem to download a lot of libraries and other software from there. What is it? Github is an online platform for storing code and software built from that code. GitHub is but one of many implementations of ```git``` which is itself a version control system, or VCS. There's a number of VCSs. Mercurial is one alternative to Git as well as the orignial ```CVS``` which is what was commonly used to manage files before Git came along. Fun fact, Git was created by Linus Torvalds, the creator of Linux. " 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "I'll be honest. I don't think I can do a better job than the Git people themselves. Let's switch over to their introductory page on Git for a bit. Run that code and then pop back here. \n", 24 | "\n", 25 | "https://guides.github.com/introduction/git-handbook/\n", 26 | "\n", 27 | "\\ \n", 28 | "\n", 29 | "Welcome back. Now that we've seen how Git works, I recommend considering using a desktop application, such as GitHub desktop. It's not necessary. Some people really like Git in the shell. But I prefer the features of the app. If you have the app running you can close into desktop. \n", 30 | "\n", 31 | "Here are some important things to remember:\n", 32 | "- You can **clone** a project, which creates your own independent version. \n", 33 | "- You can further **fork** a project, which links back to the original but is your own variant. This is what you can use to make \"pull requests\", which means that the original project will take code from your forked project. \n", 34 | "\n", 35 | "Let's look at this repository. You cannot see it yet as it has not been made public: \n", 36 | "https://github.com/oxfordinternetinstitute/sds-python\n", 37 | "\n", 38 | "On my computer, you can see a 'clone in desktop' button. Let's do that. If you don't have the app, you can also use the same code that we used in the example from github guides. \n", 39 | "\n", 40 | "This will now give you in a single directory all of the files from this course that we need with the exception of the Twitter data, which I cannot share publicly as per Twitter's policy. \n", 41 | "\n", 42 | "## Why here and not Canvas?\n", 43 | "\n", 44 | "Canvas is great for course management but afterwards, it will be hard for me to keep up these files. For now, I'm going to maintain them on GitHub. That means that this includes the assignments _and_ the model answers (mainly written by Patrick Gildersleve). Thus, next year I'll have to come up with newer, even more non-sensical exercises like Haiku, Acrostics and choose-your-own adventures. \n", 45 | "\n", 46 | "If you recall from forking the respository, you can consider giving back by creating a pull request from a forked library. If it's a good addition, I'll merge it in! This course is never perfect and never finished. But let's hope that, just like our minds, the repository keeps growing and getting ever more effective at the tasks we set out for it. \n", 47 | "\n", 48 | "## When will it be made public? \n", 49 | "\n", 50 | "As soon as possible. You have all the files you need in Canvas. However, in Git we have to make sure that we are not sharing data we shouldn't. That includes Twitter data. There is also no reason to share some of the example files. If you'll notice, my repository as a '.gitignore' File. One of the tasks left to do is to ensure that all the correct files that are supposed to be ignored by GitHub are. \n", 51 | "\n", 52 | "We also want to display all of the model answers for the various formatives and ensure that all my teaching notes are either integrated but not deleted. It's a lot of work, but it should be done next week. I know the students in the other course will be quite keen to see what we have been up to." 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "Final note.\n", 60 | "\n", 61 | "As we have been given a leccture on LaTeX it is worth showing that pandas can automatically format a table for LaTeX. " 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": null, 67 | "metadata": {}, 68 | "outputs": [], 69 | "source": [ 70 | "import pandas as pd \n", 71 | "\n", 72 | "x = pd.DataFrame([[2,3],[3,4]])\n", 73 | "display(x)\n", 74 | "print(x.to_latex())" 75 | ] 76 | } 77 | ], 78 | "metadata": { 79 | "kernelspec": { 80 | "display_name": "Python 3", 81 | "language": "python", 82 | "name": "python3" 83 | }, 84 | "language_info": { 85 | "codemirror_mode": { 86 | "name": "ipython", 87 | "version": 3 88 | }, 89 | "file_extension": ".py", 90 | "mimetype": "text/x-python", 91 | "name": "python", 92 | "nbconvert_exporter": "python", 93 | "pygments_lexer": "ipython3", 94 | "version": "3.7.0" 95 | } 96 | }, 97 | "nbformat": 4, 98 | "nbformat_minor": 2 99 | } 100 | -------------------------------------------------------------------------------- /Course_Material/Week_4/PySDS_w4-4b_Latex.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oxfordinternetinstitute/sds-python/8c2469c9ff52d53e1176a6692a27cfe6dda55bc1/Course_Material/Week_4/PySDS_w4-4b_Latex.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Fundamentals of Social Data Science 2 | 3 | This is the course repository for the introductory python course in Oxford's [Social Data Science program](https://www.oii.ox.ac.uk/study/msc-in-social-data-science/). 4 | 5 | This course, currently named [Python for Social Data Science](https://www.oii.ox.ac.uk/study/courses/python-for-social-data-science/), teaches the skills needed to begin working in soical data science. This focuses first on programming skills in python. It includes some key basic programming skills, such as lists and functions, as well as more abstract and complex topics like API access, file types, text processing and dataFrames. 6 | 7 | Data science is an emerging discipline concerned with the processing and management of data. Because data is now so prevalent, complex and volumous there is a niche for those with specific skills in data processing. Data science has four pillars: theory, data access, data wrangling and data analysis. 8 | 9 | This course only goes into very rudimentary detail on theory and analysis. Theory in our degree program is taught through the course "Foundations of Social Data Science". Analysis is partially taught with a focus on descriptives. But it is undoubtedly the case that data science is steeped in statistics. In our program statistics skills are taught in two other courses, Statistics for Social Data Science and Intro to Machine Learning. That said, some skills in this course really do benefit from knowing a little theory and a little statistics, so you will encounter some basic theoretical ideas in the pages in this repository. For example, in Week 4, lecture 1 on visualization, several visualization frameworks are included and the visualizations will be based off of some standard statistical distributions, like a normal distribution or an exponential distribution. 10 | 11 | ### The repository 12 | The repository for 2019-2020 will be held on Oxford's Canvas courseware platform until the course is complete. 13 | 14 | The repository for 2018-2019 consists of four main folders: 15 | - course work, 16 | - assignments, 17 | - supplementary data (public), 18 | - supplementeary data (withheld). 19 | 20 | The latter one is really just more data that should be in the supplementary data folder but for licensing reasons we will not be sharing that data in a public GitHub repository. In the case of the database of tweets, this is unavoidable as per the licensing agreement for use of the Twitter API is not to share tweets collected. 21 | 22 | In the assignments folder are the assignments and some model answers (where available). The model answers were typically drafted by teaching assistant Patrick Gildersleve. Updated versions of these can also be found at Patrick's own GitHub repository for [model answers](https://github.com/pgilders/OII-PySDS-ModelAnswers). 23 | 24 | ## About this course. 25 | Some of the information about programming in the course is partial. It's not meant to be incorrect, but we are definitely omitting or papering over certain topics for brevity's sake. The goal here is to get people the skills and wisdom to put a study together for analysis and publication. In that sense, this course is a little more directed but also more concise than a repository such as Jake Vanderplas' [Python for Data Science](https://github.com/jakevdp/PythonDataScienceHandbook). Jake's repository, by contrast appears to be much closer to emphasizing completeness. 26 | 27 | You will also notice various Muppet-themed examples throughout the course. This is because as a television show, Muppets offer an extensive amount of accessible data, from episode guides to wikipedia profiles on characters. This can help give is a flavor for high dimensional data while steer a little clear of some complex theoretical issues that are sure to arise in a more focused and sustained project. 28 | 29 | ### How to run these files 30 | The course is written almost entirely in Jupyter notebooks. We recommend downloading the [Anaconda package for scientific python](https://www.anaconda.com/download/), installing it and then launching the Anaconda Navigator. This navigator provides access to a host of scientific programming tools and particularly to JupyterLab. Run Jupyter lab and then navigate to a folder that includes the .ipynb files. 31 | 32 | ### Running this course in a browser 33 | Python notebooks can be run directly in the browser using Google Collab. To do this, simply get a url from a notebook page, such as github.com/oxfordinternetinstitute/sds-python/blob/master/Course_Material/Week_0/PySDS_week0_lecture1.ipynb Then, where it says github.com replace this with: https://colab.research.google.com/github so you would navigate to: https://colab.research.google.com/github/oxfordinternetinstitute/sds-python/blob/master/Course_Material/Week_0/PySDS_week0_lecture1.ipynb in order to run that page in the browser. 34 | 35 | ## Week 1. Introduction to Programming in Python. 36 | This week teaches the basics from variable types, loops, execptions and functions. 37 | 38 | ## Week 2. Data Wrangling 39 | This week covers the basics of working with the Python for Data Analysis library (pandas). We look to DataFrames, SQL, XML and JSON. We parse datetime objects and introduce regular expressions. 40 | 41 | ## Week 3. Accessing and exploring data 42 | This week examines ways to collect data from the web, how to examine that data and how to turn it into a table or DataFrame for analysis. 43 | 44 | ## Week 4. Research and presentation skills 45 | This week presents skills in data visualization, working on a server, ethical data access and LaTeX. It is more oriented towards lectures than the prior two weeks. To that end, you will see several .pdf files here based on lectures created in PowerPoint and LaTeX beamer. 46 | -------------------------------------------------------------------------------- /Supplementary_Data_(public)/MuppetsTable.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oxfordinternetinstitute/sds-python/8c2469c9ff52d53e1176a6692a27cfe6dda55bc1/Supplementary_Data_(public)/MuppetsTable.xlsx -------------------------------------------------------------------------------- /Supplementary_Data_(public)/example_lines.txt: -------------------------------------------------------------------------------- 1 | Testing Line 1 2 | Exploring Line 2 3 | Line 3 over here! 4 | What? You wanted line 4? 5 | No, I expected the fifth line 6 | I think you spelled line 6 wrong 7 | Line 7 here: Forget it. 8 | I'm the troublemaker! (Line 8 hehehe) 9 | Dr. Hogan thinks I'm line 9 :P 10 | 11 | Line 11 being difficult here. --------------------------------------------------------------------------------