├── .gitignore ├── README.md ├── day1 ├── git_followup.sh ├── git_lifejacket.sh ├── hello_world.py ├── plots │ ├── languages.png │ ├── python_comfort.png │ └── topics.png └── pythonintro.pdf ├── day2 ├── backup │ ├── datasci.json │ ├── error.json │ ├── log.txt │ └── notes.mdown ├── day2log.txt ├── download_github.py └── notes15jan.mdown ├── day3 ├── day3log.txt ├── day3notes.md ├── lecture.py └── photo.png ├── day4 ├── ISO_3166-1 ├── backup │ ├── country.py │ ├── log.py │ └── simple.py ├── decade.py ├── languages.csv ├── log.py ├── notes.mdown ├── population.csv └── prepare.py ├── day5 ├── README.md ├── backup │ └── truck_counts.csv ├── day5log.py ├── map │ ├── data.json │ ├── map.html │ └── sf_nhoods.json ├── maps.ipynb ├── mobile_food.csv ├── mpg_plots.py ├── notes.md ├── sf_nhoods.json ├── truck_map.py └── vehicles.csv ├── day6 ├── analysis.py ├── announce.py ├── babynamer.py ├── day6log.txt ├── dearyeji.txt ├── findnames.py ├── textgraph.py └── yob2013.txt ├── day7 ├── README.md ├── aa.db ├── atus.db ├── atus_blank.py ├── backup │ ├── atus.py │ └── lecture.py ├── day7log1.py └── day7log2.py ├── day8 ├── backup │ ├── notes.py │ └── titanic.py ├── betafit.py ├── day8.py ├── final.mdown ├── kaggle.py ├── sleepminutes.csv └── titanic.csv ├── exercise1 ├── exercise1.md ├── usda.py └── usda_blank.py ├── exercise2 ├── exercise2.md ├── random_walk.py └── random_walk_solution.py ├── exercise3 ├── exercise3.mdown ├── wpgraph.pdf └── wpgraph.py ├── exercise4 ├── convert.py └── exercise4.md ├── feedback ├── feedback.csv └── feedback.ipynb └── iid_talk ├── Makefile ├── iidata workshop.ipynb └── slides.mdown /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__ 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [Course Videos](http://dsi.ucdavis.edu/PythonMiniCourse/) 2 | 3 | # Python for Data Mining 4 | 5 | [Python][] is a programming language designed to have clear, concise, and 6 | expressive code. 7 | An extremely popular general-purpose language, Python has been used for tasks 8 | as diverse as web development, teaching, and systems administration. 9 | This mini-course provides an introduction to Python for data mining. 10 | 11 | Messy data has an inconsistent or inconvenient format, and may have missing 12 | values. 13 | Noisy data has measurement error. 14 | *Data mining extracts meaningful information from messy, noisy data.* 15 | This is a start-to-finish process that includes gathering, cleaning, 16 | visualizing, modeling, and reporting. 17 | 18 | Programming and research best practices are a secondary focus of the 19 | mini-course, because [Python is a philosophy][zen] as well as a language. 20 | Core concepts include: writing organized, well-documented code; being a 21 | self-sufficient learner; using version control for code management and 22 | collaboration; ensuring reproducibility of results; producing concise, 23 | informative analyses and visualizations. 24 | 25 | We will meet for four weeks during the Winter 2015 quarter at the 26 | University of California, Davis. 27 | 28 | [zen]: http://legacy.python.org/dev/peps/pep-0020/ 29 | [Python]: https://www.python.org/ 30 | 31 | ## Target Audience 32 | The mini-course is open to undergraduate and graduate students from all 33 | departments. 34 | We recommend that students have prior programming experience 35 | and a basic understanding of statistical methods, 36 | so they can follow along with the examples. 37 | For instance, completion of STA 108 and STA 141 is sufficient 38 | (but not required). 39 | 40 | ## Topics 41 | 42 | ### Core Python 43 | The mini-course will kick off with a quick introduction to the syntax of 44 | Python, including operators, data types, control flow statements, function 45 | definition, and string manipulation. 46 | Slower, in-depth coverage will be given to uniquely Pythonic features such as 47 | built-in data structures, list comprehensions, iterators, and docstrings. 48 | 49 | Authoring packages and other advanced topics may also be discussed. 50 | 51 | ### Scientific Computing 52 | Support for stable, high-performance vector operations is provided by the NumPy 53 | package. 54 | NumPy will be introduced early and used often, because it's the foundation for 55 | most other scientific computing packages. 56 | We will also cover SciPy, which extends NumPy with functions for 57 | linear algebra, optimization, and elementary statistics. 58 | 59 | Specialized packages will be discussed during the final week. 60 | 61 | ### Data Manipulation 62 | The pandas package provides tabular data structures and convenience functions 63 | for manipulating them. 64 | This includes a two-dimensional data frame similar to the one found in R. 65 | Pandas will be covered extensively, because it makes it easy to 66 | 67 | + Read and write many formats (CSV, JSON, HDF, database) 68 | + Filter and restructure data 69 | + Handle missing values gracefully 70 | + Perform group-by operations (`apply` functions) 71 | 72 | ### Data Visualization 73 | 74 | Many visualization packages are available for Python, but the mini-course will 75 | focus on Seaborn, which is a user-friendly abstraction of the venerable 76 | matplotlib package. 77 | 78 | Other packages such as ggplot2, Vincent, Bokeh, and mpld3 may also be covered. 79 | 80 | ## Programming Environment 81 | Python 3 has syntax changes and new features that break compatibility with 82 | Python 2. 83 | All of the major scientific computing packages have added support for Python 3 84 | over the last few years, so it will be our focus. 85 | We recommend the [Anaconda][] Python 3 distribution, 86 | which bundles most packages we'll use into one download. 87 | Any other packages needed can be installed using `pip` or `conda`. 88 | 89 | Python code is supported by a vast array of editors. 90 | 91 | + [Spyder IDE][Spyder], included in Anaconda, 92 | is a Python equivalent of RStudio, 93 | designed with scientific computing in mind. 94 | + [PyCharm IDE][PyCharm] and [Sublime][] provide good user interfaces. 95 | + Terminal-based text editors, such as [Vim][] and [Emacs][], are a great 96 | choice for ambitious students. They can be used with any language. 97 | See [here][Text Editors] for more details. Clark and Nick both use Vim. 98 | 99 | [Anaconda]: http://continuum.io/downloads 100 | [Spyder]: https://code.google.com/p/spyderlib/ 101 | [PyCharm]: https://www.jetbrains.com/pycharm/ 102 | [Sublime]: http://www.sublimetext.com/ 103 | [Vim]: http://www.vim.org/ 104 | [Emacs]: https://www.gnu.org/software/emacs/ 105 | [Text Editors]: http://heather.cs.ucdavis.edu/~matloff/ProgEdit/ProgEdit.html 106 | 107 | ## References 108 | 109 | No books are required, but we recommend Wes McKinney's book: 110 | 111 | + McKinney, W. (2012). _Python for Data Analysis: Data Wrangling with Pandas, 112 | NumPy, and IPython_. O'Reilly Media. 113 | 114 | Python and most of the packages we'll use have excellent documentation, which 115 | can be found at the following links. 116 | 117 | + [Python 3](https://docs.python.org/3/) 118 | + [NumPy](http://docs.scipy.org/doc/numpy/) 119 | + [SciPy](http://docs.scipy.org/doc/scipy/reference/) 120 | + [pandas](http://pandas.pydata.org/pandas-docs/stable/) 121 | + [matplotlib](http://matplotlib.org/contents.html) 122 | + [Seaborn](http://stanford.edu/~mwaskom/software/seaborn/tutorial.html) 123 | + [scikit-learn](http://scikit-learn.org/stable/documentation.html) 124 | + [IPython](http://ipython.org/documentation.html) 125 | 126 | Due to Python's popularity, a large number of general references are available. 127 | While these don't focus specifically on data analysis, they're helpful for 128 | learning the language and its idioms. 129 | Some of our favorites are listed below, many of which are free. 130 | 131 | + Swaroop, C. H. (2003). _[A Byte of Python][]_. ([PDF][ABoP PDF]) 132 | + Reitz, K. _[Hitchhiker's Guide to Python][Hitchhiker's Guide]_. 133 | ([PDF][HGoP PDF]) 134 | + Lutz, M. (2014). _Python Pocket Reference_. O'Reilly Media. 135 | + Beazley, D. (2009). _Python Essential Reference_. Addison-Wesley. 136 | + Pilgrim, M., & Willison, S. (2009). _[Dive Into Python 3][]_. Apress. 137 | + [Non-programmer's Tutorial for Python 3][Non] 138 | + [Beginner's Guide to Python][Beginner's Guide] 139 | + [Five Lifejackets to Throw to the New Coder][New Coder] 140 | + [Pyvideo][Pyvideo]\* 141 | + [StackOverflow][]. Please be conscious of the [rules][SO Rules]! 142 | 143 | \* Videos featuring Guido Van Rossum, Raymond Hettinger, Travis Oliphant, 144 | Fernando Perez, David Beazley, and Alex Martelli are suggested. 145 | 146 | 147 | [A Byte of Python]: http://www.swaroopch.com/notes/python/ 148 | [ABoP PDF]: http://files.swaroopch.com/python/byte_of_python.pdf 149 | 150 | [Hitchhiker's Guide]: http://docs.python-guide.org/en/latest/ 151 | [HGop PDF]: https://media.readthedocs.org/pdf/python-guide/latest/python-guide.pdf 152 | 153 | [Dive Into Python 3]: http://www.diveintopython3.net/ 154 | [Non]: http://en.wikibooks.org/wiki/Non-Programmer%27s_Tutorial_for_Python_3 155 | [Beginner's Guide]: https://wiki.python.org/moin/BeginnersGuide 156 | [New Coder]: http://newcoder.io/ 157 | [Pyvideo]: http://pyvideo.org/ 158 | [StackOverflow]: http://stackoverflow.com/questions/tagged/python 159 | [SO Rules]: http://stackoverflow.com/tour 160 | 161 | -------------------------------------------------------------------------------- /day1/git_followup.sh: -------------------------------------------------------------------------------- 1 | 2 | # More on Git 3 | # =========== 4 | # This file describes some everyday tasks in git. 5 | 6 | # Creating Repositories 7 | # ===================== 8 | # You don't have to use GitHub to create a new repository. You can create an 9 | # new local repository named NAME with: 10 | # 11 | # git init [NAME] 12 | # 13 | git init my-repo 14 | 15 | # If you want to use a repository you created locally with a remote repository, 16 | # you have to tell git where the remote repository is: 17 | # 18 | # git remote [add NAME URL] 19 | # 20 | 21 | # Add a remote called origin: 22 | git remote add origin https://github.com/nick-ulle/my-repo.git 23 | 24 | # List all remotes for the current repo: 25 | git remote 26 | 27 | # After adding the remote repository, you can push and pull as usual. 28 | 29 | # Branching 30 | # ========= 31 | # Git supports keeping multiple working versions of a repository at once. These 32 | # are represented as branches. Every repository starts with a branch called 33 | # 'master'. 34 | # 35 | # To make a new branch named NAME, use: 36 | # 37 | # git branch [NAME] 38 | # 39 | git branch experimental 40 | 41 | # On its own, `git branch` will list the repository's branches: 42 | 43 | git branch 44 | 45 | # You can switch branches with: 46 | # 47 | # git checkout NAME 48 | # 49 | git checkout experimental 50 | 51 | # You can delete a branch with the following command: 52 | # 53 | # git branch -d NAME 54 | # 55 | git checkout master 56 | git branch -d experimental 57 | 58 | git branch 59 | 60 | # Now let's recreate that branch and switch to it. 61 | git branch experimental 62 | git checkout experimental 63 | 64 | # If we make some changes on the branch and commit them, they don't get applied 65 | # to any other branch. 66 | touch testing.py 67 | git add testing.py 68 | git commit 69 | 70 | ls 71 | git log 72 | 73 | git checkout master 74 | ls 75 | git log 76 | 77 | git branch -v 78 | 79 | # To merge commits from another branch into the current branch, use: 80 | # 81 | # git merge BRANCH 82 | # 83 | git merge experimental 84 | 85 | git status 86 | git log 87 | ls 88 | 89 | # How should you use branches? 90 | # 91 | # Branches are useful for isolating experimental changes from code you already 92 | # have working. They also make it easy to keep manage multiple drafts. 93 | # 94 | # Use branches judiciously according to your own work habits, especially for 95 | # projects where you're the only contributor. In larger, public projects, try 96 | # to follow guidelines agreed upon by all the contributors. 97 | # 98 | # A popular branching workflow is described here: 99 | # 100 | # http://nvie.com/posts/a-successful-git-branching-model/ 101 | # 102 | 103 | # Stashing Changes 104 | # ================ 105 | # You might want to switch branches, saving your work on the current branch 106 | # without committing it. The solution to this is stashing: 107 | # 108 | # git stash [pop] 109 | # 110 | 111 | # For example, suppose you add a Python file: 112 | touch foo.py 113 | git add foo.py 114 | 115 | # Then you realize you want to switch branches and work on something completely 116 | # different. Stash your current work to get it out of the way: 117 | git stash 118 | 119 | # Switch to a different branch and do other work, committing when you're done: 120 | # (If you're following along, create the branch and switch to it with 121 | # 122 | # git checkout -b branch 123 | # 124 | # ) 125 | git checkout other-branch 126 | 127 | # When you're ready to go back to work on foo.py, you can switch back to the 128 | # original branch and pop the stash: 129 | git checkout master 130 | git stash pop 131 | ls 132 | 133 | # Revising Commits 134 | # ================ 135 | # What if you forget to add a change, make a typo in the commit message, etc? 136 | # Fix it with: 137 | # 138 | # git commit --amend 139 | # 140 | vim README.md 141 | git add README.md 142 | git commit --amend 143 | 144 | # NEVER amend a commit you've already pushed to a remote repository. It will 145 | # interfere with git's normal operation! Instead, you should make a new commit 146 | # for any changes you forgot. 147 | 148 | # Restoring Previous Commits 149 | # ========================== 150 | # Since git tracks all of the commits you make, you can easily restore the 151 | # state of the repository at an earlier point in time. You can do this with: 152 | # 153 | # git reset COMMIT [FILE] 154 | # 155 | # where COMMIT is a commit hash (listed with `git log`). You only need 156 | # to type the first few characters of the commit hash. For example: 157 | 158 | git reset f4fca 159 | 160 | # A more in-depth explanation of `git reset` is given in the git documentation. 161 | 162 | -------------------------------------------------------------------------------- /day1/git_lifejacket.sh: -------------------------------------------------------------------------------- 1 | 2 | # Git Lifejacket 3 | # ============== 4 | # Git is a distributed version control system. What does that mean? 5 | # 6 | # * Git makes it easy to share files (distributed). 7 | # * Git tracks the changes you make to files (version control system). 8 | # 9 | # Use git to... 10 | # 11 | # 1. Quickly distribute sets of files. 12 | # 2. Keep multiple versions of a file without making copies. 13 | # 3. Selectively undo changes you made 3 days or even 3 months ago. 14 | # 4. Efficiently merge work from several different people. 15 | # 5. ... 16 | # 17 | # This tutorial is a git lifejacket, meant to get you up and running. 18 | # 19 | # For more, see (try!) the git followup notes, posted online. Also check out 20 | # the git documentation at: 21 | # 22 | # http://www.git-scm.com/doc 23 | # 24 | # You can also get help with commands by appending `--help` to the end. For 25 | # example: 26 | git status --help 27 | 28 | # Git Repositories 29 | # ================ 30 | # A git repository (or 'repo') is just a set of files tracked by git. 31 | # 32 | # GitHub.com is a host for git repositories on the web, widely-used by 33 | # open-source projects. GitHub provides free hosting for public repositories. 34 | 35 | # You might want to work on a repository someone else created. Download a copy 36 | # of a repository from the web by cloning it: 37 | # 38 | # git clone URL 39 | # 40 | git clone https://github.com/nick-ulle/2015-python.git 41 | 42 | # Check the status of a repository with: 43 | # 44 | # git status 45 | # 46 | git status 47 | 48 | # You can also create new, empty repositories on GitHub, and then clone them: 49 | 50 | git clone https://github.com/nick-ulle/demo.git 51 | 52 | # Committing Changes 53 | # ================== 54 | # After changing some files, you can save a snapshot of the repository by 55 | # making a commit. This is a 2 step process. 56 | 57 | # Step 1: 58 | # 59 | # Add, or 'stage', the changes you want to save in the commit: 60 | # 61 | # git add FILE 62 | # 63 | git add README.md 64 | 65 | # To stage every file in the current repository: 66 | # 67 | # git add --all 68 | 69 | # Use `git status` to see which files are staged. 70 | 71 | # Step 2: 72 | # 73 | # Save the staged changes with the command: 74 | # 75 | # git commit 76 | # 77 | git commit 78 | 79 | # The commit command will ask you to type a message summarizing the changes. 80 | # The first line should be a short description, no more than 50 characters. 81 | # If you want to write more, insert a blank line. For example: 82 | # 83 | # Adds README.md, describing the repository. 84 | # 85 | # The added README.md also contains a classic programming phrase and the 86 | # meaning of life, the universe, and everything. 87 | 88 | # If you examine the repository status, git no longer sees any changes. This 89 | # is because they've been committed. 90 | 91 | # When should you make a commit? 92 | # 93 | # Common advice is to commit early and often. Commits are your save points, 94 | # and you never know when you'll need to go back. You could commit every time 95 | # you finish a small piece of work, such as writing a function. 96 | 97 | # You can see a history of the last N commits with the command: 98 | # 99 | # git log [-N] 100 | # 101 | git log 102 | git log -3 103 | 104 | # Your Turn! 105 | # ========== 106 | # Clone the demo repository from 107 | # 108 | # https://www.github.com/nick-ulle/demo.git 109 | # 110 | # make a new file, type something in it, and commit the file. 111 | 112 | # Try making your GitHub account, creating a repo, cloning, and make a commit! 113 | 114 | # Working With Remote Repositories 115 | # ================================ 116 | # What if you want to share your work online (say, GitHub)? An online 117 | # repository is a 'remote' repository. 118 | 119 | # Given permission (e.g., you own the repo), you can push commits to a remote 120 | # repository with the command: 121 | # 122 | # git push [REMOTE BRANCH] 123 | # 124 | 125 | # Push commits to the remote repo 'origin' on branch 'master': 126 | 127 | git push origin master 128 | 129 | # You can also retrieve commits other people have made to a repository. Do 130 | # this with: 131 | # 132 | # git pull [REMOTE BRANCH] 133 | # 134 | git pull origin master 135 | 136 | -------------------------------------------------------------------------------- /day1/hello_world.py: -------------------------------------------------------------------------------- 1 | 2 | # To open Python... 3 | # * Windows users: find ipython (3) in the Start Menu 4 | # * All others: enter `ipython` in terminal (without quotes) 5 | 6 | print('Hello world!') 7 | 8 | # Now check that packages we'll use often were installed correctly. 9 | # All of the following import lines should run without any message. 10 | 11 | # Testing NumPy 12 | import numpy as np 13 | 14 | # Testing SciPy 15 | import scipy as sp 16 | 17 | # Testing Pandas 18 | import pandas as pd 19 | 20 | # Testing Matplotlib 21 | import matplotlib.pyplot as plt 22 | 23 | # Finally, check that you can make plots: 24 | plt.plot(range(5)) 25 | plt.show() 26 | 27 | -------------------------------------------------------------------------------- /day1/plots/languages.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nick-ulle/2015-python/18e324ea50435d94c72ec853eec56ea6dfa5fe05/day1/plots/languages.png -------------------------------------------------------------------------------- /day1/plots/python_comfort.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nick-ulle/2015-python/18e324ea50435d94c72ec853eec56ea6dfa5fe05/day1/plots/python_comfort.png -------------------------------------------------------------------------------- /day1/plots/topics.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nick-ulle/2015-python/18e324ea50435d94c72ec853eec56ea6dfa5fe05/day1/plots/topics.png -------------------------------------------------------------------------------- /day1/pythonintro.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nick-ulle/2015-python/18e324ea50435d94c72ec853eec56ea6dfa5fe05/day1/pythonintro.pdf -------------------------------------------------------------------------------- /day2/backup/error.json: -------------------------------------------------------------------------------- 1 | {"documentation_url": "https://developer.github.com/v3/search", "errors": [{"code": "missing", "field": "q", "resource": "Search"}], "message": "Validation Failed"} -------------------------------------------------------------------------------- /day2/backup/log.txt: -------------------------------------------------------------------------------- 1 | clear 2 | davis = {'cool': True, 3 | 'students': 35e3} 4 | davis 5 | # This is a dictionary. The whole Python language is built on dictionaries 6 | davis['cool'] 7 | davis['students'] 8 | clear 9 | davis 10 | davis.keys() 11 | davis.values() 12 | # Keys and values together are called 'items' 13 | davis.items() 14 | davis 15 | clear 16 | 'cool' in Davis 17 | 'cool' in davis 18 | # This says that 'cool' is a key in the davis dictionary 19 | davis['cool'] 20 | # To answer this you need to know the philosophy of Python 21 | import this 22 | davis 23 | davis['stats'] 24 | # Lets add stats to davis 25 | davis['stats'] = 'excellent' 26 | davis 27 | depts = ['stats', 'math', 'ecology'] 28 | clear 29 | depts 30 | depts = ['stats', 'math', 'ecology'] 31 | davis 32 | depts 33 | type(davis) 34 | type(depts) 35 | type(123) 36 | type(123.1) 37 | type('123.1') 38 | type(True) 39 | davis['cool'] 40 | type(davis['cool']) 41 | clear 42 | depts 43 | nums = [0, 10, 20, 30, 40] 44 | # We are learning how to explore the basic data structures 45 | len 46 | ?len 47 | len? 48 | nums 49 | len(nums) 50 | # how do we pick out the first two? 51 | nums[:2] 52 | # Python begins counting at 0!!! 53 | nums[0] 54 | nums[1] 55 | nums[2] 56 | nums 57 | nums[:2] 58 | # This is called 'slicing' 59 | nums[-1] 60 | nums[-2] 61 | nums[-3] 62 | nums 63 | nums[2:] 64 | # nums up to, but excluding the 2nd element 65 | nums[:2] 66 | # nums beginning with the 2nd element 67 | nums[2:] 68 | nums 69 | nums[:2] 70 | nums[2:] 71 | nums 72 | # What do you observe? 73 | nums[:2] + nums[2:] 74 | nums 75 | nums[:2] + nums[2:] == nums 76 | k = 3 77 | nums[:k] + nums[k:] == nums 78 | k = 10 79 | len(nums) 80 | nums[:k] + nums[k:] == nums 81 | # Can also do negative slices 82 | nums[-3:] 83 | nums 84 | nums[:k] + nums[k:] == nums 85 | # The positive side of 0 indexing is that we get this pretty identity 86 | # We have learned slicing 87 | import this 88 | import this 89 | # flat is better than nested 90 | davis 91 | depts 92 | # We've seen it in Python - now in JSON! 93 | # Ways to transfer data 94 | davis 95 | import json 96 | json.dumps? 97 | json.dumps(davis) 98 | davis 99 | davis 100 | davis['dept'] = dept 101 | davis['dept'] = depts 102 | davis 103 | davis['dept'] 104 | type(davis['dept']) 105 | json.dumps(davis) 106 | davis 107 | # We have seen that JSON looks just like Python lists and dicts 108 | # Time to fetch some live data! 109 | # import requests 110 | import requests 111 | requests.get 112 | baseurl = 'https://api.github.com/search/repositories' 113 | response = requests.get(baseurl) 114 | response 115 | response 116 | response.status_code 117 | response 118 | response.content 119 | response.json 120 | response.json() 121 | r = response.json() 122 | type(r) 123 | r.keys() 124 | r 125 | r['errors'] 126 | type(r['errors']) 127 | len(r['errors']) 128 | r['errors'] 129 | r['errors'][0] 130 | type(r['errors'][0]) 131 | r 132 | payload = {'q': 'data science', 'sort': 'stars'} 133 | ds = requests.get(baseurl, params=payload) 134 | ds 135 | ds.json() 136 | dsj = ds.json() 137 | type(dsj) 138 | len(dsj) 139 | dsj.keys() 140 | dsj['total_count'] 141 | dsj['incomplete_results'] 142 | dsji = dsji['items'] 143 | dsji = dsj['items'] 144 | clear 145 | dsji 146 | clear 147 | len(dsji) 148 | type(dsji) 149 | dsji[0] 150 | # Lets get the description for all of them! 151 | type(dsji[0]) 152 | [x['description'] for x in dsji] 153 | %run download_github.py 154 | ds 155 | %run download_github.py 156 | ds 157 | -------------------------------------------------------------------------------- /day2/backup/notes.mdown: -------------------------------------------------------------------------------- 1 | Lecture 2 2 | Thu Jan 8 19:32:53 PST 2015 3 | 4 | # Today's Goals: 5 | 6 | - Introduce basic Python 7 | - Learn two data structures- lists and dicts 8 | - See that Python lists and dicts correspond to JSON 9 | - Write a client for a REST API 10 | 11 | Why this is awesome: 12 | If you can write a client for any REST API you can programmatically 13 | access an essentially unlimited amount of data. 14 | So we're done looking at the 'iris' data :) 15 | 16 | Other names for Python dictionary: 17 | - key / value pair 18 | - dict 19 | - hash table 20 | - hash map 21 | - associative array 22 | 23 | Ways of transferring data: 24 | 25 | 1. verbally 26 | 2. look in a book 27 | 3. email 28 | 4. download CSV file from the web 29 | 5. query remote database or system 30 | - Direct connection to underlying database technology *Danger zone!* 31 | - Go through a REST API layer *Easier!* 32 | 33 | Use common format like JSON, XMl, CSV 34 | CSV also called 'flat file' 35 | 36 | Remote databases are very powerful 37 | 38 | 39 | Data Golf: 40 | 1 million integers in a Python list 41 | -------------------------------------------------------------------------------- /day2/day2log.txt: -------------------------------------------------------------------------------- 1 | a = 123 2 | a 3 | a 4 | # The answer to what happens when.... 5 | # Open an interpreter, try it, and see!! 6 | type(a) 7 | type(1.0) 8 | type('hello') 9 | type(True) 10 | type(None) 11 | davis = {'state': 'California', 'students': 35000} 12 | type(davis) 13 | # The whole python language is built on dictionaries (dicts) 14 | davis.keys() 15 | davis.values() 16 | davis.items() 17 | davis 18 | davis['state'] 19 | davis['state'] = 'CA' 20 | davis 21 | davis['cool'] = True 22 | davis 23 | type(davis['cool']) 24 | # that was the dictionary 25 | # Moving on to the list 26 | nums = [0, 10, 20, 30, 40] 27 | nums 28 | type(nums) 29 | ?len 30 | nums 31 | len(nums) 32 | # We want to pick elements from the list 33 | nums[0] 34 | # Python starts counting at 0!!!!!!!!!!! 35 | # Very important 36 | nums 37 | nums[0] 38 | nums[1] 39 | nums[2] 40 | nums 41 | nums[len(nums)] 42 | nums 43 | nums[-1] 44 | # In python this is called 'slicing' 45 | nums[-2] 46 | nums[0:2] 47 | # That says 'up to, but excluding the 2nd element' 48 | nums[:2] 49 | # To get all elements starting with the 2nd: 50 | nums[2:] 51 | nums[1:] 52 | clear 53 | nums[:2] 54 | nums[2:] 55 | k = 3 56 | nums[:k] 57 | nums[k:] 58 | nums[:k] + nums[k:] 59 | nums 60 | nums + 5 61 | nums + [5] 62 | nums 63 | # Suppose we want to add 5 to each element 64 | # In math notation you might write this: 65 | # {x + 5 for x in nums} 66 | [x + 5 for x in nums] 67 | # This is called a list comprehension 68 | nums 69 | davis 70 | a = [12, ['a', 'b']] 71 | a 72 | import json 73 | json.dumps? 74 | json.dumps(davis) 75 | davis 76 | davis 77 | davis['nums'] = nums 78 | davis 79 | json.dumps(davis) 80 | json.loads? 81 | json.dumps(davis) 82 | dstring = json.dumps(davis) 83 | dstring 84 | davis2 = json.loads(dstring) 85 | davis2 86 | davis 87 | davis2 88 | davis == davis2 89 | a = 'It's a beautiful day!' 90 | a = "It's a beautiful day!" 91 | a 92 | b = ''' 93 | Hello everybody. 94 | blah blah 95 | ''' 96 | b 97 | clear 98 | davis 99 | 'cool' in davis 100 | clear 101 | davis 102 | 2 ** 10 103 | 2 ** 20 104 | 2 ** 10 105 | before = 18.7 106 | x = [float(x) for x in range(int(1e6))] 107 | len(x) 108 | x[:4] 109 | x[-5:] 110 | after = 44.2 111 | before 112 | after - before 113 | x = [float(x) for x in range(4)] 114 | x 115 | a = range(int(1e12)) 116 | a 117 | b = range(10, 100) 118 | b 119 | next(b) 120 | b.step 121 | import requests 122 | requests.get? 123 | google = requests.get('http://www.google.com') 124 | google = requests.get('https://www.google.com') 125 | cd backup/ 126 | ls 127 | google = requests.get('https://www.google.com') 128 | google = requests.get('https://www.google.com') 129 | google 130 | google = requests.get('https://www.google.com') 131 | # How do we learn about unknown objects? 132 | # Use the `type` function 133 | type(google) 134 | google.status_code 135 | google.text 136 | import requests 137 | base = 'https://api.github.com/search/repositories' 138 | response = requests.get(base) 139 | response 140 | type(response) 141 | response.text 142 | response.json? 143 | response.json() 144 | a = response.json() 145 | a 146 | a 147 | type(a) 148 | len(a) 149 | a 150 | a.keys() 151 | payload = {'q': 'data science'} 152 | type(payload) 153 | payload['sort'] = 'stars' 154 | payload 155 | response = requests.get(base, params=payload) 156 | a = response.json() 157 | response 158 | a 159 | type(a) 160 | len(a) 161 | a.keys() 162 | a['total_count'] 163 | a['items'] 164 | len(a['items']) 165 | type(a['items']) 166 | b = a['items'] 167 | b 168 | b[0] 169 | b[0].keys() 170 | c = [[x['full_name'], x['watchers']] for x in b] 171 | c 172 | ls 173 | pwd 174 | dir() 175 | davis 176 | a = 'hello' 177 | a = 10 178 | a.real 179 | % history -f day2log.txt 180 | -------------------------------------------------------------------------------- /day2/download_github.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Script to download some interesting data from Github.com 3 | 4 | Thu Dec 18 17:13:16 PST 2014 5 | 6 | There are about 25k repositories that match the query 'data science'. 7 | We'll look at the most popular ones. 8 | ''' 9 | 10 | import json 11 | import requests 12 | 13 | 14 | def query_github(params): 15 | ''' 16 | Query Github using dict of params 17 | ''' 18 | base = 'https://api.github.com/search/repositories' 19 | response = requests.get(base, params=params) 20 | return response.json() 21 | 22 | 23 | if __name__ == '__main__': 24 | 25 | payload = {'q': 'data science', 'sort': 'stars'} 26 | ds = query_github(payload) 27 | -------------------------------------------------------------------------------- /day2/notes15jan.mdown: -------------------------------------------------------------------------------- 1 | Lecture 2 2 | 3 | Thu Jan 15 14:10:15 PST 2015 4 | 5 | # Today's Goals 6 | - Introduce basic python 7 | - Learn two important data structures lists and dicts 8 | - See that Python lists and dicts correspond to JSON 9 | - Write a client for a REST API 10 | 11 | # Why this is awesome!! 12 | You can access an *enormous* amount of data. 13 | 14 | Python dictionaries: 15 | - Key / Value Pair 16 | - Hash table 17 | - Hash map 18 | - Associative array 19 | 20 | Data golf!!!! 21 | 22 | floating point number - 8 bytes 23 | 24 | How large is a Python list with 1 million floating point numbers? 25 | A lower bound should be about 8 MB. 26 | 27 | Data transfer: 28 | - Talking 29 | - Writing stuff down in books 30 | - Email each other data in a CSV attachment 31 | - download a csv file from the web 32 | - direct machine to machine transfer from a larger backend system 33 | (database) 34 | 1. Direct connection to that technology 35 | 2. To use REST API (abstraction layer) 36 | 37 | Common forms for data: 38 | XML, CSV, JSON 39 | 40 | -------------------------------------------------------------------------------- /day3/day3log.txt: -------------------------------------------------------------------------------- 1 | a = [1, 2, 3] 2 | a + 5 3 | [x + 5 for x in a] 4 | import numpy 5 | import numpy as np 6 | b = np.array(a) 7 | a 8 | b 9 | type(b) 10 | np.array([1, 2, 6, 4]) 11 | c = np.arange(1e6) 12 | 41.7 - 33.8 13 | d = [float(x) for x in range(int(1e6))] 14 | 73.2 - 41.7 15 | a 16 | a + 5 17 | b + 5 18 | %timeit print('Hello world!') 19 | c 20 | %timeit c + 5 21 | d 22 | %timeit [x + 5 for x in d] 23 | type(d) 24 | len(d) 25 | b 26 | b_copy = b 27 | b 28 | b_copy 29 | b_copy[1] = 7 30 | b_copy 31 | b 32 | import this 33 | b = np.array([1, 2, 3]) 34 | b_copy = b.copy() 35 | b 36 | b_copy 37 | b_copy[1] = 7 38 | b_copy 39 | b 40 | A = np.array([[1, 2], [3, 4]]) 41 | A 42 | A = np.array([1, 2, 3, 4]).reshape(2, 2) 43 | A 44 | np.zeros(4).reshape(2, 2) 45 | I = np.eye(2) 46 | I 47 | A 48 | A * I 49 | A.dot(I) 50 | np.dot(A, I) 51 | A @ I 52 | A[1, 0] 53 | A 54 | np.random.binomial? 55 | steps = np.random.binomial(1, 0.5, 100) 56 | steps 57 | steps[steps == 0] = -1 58 | steps 59 | positions = steps.cumsum() 60 | positions 61 | import matplotlib.pyplot as plt 62 | plt.plot(positions) 63 | plt.show() 64 | import skimage.io as io 65 | img = io.imread('photo.png', as_grey=True) 66 | img 67 | img 68 | img.shape 69 | type(img) 70 | ctr = np.mean(img, axis=0) 71 | ctr.shape 72 | ctr_img = img - ctr 73 | plt.imshow(img) 74 | plt.show() 75 | plt.imshow(img).set_cmap('gray') 76 | plt.show() 77 | plt.imshow(ctr_img).set_cmap('gray') 78 | plt.show() 79 | np.linalg.svd? 80 | _, _, v = np.linalg.svd(ctr_img) 81 | v.shape 82 | v = v.T 83 | reduced_v = v[:, :100] 84 | v.shape 85 | reduced_v.shape 86 | prin_comp = ctr_img.dot(reduced_v) 87 | prin_comp.shape 88 | reconst = prin_comp.dot(reduced_v.T) + ctr 89 | plt.imshow(reconst).set_cmap('gray') 90 | plt.show() 91 | fig, ax = plt.subplots(1, 2) 92 | ax[0].imshow(img).set_cmap('gray') 93 | ax[1].imshow(reconst).set_cmap('gray') 94 | fig.show() 95 | img.size 96 | reconst = prin_comp.dot(reduced_v.T) + ctr 97 | prin_comp.size + reduced_v.size + ctr.size 98 | original_size = img.size 99 | new_size = prin_comp.size + reduced_v.size + ctr.size 100 | new_size / original_size 101 | reconst.size 102 | %history -f code_log.txt 103 | -------------------------------------------------------------------------------- /day3/day3notes.md: -------------------------------------------------------------------------------- 1 | 2 | 1 million floating point values in a list: 24 - 32 MB (should be 8 MB) 3 | 4 | 1. Lists take up too much memory! 5 | 2. We'd like to have vectorization. 6 | 3. Lists are kind of slow. 7 | 8 | NumPy provides an alternative: the n-dimensional array. 9 | 10 | DATA GOLF! 11 | How much memory does a 1 million element ndarray of floats use? 12 | 8 MB! 13 | 14 | NumPy takes 8.6 ms to add 5 to 1 million element ndarray. 15 | Python lists take 501ms! 16 | 17 | Be careful with references! 18 | 19 | We saw NumPy gives us arrays, matrices, ... But what else? 20 | 21 | * random number generation (numpy.random) 22 | * fast Fourier transforms (numpy.fft) 23 | * polynomials (numpy.polynomial) 24 | * linear algebra (numpy.linalg) 25 | * support for calling C libraries 26 | 27 | # Simple Random Walk! 28 | 29 | Random walk with 100 steps 30 | 31 | 1. Flip a coin 100 times--these are the steps (0 means down). 32 | 2. Take cumulative sums 33 | 34 | # PCA / SVD 35 | The SVD is a very important decomposition, especially to statistics. 36 | A great statistics example is Principal Components Analysis (PCA). 37 | What is PCA typically used for? 38 | 39 | Say X is a centered n-by-p data matrix. Then 40 | 41 | X = UDV' = λ₁u₁v₁' + ... + λₖuₖvₖ' + ... + λₚuₚvₚ' (n < p) 42 | 43 | PCA takes the first k principal components, λ₁u₁, ..., λₖuₖ. 44 | A slightly different perspective: PCA approximates the original data by 45 | 46 | X ~ λ₁u₁v₁' + ... + λₖuₖvₖ' 47 | 48 | Lossy data compression! 49 | 50 | DATA GOLF! 51 | Original image: 180,500 52 | Compressed image: 86,461 53 | -------------------------------------------------------------------------------- /day3/lecture.py: -------------------------------------------------------------------------------- 1 | 2 | # Today's Agenda: 3 | # 4 | # 1. Introduce NumPy 5 | # 2. Simple random walk 6 | # 3. Visualize SVD/PCA 7 | 8 | a = [1, 2, 3] 9 | 10 | # Python's lists are awesome, but they have a few serious limitations for 11 | # numerical computing: 12 | # 13 | # 1. They use a lot of memory (~32 MB for 1 million 64-bit floats). 14 | # 2. They're slow. 15 | # 3. They're not vectorized. We could use list comprehensions, but do you 16 | # really want to write 17 | 18 | [x + 5 for x in a] 19 | 20 | # instead of 21 | 22 | a + 5 23 | 24 | # So we need a better solution when we do numerical computing. Enter NumPy! 25 | 26 | import numpy 27 | 28 | # We're going to use NumPy all the time, so let's save some typing with an 29 | # alias: 30 | 31 | import numpy as np 32 | 33 | # Using `np` for NumPy is a common convention. 34 | # 35 | # You can convert a Python list to a NumPy array with `array()`. 36 | 37 | np.array([1, 2]) 38 | b = np.array(a) 39 | 40 | # DATA GOLF! A list of 1 million floats uses ~32 MB. 41 | # How much memory does a NumPy array of 1 million floats use? 42 | 43 | big_np_array = np.arange(1e6) 44 | big_list = [float(x) for x in range(int(1e6))] 45 | 46 | # NumPy also knows we want to do numerical work, where vectorized operations 47 | # make sense. 48 | 49 | b + 5 50 | 51 | # These vectorized operations are also faster than corresponding operations on 52 | # lists. 53 | 54 | %timeit big_np_array + 5 55 | %timeit [x + 5 for x in big_list] 56 | 57 | # Indexing and slicing uses `[ ]`, the same as Python lists. 58 | 59 | b[1:] 60 | 61 | # WARNING: Python objects, including lists and NumPy arrays, are stored as 62 | # references. Compared to R, they might not behave as you'd expect. 63 | 64 | b_copy = b 65 | b_copy[1] = 7 66 | 67 | # What are the elements of `b_copy`? What're the elements of `b`? 68 | b_copy 69 | b 70 | 71 | # ZEN: Explicit is better than implicit. 72 | # If you wanted to make a copy, do it explicitly! 73 | 74 | b = np.array(a) 75 | 76 | b_copy = b.copy() 77 | b_copy[1] = 7 78 | 79 | b_copy 80 | b 81 | 82 | # Matrices can be defined directly 83 | 84 | np.array([[1, 2], [3, 4]]) 85 | 86 | # or by reshaping an array 87 | 88 | A = np.array([1, 2, 3, 4]).reshape(2, 2) 89 | A 90 | 91 | # How do you do matrix multiplication? 92 | I = np.eye(2) 93 | I 94 | 95 | A * I 96 | 97 | # Using `*` does element-by-element multiplication. Instead, call the `dot()` 98 | # method or `dot()` function. 99 | 100 | A.dot(I) 101 | np.dot(A, I) 102 | 103 | # In Python 3.5, there will be an operator, `@`, for matrix multiplication. 104 | 105 | # What else do we get with NumPy? 106 | # 107 | # * linear algebra (numpy.linalg) 108 | # * random number generation (numpy.random) 109 | # * Fourier transforms (numpy.fft) 110 | # * polynomials (numpy.polynomial) 111 | 112 | # Let's write a simple random walk! 113 | # 114 | # 1. Get the step (up or down) at each time. That is, take N independent 115 | # samples from a binomial(1, p) or Bernoulli(p) distribution. 116 | # 2. Transform 0s to -1s (down). 117 | # 3. Use cumulative sums to calculate the position at each time. 118 | 119 | N = 10 120 | p = 0.5 121 | steps = np.random.binomial(1, p, N) 122 | 123 | steps[steps == 0] = -1 124 | steps 125 | 126 | positions = steps.cumsum() 127 | 128 | import matplotlib.pyplot as plt 129 | 130 | # Turn on Matplotlib's interactive mode. 131 | plt.ion() 132 | 133 | plt.plot(positions) 134 | 135 | # The SVD is a very important decomposition, especially to statistics. 136 | # A great statistics example is Principal Components Analysis (PCA). 137 | # What is PCA typically used for? 138 | 139 | # Say X is a centered n-by-p data matrix. Then 140 | # 141 | # X = UDV' = λ₁u₁v₁' + ... + λₖuₖvₖ' + ... + λₚuₚvₚ' (n < p) 142 | # 143 | # PCA takes the first k principal components, λ₁u₁, ..., λₖuₖ. 144 | # A slightly different perspective: PCA approximates the original data by 145 | # 146 | # X ~ λ₁u₁v₁' + ... + λₖuₖvₖ' 147 | # 148 | # Lossy data compression! 149 | 150 | # To load an image, use SciPy, Matplotlib, or Scikit-Image. 151 | import skimage as ski, skimage.io 152 | 153 | img = ski.io.imread('photo.png', as_grey=True) 154 | 155 | plt.imshow(img).set_cmap('gray') 156 | 157 | # Center the image and take the SVD. 158 | mean = np.mean(img, axis=0) 159 | 160 | ctr_img = img - mean 161 | 162 | plt.imshow(ctr_img).set_cmap('gray') 163 | 164 | _, _, v = np.linalg.svd(ctr_img) 165 | 166 | # `linalg.svd()` returns V', not V. 167 | v = v.T 168 | 169 | # Remove some columns from V (terms in the SVD sum). 170 | v_reduced = v[:, 0:100] 171 | prin_comp = ctr_img.dot(v_reduced) 172 | 173 | # Reconstruct the data. 174 | reconst = np.dot(prin_comp, v_reduced.T) + mean 175 | 176 | # Another way: 177 | reconst = prin_comp.dot(v_reduced.T) + mean 178 | 179 | fig, axs = plt.subplots(1, 2) 180 | for ax, im in zip(axs, [img, reconst]): 181 | ax.imshow(im).set_cmap('gray') 182 | fig.show() 183 | 184 | # DATA GOLF! How much compression? 185 | # Original image: 186 | 187 | original_size = img.size 188 | 189 | # Compressed image: 190 | 191 | compressed_size = mean.size + v_reduced.size + prin_comp.size 192 | 193 | compressed_size / original_size 194 | 195 | -------------------------------------------------------------------------------- /day3/photo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nick-ulle/2015-python/18e324ea50435d94c72ec853eec56ea6dfa5fe05/day3/photo.png -------------------------------------------------------------------------------- /day4/backup/country.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Lecture 4 3 | 4 | Wed Jan 21 20:41:05 PST 2015 5 | 6 | Today's Goals: 7 | - functions 8 | - scripting 9 | - pandas 10 | 11 | A python script is just a plain text file with a .py extension. It contains 12 | a bunch of commands which will be executed in the order that they appear. 13 | 14 | Pandas gives us the DataFrame and all the power that comes with it. 15 | 16 | Data Munging: Putting your data into the correct form and type for 17 | analysis. This is the foundation for any analysis you wish to do. It pays 18 | to be able to do it efficiently. 19 | 20 | Glamour: * 21 | Utility: ***** 22 | 23 | Pandas makes data munging easier. 24 | 25 | Documentation and doctests are sooooo easy in Python. 26 | ''' 27 | 28 | def decade(year): 29 | ''' 30 | Return the decade of a given year 31 | 32 | >>> decade(1954) 33 | 1950 34 | 35 | ''' 36 | return 10 * (year // 10) 37 | 38 | 39 | if __name__ == '__main__': 40 | import doctest 41 | doctest.testmod() 42 | -------------------------------------------------------------------------------- /day4/backup/log.py: -------------------------------------------------------------------------------- 1 | clear 2 | # The first thing we do is import the libraries we need 3 | # This can get a little tiresome 4 | # If I want to use everything in the Numpy library: 5 | # First let's see what I have in base Python: 6 | dir() 7 | # Very little 8 | # Suppose I now want everything from the Numpy library for my interactive work: 9 | from numpy import * 10 | dir() 11 | # We just brought everything from Numpy into the workspace 12 | # More precisely- they are all global variables and functions now 13 | # This is convenient for interactive work 14 | # But not recommended for scripting / programming 15 | # Ipython makes this easy: 16 | %pylab 17 | # This imported everything from numpy and matplotlib and set up the plotting 18 | import seaborn 19 | # seaborn gives you pretty graphics :) 20 | plot([1, 2, 1, 2]) 21 | import pandas as pd 22 | import pandas as funstuff 23 | clear 24 | # Functions are a _wonderful_ thing 25 | def greeter(name): 26 | print('hello', name) 27 | greeter('Qi') 28 | ls 29 | %run country.py 30 | # read in the population data 31 | p = pd.read_csv('population.csv') 32 | type(p) 33 | p.shape 34 | p.dtypes 35 | # Date should be a timestamp, not object 36 | # We need to fix it! 37 | pd.to_datetime? 38 | p 39 | p.columns 40 | # Lets start out with the basics 41 | # Columns of a DataFrame are called Series 42 | # The name comes from time series 43 | # hint- they work really well for time series 44 | pd.DateOffset? 45 | pd.date_range? 46 | a = pd.date_range('2014-01-01') 47 | # I've just illustrated a valuable technique- 48 | If you don't know what will happen, just try it and see what error you get 49 | # If you don't know what will happen, just try it and see what error you get 50 | a = pd.date_range('2014-01-01', periods=10) 51 | a 52 | aser = pd.Series(a) 53 | aser 54 | import numpy as np 55 | a = np.ones(5) 56 | a 57 | aser = pd.Series(a) 58 | aser 59 | aser.index 60 | # Pandas is designed to preserve the structure of your data 61 | # That means the index will be maintained throughout the operations 62 | bser = pd.Series(np.ones(10)) 63 | bser 64 | aser 65 | bser 66 | aser + bser 67 | # Alignment was preserved 68 | aser 69 | aser.index[0] 70 | # This would be a crazy thing to doaser.index[0] = 27 71 | aser.index[0] = 27 72 | aser.index 73 | a = aser 74 | a 75 | a.index 76 | # Index is an attribut of a dataframe or series 77 | a.index = pd.date_range('2014-01-01', periods=5) 78 | a 79 | b = bser 80 | b.index = pd.date_range('2014-01-03', periods=10) 81 | b 82 | a 83 | b 84 | # Dataframes try hard to preserve alignment 85 | a + b 86 | a.data 87 | a.as_matrix() 88 | # Indexing is important 89 | # You get speed and data integrity 90 | p 91 | p.dtypes 92 | p.columns 93 | # Selecting columns is like key lookups in a dictionary 94 | p['Date'] 95 | p['Date'][:10] 96 | p['Date'].head() 97 | pd.to_datetime(p['Date']) 98 | # Before we had strings 99 | # These are nanosecond timestamps 100 | p['Date'] = pd.to_datetime(p['Date']) 101 | # I just updated my script with the new commands 102 | # lets make sure it worked 103 | %run country.py 104 | p.dtypes 105 | # type is correct 106 | # We have another table: 107 | c = pd.read_csv('languages.csv') 108 | c 109 | c.shape 110 | c.dtypes 111 | # Our goal is to put the language into the population table 112 | p.head() 113 | # But we don't have country names 114 | wiki = pd.read_html('http://en.wikipedia.org/wiki/ISO_3166-1') 115 | # You may expect this to be a DataFrame 116 | # But HTML pages can contain many tables 117 | type(wiki) 118 | len(wiki) 119 | wiki[0] 120 | wiki[0].head() 121 | # We only need columns 0 and 2 122 | w = wiki[0].iloc[:, [0, 2]] 123 | # df.iloc lets us do integer selction of rows, columns 124 | w.shape 125 | w.dtypes 126 | w.head() 127 | w.columns 128 | w.columns = ['country', 'ISO'] 129 | w.columns 130 | # Now for the join 131 | l 132 | dir() 133 | sh short name (upper/lower case) Alpha-3 code5 134 | c 135 | l = c 136 | l 137 | l 138 | w 139 | w.head() 140 | w.columns 141 | l.columns 142 | l.merge(w) 143 | p.columns 144 | p2 = p.merge(l.merge(w)) 145 | p2.columns 146 | p2.dtypes 147 | p2.head() 148 | # A successful join 149 | p.head() 150 | p['ISO'].unique() 151 | p.shape 152 | p.head() 153 | p.pivot? 154 | p.pivot(index='Date', columns='ISO', values='population') 155 | p3 = p.pivot(index='Date', columns='ISO', values='population') 156 | p3.dtypes 157 | p3.head() 158 | p3.iloc[:10, ] 159 | p3.iloc[10:20, ] 160 | p3 161 | p3.plot() 162 | p3.dtypes 163 | p2 164 | p2.head() 165 | p2['language'].unique() 166 | ppop = p2['population'] 167 | p2.groupby? 168 | p2.groupby('language') 169 | grouper = p2.groupby('language') 170 | type(grouper) 171 | grouper.name 172 | grouper.ngroups 173 | grouper.mean() 174 | # Let's put this together 175 | p2.groupby('language').mean() 176 | p2.groupby('language').count() 177 | grouper.mad? 178 | grouper.size? 179 | grouper.size() 180 | # Suppose we want to know the average population in each country by decade 181 | p 182 | p2 183 | p3 184 | # Suppose we want to know the average population in each country by decade 185 | # If p3 had a column with the decades we could do a groupby 186 | p3.index 187 | p3.index.year 188 | p3 189 | p3['year'] = p3.index.year 190 | p3.dtypes 191 | p3.head() 192 | # We need to get the decade from the year 193 | # ie 1954 -> 1950 194 | 1954 // 10 195 | # floor division by 10 196 | 10 * 1954 // 10 197 | 10 * (1954 // 10) 198 | %run country.py 199 | %run country.py 200 | %run country.py 201 | decade 202 | ?decade 203 | decade(2018) 204 | p2 205 | p3.dtypes 206 | p3['year'].apply(decade) 207 | p3['decade'] = p3['year'].apply(decade) 208 | p3.head() 209 | p3.groupby('decade').mean() 210 | # So what if I want a column with this? 211 | g = p3['CHN'].groupby('decade') 212 | g = p3[['CHN', 'decade']].groupby('decade') 213 | g.transform? 214 | g.transform(np.mean) 215 | p3['china10yr'] = p3[['CHN', 'decade']].groupby('decade').transform(np.mean) 216 | p3.plot() 217 | p3[['CHN', 'china10yr']].plot() 218 | a = np.arange(20) 219 | a 220 | a // 10 221 | ls 222 | a = pd.read_html('ISO_3166-1') 223 | -------------------------------------------------------------------------------- /day4/backup/simple.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Write what your program does here in the header 3 | ''' 4 | 5 | # Imports come first 6 | import json 7 | 8 | 9 | print('This runs when the script is imported') 10 | 11 | 12 | def greet(person): 13 | ''' 14 | This is a function docstring 15 | ''' 16 | print('hello', person) 17 | 18 | 19 | if __name__ == '__main__': 20 | 21 | # Put the action code here 22 | # Also a good place for tests 23 | 24 | print('the __name__ is __main__!') 25 | greet('Matt') 26 | -------------------------------------------------------------------------------- /day4/decade.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Tools for working with dates 3 | ''' 4 | 5 | #import pandas as pd 6 | 7 | def decade(year): 8 | ''' 9 | Given a year, return the decade 10 | 11 | >>> decade(1986) 12 | 1980 13 | 14 | ''' 15 | return 10 * (year // 10) 16 | 17 | 18 | if __name__ == '__main__': 19 | import doctest 20 | doctest.testmod() 21 | -------------------------------------------------------------------------------- /day4/languages.csv: -------------------------------------------------------------------------------- 1 | country,language 2 | Canada,English 3 | China,Mandarin 4 | United States of America,English 5 | -------------------------------------------------------------------------------- /day4/log.py: -------------------------------------------------------------------------------- 1 | # first lets look at what we have 2 | dir() 3 | # lets pull in everything from Numpy 4 | from numpy import * 5 | dir() 6 | %pylab 7 | plot? 8 | # everything in matplotlib and numpy have become global variables 9 | # This is recommended interactive work- but not for scripting or programming 10 | import seaborn 11 | # Seaborn gives pretty graphics 12 | plot([0, 1, 0]) 13 | # In summary just use %pylab 14 | import pandas as pd 15 | import pandas as penguins 16 | # I could have given pandas any name I want 17 | import numpy as np 18 | # Pandas maintains data integrity across operations 19 | a = pd.Series(np.ones(10)) 20 | a 21 | a 22 | type(a) 23 | # the dataframe is like a table 24 | # the columns in a dataframe are called series 25 | a 26 | b = pd.Series(np.ones(5)) 27 | b 28 | type(b) 29 | b 30 | b.index 31 | a + b 32 | pd.date_range('2014-01-01', periods=10) 33 | a 34 | pd.date_range('2014-01-01', periods=10) 35 | a.index 36 | a.index[1] 37 | a.index[1] = 29 38 | a.index = pd.date_range('2014-01-01', periods=10) 39 | a 40 | a.index = pd.date_range('2014-01-01', periods=10) 41 | b.index = pd.date_range('2014-01-03', periods=5) 42 | a 43 | b 44 | a + b 45 | a 46 | a[:10] 47 | a[4] 48 | # Now for real data 49 | # Pandas makes it very easy to read and write data 50 | # in various formats 51 | p = pd.read_csv('population.csv') 52 | type(p) 53 | # How big is it? 54 | p.shape 55 | # We have 186 rows 56 | # and 3 columns 57 | p.dtypes 58 | p.head() 59 | p.dtypes 60 | p['Date'].head() 61 | p['Date'][0] 62 | d1 = p['Date'][0] 63 | type(d1) 64 | # Lets convert to datetime 65 | pd.to_datetime(d1['Date']) 66 | pd.to_datetime(p['Date']) 67 | p['Date'] = pd.to_datetime(p['Date']) 68 | p.dtypes 69 | # How many countries are there? 70 | p['ISO'] 71 | p['ISO'].unique() 72 | p['ISO'].head() 73 | p.head() 74 | p.dtypes 75 | p.pivot? 76 | p2 = p.pivot(index='Date', columns='ISO', values='population') 77 | p2.head() 78 | # Now we can plot 79 | p2.plot() 80 | # I have another table 81 | lang = pd.read_csv('languages.csv') 82 | lang 83 | p.head() 84 | lang 85 | wiki = pd.read_html('http://en.wikipedia.org/wiki/ISO_3166-1') 86 | type(wiki) 87 | len(wiki) 88 | wiki[0] 89 | w = wiki[0] 90 | w.head() 91 | w2 = w.iloc[1:, [0, 2]] 92 | w2.head() 93 | w2.columns 94 | p.columns 95 | lang.columns 96 | w2.columns = ['country', ISO'] 97 | w2.columns = ['country', 'ISO'] 98 | w2.head() 99 | lang.merge(w2) 100 | p.head() 101 | p.merge(lang.merge(w2)) 102 | p2 = p.merge(lang.merge(w2)) 103 | p2.columns 104 | p2.head() 105 | # given a year- how do we get the decade? 106 | % run decade.py 107 | # 1958 -> 1950 108 | 1958 // 10 109 | 10 * (1958 // 10) 110 | %run decade.py 111 | decade? 112 | import decade 113 | p2.dtypes 114 | p2['year'] = p2.index.year 115 | p2.index[:10] 116 | p2.head() 117 | p2.index = p2['Date'] 118 | p2.index.year 119 | p2['year'] = p2.index.year 120 | p2.dtypes 121 | p2.head() 122 | p2['year'].apply(decade) 123 | p2['year'].apply(decade.decade) 124 | p2['decade'] = p2['year'].apply(decade.decade) 125 | p2.head() 126 | # The question was- what was the mean population in China by decade? 127 | p2.groupby(['country', 'decade']) 128 | p2.groupby(['country', 'decade']).mean() 129 | p3 = p2.groupby(['country', 'decade']).mean() 130 | p3 131 | pd.__version__ 132 | %history -f 'day4log.py' 133 | -------------------------------------------------------------------------------- /day4/notes.mdown: -------------------------------------------------------------------------------- 1 | Lecture 4 2 | 3 | Thu Jan 22 14:09:21 PST 2015 4 | 5 | The story of Wes: 6 | * keep integrity of data 7 | * express data analysis operations cleanly and efficiently 8 | * handle time series very well 9 | 10 | Pandas - panel data analysis - DataFrame 11 | 12 | Data Munging - preparing your data 13 | Glamour - @ 14 | Utility - @@@@@ 15 | 16 | Why this is awesome: 17 | It's foundational for all subsequent analysis 18 | Looking for concise, expressive code 19 | -------------------------------------------------------------------------------- /day4/population.csv: -------------------------------------------------------------------------------- 1 | Date,ISO,population 2 | 1950-01-01 00:00:00,CHN,562.579779 3 | 1951-01-01 00:00:00,CHN,567.100762 4 | 1952-01-01 00:00:00,CHN,574.5362739999999 5 | 1953-01-01 00:00:00,CHN,584.1913539999999 6 | 1954-01-01 00:00:00,CHN,594.7248440000001 7 | 1955-01-01 00:00:00,CHN,606.729654 8 | 1956-01-01 00:00:00,CHN,619.135938 9 | 1957-01-01 00:00:00,CHN,633.214551 10 | 1958-01-01 00:00:00,CHN,646.703076 11 | 1959-01-01 00:00:00,CHN,654.3494300000001 12 | 1960-01-01 00:00:00,CHN,650.6605129999999 13 | 1961-01-01 00:00:00,CHN,644.669932 14 | 1962-01-01 00:00:00,CHN,653.302104 15 | 1963-01-01 00:00:00,CHN,674.248708 16 | 1964-01-01 00:00:00,CHN,696.064936 17 | 1965-01-01 00:00:00,CHN,715.546458 18 | 1966-01-01 00:00:00,CHN,735.903786 19 | 1967-01-01 00:00:00,CHN,755.3201190000001 20 | 1968-01-01 00:00:00,CHN,776.152777 21 | 1969-01-01 00:00:00,CHN,798.640508 22 | 1970-01-01 00:00:00,CHN,820.403282 23 | 1971-01-01 00:00:00,CHN,842.4556779999999 24 | 1972-01-01 00:00:00,CHN,863.439051 25 | 1973-01-01 00:00:00,CHN,883.019765 26 | 1974-01-01 00:00:00,CHN,901.3180649999999 27 | 1975-01-01 00:00:00,CHN,917.898537 28 | 1976-01-01 00:00:00,CHN,932.5887270000001 29 | 1977-01-01 00:00:00,CHN,946.093816 30 | 1978-01-01 00:00:00,CHN,958.835162 31 | 1979-01-01 00:00:00,CHN,972.136875 32 | 1980-01-01 00:00:00,CHN,984.73646 33 | 1981-01-01 00:00:00,CHN,997.000718 34 | 1982-01-01 00:00:00,CHN,1012.490488 35 | 1983-01-01 00:00:00,CHN,1028.3565350000001 36 | 1984-01-01 00:00:00,CHN,1042.75605 37 | 1985-01-01 00:00:00,CHN,1058.007717 38 | 1986-01-01 00:00:00,CHN,1074.522563 39 | 1987-01-01 00:00:00,CHN,1093.7257120000002 40 | 1988-01-01 00:00:00,CHN,1112.866405 41 | 1989-01-01 00:00:00,CHN,1130.729412 42 | 1990-01-01 00:00:00,CHN,1148.36447 43 | 1991-01-01 00:00:00,CHN,1163.610388 44 | 1992-01-01 00:00:00,CHN,1177.535611 45 | 1993-01-01 00:00:00,CHN,1190.761826 46 | 1994-01-01 00:00:00,CHN,1203.8017439999999 47 | 1995-01-01 00:00:00,CHN,1216.3784440000002 48 | 1996-01-01 00:00:00,CHN,1227.882189 49 | 1997-01-01 00:00:00,CHN,1238.1256640000001 50 | 1998-01-01 00:00:00,CHN,1247.502219 51 | 1999-01-01 00:00:00,CHN,1255.9924990000002 52 | 2000-01-01 00:00:00,CHN,1263.6375309999999 53 | 2001-01-01 00:00:00,CHN,1270.744232 54 | 2002-01-01 00:00:00,CHN,1277.59472 55 | 2003-01-01 00:00:00,CHN,1284.3033159999998 56 | 2004-01-01 00:00:00,CHN,1291.001804 57 | 2005-01-01 00:00:00,CHN,1297.765318 58 | 2006-01-01 00:00:00,CHN,1304.2618850000001 59 | 2007-01-01 00:00:00,CHN,1310.583544 60 | 2008-01-01 00:00:00,CHN,1317.065677 61 | 2009-01-01 00:00:00,CHN,1323.591583 62 | 2010-01-01 00:00:00,CHN,1330.141295 63 | 2011-01-01 00:00:00,CHN, 64 | 1950-01-01 00:00:00,USA, 65 | 1951-01-01 00:00:00,USA, 66 | 1952-01-01 00:00:00,USA, 67 | 1953-01-01 00:00:00,USA, 68 | 1954-01-01 00:00:00,USA, 69 | 1955-01-01 00:00:00,USA, 70 | 1956-01-01 00:00:00,USA, 71 | 1957-01-01 00:00:00,USA, 72 | 1958-01-01 00:00:00,USA, 73 | 1959-01-01 00:00:00,USA, 74 | 1960-01-01 00:00:00,USA,180.67 75 | 1961-01-01 00:00:00,USA,183.69 76 | 1962-01-01 00:00:00,USA,186.54 77 | 1963-01-01 00:00:00,USA,189.24 78 | 1964-01-01 00:00:00,USA,191.89 79 | 1965-01-01 00:00:00,USA,194.3 80 | 1966-01-01 00:00:00,USA,196.56 81 | 1967-01-01 00:00:00,USA,198.71 82 | 1968-01-01 00:00:00,USA,200.71 83 | 1969-01-01 00:00:00,USA,202.68 84 | 1970-01-01 00:00:00,USA,205.05 85 | 1971-01-01 00:00:00,USA,207.66 86 | 1972-01-01 00:00:00,USA,209.9 87 | 1973-01-01 00:00:00,USA,211.91 88 | 1974-01-01 00:00:00,USA,213.85 89 | 1975-01-01 00:00:00,USA,215.97 90 | 1976-01-01 00:00:00,USA,218.04 91 | 1977-01-01 00:00:00,USA,220.24 92 | 1978-01-01 00:00:00,USA,222.59 93 | 1979-01-01 00:00:00,USA,225.06 94 | 1980-01-01 00:00:00,USA,227.73 95 | 1981-01-01 00:00:00,USA,229.97 96 | 1982-01-01 00:00:00,USA,232.19 97 | 1983-01-01 00:00:00,USA,234.31 98 | 1984-01-01 00:00:00,USA,236.35 99 | 1985-01-01 00:00:00,USA,238.47 100 | 1986-01-01 00:00:00,USA,240.65 101 | 1987-01-01 00:00:00,USA,242.8 102 | 1988-01-01 00:00:00,USA,245.02 103 | 1989-01-01 00:00:00,USA,247.34 104 | 1990-01-01 00:00:00,USA,250.13 105 | 1991-01-01 00:00:00,USA,253.49 106 | 1992-01-01 00:00:00,USA,256.89 107 | 1993-01-01 00:00:00,USA,260.26 108 | 1994-01-01 00:00:00,USA,263.44 109 | 1995-01-01 00:00:00,USA,266.56 110 | 1996-01-01 00:00:00,USA,269.67 111 | 1997-01-01 00:00:00,USA,272.91 112 | 1998-01-01 00:00:00,USA,276.12 113 | 1999-01-01 00:00:00,USA,279.3 114 | 2000-01-01 00:00:00,USA,282.38 115 | 2001-01-01 00:00:00,USA,285.31 116 | 2002-01-01 00:00:00,USA,288.1 117 | 2003-01-01 00:00:00,USA,290.82 118 | 2004-01-01 00:00:00,USA,293.46 119 | 2005-01-01 00:00:00,USA,296.19 120 | 2006-01-01 00:00:00,USA,299.0 121 | 2007-01-01 00:00:00,USA,302.0 122 | 2008-01-01 00:00:00,USA,304.8 123 | 2009-01-01 00:00:00,USA,307.44 124 | 2010-01-01 00:00:00,USA,309.98 125 | 2011-01-01 00:00:00,USA,312.24 126 | 1950-01-01 00:00:00,CAN, 127 | 1951-01-01 00:00:00,CAN, 128 | 1952-01-01 00:00:00,CAN, 129 | 1953-01-01 00:00:00,CAN, 130 | 1954-01-01 00:00:00,CAN, 131 | 1955-01-01 00:00:00,CAN, 132 | 1956-01-01 00:00:00,CAN, 133 | 1957-01-01 00:00:00,CAN, 134 | 1958-01-01 00:00:00,CAN, 135 | 1959-01-01 00:00:00,CAN, 136 | 1960-01-01 00:00:00,CAN,17.91 137 | 1961-01-01 00:00:00,CAN,18.27 138 | 1962-01-01 00:00:00,CAN,18.61 139 | 1963-01-01 00:00:00,CAN,18.96 140 | 1964-01-01 00:00:00,CAN,19.33 141 | 1965-01-01 00:00:00,CAN,19.68 142 | 1966-01-01 00:00:00,CAN,20.05 143 | 1967-01-01 00:00:00,CAN,20.41 144 | 1968-01-01 00:00:00,CAN,20.73 145 | 1969-01-01 00:00:00,CAN,21.03 146 | 1970-01-01 00:00:00,CAN,21.32 147 | 1971-01-01 00:00:00,CAN,21.96 148 | 1972-01-01 00:00:00,CAN,22.22 149 | 1973-01-01 00:00:00,CAN,22.49 150 | 1974-01-01 00:00:00,CAN,22.81 151 | 1975-01-01 00:00:00,CAN,23.14 152 | 1976-01-01 00:00:00,CAN,23.45 153 | 1977-01-01 00:00:00,CAN,23.73 154 | 1978-01-01 00:00:00,CAN,23.96 155 | 1979-01-01 00:00:00,CAN,24.2 156 | 1980-01-01 00:00:00,CAN,24.52 157 | 1981-01-01 00:00:00,CAN,24.82 158 | 1982-01-01 00:00:00,CAN,25.12 159 | 1983-01-01 00:00:00,CAN,25.37 160 | 1984-01-01 00:00:00,CAN,25.61 161 | 1985-01-01 00:00:00,CAN,25.84 162 | 1986-01-01 00:00:00,CAN,26.1 163 | 1987-01-01 00:00:00,CAN,26.45 164 | 1988-01-01 00:00:00,CAN,26.79 165 | 1989-01-01 00:00:00,CAN,27.28 166 | 1990-01-01 00:00:00,CAN,27.69 167 | 1991-01-01 00:00:00,CAN,28.04 168 | 1992-01-01 00:00:00,CAN,28.37 169 | 1993-01-01 00:00:00,CAN,28.68 170 | 1994-01-01 00:00:00,CAN,29.0 171 | 1995-01-01 00:00:00,CAN,29.3 172 | 1996-01-01 00:00:00,CAN,29.61 173 | 1997-01-01 00:00:00,CAN,29.91 174 | 1998-01-01 00:00:00,CAN,30.16 175 | 1999-01-01 00:00:00,CAN,30.4 176 | 2000-01-01 00:00:00,CAN,30.69 177 | 2001-01-01 00:00:00,CAN,31.02 178 | 2002-01-01 00:00:00,CAN,31.35 179 | 2003-01-01 00:00:00,CAN,31.64 180 | 2004-01-01 00:00:00,CAN,31.94 181 | 2005-01-01 00:00:00,CAN,32.25 182 | 2006-01-01 00:00:00,CAN,32.58 183 | 2007-01-01 00:00:00,CAN,32.93 184 | 2008-01-01 00:00:00,CAN,33.32 185 | 2009-01-01 00:00:00,CAN,33.73 186 | 2010-01-01 00:00:00,CAN,34.13 187 | 2011-01-01 00:00:00,CAN,34.48 188 | -------------------------------------------------------------------------------- /day4/prepare.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import Quandl 3 | 4 | china = Quandl.get("FRED/POPTTLCNA173NUPN") 5 | usa = Quandl.get("FRED/USAPOPL") 6 | canada = Quandl.get("FRED/CANPOPL") 7 | 8 | # Convert from thousands to millions 9 | china = china / 1000 10 | 11 | p = pd.concat([china, usa, canada], axis=1) 12 | p.columns = ['china', 'usa', 'canada'] 13 | 14 | p.reset_index(inplace=True) 15 | 16 | popmelt = pd.melt(p, id_vars='Date') 17 | 18 | popmelt.columns = ['Date', 'ISO', 'population'] 19 | 20 | popmelt['ISO'][popmelt['ISO'] == 'china'] = 'CHN' 21 | popmelt['ISO'][popmelt['ISO'] == 'usa'] = 'USA' 22 | popmelt['ISO'][popmelt['ISO'] == 'canada'] = 'CAN' 23 | 24 | popmelt.to_csv('population.csv', index=False) 25 | 26 | languages = {'country': 27 | ['Canada', 'China', 'United States of America'], 28 | 'language': 29 | ['English', 'Mandarin', 'English']} 30 | 31 | ldf = pd.DataFrame(languages) 32 | ldf.to_csv('languages.csv', index=False) 33 | -------------------------------------------------------------------------------- /day5/README.md: -------------------------------------------------------------------------------- 1 | 2 | To view the map, you must download the entire `map` directory. The map is 3 | powered by [OpenStreetMap](https://www.openstreetmap.org/). 4 | 5 | The mobile food permits data & San Francisco neighborhoods GeoJSON was sourced 6 | from San Francisco's [open data collection](https://data.sfgov.org/). 7 | 8 | The vehicle mileage data was sourced from the Environmental Protection Agency's 9 | National Vehicle and Fuel Emissions Laboratory, which makes data freely 10 | available [here](http://fueleconomy.gov/feg/download.shtml). 11 | 12 | -------------------------------------------------------------------------------- /day5/backup/truck_counts.csv: -------------------------------------------------------------------------------- 1 | Neighborhood,Trucks 2 | South of Market,88 3 | Financial District,75 4 | Mission,61 5 | Central Waterfront,47 6 | Bayview,38 7 | Mission Bay,28 8 | Potrero Hill,26 9 | Produce Market,24 10 | Downtown / Union Square,22 11 | Hunters Point,20 12 | India Basin,18 13 | Bret Harte,17 14 | Dogpatch,15 15 | Northern Waterfront,14 16 | Showplace Square,13 17 | South Beach,11 18 | Rincon Hill,9 19 | Pacific Heights,9 20 | Mission Dolores,7 21 | Silver Terrace,4 22 | Western Addition,4 23 | Bernal Heights,4 24 | Civic Center,4 25 | Little Hollywood,4 26 | Presidio Heights,4 27 | Hayes Valley,3 28 | Candlestick Point SRA,3 29 | Cow Hollow,3 30 | Castro,3 31 | Apparel City,3 32 | Panhandle,2 33 | Lower Nob Hill,2 34 | North Beach,2 35 | Parnassus Heights,2 36 | Inner Sunset,2 37 | Cathedral Hill,2 38 | Tenderloin,2 39 | Anza Vista,2 40 | Marina,2 41 | Golden Gate Heights,2 42 | Russian Hill,2 43 | Lakeshore,1 44 | Portola,1 45 | Ingleside,1 46 | Polk Gulch,1 47 | Laurel Heights / Jordan Park,1 48 | Lower Haight,1 49 | Eureka Valley,1 50 | Outer Mission,1 51 | Crocker Amazon,1 52 | Lone Mountain,1 53 | Outer Richmond,1 54 | Noe Valley,1 55 | Westwood Park,1 56 | Laguna Honda,1 57 | Alamo Square,1 58 | -------------------------------------------------------------------------------- /day5/day5log.py: -------------------------------------------------------------------------------- 1 | import matplotlib.pyplot as plt 2 | import numpy as np 3 | x = np.linspace(0, 1, 100) 4 | y_cos = np.cos(x) 5 | y_sin = np.sin(x) 6 | plt.plot(x, y_cos) 7 | plt.show() 8 | plt.ion() 9 | plt.plot(x, y_cos) 10 | plt.plot(x, y_cos, 'r-x') 11 | plt.plot? 12 | plt.plot(x, y_cos, color='red', linestyle='-', marker='x') 13 | plt.plot(x, y_cos, x, y_sin) 14 | plt.plot(x, y_cos) 15 | plt.plot(x, y_sin) 16 | # use plt.subplots() to make multiple plots. 17 | fig, ax = plt.subplots(2, 1) 18 | ax.plot(x, y_sin) 19 | ax[0].plot(x, y_sin) 20 | ax[1].plot(x, y_cos) 21 | # plt.draw() forces the plot display to update. 22 | plt.draw() 23 | # plot in XKCD style! 24 | with plt.xkcd(): 25 | plt.plot(x, y_sin) 26 | plt.plot(x, y_cos) 27 | %history -f day5log.py 28 | -------------------------------------------------------------------------------- /day5/map/data.json: -------------------------------------------------------------------------------- 1 | [{"Potrero Hill": 26, "Alamo Square": 1, "Civic Center": 4, "Presidio Heights": 4, "Downtown / Union Square": 22, "Dogpatch": 15, "Cathedral Hill": 2, "Tenderloin": 2, "South Beach": 11, "Lower Nob Hill": 2, "India Basin": 18, "Outer Mission": 1, "Golden Gate Heights": 2, "Bayview": 38, "Cow Hollow": 3, "Showplace Square": 13, "Eureka Valley": 1, "Russian Hill": 2, "Little Hollywood": 4, "Mission": 61, "Inner Sunset": 2, "Marina": 2, "Silver Terrace": 4, "Central Waterfront": 47, "Northern Waterfront": 14, "Lone Mountain": 1, "Anza Vista": 2, "Ingleside": 1, "Noe Valley": 1, "North Beach": 2, "Castro": 3, "Lakeshore": 1, "Laurel Heights / Jordan Park": 1, "Western Addition": 4, "Financial District": 75, "Polk Gulch": 1, "Rincon Hill": 9, "Lower Haight": 1, "Outer Richmond": 1, "Mission Bay": 28, "South of Market": 88, "Laguna Honda": 1, "Bret Harte": 17, "Apparel City": 3, "Mission Dolores": 7, "Parnassus Heights": 2, "Portola": 1, "Candlestick Point SRA": 3, "Hayes Valley": 3, "Crocker Amazon": 1, "Pacific Heights": 9, "Produce Market": 24, "Panhandle": 2, "Hunters Point": 20, "Bernal Heights": 4, "Westwood Park": 1}] -------------------------------------------------------------------------------- /day5/map/map.html: -------------------------------------------------------------------------------- 1 | 2 |
3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 151 | -------------------------------------------------------------------------------- /day5/mpg_plots.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | ''' Plots for EPA fuel economy data. 3 | ''' 4 | 5 | import pandas as pd 6 | import seaborn as sns 7 | import matplotlib.pyplot as plt 8 | 9 | def main(): 10 | ''' Run a demo of the functions in this module. 11 | ''' 12 | vehicles = pd.read_csv('vehicles.csv') 13 | 14 | make_mpg_plot(vehicles) 15 | 16 | def make_mpg_plot(vehicles): 17 | ''' Make a plot of MPG by Make and Year. 18 | ''' 19 | # Take means by make and year. 20 | groups = vehicles.groupby(['year', 'make']) 21 | avg_mpg = groups.city_mpg.mean() 22 | 23 | # Convert the indexes to columns, then pivot. 24 | avg_mpg = avg_mpg.reset_index() 25 | mpg_matrix = avg_mpg.pivot('year', 'make', 'city_mpg') 26 | 27 | # Subset down to a few brands. 28 | brands = ['Honda', 'Toyota', 'Hyundai', 'BMW', 'Mercedes-Benz', 29 | 'Ferrari', 'Lamborghini', 'Maserati', 'Fiat', 'Bentley', 30 | 'Ford', 'Chevrolet', 'Dodge', 'Jeep'] 31 | brand_matrix = mpg_matrix[brands] 32 | 33 | # Plot the data using seaborn and matplotlib. 34 | # The `subplots()` method returns a figure and an axes. 35 | fig, ax = plt.subplots(1, 1) 36 | sns.heatmap(brand_matrix, cmap='BuGn', ax=ax) 37 | ax.set_xlabel('Make') 38 | ax.set_ylabel('Year') 39 | ax.set_title('MPG by Make and Year') 40 | 41 | # Setting a tight layout ensures the entire plot fits in the plotting 42 | # window. 43 | fig.tight_layout() 44 | 45 | # Show the figure. 46 | fig.show() 47 | # Alternatively: 48 | # fig.savefig('mpg_plot.png') 49 | 50 | # Only call `main()` when the script is executed directly (not when it's 51 | # imported with `import mpg_plots`). This allows the script to be used as an 52 | # application AND as a module in other scripts. 53 | if __name__ == '__main__': 54 | main() 55 | 56 | -------------------------------------------------------------------------------- /day5/notes.md: -------------------------------------------------------------------------------- 1 | 2 | There are many different packages for plotting with Python. These are split 3 | into two ecosystems, "matplotlib" and "javascript". 4 | 5 | * The "matplotlib" ecosystem has highly-customizable tools well-suited for 6 | printed graphics and interative (offline) applications. Some members: 7 | + matplotlib - general plotting 8 | + seaborn - "pretty" statistical plotting 9 | + ggplot - grammar of graphics plotting 10 | + basemap - old maps package, using shapefiles 11 | + cartopy - new maps package, using GeoJSON and recent Python GIS libraries 12 | * The "javascript" ecosystem is less mature, but growing fast. These modules 13 | are designed for interactive web graphics. Each leverages a javascript 14 | library. Many require an HTML server to work. Some members: 15 | + bokeh (D3.js) - general plotting, especially large/streaming data 16 | + vincent (vega.js) - general plotting 17 | + folium (leaflet.js) - interactive world maps (similar to Google Maps) 18 | + kartograph (kartograph.js) - interactive maps 19 | 20 | Since it's the most mature, we'll look at matplotlib first. 21 | 22 | matplotlib can be used interactively (PyLab) or in a non-interactive style. The 23 | interactive functions are in `matplotlib.pyplot`. 24 | 25 | The basic plotting function is `plot()`. It can produce line and scatter plots. 26 | Plots can be customized quickly with format strings, or more verbosely using 27 | parameters. Note the `plot()` can plot several lines in one call. 28 | 29 | Other plots are available: 30 | 31 | * acorr 32 | * bar 33 | * barh 34 | * boxplot 35 | * contour 36 | * errorbar 37 | * hist 38 | * scatter 39 | * violinplot 40 | * xcorr 41 | 42 | Writing non-interactive matplotlib code gives you more control over the 43 | resulting plots. In order to understand the documentation, you need to learn 44 | some jargon: 45 | 46 | * Figure - a drawing surface 47 | * Axes - a single plot, including its axes, title, lines, etc... 48 | * Artist - a single element of a plot; for example, a line 49 | 50 | It's possible to use non-interactive methods directly at the interpreter, but 51 | easuer to use them in a script. Every Python script is a module, meaning it 52 | can be imported in other scripts. Scripts are also very easy to document using 53 | docstrings (triple-quoted strings). 54 | 55 | -------------------------------------------------------------------------------- /day5/truck_map.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | ''' Map food trucks by neighborhood in San Francisco. 3 | ''' 4 | 5 | import json 6 | import folium 7 | import pandas as pd 8 | import shapely.geometry as geom 9 | 10 | def main(): 11 | ''' Produce a map of neighborhood food trucks in San Francisco. 12 | ''' 13 | # Load the neighborhoods GeoJSON data, and extract the neighborhood names 14 | # and shapes. 15 | nhoods = load_geo_json('sf_nhoods.json') 16 | features = extract_features(nhoods) 17 | 18 | # Load the food truck data, and extract (longitude, latitude) pairs. 19 | trucks = pd.read_csv('mobile_food.csv') 20 | loc = trucks[['Longitude', 'Latitude']].dropna() 21 | 22 | # Determine the neighborhood each food truck is in. 23 | trucks['Neighborhood'] = loc.apply(get_nhood, axis=1, features=features) 24 | 25 | counts = trucks['Neighborhood'].value_counts() 26 | counts = counts.reset_index() 27 | counts.columns = pd.Index(['Neighborhood', 'Trucks']) 28 | 29 | # Create a map. 30 | my_map = folium.Map(location=[37.77, -122.45], zoom_start=12) 31 | 32 | my_map.geo_json(geo_path = 'sf_nhoods.json', 33 | key_on='feature.properties.name', 34 | data = counts, columns = ['Neighborhood', 'Trucks'], 35 | fill_color='BuGn') 36 | 37 | my_map.create_map('map.html') 38 | 39 | def get_nhood(truck, features): 40 | ''' Identify the neighborhood of a given point. 41 | ''' 42 | truck = geom.Point(*truck) 43 | for name, boundary in features: 44 | if truck.within(boundary): 45 | return name 46 | 47 | return None 48 | 49 | def load_geo_json(path): 50 | ''' Load a GeoJSON file. 51 | ''' 52 | with open(path) as file: 53 | geo_json = json.load(file) 54 | 55 | return geo_json 56 | 57 | def extract_features(geo_json): 58 | ''' Extract names and geometries from a geo_json dict. 59 | ''' 60 | features = [] 61 | for feature in geo_json['features']: 62 | name = feature['properties']['name'] 63 | geometry = geom.asShape(feature['geometry']) 64 | 65 | features.append((name, geometry)) 66 | 67 | return features 68 | 69 | if __name__ == '__main__': 70 | main() 71 | 72 | -------------------------------------------------------------------------------- /day6/analysis.py: -------------------------------------------------------------------------------- 1 | ''' 2 | We'd like to do network analysis on a body of text. 3 | 4 | Build a graph where the nodes are names, and an edge exists if the two 5 | names appear in the same sentence. 6 | 7 | The first rule about regular expressions: 8 | Avoid them when possible!!!! 9 | 10 | Data Golf! 11 | Today it's for a prize- a book on Python 12 | Hints: 13 | - An ASCII character of text is 1 byte 14 | - A typical word is 6 characters 15 | - A typical novel has 200K words 16 | - War and Peace is 3x a normal book 17 | 18 | How big is War and Peace? 19 | 20 | regex should match capital letters not at the beginning of a sentence. 21 | ''' 22 | 23 | import re 24 | from itertools import combinations 25 | import networkx as nx 26 | 27 | names = 'Nick Clark Rick Qi'.split(' ') 28 | 29 | namepairs = set(combinations(names, 2)) 30 | 31 | pattern = re.compile(r''' 32 | [^^] # Not the beginning of a string 33 | (? FROM\n", 212 | " | title | \n", 213 | "date_posted | \n", 214 | "price | \n", 215 | "
---|---|---|---|
0 | \n", 220 | "$995 / 3br - 1350ft2 - Lancaster Apartment Uni... | \n", 221 | "2016-04-13 15:34:06 | \n", 222 | "995.0 | \n", 223 | "
1 | \n", 226 | "$1935 / 2br - 1154ft2 - No place like The Colo... | \n", 227 | "2016-04-16 17:55:45 | \n", 228 | "1935.0 | \n", 229 | "
2 | \n", 232 | "$1825 / 2br - 1056ft2 - No place like The Colo... | \n", 233 | "2016-04-16 17:58:01 | \n", 234 | "1825.0 | \n", 235 | "
3 | \n", 238 | "$650 / 1br - FURNISHED 1 BED GUEST QUARTERS, D... | \n", 239 | "2016-04-17 21:26:15 | \n", 240 | "650.0 | \n", 241 | "
4 | \n", 244 | "$1599 / 2br - 951ft2 - Big Savings! | \n", 245 | "2016-04-28 11:58:28 | \n", 246 | "1599.0 | \n", 247 | "
5 | \n", 250 | "$1335 / 1br - 701ft2 - Looking for your next h... | \n", 251 | "2016-04-28 13:12:20 | \n", 252 | "1335.0 | \n", 253 | "