├── .gitignore ├── README.md ├── day1 ├── git_followup.sh ├── git_lifejacket.sh ├── hello_world.py ├── plots │ ├── languages.png │ ├── python_comfort.png │ └── topics.png └── pythonintro.pdf ├── day2 ├── backup │ ├── datasci.json │ ├── error.json │ ├── log.txt │ └── notes.mdown ├── day2log.txt ├── download_github.py └── notes15jan.mdown ├── day3 ├── day3log.txt ├── day3notes.md ├── lecture.py └── photo.png ├── day4 ├── ISO_3166-1 ├── backup │ ├── country.py │ ├── log.py │ └── simple.py ├── decade.py ├── languages.csv ├── log.py ├── notes.mdown ├── population.csv └── prepare.py ├── day5 ├── README.md ├── backup │ └── truck_counts.csv ├── day5log.py ├── map │ ├── data.json │ ├── map.html │ └── sf_nhoods.json ├── maps.ipynb ├── mobile_food.csv ├── mpg_plots.py ├── notes.md ├── sf_nhoods.json ├── truck_map.py └── vehicles.csv ├── day6 ├── analysis.py ├── announce.py ├── babynamer.py ├── day6log.txt ├── dearyeji.txt ├── findnames.py ├── textgraph.py └── yob2013.txt ├── day7 ├── README.md ├── aa.db ├── atus.db ├── atus_blank.py ├── backup │ ├── atus.py │ └── lecture.py ├── day7log1.py └── day7log2.py ├── day8 ├── backup │ ├── notes.py │ └── titanic.py ├── betafit.py ├── day8.py ├── final.mdown ├── kaggle.py ├── sleepminutes.csv └── titanic.csv ├── exercise1 ├── exercise1.md ├── usda.py └── usda_blank.py ├── exercise2 ├── exercise2.md ├── random_walk.py └── random_walk_solution.py ├── exercise3 ├── exercise3.mdown ├── wpgraph.pdf └── wpgraph.py ├── exercise4 ├── convert.py └── exercise4.md ├── feedback ├── feedback.csv └── feedback.ipynb └── iid_talk ├── Makefile ├── iidata workshop.ipynb └── slides.mdown /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__ 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [Course Videos](http://dsi.ucdavis.edu/PythonMiniCourse/) 2 | 3 | # Python for Data Mining 4 | 5 | [Python][] is a programming language designed to have clear, concise, and 6 | expressive code. 7 | An extremely popular general-purpose language, Python has been used for tasks 8 | as diverse as web development, teaching, and systems administration. 9 | This mini-course provides an introduction to Python for data mining. 10 | 11 | Messy data has an inconsistent or inconvenient format, and may have missing 12 | values. 13 | Noisy data has measurement error. 14 | *Data mining extracts meaningful information from messy, noisy data.* 15 | This is a start-to-finish process that includes gathering, cleaning, 16 | visualizing, modeling, and reporting. 17 | 18 | Programming and research best practices are a secondary focus of the 19 | mini-course, because [Python is a philosophy][zen] as well as a language. 20 | Core concepts include: writing organized, well-documented code; being a 21 | self-sufficient learner; using version control for code management and 22 | collaboration; ensuring reproducibility of results; producing concise, 23 | informative analyses and visualizations. 24 | 25 | We will meet for four weeks during the Winter 2015 quarter at the 26 | University of California, Davis. 27 | 28 | [zen]: http://legacy.python.org/dev/peps/pep-0020/ 29 | [Python]: https://www.python.org/ 30 | 31 | ## Target Audience 32 | The mini-course is open to undergraduate and graduate students from all 33 | departments. 34 | We recommend that students have prior programming experience 35 | and a basic understanding of statistical methods, 36 | so they can follow along with the examples. 37 | For instance, completion of STA 108 and STA 141 is sufficient 38 | (but not required). 39 | 40 | ## Topics 41 | 42 | ### Core Python 43 | The mini-course will kick off with a quick introduction to the syntax of 44 | Python, including operators, data types, control flow statements, function 45 | definition, and string manipulation. 46 | Slower, in-depth coverage will be given to uniquely Pythonic features such as 47 | built-in data structures, list comprehensions, iterators, and docstrings. 48 | 49 | Authoring packages and other advanced topics may also be discussed. 50 | 51 | ### Scientific Computing 52 | Support for stable, high-performance vector operations is provided by the NumPy 53 | package. 54 | NumPy will be introduced early and used often, because it's the foundation for 55 | most other scientific computing packages. 56 | We will also cover SciPy, which extends NumPy with functions for 57 | linear algebra, optimization, and elementary statistics. 58 | 59 | Specialized packages will be discussed during the final week. 60 | 61 | ### Data Manipulation 62 | The pandas package provides tabular data structures and convenience functions 63 | for manipulating them. 64 | This includes a two-dimensional data frame similar to the one found in R. 65 | Pandas will be covered extensively, because it makes it easy to 66 | 67 | + Read and write many formats (CSV, JSON, HDF, database) 68 | + Filter and restructure data 69 | + Handle missing values gracefully 70 | + Perform group-by operations (`apply` functions) 71 | 72 | ### Data Visualization 73 | 74 | Many visualization packages are available for Python, but the mini-course will 75 | focus on Seaborn, which is a user-friendly abstraction of the venerable 76 | matplotlib package. 77 | 78 | Other packages such as ggplot2, Vincent, Bokeh, and mpld3 may also be covered. 79 | 80 | ## Programming Environment 81 | Python 3 has syntax changes and new features that break compatibility with 82 | Python 2. 83 | All of the major scientific computing packages have added support for Python 3 84 | over the last few years, so it will be our focus. 85 | We recommend the [Anaconda][] Python 3 distribution, 86 | which bundles most packages we'll use into one download. 87 | Any other packages needed can be installed using `pip` or `conda`. 88 | 89 | Python code is supported by a vast array of editors. 90 | 91 | + [Spyder IDE][Spyder], included in Anaconda, 92 | is a Python equivalent of RStudio, 93 | designed with scientific computing in mind. 94 | + [PyCharm IDE][PyCharm] and [Sublime][] provide good user interfaces. 95 | + Terminal-based text editors, such as [Vim][] and [Emacs][], are a great 96 | choice for ambitious students. They can be used with any language. 97 | See [here][Text Editors] for more details. Clark and Nick both use Vim. 98 | 99 | [Anaconda]: http://continuum.io/downloads 100 | [Spyder]: https://code.google.com/p/spyderlib/ 101 | [PyCharm]: https://www.jetbrains.com/pycharm/ 102 | [Sublime]: http://www.sublimetext.com/ 103 | [Vim]: http://www.vim.org/ 104 | [Emacs]: https://www.gnu.org/software/emacs/ 105 | [Text Editors]: http://heather.cs.ucdavis.edu/~matloff/ProgEdit/ProgEdit.html 106 | 107 | ## References 108 | 109 | No books are required, but we recommend Wes McKinney's book: 110 | 111 | + McKinney, W. (2012). _Python for Data Analysis: Data Wrangling with Pandas, 112 | NumPy, and IPython_. O'Reilly Media. 113 | 114 | Python and most of the packages we'll use have excellent documentation, which 115 | can be found at the following links. 116 | 117 | + [Python 3](https://docs.python.org/3/) 118 | + [NumPy](http://docs.scipy.org/doc/numpy/) 119 | + [SciPy](http://docs.scipy.org/doc/scipy/reference/) 120 | + [pandas](http://pandas.pydata.org/pandas-docs/stable/) 121 | + [matplotlib](http://matplotlib.org/contents.html) 122 | + [Seaborn](http://stanford.edu/~mwaskom/software/seaborn/tutorial.html) 123 | + [scikit-learn](http://scikit-learn.org/stable/documentation.html) 124 | + [IPython](http://ipython.org/documentation.html) 125 | 126 | Due to Python's popularity, a large number of general references are available. 127 | While these don't focus specifically on data analysis, they're helpful for 128 | learning the language and its idioms. 129 | Some of our favorites are listed below, many of which are free. 130 | 131 | + Swaroop, C. H. (2003). _[A Byte of Python][]_. ([PDF][ABoP PDF]) 132 | + Reitz, K. _[Hitchhiker's Guide to Python][Hitchhiker's Guide]_. 133 | ([PDF][HGoP PDF]) 134 | + Lutz, M. (2014). _Python Pocket Reference_. O'Reilly Media. 135 | + Beazley, D. (2009). _Python Essential Reference_. Addison-Wesley. 136 | + Pilgrim, M., & Willison, S. (2009). _[Dive Into Python 3][]_. Apress. 137 | + [Non-programmer's Tutorial for Python 3][Non] 138 | + [Beginner's Guide to Python][Beginner's Guide] 139 | + [Five Lifejackets to Throw to the New Coder][New Coder] 140 | + [Pyvideo][Pyvideo]\* 141 | + [StackOverflow][]. Please be conscious of the [rules][SO Rules]! 142 | 143 | \* Videos featuring Guido Van Rossum, Raymond Hettinger, Travis Oliphant, 144 | Fernando Perez, David Beazley, and Alex Martelli are suggested. 145 | 146 | 147 | [A Byte of Python]: http://www.swaroopch.com/notes/python/ 148 | [ABoP PDF]: http://files.swaroopch.com/python/byte_of_python.pdf 149 | 150 | [Hitchhiker's Guide]: http://docs.python-guide.org/en/latest/ 151 | [HGop PDF]: https://media.readthedocs.org/pdf/python-guide/latest/python-guide.pdf 152 | 153 | [Dive Into Python 3]: http://www.diveintopython3.net/ 154 | [Non]: http://en.wikibooks.org/wiki/Non-Programmer%27s_Tutorial_for_Python_3 155 | [Beginner's Guide]: https://wiki.python.org/moin/BeginnersGuide 156 | [New Coder]: http://newcoder.io/ 157 | [Pyvideo]: http://pyvideo.org/ 158 | [StackOverflow]: http://stackoverflow.com/questions/tagged/python 159 | [SO Rules]: http://stackoverflow.com/tour 160 | 161 | -------------------------------------------------------------------------------- /day1/git_followup.sh: -------------------------------------------------------------------------------- 1 | 2 | # More on Git 3 | # =========== 4 | # This file describes some everyday tasks in git. 5 | 6 | # Creating Repositories 7 | # ===================== 8 | # You don't have to use GitHub to create a new repository. You can create an 9 | # new local repository named NAME with: 10 | # 11 | # git init [NAME] 12 | # 13 | git init my-repo 14 | 15 | # If you want to use a repository you created locally with a remote repository, 16 | # you have to tell git where the remote repository is: 17 | # 18 | # git remote [add NAME URL] 19 | # 20 | 21 | # Add a remote called origin: 22 | git remote add origin https://github.com/nick-ulle/my-repo.git 23 | 24 | # List all remotes for the current repo: 25 | git remote 26 | 27 | # After adding the remote repository, you can push and pull as usual. 28 | 29 | # Branching 30 | # ========= 31 | # Git supports keeping multiple working versions of a repository at once. These 32 | # are represented as branches. Every repository starts with a branch called 33 | # 'master'. 34 | # 35 | # To make a new branch named NAME, use: 36 | # 37 | # git branch [NAME] 38 | # 39 | git branch experimental 40 | 41 | # On its own, `git branch` will list the repository's branches: 42 | 43 | git branch 44 | 45 | # You can switch branches with: 46 | # 47 | # git checkout NAME 48 | # 49 | git checkout experimental 50 | 51 | # You can delete a branch with the following command: 52 | # 53 | # git branch -d NAME 54 | # 55 | git checkout master 56 | git branch -d experimental 57 | 58 | git branch 59 | 60 | # Now let's recreate that branch and switch to it. 61 | git branch experimental 62 | git checkout experimental 63 | 64 | # If we make some changes on the branch and commit them, they don't get applied 65 | # to any other branch. 66 | touch testing.py 67 | git add testing.py 68 | git commit 69 | 70 | ls 71 | git log 72 | 73 | git checkout master 74 | ls 75 | git log 76 | 77 | git branch -v 78 | 79 | # To merge commits from another branch into the current branch, use: 80 | # 81 | # git merge BRANCH 82 | # 83 | git merge experimental 84 | 85 | git status 86 | git log 87 | ls 88 | 89 | # How should you use branches? 90 | # 91 | # Branches are useful for isolating experimental changes from code you already 92 | # have working. They also make it easy to keep manage multiple drafts. 93 | # 94 | # Use branches judiciously according to your own work habits, especially for 95 | # projects where you're the only contributor. In larger, public projects, try 96 | # to follow guidelines agreed upon by all the contributors. 97 | # 98 | # A popular branching workflow is described here: 99 | # 100 | # http://nvie.com/posts/a-successful-git-branching-model/ 101 | # 102 | 103 | # Stashing Changes 104 | # ================ 105 | # You might want to switch branches, saving your work on the current branch 106 | # without committing it. The solution to this is stashing: 107 | # 108 | # git stash [pop] 109 | # 110 | 111 | # For example, suppose you add a Python file: 112 | touch foo.py 113 | git add foo.py 114 | 115 | # Then you realize you want to switch branches and work on something completely 116 | # different. Stash your current work to get it out of the way: 117 | git stash 118 | 119 | # Switch to a different branch and do other work, committing when you're done: 120 | # (If you're following along, create the branch and switch to it with 121 | # 122 | # git checkout -b branch 123 | # 124 | # ) 125 | git checkout other-branch 126 | 127 | # When you're ready to go back to work on foo.py, you can switch back to the 128 | # original branch and pop the stash: 129 | git checkout master 130 | git stash pop 131 | ls 132 | 133 | # Revising Commits 134 | # ================ 135 | # What if you forget to add a change, make a typo in the commit message, etc? 136 | # Fix it with: 137 | # 138 | # git commit --amend 139 | # 140 | vim README.md 141 | git add README.md 142 | git commit --amend 143 | 144 | # NEVER amend a commit you've already pushed to a remote repository. It will 145 | # interfere with git's normal operation! Instead, you should make a new commit 146 | # for any changes you forgot. 147 | 148 | # Restoring Previous Commits 149 | # ========================== 150 | # Since git tracks all of the commits you make, you can easily restore the 151 | # state of the repository at an earlier point in time. You can do this with: 152 | # 153 | # git reset COMMIT [FILE] 154 | # 155 | # where COMMIT is a commit hash (listed with `git log`). You only need 156 | # to type the first few characters of the commit hash. For example: 157 | 158 | git reset f4fca 159 | 160 | # A more in-depth explanation of `git reset` is given in the git documentation. 161 | 162 | -------------------------------------------------------------------------------- /day1/git_lifejacket.sh: -------------------------------------------------------------------------------- 1 | 2 | # Git Lifejacket 3 | # ============== 4 | # Git is a distributed version control system. What does that mean? 5 | # 6 | # * Git makes it easy to share files (distributed). 7 | # * Git tracks the changes you make to files (version control system). 8 | # 9 | # Use git to... 10 | # 11 | # 1. Quickly distribute sets of files. 12 | # 2. Keep multiple versions of a file without making copies. 13 | # 3. Selectively undo changes you made 3 days or even 3 months ago. 14 | # 4. Efficiently merge work from several different people. 15 | # 5. ... 16 | # 17 | # This tutorial is a git lifejacket, meant to get you up and running. 18 | # 19 | # For more, see (try!) the git followup notes, posted online. Also check out 20 | # the git documentation at: 21 | # 22 | # http://www.git-scm.com/doc 23 | # 24 | # You can also get help with commands by appending `--help` to the end. For 25 | # example: 26 | git status --help 27 | 28 | # Git Repositories 29 | # ================ 30 | # A git repository (or 'repo') is just a set of files tracked by git. 31 | # 32 | # GitHub.com is a host for git repositories on the web, widely-used by 33 | # open-source projects. GitHub provides free hosting for public repositories. 34 | 35 | # You might want to work on a repository someone else created. Download a copy 36 | # of a repository from the web by cloning it: 37 | # 38 | # git clone URL 39 | # 40 | git clone https://github.com/nick-ulle/2015-python.git 41 | 42 | # Check the status of a repository with: 43 | # 44 | # git status 45 | # 46 | git status 47 | 48 | # You can also create new, empty repositories on GitHub, and then clone them: 49 | 50 | git clone https://github.com/nick-ulle/demo.git 51 | 52 | # Committing Changes 53 | # ================== 54 | # After changing some files, you can save a snapshot of the repository by 55 | # making a commit. This is a 2 step process. 56 | 57 | # Step 1: 58 | # 59 | # Add, or 'stage', the changes you want to save in the commit: 60 | # 61 | # git add FILE 62 | # 63 | git add README.md 64 | 65 | # To stage every file in the current repository: 66 | # 67 | # git add --all 68 | 69 | # Use `git status` to see which files are staged. 70 | 71 | # Step 2: 72 | # 73 | # Save the staged changes with the command: 74 | # 75 | # git commit 76 | # 77 | git commit 78 | 79 | # The commit command will ask you to type a message summarizing the changes. 80 | # The first line should be a short description, no more than 50 characters. 81 | # If you want to write more, insert a blank line. For example: 82 | # 83 | # Adds README.md, describing the repository. 84 | # 85 | # The added README.md also contains a classic programming phrase and the 86 | # meaning of life, the universe, and everything. 87 | 88 | # If you examine the repository status, git no longer sees any changes. This 89 | # is because they've been committed. 90 | 91 | # When should you make a commit? 92 | # 93 | # Common advice is to commit early and often. Commits are your save points, 94 | # and you never know when you'll need to go back. You could commit every time 95 | # you finish a small piece of work, such as writing a function. 96 | 97 | # You can see a history of the last N commits with the command: 98 | # 99 | # git log [-N] 100 | # 101 | git log 102 | git log -3 103 | 104 | # Your Turn! 105 | # ========== 106 | # Clone the demo repository from 107 | # 108 | # https://www.github.com/nick-ulle/demo.git 109 | # 110 | # make a new file, type something in it, and commit the file. 111 | 112 | # Try making your GitHub account, creating a repo, cloning, and make a commit! 113 | 114 | # Working With Remote Repositories 115 | # ================================ 116 | # What if you want to share your work online (say, GitHub)? An online 117 | # repository is a 'remote' repository. 118 | 119 | # Given permission (e.g., you own the repo), you can push commits to a remote 120 | # repository with the command: 121 | # 122 | # git push [REMOTE BRANCH] 123 | # 124 | 125 | # Push commits to the remote repo 'origin' on branch 'master': 126 | 127 | git push origin master 128 | 129 | # You can also retrieve commits other people have made to a repository. Do 130 | # this with: 131 | # 132 | # git pull [REMOTE BRANCH] 133 | # 134 | git pull origin master 135 | 136 | -------------------------------------------------------------------------------- /day1/hello_world.py: -------------------------------------------------------------------------------- 1 | 2 | # To open Python... 3 | # * Windows users: find ipython (3) in the Start Menu 4 | # * All others: enter `ipython` in terminal (without quotes) 5 | 6 | print('Hello world!') 7 | 8 | # Now check that packages we'll use often were installed correctly. 9 | # All of the following import lines should run without any message. 10 | 11 | # Testing NumPy 12 | import numpy as np 13 | 14 | # Testing SciPy 15 | import scipy as sp 16 | 17 | # Testing Pandas 18 | import pandas as pd 19 | 20 | # Testing Matplotlib 21 | import matplotlib.pyplot as plt 22 | 23 | # Finally, check that you can make plots: 24 | plt.plot(range(5)) 25 | plt.show() 26 | 27 | -------------------------------------------------------------------------------- /day1/plots/languages.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nick-ulle/2015-python/18e324ea50435d94c72ec853eec56ea6dfa5fe05/day1/plots/languages.png -------------------------------------------------------------------------------- /day1/plots/python_comfort.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nick-ulle/2015-python/18e324ea50435d94c72ec853eec56ea6dfa5fe05/day1/plots/python_comfort.png -------------------------------------------------------------------------------- /day1/plots/topics.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nick-ulle/2015-python/18e324ea50435d94c72ec853eec56ea6dfa5fe05/day1/plots/topics.png -------------------------------------------------------------------------------- /day1/pythonintro.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nick-ulle/2015-python/18e324ea50435d94c72ec853eec56ea6dfa5fe05/day1/pythonintro.pdf -------------------------------------------------------------------------------- /day2/backup/error.json: -------------------------------------------------------------------------------- 1 | {"documentation_url": "https://developer.github.com/v3/search", "errors": [{"code": "missing", "field": "q", "resource": "Search"}], "message": "Validation Failed"} -------------------------------------------------------------------------------- /day2/backup/log.txt: -------------------------------------------------------------------------------- 1 | clear 2 | davis = {'cool': True, 3 | 'students': 35e3} 4 | davis 5 | # This is a dictionary. The whole Python language is built on dictionaries 6 | davis['cool'] 7 | davis['students'] 8 | clear 9 | davis 10 | davis.keys() 11 | davis.values() 12 | # Keys and values together are called 'items' 13 | davis.items() 14 | davis 15 | clear 16 | 'cool' in Davis 17 | 'cool' in davis 18 | # This says that 'cool' is a key in the davis dictionary 19 | davis['cool'] 20 | # To answer this you need to know the philosophy of Python 21 | import this 22 | davis 23 | davis['stats'] 24 | # Lets add stats to davis 25 | davis['stats'] = 'excellent' 26 | davis 27 | depts = ['stats', 'math', 'ecology'] 28 | clear 29 | depts 30 | depts = ['stats', 'math', 'ecology'] 31 | davis 32 | depts 33 | type(davis) 34 | type(depts) 35 | type(123) 36 | type(123.1) 37 | type('123.1') 38 | type(True) 39 | davis['cool'] 40 | type(davis['cool']) 41 | clear 42 | depts 43 | nums = [0, 10, 20, 30, 40] 44 | # We are learning how to explore the basic data structures 45 | len 46 | ?len 47 | len? 48 | nums 49 | len(nums) 50 | # how do we pick out the first two? 51 | nums[:2] 52 | # Python begins counting at 0!!! 53 | nums[0] 54 | nums[1] 55 | nums[2] 56 | nums 57 | nums[:2] 58 | # This is called 'slicing' 59 | nums[-1] 60 | nums[-2] 61 | nums[-3] 62 | nums 63 | nums[2:] 64 | # nums up to, but excluding the 2nd element 65 | nums[:2] 66 | # nums beginning with the 2nd element 67 | nums[2:] 68 | nums 69 | nums[:2] 70 | nums[2:] 71 | nums 72 | # What do you observe? 73 | nums[:2] + nums[2:] 74 | nums 75 | nums[:2] + nums[2:] == nums 76 | k = 3 77 | nums[:k] + nums[k:] == nums 78 | k = 10 79 | len(nums) 80 | nums[:k] + nums[k:] == nums 81 | # Can also do negative slices 82 | nums[-3:] 83 | nums 84 | nums[:k] + nums[k:] == nums 85 | # The positive side of 0 indexing is that we get this pretty identity 86 | # We have learned slicing 87 | import this 88 | import this 89 | # flat is better than nested 90 | davis 91 | depts 92 | # We've seen it in Python - now in JSON! 93 | # Ways to transfer data 94 | davis 95 | import json 96 | json.dumps? 97 | json.dumps(davis) 98 | davis 99 | davis 100 | davis['dept'] = dept 101 | davis['dept'] = depts 102 | davis 103 | davis['dept'] 104 | type(davis['dept']) 105 | json.dumps(davis) 106 | davis 107 | # We have seen that JSON looks just like Python lists and dicts 108 | # Time to fetch some live data! 109 | # import requests 110 | import requests 111 | requests.get 112 | baseurl = 'https://api.github.com/search/repositories' 113 | response = requests.get(baseurl) 114 | response 115 | response 116 | response.status_code 117 | response 118 | response.content 119 | response.json 120 | response.json() 121 | r = response.json() 122 | type(r) 123 | r.keys() 124 | r 125 | r['errors'] 126 | type(r['errors']) 127 | len(r['errors']) 128 | r['errors'] 129 | r['errors'][0] 130 | type(r['errors'][0]) 131 | r 132 | payload = {'q': 'data science', 'sort': 'stars'} 133 | ds = requests.get(baseurl, params=payload) 134 | ds 135 | ds.json() 136 | dsj = ds.json() 137 | type(dsj) 138 | len(dsj) 139 | dsj.keys() 140 | dsj['total_count'] 141 | dsj['incomplete_results'] 142 | dsji = dsji['items'] 143 | dsji = dsj['items'] 144 | clear 145 | dsji 146 | clear 147 | len(dsji) 148 | type(dsji) 149 | dsji[0] 150 | # Lets get the description for all of them! 151 | type(dsji[0]) 152 | [x['description'] for x in dsji] 153 | %run download_github.py 154 | ds 155 | %run download_github.py 156 | ds 157 | -------------------------------------------------------------------------------- /day2/backup/notes.mdown: -------------------------------------------------------------------------------- 1 | Lecture 2 2 | Thu Jan 8 19:32:53 PST 2015 3 | 4 | # Today's Goals: 5 | 6 | - Introduce basic Python 7 | - Learn two data structures- lists and dicts 8 | - See that Python lists and dicts correspond to JSON 9 | - Write a client for a REST API 10 | 11 | Why this is awesome: 12 | If you can write a client for any REST API you can programmatically 13 | access an essentially unlimited amount of data. 14 | So we're done looking at the 'iris' data :) 15 | 16 | Other names for Python dictionary: 17 | - key / value pair 18 | - dict 19 | - hash table 20 | - hash map 21 | - associative array 22 | 23 | Ways of transferring data: 24 | 25 | 1. verbally 26 | 2. look in a book 27 | 3. email 28 | 4. download CSV file from the web 29 | 5. query remote database or system 30 | - Direct connection to underlying database technology *Danger zone!* 31 | - Go through a REST API layer *Easier!* 32 | 33 | Use common format like JSON, XMl, CSV 34 | CSV also called 'flat file' 35 | 36 | Remote databases are very powerful 37 | 38 | 39 | Data Golf: 40 | 1 million integers in a Python list 41 | -------------------------------------------------------------------------------- /day2/day2log.txt: -------------------------------------------------------------------------------- 1 | a = 123 2 | a 3 | a 4 | # The answer to what happens when.... 5 | # Open an interpreter, try it, and see!! 6 | type(a) 7 | type(1.0) 8 | type('hello') 9 | type(True) 10 | type(None) 11 | davis = {'state': 'California', 'students': 35000} 12 | type(davis) 13 | # The whole python language is built on dictionaries (dicts) 14 | davis.keys() 15 | davis.values() 16 | davis.items() 17 | davis 18 | davis['state'] 19 | davis['state'] = 'CA' 20 | davis 21 | davis['cool'] = True 22 | davis 23 | type(davis['cool']) 24 | # that was the dictionary 25 | # Moving on to the list 26 | nums = [0, 10, 20, 30, 40] 27 | nums 28 | type(nums) 29 | ?len 30 | nums 31 | len(nums) 32 | # We want to pick elements from the list 33 | nums[0] 34 | # Python starts counting at 0!!!!!!!!!!! 35 | # Very important 36 | nums 37 | nums[0] 38 | nums[1] 39 | nums[2] 40 | nums 41 | nums[len(nums)] 42 | nums 43 | nums[-1] 44 | # In python this is called 'slicing' 45 | nums[-2] 46 | nums[0:2] 47 | # That says 'up to, but excluding the 2nd element' 48 | nums[:2] 49 | # To get all elements starting with the 2nd: 50 | nums[2:] 51 | nums[1:] 52 | clear 53 | nums[:2] 54 | nums[2:] 55 | k = 3 56 | nums[:k] 57 | nums[k:] 58 | nums[:k] + nums[k:] 59 | nums 60 | nums + 5 61 | nums + [5] 62 | nums 63 | # Suppose we want to add 5 to each element 64 | # In math notation you might write this: 65 | # {x + 5 for x in nums} 66 | [x + 5 for x in nums] 67 | # This is called a list comprehension 68 | nums 69 | davis 70 | a = [12, ['a', 'b']] 71 | a 72 | import json 73 | json.dumps? 74 | json.dumps(davis) 75 | davis 76 | davis 77 | davis['nums'] = nums 78 | davis 79 | json.dumps(davis) 80 | json.loads? 81 | json.dumps(davis) 82 | dstring = json.dumps(davis) 83 | dstring 84 | davis2 = json.loads(dstring) 85 | davis2 86 | davis 87 | davis2 88 | davis == davis2 89 | a = 'It's a beautiful day!' 90 | a = "It's a beautiful day!" 91 | a 92 | b = ''' 93 | Hello everybody. 94 | blah blah 95 | ''' 96 | b 97 | clear 98 | davis 99 | 'cool' in davis 100 | clear 101 | davis 102 | 2 ** 10 103 | 2 ** 20 104 | 2 ** 10 105 | before = 18.7 106 | x = [float(x) for x in range(int(1e6))] 107 | len(x) 108 | x[:4] 109 | x[-5:] 110 | after = 44.2 111 | before 112 | after - before 113 | x = [float(x) for x in range(4)] 114 | x 115 | a = range(int(1e12)) 116 | a 117 | b = range(10, 100) 118 | b 119 | next(b) 120 | b.step 121 | import requests 122 | requests.get? 123 | google = requests.get('http://www.google.com') 124 | google = requests.get('https://www.google.com') 125 | cd backup/ 126 | ls 127 | google = requests.get('https://www.google.com') 128 | google = requests.get('https://www.google.com') 129 | google 130 | google = requests.get('https://www.google.com') 131 | # How do we learn about unknown objects? 132 | # Use the `type` function 133 | type(google) 134 | google.status_code 135 | google.text 136 | import requests 137 | base = 'https://api.github.com/search/repositories' 138 | response = requests.get(base) 139 | response 140 | type(response) 141 | response.text 142 | response.json? 143 | response.json() 144 | a = response.json() 145 | a 146 | a 147 | type(a) 148 | len(a) 149 | a 150 | a.keys() 151 | payload = {'q': 'data science'} 152 | type(payload) 153 | payload['sort'] = 'stars' 154 | payload 155 | response = requests.get(base, params=payload) 156 | a = response.json() 157 | response 158 | a 159 | type(a) 160 | len(a) 161 | a.keys() 162 | a['total_count'] 163 | a['items'] 164 | len(a['items']) 165 | type(a['items']) 166 | b = a['items'] 167 | b 168 | b[0] 169 | b[0].keys() 170 | c = [[x['full_name'], x['watchers']] for x in b] 171 | c 172 | ls 173 | pwd 174 | dir() 175 | davis 176 | a = 'hello' 177 | a = 10 178 | a.real 179 | % history -f day2log.txt 180 | -------------------------------------------------------------------------------- /day2/download_github.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Script to download some interesting data from Github.com 3 | 4 | Thu Dec 18 17:13:16 PST 2014 5 | 6 | There are about 25k repositories that match the query 'data science'. 7 | We'll look at the most popular ones. 8 | ''' 9 | 10 | import json 11 | import requests 12 | 13 | 14 | def query_github(params): 15 | ''' 16 | Query Github using dict of params 17 | ''' 18 | base = 'https://api.github.com/search/repositories' 19 | response = requests.get(base, params=params) 20 | return response.json() 21 | 22 | 23 | if __name__ == '__main__': 24 | 25 | payload = {'q': 'data science', 'sort': 'stars'} 26 | ds = query_github(payload) 27 | -------------------------------------------------------------------------------- /day2/notes15jan.mdown: -------------------------------------------------------------------------------- 1 | Lecture 2 2 | 3 | Thu Jan 15 14:10:15 PST 2015 4 | 5 | # Today's Goals 6 | - Introduce basic python 7 | - Learn two important data structures lists and dicts 8 | - See that Python lists and dicts correspond to JSON 9 | - Write a client for a REST API 10 | 11 | # Why this is awesome!! 12 | You can access an *enormous* amount of data. 13 | 14 | Python dictionaries: 15 | - Key / Value Pair 16 | - Hash table 17 | - Hash map 18 | - Associative array 19 | 20 | Data golf!!!! 21 | 22 | floating point number - 8 bytes 23 | 24 | How large is a Python list with 1 million floating point numbers? 25 | A lower bound should be about 8 MB. 26 | 27 | Data transfer: 28 | - Talking 29 | - Writing stuff down in books 30 | - Email each other data in a CSV attachment 31 | - download a csv file from the web 32 | - direct machine to machine transfer from a larger backend system 33 | (database) 34 | 1. Direct connection to that technology 35 | 2. To use REST API (abstraction layer) 36 | 37 | Common forms for data: 38 | XML, CSV, JSON 39 | 40 | -------------------------------------------------------------------------------- /day3/day3log.txt: -------------------------------------------------------------------------------- 1 | a = [1, 2, 3] 2 | a + 5 3 | [x + 5 for x in a] 4 | import numpy 5 | import numpy as np 6 | b = np.array(a) 7 | a 8 | b 9 | type(b) 10 | np.array([1, 2, 6, 4]) 11 | c = np.arange(1e6) 12 | 41.7 - 33.8 13 | d = [float(x) for x in range(int(1e6))] 14 | 73.2 - 41.7 15 | a 16 | a + 5 17 | b + 5 18 | %timeit print('Hello world!') 19 | c 20 | %timeit c + 5 21 | d 22 | %timeit [x + 5 for x in d] 23 | type(d) 24 | len(d) 25 | b 26 | b_copy = b 27 | b 28 | b_copy 29 | b_copy[1] = 7 30 | b_copy 31 | b 32 | import this 33 | b = np.array([1, 2, 3]) 34 | b_copy = b.copy() 35 | b 36 | b_copy 37 | b_copy[1] = 7 38 | b_copy 39 | b 40 | A = np.array([[1, 2], [3, 4]]) 41 | A 42 | A = np.array([1, 2, 3, 4]).reshape(2, 2) 43 | A 44 | np.zeros(4).reshape(2, 2) 45 | I = np.eye(2) 46 | I 47 | A 48 | A * I 49 | A.dot(I) 50 | np.dot(A, I) 51 | A @ I 52 | A[1, 0] 53 | A 54 | np.random.binomial? 55 | steps = np.random.binomial(1, 0.5, 100) 56 | steps 57 | steps[steps == 0] = -1 58 | steps 59 | positions = steps.cumsum() 60 | positions 61 | import matplotlib.pyplot as plt 62 | plt.plot(positions) 63 | plt.show() 64 | import skimage.io as io 65 | img = io.imread('photo.png', as_grey=True) 66 | img 67 | img 68 | img.shape 69 | type(img) 70 | ctr = np.mean(img, axis=0) 71 | ctr.shape 72 | ctr_img = img - ctr 73 | plt.imshow(img) 74 | plt.show() 75 | plt.imshow(img).set_cmap('gray') 76 | plt.show() 77 | plt.imshow(ctr_img).set_cmap('gray') 78 | plt.show() 79 | np.linalg.svd? 80 | _, _, v = np.linalg.svd(ctr_img) 81 | v.shape 82 | v = v.T 83 | reduced_v = v[:, :100] 84 | v.shape 85 | reduced_v.shape 86 | prin_comp = ctr_img.dot(reduced_v) 87 | prin_comp.shape 88 | reconst = prin_comp.dot(reduced_v.T) + ctr 89 | plt.imshow(reconst).set_cmap('gray') 90 | plt.show() 91 | fig, ax = plt.subplots(1, 2) 92 | ax[0].imshow(img).set_cmap('gray') 93 | ax[1].imshow(reconst).set_cmap('gray') 94 | fig.show() 95 | img.size 96 | reconst = prin_comp.dot(reduced_v.T) + ctr 97 | prin_comp.size + reduced_v.size + ctr.size 98 | original_size = img.size 99 | new_size = prin_comp.size + reduced_v.size + ctr.size 100 | new_size / original_size 101 | reconst.size 102 | %history -f code_log.txt 103 | -------------------------------------------------------------------------------- /day3/day3notes.md: -------------------------------------------------------------------------------- 1 | 2 | 1 million floating point values in a list: 24 - 32 MB (should be 8 MB) 3 | 4 | 1. Lists take up too much memory! 5 | 2. We'd like to have vectorization. 6 | 3. Lists are kind of slow. 7 | 8 | NumPy provides an alternative: the n-dimensional array. 9 | 10 | DATA GOLF! 11 | How much memory does a 1 million element ndarray of floats use? 12 | 8 MB! 13 | 14 | NumPy takes 8.6 ms to add 5 to 1 million element ndarray. 15 | Python lists take 501ms! 16 | 17 | Be careful with references! 18 | 19 | We saw NumPy gives us arrays, matrices, ... But what else? 20 | 21 | * random number generation (numpy.random) 22 | * fast Fourier transforms (numpy.fft) 23 | * polynomials (numpy.polynomial) 24 | * linear algebra (numpy.linalg) 25 | * support for calling C libraries 26 | 27 | # Simple Random Walk! 28 | 29 | Random walk with 100 steps 30 | 31 | 1. Flip a coin 100 times--these are the steps (0 means down). 32 | 2. Take cumulative sums 33 | 34 | # PCA / SVD 35 | The SVD is a very important decomposition, especially to statistics. 36 | A great statistics example is Principal Components Analysis (PCA). 37 | What is PCA typically used for? 38 | 39 | Say X is a centered n-by-p data matrix. Then 40 | 41 | X = UDV' = λ₁u₁v₁' + ... + λₖuₖvₖ' + ... + λₚuₚvₚ' (n < p) 42 | 43 | PCA takes the first k principal components, λ₁u₁, ..., λₖuₖ. 44 | A slightly different perspective: PCA approximates the original data by 45 | 46 | X ~ λ₁u₁v₁' + ... + λₖuₖvₖ' 47 | 48 | Lossy data compression! 49 | 50 | DATA GOLF! 51 | Original image: 180,500 52 | Compressed image: 86,461 53 | -------------------------------------------------------------------------------- /day3/lecture.py: -------------------------------------------------------------------------------- 1 | 2 | # Today's Agenda: 3 | # 4 | # 1. Introduce NumPy 5 | # 2. Simple random walk 6 | # 3. Visualize SVD/PCA 7 | 8 | a = [1, 2, 3] 9 | 10 | # Python's lists are awesome, but they have a few serious limitations for 11 | # numerical computing: 12 | # 13 | # 1. They use a lot of memory (~32 MB for 1 million 64-bit floats). 14 | # 2. They're slow. 15 | # 3. They're not vectorized. We could use list comprehensions, but do you 16 | # really want to write 17 | 18 | [x + 5 for x in a] 19 | 20 | # instead of 21 | 22 | a + 5 23 | 24 | # So we need a better solution when we do numerical computing. Enter NumPy! 25 | 26 | import numpy 27 | 28 | # We're going to use NumPy all the time, so let's save some typing with an 29 | # alias: 30 | 31 | import numpy as np 32 | 33 | # Using `np` for NumPy is a common convention. 34 | # 35 | # You can convert a Python list to a NumPy array with `array()`. 36 | 37 | np.array([1, 2]) 38 | b = np.array(a) 39 | 40 | # DATA GOLF! A list of 1 million floats uses ~32 MB. 41 | # How much memory does a NumPy array of 1 million floats use? 42 | 43 | big_np_array = np.arange(1e6) 44 | big_list = [float(x) for x in range(int(1e6))] 45 | 46 | # NumPy also knows we want to do numerical work, where vectorized operations 47 | # make sense. 48 | 49 | b + 5 50 | 51 | # These vectorized operations are also faster than corresponding operations on 52 | # lists. 53 | 54 | %timeit big_np_array + 5 55 | %timeit [x + 5 for x in big_list] 56 | 57 | # Indexing and slicing uses `[ ]`, the same as Python lists. 58 | 59 | b[1:] 60 | 61 | # WARNING: Python objects, including lists and NumPy arrays, are stored as 62 | # references. Compared to R, they might not behave as you'd expect. 63 | 64 | b_copy = b 65 | b_copy[1] = 7 66 | 67 | # What are the elements of `b_copy`? What're the elements of `b`? 68 | b_copy 69 | b 70 | 71 | # ZEN: Explicit is better than implicit. 72 | # If you wanted to make a copy, do it explicitly! 73 | 74 | b = np.array(a) 75 | 76 | b_copy = b.copy() 77 | b_copy[1] = 7 78 | 79 | b_copy 80 | b 81 | 82 | # Matrices can be defined directly 83 | 84 | np.array([[1, 2], [3, 4]]) 85 | 86 | # or by reshaping an array 87 | 88 | A = np.array([1, 2, 3, 4]).reshape(2, 2) 89 | A 90 | 91 | # How do you do matrix multiplication? 92 | I = np.eye(2) 93 | I 94 | 95 | A * I 96 | 97 | # Using `*` does element-by-element multiplication. Instead, call the `dot()` 98 | # method or `dot()` function. 99 | 100 | A.dot(I) 101 | np.dot(A, I) 102 | 103 | # In Python 3.5, there will be an operator, `@`, for matrix multiplication. 104 | 105 | # What else do we get with NumPy? 106 | # 107 | # * linear algebra (numpy.linalg) 108 | # * random number generation (numpy.random) 109 | # * Fourier transforms (numpy.fft) 110 | # * polynomials (numpy.polynomial) 111 | 112 | # Let's write a simple random walk! 113 | # 114 | # 1. Get the step (up or down) at each time. That is, take N independent 115 | # samples from a binomial(1, p) or Bernoulli(p) distribution. 116 | # 2. Transform 0s to -1s (down). 117 | # 3. Use cumulative sums to calculate the position at each time. 118 | 119 | N = 10 120 | p = 0.5 121 | steps = np.random.binomial(1, p, N) 122 | 123 | steps[steps == 0] = -1 124 | steps 125 | 126 | positions = steps.cumsum() 127 | 128 | import matplotlib.pyplot as plt 129 | 130 | # Turn on Matplotlib's interactive mode. 131 | plt.ion() 132 | 133 | plt.plot(positions) 134 | 135 | # The SVD is a very important decomposition, especially to statistics. 136 | # A great statistics example is Principal Components Analysis (PCA). 137 | # What is PCA typically used for? 138 | 139 | # Say X is a centered n-by-p data matrix. Then 140 | # 141 | # X = UDV' = λ₁u₁v₁' + ... + λₖuₖvₖ' + ... + λₚuₚvₚ' (n < p) 142 | # 143 | # PCA takes the first k principal components, λ₁u₁, ..., λₖuₖ. 144 | # A slightly different perspective: PCA approximates the original data by 145 | # 146 | # X ~ λ₁u₁v₁' + ... + λₖuₖvₖ' 147 | # 148 | # Lossy data compression! 149 | 150 | # To load an image, use SciPy, Matplotlib, or Scikit-Image. 151 | import skimage as ski, skimage.io 152 | 153 | img = ski.io.imread('photo.png', as_grey=True) 154 | 155 | plt.imshow(img).set_cmap('gray') 156 | 157 | # Center the image and take the SVD. 158 | mean = np.mean(img, axis=0) 159 | 160 | ctr_img = img - mean 161 | 162 | plt.imshow(ctr_img).set_cmap('gray') 163 | 164 | _, _, v = np.linalg.svd(ctr_img) 165 | 166 | # `linalg.svd()` returns V', not V. 167 | v = v.T 168 | 169 | # Remove some columns from V (terms in the SVD sum). 170 | v_reduced = v[:, 0:100] 171 | prin_comp = ctr_img.dot(v_reduced) 172 | 173 | # Reconstruct the data. 174 | reconst = np.dot(prin_comp, v_reduced.T) + mean 175 | 176 | # Another way: 177 | reconst = prin_comp.dot(v_reduced.T) + mean 178 | 179 | fig, axs = plt.subplots(1, 2) 180 | for ax, im in zip(axs, [img, reconst]): 181 | ax.imshow(im).set_cmap('gray') 182 | fig.show() 183 | 184 | # DATA GOLF! How much compression? 185 | # Original image: 186 | 187 | original_size = img.size 188 | 189 | # Compressed image: 190 | 191 | compressed_size = mean.size + v_reduced.size + prin_comp.size 192 | 193 | compressed_size / original_size 194 | 195 | -------------------------------------------------------------------------------- /day3/photo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nick-ulle/2015-python/18e324ea50435d94c72ec853eec56ea6dfa5fe05/day3/photo.png -------------------------------------------------------------------------------- /day4/backup/country.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Lecture 4 3 | 4 | Wed Jan 21 20:41:05 PST 2015 5 | 6 | Today's Goals: 7 | - functions 8 | - scripting 9 | - pandas 10 | 11 | A python script is just a plain text file with a .py extension. It contains 12 | a bunch of commands which will be executed in the order that they appear. 13 | 14 | Pandas gives us the DataFrame and all the power that comes with it. 15 | 16 | Data Munging: Putting your data into the correct form and type for 17 | analysis. This is the foundation for any analysis you wish to do. It pays 18 | to be able to do it efficiently. 19 | 20 | Glamour: * 21 | Utility: ***** 22 | 23 | Pandas makes data munging easier. 24 | 25 | Documentation and doctests are sooooo easy in Python. 26 | ''' 27 | 28 | def decade(year): 29 | ''' 30 | Return the decade of a given year 31 | 32 | >>> decade(1954) 33 | 1950 34 | 35 | ''' 36 | return 10 * (year // 10) 37 | 38 | 39 | if __name__ == '__main__': 40 | import doctest 41 | doctest.testmod() 42 | -------------------------------------------------------------------------------- /day4/backup/log.py: -------------------------------------------------------------------------------- 1 | clear 2 | # The first thing we do is import the libraries we need 3 | # This can get a little tiresome 4 | # If I want to use everything in the Numpy library: 5 | # First let's see what I have in base Python: 6 | dir() 7 | # Very little 8 | # Suppose I now want everything from the Numpy library for my interactive work: 9 | from numpy import * 10 | dir() 11 | # We just brought everything from Numpy into the workspace 12 | # More precisely- they are all global variables and functions now 13 | # This is convenient for interactive work 14 | # But not recommended for scripting / programming 15 | # Ipython makes this easy: 16 | %pylab 17 | # This imported everything from numpy and matplotlib and set up the plotting 18 | import seaborn 19 | # seaborn gives you pretty graphics :) 20 | plot([1, 2, 1, 2]) 21 | import pandas as pd 22 | import pandas as funstuff 23 | clear 24 | # Functions are a _wonderful_ thing 25 | def greeter(name): 26 | print('hello', name) 27 | greeter('Qi') 28 | ls 29 | %run country.py 30 | # read in the population data 31 | p = pd.read_csv('population.csv') 32 | type(p) 33 | p.shape 34 | p.dtypes 35 | # Date should be a timestamp, not object 36 | # We need to fix it! 37 | pd.to_datetime? 38 | p 39 | p.columns 40 | # Lets start out with the basics 41 | # Columns of a DataFrame are called Series 42 | # The name comes from time series 43 | # hint- they work really well for time series 44 | pd.DateOffset? 45 | pd.date_range? 46 | a = pd.date_range('2014-01-01') 47 | # I've just illustrated a valuable technique- 48 | If you don't know what will happen, just try it and see what error you get 49 | # If you don't know what will happen, just try it and see what error you get 50 | a = pd.date_range('2014-01-01', periods=10) 51 | a 52 | aser = pd.Series(a) 53 | aser 54 | import numpy as np 55 | a = np.ones(5) 56 | a 57 | aser = pd.Series(a) 58 | aser 59 | aser.index 60 | # Pandas is designed to preserve the structure of your data 61 | # That means the index will be maintained throughout the operations 62 | bser = pd.Series(np.ones(10)) 63 | bser 64 | aser 65 | bser 66 | aser + bser 67 | # Alignment was preserved 68 | aser 69 | aser.index[0] 70 | # This would be a crazy thing to doaser.index[0] = 27 71 | aser.index[0] = 27 72 | aser.index 73 | a = aser 74 | a 75 | a.index 76 | # Index is an attribut of a dataframe or series 77 | a.index = pd.date_range('2014-01-01', periods=5) 78 | a 79 | b = bser 80 | b.index = pd.date_range('2014-01-03', periods=10) 81 | b 82 | a 83 | b 84 | # Dataframes try hard to preserve alignment 85 | a + b 86 | a.data 87 | a.as_matrix() 88 | # Indexing is important 89 | # You get speed and data integrity 90 | p 91 | p.dtypes 92 | p.columns 93 | # Selecting columns is like key lookups in a dictionary 94 | p['Date'] 95 | p['Date'][:10] 96 | p['Date'].head() 97 | pd.to_datetime(p['Date']) 98 | # Before we had strings 99 | # These are nanosecond timestamps 100 | p['Date'] = pd.to_datetime(p['Date']) 101 | # I just updated my script with the new commands 102 | # lets make sure it worked 103 | %run country.py 104 | p.dtypes 105 | # type is correct 106 | # We have another table: 107 | c = pd.read_csv('languages.csv') 108 | c 109 | c.shape 110 | c.dtypes 111 | # Our goal is to put the language into the population table 112 | p.head() 113 | # But we don't have country names 114 | wiki = pd.read_html('http://en.wikipedia.org/wiki/ISO_3166-1') 115 | # You may expect this to be a DataFrame 116 | # But HTML pages can contain many tables 117 | type(wiki) 118 | len(wiki) 119 | wiki[0] 120 | wiki[0].head() 121 | # We only need columns 0 and 2 122 | w = wiki[0].iloc[:, [0, 2]] 123 | # df.iloc lets us do integer selction of rows, columns 124 | w.shape 125 | w.dtypes 126 | w.head() 127 | w.columns 128 | w.columns = ['country', 'ISO'] 129 | w.columns 130 | # Now for the join 131 | l 132 | dir() 133 | sh short name (upper/lower case) Alpha-3 code5 134 | c 135 | l = c 136 | l 137 | l 138 | w 139 | w.head() 140 | w.columns 141 | l.columns 142 | l.merge(w) 143 | p.columns 144 | p2 = p.merge(l.merge(w)) 145 | p2.columns 146 | p2.dtypes 147 | p2.head() 148 | # A successful join 149 | p.head() 150 | p['ISO'].unique() 151 | p.shape 152 | p.head() 153 | p.pivot? 154 | p.pivot(index='Date', columns='ISO', values='population') 155 | p3 = p.pivot(index='Date', columns='ISO', values='population') 156 | p3.dtypes 157 | p3.head() 158 | p3.iloc[:10, ] 159 | p3.iloc[10:20, ] 160 | p3 161 | p3.plot() 162 | p3.dtypes 163 | p2 164 | p2.head() 165 | p2['language'].unique() 166 | ppop = p2['population'] 167 | p2.groupby? 168 | p2.groupby('language') 169 | grouper = p2.groupby('language') 170 | type(grouper) 171 | grouper.name 172 | grouper.ngroups 173 | grouper.mean() 174 | # Let's put this together 175 | p2.groupby('language').mean() 176 | p2.groupby('language').count() 177 | grouper.mad? 178 | grouper.size? 179 | grouper.size() 180 | # Suppose we want to know the average population in each country by decade 181 | p 182 | p2 183 | p3 184 | # Suppose we want to know the average population in each country by decade 185 | # If p3 had a column with the decades we could do a groupby 186 | p3.index 187 | p3.index.year 188 | p3 189 | p3['year'] = p3.index.year 190 | p3.dtypes 191 | p3.head() 192 | # We need to get the decade from the year 193 | # ie 1954 -> 1950 194 | 1954 // 10 195 | # floor division by 10 196 | 10 * 1954 // 10 197 | 10 * (1954 // 10) 198 | %run country.py 199 | %run country.py 200 | %run country.py 201 | decade 202 | ?decade 203 | decade(2018) 204 | p2 205 | p3.dtypes 206 | p3['year'].apply(decade) 207 | p3['decade'] = p3['year'].apply(decade) 208 | p3.head() 209 | p3.groupby('decade').mean() 210 | # So what if I want a column with this? 211 | g = p3['CHN'].groupby('decade') 212 | g = p3[['CHN', 'decade']].groupby('decade') 213 | g.transform? 214 | g.transform(np.mean) 215 | p3['china10yr'] = p3[['CHN', 'decade']].groupby('decade').transform(np.mean) 216 | p3.plot() 217 | p3[['CHN', 'china10yr']].plot() 218 | a = np.arange(20) 219 | a 220 | a // 10 221 | ls 222 | a = pd.read_html('ISO_3166-1') 223 | -------------------------------------------------------------------------------- /day4/backup/simple.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Write what your program does here in the header 3 | ''' 4 | 5 | # Imports come first 6 | import json 7 | 8 | 9 | print('This runs when the script is imported') 10 | 11 | 12 | def greet(person): 13 | ''' 14 | This is a function docstring 15 | ''' 16 | print('hello', person) 17 | 18 | 19 | if __name__ == '__main__': 20 | 21 | # Put the action code here 22 | # Also a good place for tests 23 | 24 | print('the __name__ is __main__!') 25 | greet('Matt') 26 | -------------------------------------------------------------------------------- /day4/decade.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Tools for working with dates 3 | ''' 4 | 5 | #import pandas as pd 6 | 7 | def decade(year): 8 | ''' 9 | Given a year, return the decade 10 | 11 | >>> decade(1986) 12 | 1980 13 | 14 | ''' 15 | return 10 * (year // 10) 16 | 17 | 18 | if __name__ == '__main__': 19 | import doctest 20 | doctest.testmod() 21 | -------------------------------------------------------------------------------- /day4/languages.csv: -------------------------------------------------------------------------------- 1 | country,language 2 | Canada,English 3 | China,Mandarin 4 | United States of America,English 5 | -------------------------------------------------------------------------------- /day4/log.py: -------------------------------------------------------------------------------- 1 | # first lets look at what we have 2 | dir() 3 | # lets pull in everything from Numpy 4 | from numpy import * 5 | dir() 6 | %pylab 7 | plot? 8 | # everything in matplotlib and numpy have become global variables 9 | # This is recommended interactive work- but not for scripting or programming 10 | import seaborn 11 | # Seaborn gives pretty graphics 12 | plot([0, 1, 0]) 13 | # In summary just use %pylab 14 | import pandas as pd 15 | import pandas as penguins 16 | # I could have given pandas any name I want 17 | import numpy as np 18 | # Pandas maintains data integrity across operations 19 | a = pd.Series(np.ones(10)) 20 | a 21 | a 22 | type(a) 23 | # the dataframe is like a table 24 | # the columns in a dataframe are called series 25 | a 26 | b = pd.Series(np.ones(5)) 27 | b 28 | type(b) 29 | b 30 | b.index 31 | a + b 32 | pd.date_range('2014-01-01', periods=10) 33 | a 34 | pd.date_range('2014-01-01', periods=10) 35 | a.index 36 | a.index[1] 37 | a.index[1] = 29 38 | a.index = pd.date_range('2014-01-01', periods=10) 39 | a 40 | a.index = pd.date_range('2014-01-01', periods=10) 41 | b.index = pd.date_range('2014-01-03', periods=5) 42 | a 43 | b 44 | a + b 45 | a 46 | a[:10] 47 | a[4] 48 | # Now for real data 49 | # Pandas makes it very easy to read and write data 50 | # in various formats 51 | p = pd.read_csv('population.csv') 52 | type(p) 53 | # How big is it? 54 | p.shape 55 | # We have 186 rows 56 | # and 3 columns 57 | p.dtypes 58 | p.head() 59 | p.dtypes 60 | p['Date'].head() 61 | p['Date'][0] 62 | d1 = p['Date'][0] 63 | type(d1) 64 | # Lets convert to datetime 65 | pd.to_datetime(d1['Date']) 66 | pd.to_datetime(p['Date']) 67 | p['Date'] = pd.to_datetime(p['Date']) 68 | p.dtypes 69 | # How many countries are there? 70 | p['ISO'] 71 | p['ISO'].unique() 72 | p['ISO'].head() 73 | p.head() 74 | p.dtypes 75 | p.pivot? 76 | p2 = p.pivot(index='Date', columns='ISO', values='population') 77 | p2.head() 78 | # Now we can plot 79 | p2.plot() 80 | # I have another table 81 | lang = pd.read_csv('languages.csv') 82 | lang 83 | p.head() 84 | lang 85 | wiki = pd.read_html('http://en.wikipedia.org/wiki/ISO_3166-1') 86 | type(wiki) 87 | len(wiki) 88 | wiki[0] 89 | w = wiki[0] 90 | w.head() 91 | w2 = w.iloc[1:, [0, 2]] 92 | w2.head() 93 | w2.columns 94 | p.columns 95 | lang.columns 96 | w2.columns = ['country', ISO'] 97 | w2.columns = ['country', 'ISO'] 98 | w2.head() 99 | lang.merge(w2) 100 | p.head() 101 | p.merge(lang.merge(w2)) 102 | p2 = p.merge(lang.merge(w2)) 103 | p2.columns 104 | p2.head() 105 | # given a year- how do we get the decade? 106 | % run decade.py 107 | # 1958 -> 1950 108 | 1958 // 10 109 | 10 * (1958 // 10) 110 | %run decade.py 111 | decade? 112 | import decade 113 | p2.dtypes 114 | p2['year'] = p2.index.year 115 | p2.index[:10] 116 | p2.head() 117 | p2.index = p2['Date'] 118 | p2.index.year 119 | p2['year'] = p2.index.year 120 | p2.dtypes 121 | p2.head() 122 | p2['year'].apply(decade) 123 | p2['year'].apply(decade.decade) 124 | p2['decade'] = p2['year'].apply(decade.decade) 125 | p2.head() 126 | # The question was- what was the mean population in China by decade? 127 | p2.groupby(['country', 'decade']) 128 | p2.groupby(['country', 'decade']).mean() 129 | p3 = p2.groupby(['country', 'decade']).mean() 130 | p3 131 | pd.__version__ 132 | %history -f 'day4log.py' 133 | -------------------------------------------------------------------------------- /day4/notes.mdown: -------------------------------------------------------------------------------- 1 | Lecture 4 2 | 3 | Thu Jan 22 14:09:21 PST 2015 4 | 5 | The story of Wes: 6 | * keep integrity of data 7 | * express data analysis operations cleanly and efficiently 8 | * handle time series very well 9 | 10 | Pandas - panel data analysis - DataFrame 11 | 12 | Data Munging - preparing your data 13 | Glamour - @ 14 | Utility - @@@@@ 15 | 16 | Why this is awesome: 17 | It's foundational for all subsequent analysis 18 | Looking for concise, expressive code 19 | -------------------------------------------------------------------------------- /day4/population.csv: -------------------------------------------------------------------------------- 1 | Date,ISO,population 2 | 1950-01-01 00:00:00,CHN,562.579779 3 | 1951-01-01 00:00:00,CHN,567.100762 4 | 1952-01-01 00:00:00,CHN,574.5362739999999 5 | 1953-01-01 00:00:00,CHN,584.1913539999999 6 | 1954-01-01 00:00:00,CHN,594.7248440000001 7 | 1955-01-01 00:00:00,CHN,606.729654 8 | 1956-01-01 00:00:00,CHN,619.135938 9 | 1957-01-01 00:00:00,CHN,633.214551 10 | 1958-01-01 00:00:00,CHN,646.703076 11 | 1959-01-01 00:00:00,CHN,654.3494300000001 12 | 1960-01-01 00:00:00,CHN,650.6605129999999 13 | 1961-01-01 00:00:00,CHN,644.669932 14 | 1962-01-01 00:00:00,CHN,653.302104 15 | 1963-01-01 00:00:00,CHN,674.248708 16 | 1964-01-01 00:00:00,CHN,696.064936 17 | 1965-01-01 00:00:00,CHN,715.546458 18 | 1966-01-01 00:00:00,CHN,735.903786 19 | 1967-01-01 00:00:00,CHN,755.3201190000001 20 | 1968-01-01 00:00:00,CHN,776.152777 21 | 1969-01-01 00:00:00,CHN,798.640508 22 | 1970-01-01 00:00:00,CHN,820.403282 23 | 1971-01-01 00:00:00,CHN,842.4556779999999 24 | 1972-01-01 00:00:00,CHN,863.439051 25 | 1973-01-01 00:00:00,CHN,883.019765 26 | 1974-01-01 00:00:00,CHN,901.3180649999999 27 | 1975-01-01 00:00:00,CHN,917.898537 28 | 1976-01-01 00:00:00,CHN,932.5887270000001 29 | 1977-01-01 00:00:00,CHN,946.093816 30 | 1978-01-01 00:00:00,CHN,958.835162 31 | 1979-01-01 00:00:00,CHN,972.136875 32 | 1980-01-01 00:00:00,CHN,984.73646 33 | 1981-01-01 00:00:00,CHN,997.000718 34 | 1982-01-01 00:00:00,CHN,1012.490488 35 | 1983-01-01 00:00:00,CHN,1028.3565350000001 36 | 1984-01-01 00:00:00,CHN,1042.75605 37 | 1985-01-01 00:00:00,CHN,1058.007717 38 | 1986-01-01 00:00:00,CHN,1074.522563 39 | 1987-01-01 00:00:00,CHN,1093.7257120000002 40 | 1988-01-01 00:00:00,CHN,1112.866405 41 | 1989-01-01 00:00:00,CHN,1130.729412 42 | 1990-01-01 00:00:00,CHN,1148.36447 43 | 1991-01-01 00:00:00,CHN,1163.610388 44 | 1992-01-01 00:00:00,CHN,1177.535611 45 | 1993-01-01 00:00:00,CHN,1190.761826 46 | 1994-01-01 00:00:00,CHN,1203.8017439999999 47 | 1995-01-01 00:00:00,CHN,1216.3784440000002 48 | 1996-01-01 00:00:00,CHN,1227.882189 49 | 1997-01-01 00:00:00,CHN,1238.1256640000001 50 | 1998-01-01 00:00:00,CHN,1247.502219 51 | 1999-01-01 00:00:00,CHN,1255.9924990000002 52 | 2000-01-01 00:00:00,CHN,1263.6375309999999 53 | 2001-01-01 00:00:00,CHN,1270.744232 54 | 2002-01-01 00:00:00,CHN,1277.59472 55 | 2003-01-01 00:00:00,CHN,1284.3033159999998 56 | 2004-01-01 00:00:00,CHN,1291.001804 57 | 2005-01-01 00:00:00,CHN,1297.765318 58 | 2006-01-01 00:00:00,CHN,1304.2618850000001 59 | 2007-01-01 00:00:00,CHN,1310.583544 60 | 2008-01-01 00:00:00,CHN,1317.065677 61 | 2009-01-01 00:00:00,CHN,1323.591583 62 | 2010-01-01 00:00:00,CHN,1330.141295 63 | 2011-01-01 00:00:00,CHN, 64 | 1950-01-01 00:00:00,USA, 65 | 1951-01-01 00:00:00,USA, 66 | 1952-01-01 00:00:00,USA, 67 | 1953-01-01 00:00:00,USA, 68 | 1954-01-01 00:00:00,USA, 69 | 1955-01-01 00:00:00,USA, 70 | 1956-01-01 00:00:00,USA, 71 | 1957-01-01 00:00:00,USA, 72 | 1958-01-01 00:00:00,USA, 73 | 1959-01-01 00:00:00,USA, 74 | 1960-01-01 00:00:00,USA,180.67 75 | 1961-01-01 00:00:00,USA,183.69 76 | 1962-01-01 00:00:00,USA,186.54 77 | 1963-01-01 00:00:00,USA,189.24 78 | 1964-01-01 00:00:00,USA,191.89 79 | 1965-01-01 00:00:00,USA,194.3 80 | 1966-01-01 00:00:00,USA,196.56 81 | 1967-01-01 00:00:00,USA,198.71 82 | 1968-01-01 00:00:00,USA,200.71 83 | 1969-01-01 00:00:00,USA,202.68 84 | 1970-01-01 00:00:00,USA,205.05 85 | 1971-01-01 00:00:00,USA,207.66 86 | 1972-01-01 00:00:00,USA,209.9 87 | 1973-01-01 00:00:00,USA,211.91 88 | 1974-01-01 00:00:00,USA,213.85 89 | 1975-01-01 00:00:00,USA,215.97 90 | 1976-01-01 00:00:00,USA,218.04 91 | 1977-01-01 00:00:00,USA,220.24 92 | 1978-01-01 00:00:00,USA,222.59 93 | 1979-01-01 00:00:00,USA,225.06 94 | 1980-01-01 00:00:00,USA,227.73 95 | 1981-01-01 00:00:00,USA,229.97 96 | 1982-01-01 00:00:00,USA,232.19 97 | 1983-01-01 00:00:00,USA,234.31 98 | 1984-01-01 00:00:00,USA,236.35 99 | 1985-01-01 00:00:00,USA,238.47 100 | 1986-01-01 00:00:00,USA,240.65 101 | 1987-01-01 00:00:00,USA,242.8 102 | 1988-01-01 00:00:00,USA,245.02 103 | 1989-01-01 00:00:00,USA,247.34 104 | 1990-01-01 00:00:00,USA,250.13 105 | 1991-01-01 00:00:00,USA,253.49 106 | 1992-01-01 00:00:00,USA,256.89 107 | 1993-01-01 00:00:00,USA,260.26 108 | 1994-01-01 00:00:00,USA,263.44 109 | 1995-01-01 00:00:00,USA,266.56 110 | 1996-01-01 00:00:00,USA,269.67 111 | 1997-01-01 00:00:00,USA,272.91 112 | 1998-01-01 00:00:00,USA,276.12 113 | 1999-01-01 00:00:00,USA,279.3 114 | 2000-01-01 00:00:00,USA,282.38 115 | 2001-01-01 00:00:00,USA,285.31 116 | 2002-01-01 00:00:00,USA,288.1 117 | 2003-01-01 00:00:00,USA,290.82 118 | 2004-01-01 00:00:00,USA,293.46 119 | 2005-01-01 00:00:00,USA,296.19 120 | 2006-01-01 00:00:00,USA,299.0 121 | 2007-01-01 00:00:00,USA,302.0 122 | 2008-01-01 00:00:00,USA,304.8 123 | 2009-01-01 00:00:00,USA,307.44 124 | 2010-01-01 00:00:00,USA,309.98 125 | 2011-01-01 00:00:00,USA,312.24 126 | 1950-01-01 00:00:00,CAN, 127 | 1951-01-01 00:00:00,CAN, 128 | 1952-01-01 00:00:00,CAN, 129 | 1953-01-01 00:00:00,CAN, 130 | 1954-01-01 00:00:00,CAN, 131 | 1955-01-01 00:00:00,CAN, 132 | 1956-01-01 00:00:00,CAN, 133 | 1957-01-01 00:00:00,CAN, 134 | 1958-01-01 00:00:00,CAN, 135 | 1959-01-01 00:00:00,CAN, 136 | 1960-01-01 00:00:00,CAN,17.91 137 | 1961-01-01 00:00:00,CAN,18.27 138 | 1962-01-01 00:00:00,CAN,18.61 139 | 1963-01-01 00:00:00,CAN,18.96 140 | 1964-01-01 00:00:00,CAN,19.33 141 | 1965-01-01 00:00:00,CAN,19.68 142 | 1966-01-01 00:00:00,CAN,20.05 143 | 1967-01-01 00:00:00,CAN,20.41 144 | 1968-01-01 00:00:00,CAN,20.73 145 | 1969-01-01 00:00:00,CAN,21.03 146 | 1970-01-01 00:00:00,CAN,21.32 147 | 1971-01-01 00:00:00,CAN,21.96 148 | 1972-01-01 00:00:00,CAN,22.22 149 | 1973-01-01 00:00:00,CAN,22.49 150 | 1974-01-01 00:00:00,CAN,22.81 151 | 1975-01-01 00:00:00,CAN,23.14 152 | 1976-01-01 00:00:00,CAN,23.45 153 | 1977-01-01 00:00:00,CAN,23.73 154 | 1978-01-01 00:00:00,CAN,23.96 155 | 1979-01-01 00:00:00,CAN,24.2 156 | 1980-01-01 00:00:00,CAN,24.52 157 | 1981-01-01 00:00:00,CAN,24.82 158 | 1982-01-01 00:00:00,CAN,25.12 159 | 1983-01-01 00:00:00,CAN,25.37 160 | 1984-01-01 00:00:00,CAN,25.61 161 | 1985-01-01 00:00:00,CAN,25.84 162 | 1986-01-01 00:00:00,CAN,26.1 163 | 1987-01-01 00:00:00,CAN,26.45 164 | 1988-01-01 00:00:00,CAN,26.79 165 | 1989-01-01 00:00:00,CAN,27.28 166 | 1990-01-01 00:00:00,CAN,27.69 167 | 1991-01-01 00:00:00,CAN,28.04 168 | 1992-01-01 00:00:00,CAN,28.37 169 | 1993-01-01 00:00:00,CAN,28.68 170 | 1994-01-01 00:00:00,CAN,29.0 171 | 1995-01-01 00:00:00,CAN,29.3 172 | 1996-01-01 00:00:00,CAN,29.61 173 | 1997-01-01 00:00:00,CAN,29.91 174 | 1998-01-01 00:00:00,CAN,30.16 175 | 1999-01-01 00:00:00,CAN,30.4 176 | 2000-01-01 00:00:00,CAN,30.69 177 | 2001-01-01 00:00:00,CAN,31.02 178 | 2002-01-01 00:00:00,CAN,31.35 179 | 2003-01-01 00:00:00,CAN,31.64 180 | 2004-01-01 00:00:00,CAN,31.94 181 | 2005-01-01 00:00:00,CAN,32.25 182 | 2006-01-01 00:00:00,CAN,32.58 183 | 2007-01-01 00:00:00,CAN,32.93 184 | 2008-01-01 00:00:00,CAN,33.32 185 | 2009-01-01 00:00:00,CAN,33.73 186 | 2010-01-01 00:00:00,CAN,34.13 187 | 2011-01-01 00:00:00,CAN,34.48 188 | -------------------------------------------------------------------------------- /day4/prepare.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import Quandl 3 | 4 | china = Quandl.get("FRED/POPTTLCNA173NUPN") 5 | usa = Quandl.get("FRED/USAPOPL") 6 | canada = Quandl.get("FRED/CANPOPL") 7 | 8 | # Convert from thousands to millions 9 | china = china / 1000 10 | 11 | p = pd.concat([china, usa, canada], axis=1) 12 | p.columns = ['china', 'usa', 'canada'] 13 | 14 | p.reset_index(inplace=True) 15 | 16 | popmelt = pd.melt(p, id_vars='Date') 17 | 18 | popmelt.columns = ['Date', 'ISO', 'population'] 19 | 20 | popmelt['ISO'][popmelt['ISO'] == 'china'] = 'CHN' 21 | popmelt['ISO'][popmelt['ISO'] == 'usa'] = 'USA' 22 | popmelt['ISO'][popmelt['ISO'] == 'canada'] = 'CAN' 23 | 24 | popmelt.to_csv('population.csv', index=False) 25 | 26 | languages = {'country': 27 | ['Canada', 'China', 'United States of America'], 28 | 'language': 29 | ['English', 'Mandarin', 'English']} 30 | 31 | ldf = pd.DataFrame(languages) 32 | ldf.to_csv('languages.csv', index=False) 33 | -------------------------------------------------------------------------------- /day5/README.md: -------------------------------------------------------------------------------- 1 | 2 | To view the map, you must download the entire `map` directory. The map is 3 | powered by [OpenStreetMap](https://www.openstreetmap.org/). 4 | 5 | The mobile food permits data & San Francisco neighborhoods GeoJSON was sourced 6 | from San Francisco's [open data collection](https://data.sfgov.org/). 7 | 8 | The vehicle mileage data was sourced from the Environmental Protection Agency's 9 | National Vehicle and Fuel Emissions Laboratory, which makes data freely 10 | available [here](http://fueleconomy.gov/feg/download.shtml). 11 | 12 | -------------------------------------------------------------------------------- /day5/backup/truck_counts.csv: -------------------------------------------------------------------------------- 1 | Neighborhood,Trucks 2 | South of Market,88 3 | Financial District,75 4 | Mission,61 5 | Central Waterfront,47 6 | Bayview,38 7 | Mission Bay,28 8 | Potrero Hill,26 9 | Produce Market,24 10 | Downtown / Union Square,22 11 | Hunters Point,20 12 | India Basin,18 13 | Bret Harte,17 14 | Dogpatch,15 15 | Northern Waterfront,14 16 | Showplace Square,13 17 | South Beach,11 18 | Rincon Hill,9 19 | Pacific Heights,9 20 | Mission Dolores,7 21 | Silver Terrace,4 22 | Western Addition,4 23 | Bernal Heights,4 24 | Civic Center,4 25 | Little Hollywood,4 26 | Presidio Heights,4 27 | Hayes Valley,3 28 | Candlestick Point SRA,3 29 | Cow Hollow,3 30 | Castro,3 31 | Apparel City,3 32 | Panhandle,2 33 | Lower Nob Hill,2 34 | North Beach,2 35 | Parnassus Heights,2 36 | Inner Sunset,2 37 | Cathedral Hill,2 38 | Tenderloin,2 39 | Anza Vista,2 40 | Marina,2 41 | Golden Gate Heights,2 42 | Russian Hill,2 43 | Lakeshore,1 44 | Portola,1 45 | Ingleside,1 46 | Polk Gulch,1 47 | Laurel Heights / Jordan Park,1 48 | Lower Haight,1 49 | Eureka Valley,1 50 | Outer Mission,1 51 | Crocker Amazon,1 52 | Lone Mountain,1 53 | Outer Richmond,1 54 | Noe Valley,1 55 | Westwood Park,1 56 | Laguna Honda,1 57 | Alamo Square,1 58 | -------------------------------------------------------------------------------- /day5/day5log.py: -------------------------------------------------------------------------------- 1 | import matplotlib.pyplot as plt 2 | import numpy as np 3 | x = np.linspace(0, 1, 100) 4 | y_cos = np.cos(x) 5 | y_sin = np.sin(x) 6 | plt.plot(x, y_cos) 7 | plt.show() 8 | plt.ion() 9 | plt.plot(x, y_cos) 10 | plt.plot(x, y_cos, 'r-x') 11 | plt.plot? 12 | plt.plot(x, y_cos, color='red', linestyle='-', marker='x') 13 | plt.plot(x, y_cos, x, y_sin) 14 | plt.plot(x, y_cos) 15 | plt.plot(x, y_sin) 16 | # use plt.subplots() to make multiple plots. 17 | fig, ax = plt.subplots(2, 1) 18 | ax.plot(x, y_sin) 19 | ax[0].plot(x, y_sin) 20 | ax[1].plot(x, y_cos) 21 | # plt.draw() forces the plot display to update. 22 | plt.draw() 23 | # plot in XKCD style! 24 | with plt.xkcd(): 25 | plt.plot(x, y_sin) 26 | plt.plot(x, y_cos) 27 | %history -f day5log.py 28 | -------------------------------------------------------------------------------- /day5/map/data.json: -------------------------------------------------------------------------------- 1 | [{"Potrero Hill": 26, "Alamo Square": 1, "Civic Center": 4, "Presidio Heights": 4, "Downtown / Union Square": 22, "Dogpatch": 15, "Cathedral Hill": 2, "Tenderloin": 2, "South Beach": 11, "Lower Nob Hill": 2, "India Basin": 18, "Outer Mission": 1, "Golden Gate Heights": 2, "Bayview": 38, "Cow Hollow": 3, "Showplace Square": 13, "Eureka Valley": 1, "Russian Hill": 2, "Little Hollywood": 4, "Mission": 61, "Inner Sunset": 2, "Marina": 2, "Silver Terrace": 4, "Central Waterfront": 47, "Northern Waterfront": 14, "Lone Mountain": 1, "Anza Vista": 2, "Ingleside": 1, "Noe Valley": 1, "North Beach": 2, "Castro": 3, "Lakeshore": 1, "Laurel Heights / Jordan Park": 1, "Western Addition": 4, "Financial District": 75, "Polk Gulch": 1, "Rincon Hill": 9, "Lower Haight": 1, "Outer Richmond": 1, "Mission Bay": 28, "South of Market": 88, "Laguna Honda": 1, "Bret Harte": 17, "Apparel City": 3, "Mission Dolores": 7, "Parnassus Heights": 2, "Portola": 1, "Candlestick Point SRA": 3, "Hayes Valley": 3, "Crocker Amazon": 1, "Pacific Heights": 9, "Produce Market": 24, "Panhandle": 2, "Hunters Point": 20, "Bernal Heights": 4, "Westwood Park": 1}] -------------------------------------------------------------------------------- /day5/map/map.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 37 | 38 | 39 | 40 | 41 |
42 | 43 | 151 | -------------------------------------------------------------------------------- /day5/mpg_plots.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | ''' Plots for EPA fuel economy data. 3 | ''' 4 | 5 | import pandas as pd 6 | import seaborn as sns 7 | import matplotlib.pyplot as plt 8 | 9 | def main(): 10 | ''' Run a demo of the functions in this module. 11 | ''' 12 | vehicles = pd.read_csv('vehicles.csv') 13 | 14 | make_mpg_plot(vehicles) 15 | 16 | def make_mpg_plot(vehicles): 17 | ''' Make a plot of MPG by Make and Year. 18 | ''' 19 | # Take means by make and year. 20 | groups = vehicles.groupby(['year', 'make']) 21 | avg_mpg = groups.city_mpg.mean() 22 | 23 | # Convert the indexes to columns, then pivot. 24 | avg_mpg = avg_mpg.reset_index() 25 | mpg_matrix = avg_mpg.pivot('year', 'make', 'city_mpg') 26 | 27 | # Subset down to a few brands. 28 | brands = ['Honda', 'Toyota', 'Hyundai', 'BMW', 'Mercedes-Benz', 29 | 'Ferrari', 'Lamborghini', 'Maserati', 'Fiat', 'Bentley', 30 | 'Ford', 'Chevrolet', 'Dodge', 'Jeep'] 31 | brand_matrix = mpg_matrix[brands] 32 | 33 | # Plot the data using seaborn and matplotlib. 34 | # The `subplots()` method returns a figure and an axes. 35 | fig, ax = plt.subplots(1, 1) 36 | sns.heatmap(brand_matrix, cmap='BuGn', ax=ax) 37 | ax.set_xlabel('Make') 38 | ax.set_ylabel('Year') 39 | ax.set_title('MPG by Make and Year') 40 | 41 | # Setting a tight layout ensures the entire plot fits in the plotting 42 | # window. 43 | fig.tight_layout() 44 | 45 | # Show the figure. 46 | fig.show() 47 | # Alternatively: 48 | # fig.savefig('mpg_plot.png') 49 | 50 | # Only call `main()` when the script is executed directly (not when it's 51 | # imported with `import mpg_plots`). This allows the script to be used as an 52 | # application AND as a module in other scripts. 53 | if __name__ == '__main__': 54 | main() 55 | 56 | -------------------------------------------------------------------------------- /day5/notes.md: -------------------------------------------------------------------------------- 1 | 2 | There are many different packages for plotting with Python. These are split 3 | into two ecosystems, "matplotlib" and "javascript". 4 | 5 | * The "matplotlib" ecosystem has highly-customizable tools well-suited for 6 | printed graphics and interative (offline) applications. Some members: 7 | + matplotlib - general plotting 8 | + seaborn - "pretty" statistical plotting 9 | + ggplot - grammar of graphics plotting 10 | + basemap - old maps package, using shapefiles 11 | + cartopy - new maps package, using GeoJSON and recent Python GIS libraries 12 | * The "javascript" ecosystem is less mature, but growing fast. These modules 13 | are designed for interactive web graphics. Each leverages a javascript 14 | library. Many require an HTML server to work. Some members: 15 | + bokeh (D3.js) - general plotting, especially large/streaming data 16 | + vincent (vega.js) - general plotting 17 | + folium (leaflet.js) - interactive world maps (similar to Google Maps) 18 | + kartograph (kartograph.js) - interactive maps 19 | 20 | Since it's the most mature, we'll look at matplotlib first. 21 | 22 | matplotlib can be used interactively (PyLab) or in a non-interactive style. The 23 | interactive functions are in `matplotlib.pyplot`. 24 | 25 | The basic plotting function is `plot()`. It can produce line and scatter plots. 26 | Plots can be customized quickly with format strings, or more verbosely using 27 | parameters. Note the `plot()` can plot several lines in one call. 28 | 29 | Other plots are available: 30 | 31 | * acorr 32 | * bar 33 | * barh 34 | * boxplot 35 | * contour 36 | * errorbar 37 | * hist 38 | * scatter 39 | * violinplot 40 | * xcorr 41 | 42 | Writing non-interactive matplotlib code gives you more control over the 43 | resulting plots. In order to understand the documentation, you need to learn 44 | some jargon: 45 | 46 | * Figure - a drawing surface 47 | * Axes - a single plot, including its axes, title, lines, etc... 48 | * Artist - a single element of a plot; for example, a line 49 | 50 | It's possible to use non-interactive methods directly at the interpreter, but 51 | easuer to use them in a script. Every Python script is a module, meaning it 52 | can be imported in other scripts. Scripts are also very easy to document using 53 | docstrings (triple-quoted strings). 54 | 55 | -------------------------------------------------------------------------------- /day5/truck_map.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | ''' Map food trucks by neighborhood in San Francisco. 3 | ''' 4 | 5 | import json 6 | import folium 7 | import pandas as pd 8 | import shapely.geometry as geom 9 | 10 | def main(): 11 | ''' Produce a map of neighborhood food trucks in San Francisco. 12 | ''' 13 | # Load the neighborhoods GeoJSON data, and extract the neighborhood names 14 | # and shapes. 15 | nhoods = load_geo_json('sf_nhoods.json') 16 | features = extract_features(nhoods) 17 | 18 | # Load the food truck data, and extract (longitude, latitude) pairs. 19 | trucks = pd.read_csv('mobile_food.csv') 20 | loc = trucks[['Longitude', 'Latitude']].dropna() 21 | 22 | # Determine the neighborhood each food truck is in. 23 | trucks['Neighborhood'] = loc.apply(get_nhood, axis=1, features=features) 24 | 25 | counts = trucks['Neighborhood'].value_counts() 26 | counts = counts.reset_index() 27 | counts.columns = pd.Index(['Neighborhood', 'Trucks']) 28 | 29 | # Create a map. 30 | my_map = folium.Map(location=[37.77, -122.45], zoom_start=12) 31 | 32 | my_map.geo_json(geo_path = 'sf_nhoods.json', 33 | key_on='feature.properties.name', 34 | data = counts, columns = ['Neighborhood', 'Trucks'], 35 | fill_color='BuGn') 36 | 37 | my_map.create_map('map.html') 38 | 39 | def get_nhood(truck, features): 40 | ''' Identify the neighborhood of a given point. 41 | ''' 42 | truck = geom.Point(*truck) 43 | for name, boundary in features: 44 | if truck.within(boundary): 45 | return name 46 | 47 | return None 48 | 49 | def load_geo_json(path): 50 | ''' Load a GeoJSON file. 51 | ''' 52 | with open(path) as file: 53 | geo_json = json.load(file) 54 | 55 | return geo_json 56 | 57 | def extract_features(geo_json): 58 | ''' Extract names and geometries from a geo_json dict. 59 | ''' 60 | features = [] 61 | for feature in geo_json['features']: 62 | name = feature['properties']['name'] 63 | geometry = geom.asShape(feature['geometry']) 64 | 65 | features.append((name, geometry)) 66 | 67 | return features 68 | 69 | if __name__ == '__main__': 70 | main() 71 | 72 | -------------------------------------------------------------------------------- /day6/analysis.py: -------------------------------------------------------------------------------- 1 | ''' 2 | We'd like to do network analysis on a body of text. 3 | 4 | Build a graph where the nodes are names, and an edge exists if the two 5 | names appear in the same sentence. 6 | 7 | The first rule about regular expressions: 8 | Avoid them when possible!!!! 9 | 10 | Data Golf! 11 | Today it's for a prize- a book on Python 12 | Hints: 13 | - An ASCII character of text is 1 byte 14 | - A typical word is 6 characters 15 | - A typical novel has 200K words 16 | - War and Peace is 3x a normal book 17 | 18 | How big is War and Peace? 19 | 20 | regex should match capital letters not at the beginning of a sentence. 21 | ''' 22 | 23 | import re 24 | from itertools import combinations 25 | import networkx as nx 26 | 27 | names = 'Nick Clark Rick Qi'.split(' ') 28 | 29 | namepairs = set(combinations(names, 2)) 30 | 31 | pattern = re.compile(r''' 32 | [^^] # Not the beginning of a string 33 | (? FROM 65 | # 66 | # How can we tell what tables are in the database? 67 | 68 | engine.table_names() 69 | 70 | # Let's select all columns in the 'aa' table. 71 | 72 | aa = pd.read_sql('SELECT * FROM aa', engine) 73 | 74 | # Can anyone tell what this dataset is? 75 | 76 | aa.head() 77 | aa.tail() 78 | 79 | # Let's try a more complicated query using `WHERE`. 80 | 81 | pd.read_sql('SELECT * FROM aa WHERE year >= 1950 AND year < 1960', engine) 82 | 83 | # Who's the oldest person in the table? 84 | 85 | pd.read_sql('SELECT * FROM aa ORDER BY age DESC LIMIT 10', engine) 86 | 87 | # YOUR TURN: 88 | # 89 | # * Who's the youngest person in the table? 90 | # * How many Hepburns are in the table? 91 | # * How many Californians have won? 92 | # * Who has the most wins? 93 | 94 | # The trick is to take advantage of SQL and Pandas together. Generally, you 95 | # want to use SQL to reduce the data to a manageable size (i.e., so it fits in 96 | # memory) and then use Pandas to investigate the fine details. 97 | 98 | -------------------------------------------------------------------------------- /day7/day7log1.py: -------------------------------------------------------------------------------- 1 | ls 2 | import pandas as pd 3 | pd.read_sql? 4 | import sqlalchemy as sa 5 | engine = sa.create_engine('sql 6 | engine = sa.create_engine('sqlite:///aa.db') 7 | engine 8 | engine.table_names() 9 | query = 'SELECT * FROM aa' 10 | aa = pd.read_sql(query, engine) 11 | aa.head() 12 | aa.tail() 13 | query = 'SELECT * FROM aa WHERE person LIKE "%Brando"' 14 | query 15 | pd.read_sql(query, engine) 16 | query = 'SELECT person FROM aa WHERE year >= 1920 AND year < 1930' 17 | query 18 | pd.read_sql(query, engine) 19 | query 20 | engine = sa.create_engine('sqlite:///aa.db') 21 | query = 'SELECT * FROM aa ORDER BY age DESC LIMIT 5' 22 | query 23 | pd.read_sql(query, engine) 24 | q = 'SELECT * FROM aa ORDER BY age ASC LIMIT 5' 25 | query 26 | q 27 | pd.read_sql(q, engine) 28 | q = 'SELECT * FROM aa WHERE person LIKE "%Hepburn"' 29 | q 30 | pd.read_sql(q, engine) 31 | q = 'SELECT * FROM aa WHERE birthplace == "California"' 32 | q 33 | pd.read_sql(q, engine) 34 | q = 'SELECT DISTINCT person f 35 | q = 'SELECT DISTINCT person FROM aa WHERE birthplace == "California"' 36 | q 37 | pd.read_sql(q, engine) 38 | pd.read_sql(q, engine).size 39 | aa 40 | aa.Person.value_counts() 41 | q = 'SELECT COUNT(*) FROM aa GROUP BY person' 42 | q 43 | pd.read_sql(q, engine) 44 | q 45 | q = 'SELECT person, COUNT(*) as Count FROM aa GROUP BY person ORDER BY count DESC LIMIT 5' 46 | q 47 | q 48 | pd.read_sql(q, engine) 49 | %history -f log1.py 50 | -------------------------------------------------------------------------------- /day7/day7log2.py: -------------------------------------------------------------------------------- 1 | import sqlalchemy as sa 2 | engine = sa.create_engine('sqlite:///atus.db') 3 | engine.table_names() 4 | meta = sa.MetaData(engine, reflect=True) 5 | meta.tables 6 | tables = meta.tables 7 | type(tables) 8 | tables.keys() 9 | engine.table_names() 10 | who = tables['who'] 11 | type(who) 12 | who 13 | who.columns 14 | print(who.columns) 15 | act = tables['activity'] 16 | print(act.columns) 17 | from sqlalchemy import select, literal_column, func 18 | sums_q = select([func.sum(act.c.TUACTDUR24).label('sleep')]).\ 19 | group_by(act.c.TUCASEID) 20 | print(sums_q) 21 | sums_q = select([func.sum(act.c.TUACTDUR24).label('sleep')]).\ 22 | where(act.c.TRCODE == 10101).\ 23 | group_by(act.c.TUCASEID) 24 | print(sums_q) 25 | conn = engine.connect() 26 | result = conn.execute(sums_q) 27 | for row in result: 28 | print(row) 29 | import pandas as pd 30 | pd.DataFrame.from_records(row for row in result) 31 | result = conn.execute(sums_q) 32 | pd.DataFrame.from_records(row for row in result) 33 | avg_q = select([func.avg(literal_column('Sleep').label('AVG_Sleep')]).\ 34 | select_from(sums_q) 35 | 36 | 37 | 38 | ) 39 | avg_q = select([func.avg(literal_column('Sleep')).label('AVG_Sleep')]).\ 40 | select_from(sums_q) 41 | print(avg_q) 42 | result = conn.execute(avg_q) 43 | result 44 | result = result[0] 45 | result = resultresult = conn.execute(avg_q).fetchall() 46 | result 47 | result = result[0][0] 48 | result 49 | result / 60 50 | %history log2.py 51 | %history -f log2.py 52 | -------------------------------------------------------------------------------- /day8/backup/notes.py: -------------------------------------------------------------------------------- 1 | from scipy import stats 2 | # stats is a scipy submodule, not an object 3 | 4 | import matplotlib.pyplot as plt 5 | 6 | plt.ion() 7 | # Starting out: 8 | # Statistically you might say Z ~ N(0, 1) 9 | # To denote a normal random variable 10 | Z = stats.norm(0, 1) 11 | # This is how you do it in Python 12 | # There is a tight correspondence between the code and the mathematical notation 13 | # :) 14 | # Both Python and Julia have had the luxury of learning from R 15 | # I find random variables much more intuitive here 16 | # For example: 17 | # What's the mean? 18 | Z.mean() 19 | Z.median() 20 | Z.interval(0.95) 21 | # A 95 percent confidence interval 22 | # The object oriented approach is quite convenient 23 | Z.cdf(2) 24 | Z.moment? 25 | Z.moment(0) 26 | Z.moment(1) 27 | Z.moment(2) 28 | Z.moment(3) 29 | Z.pdf(0) 30 | Z.ppf(0.975) 31 | # percentage point function- also called quantile 32 | # pmf is probability mass function- but continuous RV's don't have a pmf 33 | Z.pmf(0) 34 | Z.rvs(10) 35 | Z.stats() 36 | Z.std() 37 | # scipy.stats supports nearly 100 different distributions 38 | # And the all behave EXACTLY like this :) 39 | stats.zipf? 40 | Z2 = stats.zipf(4) 41 | Z2.mean() 42 | x = np.linspace(-2, 2) 43 | # Visualize it 44 | plt.plot(x, Z.pdf(x)) 45 | plt.clf() 46 | plt.plot(x, Z.pdf(x)) 47 | x = np.linspace(-4, 4) 48 | plt.plot(x, Z.pdf(x)) 49 | plt.plot(x, Z.cdf(x)) 50 | # You might ask- what's the survival function? 51 | plt.plot(x, Z.sf(x)) 52 | 53 | 54 | # Now lets look at the sleep data that Nick introduced last time 55 | # Some exploratory analysis of this distribution 56 | # We'd like to see if we can fit a statistical distribution to this data. 57 | 58 | import pandas as pd 59 | sleep = pd.read_csv('sleepminutes.csv') 60 | sleep.head() 61 | type(sleep) 62 | sleep = sleep['minutes'] 63 | # We're just working with one series here 64 | plt.hist(sleep, bins=30) 65 | # Suppose that we want to model this with an appropriate statistical distribution 66 | # The only possible range of values are between 0 and 24 * 60 67 | 24 * 60 68 | sleep.max() 69 | # So somebody slept 23.5 / 24 hours 70 | # It's possible to fit this using a beta distribution 71 | # Beta is only defined on [0, 1] 72 | sleep.min() 73 | # So we'll need to scale it 74 | # All RV's in Scipy have parameters for shape, location and scale 75 | stats.beta.fit? 76 | stats.beta.fit(sleep, floc=0, fscale = 24 * 60) 77 | # floc means 'fixed location' 78 | bparams = stats.beta.fit(sleep, floc=0, fscale = 24 * 60) 79 | # We know the shape and scale, so we'll fit using this knowledge 80 | # This is the MLE for alpha and beta parameters 81 | # Now we make a random variable 82 | sbeta = stats.beta(*bparams) 83 | sbeta 84 | sbeta.interval(1) 85 | sbeta.mean() 86 | sleep.mean() 87 | 88 | # So sbeta here is the beta distribution fitted to this data 89 | # Let's plot it and make sure 90 | # that it looks reasonable 91 | 92 | x = np.linspace(0, 60*24) 93 | h, edges = np.histogram(sleep, 30, normed=True) 94 | plt.bar(edges[:-1], h, width=np.diff(edges)) 95 | plt.plot(x, sbeta.pdf(x), linewidth=4, color='orange') 96 | 97 | plt.clf() 98 | 99 | # Titanic model building 100 | preds = rfmod.predict(X) 101 | 102 | plt.boxplot(scores) 103 | -------------------------------------------------------------------------------- /day8/backup/titanic.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Predict survivors of the Titanic 3 | ''' 4 | 5 | import pandas as pd 6 | from sklearn.cross_validation import cross_val_score 7 | from sklearn.ensemble import RandomForestClassifier 8 | 9 | 10 | ################################################## 11 | # 12 | # 1) Prepare the data 13 | # 14 | # Machine learning is regression with a sexier name. 15 | # 16 | # Fitting regression models requires clean numeric data. 17 | # Typically we have an n x 1 vector response y and an n x p 18 | # predictor matrix X. 19 | # 20 | # Much of the art is in preparing an appropriate X. 21 | # Tasks include: 22 | # - Separating validation data set 23 | # - Joins / calculated columns 24 | # - transform categorical data and text 25 | # - impute missing data 26 | # 27 | ################################################## 28 | 29 | # Read in the data 30 | titanic = pd.read_csv('titanic.csv') 31 | 32 | y = titanic['Survived'] 33 | 34 | # Some numeric features 35 | features = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare'] 36 | 37 | X = titanic[features].copy() 38 | 39 | # Add in the gender 40 | X['Sex'], genders = pd.factorize(titanic['Sex']) 41 | 42 | # Need full rows 43 | #X.fillna(X.mean(), inplace=True) 44 | # Drop rows with missing data - although we could impute 45 | fullrows = -X.isnull().any(axis=1) 46 | X = X[fullrows] 47 | y = y[fullrows] 48 | 49 | ################################################## 50 | # 51 | # 2) Fit and evaluate models 52 | # 53 | # Fitting models is easy in scikit learn. 54 | # 55 | # Some techniques: 56 | # - Cross validation 57 | # - Grid searches over model parameter spaces 58 | # - Various ways to score 59 | # 60 | ################################################## 61 | 62 | rfmod = RandomForestClassifier() 63 | rfmod.fit(X, y) 64 | 65 | # How well did it perform? 66 | # 10 fold cross validation 67 | scores = cross_val_score(rfmod, X, y, cv=10) 68 | 69 | ################################################## 70 | # 71 | # 3) Predict 72 | # 73 | # Make predictions on new data using fitted model 74 | # 75 | ################################################## 76 | -------------------------------------------------------------------------------- /day8/betafit.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Fitting a beta distribution to the sleep data 3 | ''' 4 | 5 | import numpy as np 6 | from scipy import stats 7 | import matplotlib.pyplot as plt 8 | import pandas as pd 9 | 10 | plt.style.use('ggplot') 11 | 12 | sleep = pd.read_csv('sleepminutes.csv') 13 | sleep = sleep['minutes'] / 60 14 | 15 | # Fit the MLE of the beta distribution 16 | p = stats.beta.fit(sleep, floc=0, fscale=24) 17 | 18 | # The corresponding random variable 19 | Xsleep = stats.beta(*p) 20 | 21 | plt.hist(sleep, bins=30, normed=True) 22 | x = np.linspace(0, 24) 23 | plt.plot(x, Xsleep.pdf(x), linewidth=4) 24 | s = 'alpha: {:.3}, beta: {:.3}, loc: {}, scale: {}' 25 | plt.title(s.format(*p)) 26 | plt.xlabel('Hours of sleep') 27 | plt.savefig('sleep.pdf') 28 | -------------------------------------------------------------------------------- /day8/day8.py: -------------------------------------------------------------------------------- 1 | from scipy import stats 2 | # Random variables 3 | # Statistically: Z ~ N(0, 1) 4 | Z = stats.norm(0, 1) 5 | Z.mean() 6 | Z.interval(0.95) 7 | Z.ppf(0.975) 8 | Z.std() 9 | Z.pdf(0) 10 | import matplotlib.pyplot as plt 11 | plt.ion() 12 | # plot density of normal from -4 to 4 13 | import numpy as np 14 | x = np.linspace(-4, 4) 15 | Z 16 | Z.pdf(x) 17 | plt.plot(x, Z.pdf(x)) 18 | plt.plot(x, Z.cdf(x)) 19 | Z.sf? 20 | plt.plot(x, Z.sf(x)) 21 | import pandas as pd 22 | sleep = pd.read_csv('sleepminutes.csv') 23 | sleep.shape 24 | sleep.head() 25 | sleep = sleep['minutes'] 26 | type(sleep) 27 | # Lets model this with a statistical distribution 28 | plt.hist(sleep) 29 | sleep.min() 30 | sleep.max() 31 | 24 * 60 32 | # Goal- fit the MLE of the beta distribution 33 | # to our sleep data 34 | stats.beta.fit(sleep, floc=0, fscale=24*60) 35 | Xsleep = stats.beta(stats.beta.fit(sleep, floc=0, fscale=24*60)) 36 | Xsleep = stats.beta(*stats.beta.fit(sleep, floc=0, fscale=24*60)) 37 | Xsleep 38 | stats.beta.fit(sleep, floc=0, fscale=24*60)) 39 | stats.beta.fit(sleep, floc=0, fscale=24*60)) 40 | Xsleep 41 | stats.beta? 42 | sleep.mean() 43 | Xsleep.mean() 44 | # Now lets plot it 45 | # Now lets plot it 46 | h, edges = np.histogram(sleep, 30, normed=True) 47 | h 48 | plt.bar(edges[:-1], h, width=np.diff(edges)) 49 | plt.clf() 50 | plt.bar(edges[:-1], h, width=np.diff(edges)) 51 | # Now add our fitted beta distribution 52 | x = np.linspace(0, 60*24) 53 | x 54 | x = np.linspace(0, 60*24) 55 | plt.plot(x, Xsleep.pdf(x), linewidth=4, color='orange') 56 | # Now let's move on to the machine learning 57 | clear 58 | %run kaggle.py 59 | type(titanic) 60 | titanic.dtypes() 61 | titanic.dtypes 62 | clear 63 | titanic.dtypes 64 | clear 65 | %run kaggle.py 66 | X.shape 67 | pd.factorize? 68 | clear 69 | titanic.dtypes 70 | %run kaggle.py 71 | clear 72 | genders 73 | a, b = (0, 1) 74 | a 75 | b 76 | a, b = 0, 1 77 | a 78 | b 79 | clear 80 | %run kaggle.py 81 | plt.hist? 82 | clear 83 | X.isnull().sum() 84 | # Pull out stats module from Scipy 85 | from scipy import stats 86 | %run kaggle.py 87 | X.shape 88 | clear 89 | rfmod 90 | rfmod 91 | rfmod.score() 92 | # We're going to do cross validation 93 | from sklearn.cross_validation import cross_val_score 94 | scores = cross_val_score(rfmod, X, y, cv=10) 95 | scores 96 | plt.box(scores) 97 | q 98 | plt.clf() 99 | plt.box(scores) 100 | plt.boxplot(scores) 101 | scores 102 | plt.boxplot(scores) 103 | rfmod.predict(X) 104 | rfmod.predict_proba(X) 105 | a = rfmod.predict_proba(X) 106 | a.shape 107 | len(a) 108 | a.size 109 | %history -f day8.py 110 | -------------------------------------------------------------------------------- /day8/final.mdown: -------------------------------------------------------------------------------- 1 | Quick summary: 2 | 3 | Web API's - requests, JSON 4 | numerical computing- Numpy 5 | Munging -pandas 6 | DataVis, maps - matplotlib 7 | graphs, text - networkx, re 8 | databases - sqlalchemy 9 | stats / ML - scipy, sklearn 10 | 11 | You can be extremely productive with these tools. 12 | 13 | But theree's so much more!! 14 | 15 | How to continue learning: 16 | - Kaggle 17 | - Project Euler 18 | - Read good code 19 | -------------------------------------------------------------------------------- /day8/kaggle.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Titanic data set 3 | 4 | Predict who will survive 5 | ''' 6 | 7 | import pandas as pd 8 | from sklearn.ensemble import RandomForestClassifier 9 | 10 | ################################################## 11 | # 12 | # 1) Prepare the data 13 | # 14 | # Machine learning - better marketing 15 | # 16 | # Fitting regression models requires clean numeric data. 17 | # Typically we have an n x 1 vector response y and an n x p 18 | # predictor matrix X. 19 | # 20 | # Much of the art is in preparing an appropriate X. 21 | # Tasks include: 22 | # - Separating validation data set 23 | # - Joins / calculated columns 24 | # - transform categorical data and text 25 | # - impute missing data 26 | # 27 | ################################################## 28 | 29 | titanic = pd.read_csv('titanic.csv') 30 | 31 | y = titanic['Survived'] 32 | 33 | # Features to include in the model 34 | features = ['Pclass', 'SibSp', 'Parch', 'Fare'] 35 | 36 | X = titanic[features].copy() 37 | 38 | # Add in the gender 39 | X['Sex'], genders = pd.factorize(titanic['Sex']) 40 | 41 | ################################################## 42 | # 43 | # 2) Fit and evaluate models 44 | # 45 | # Fitting models is easy in scikit learn. 46 | # 47 | # Some techniques: 48 | # - Cross validation 49 | # - Grid searches over model parameter spaces 50 | # - Various ways to score 51 | # 52 | ################################################## 53 | 54 | 55 | rfmod = RandomForestClassifier() 56 | 57 | rfmod.fit(X, y) 58 | 59 | ################################################## 60 | # 61 | # 3) Predict 62 | # 63 | # Make predictions on new data using fitted model 64 | # 65 | ################################################## 66 | -------------------------------------------------------------------------------- /exercise1/exercise1.md: -------------------------------------------------------------------------------- 1 | # Exercise 1 - Agricultural Web Scraping 2 | 3 | _Update 2_: Hint should be- 4 | 5 | 7th is Pennsylvania with $43,487,000. And there are only 8 states 6 | listed, not 10. 7 | 8 | Thanks to Rylan Schaeffer and Haomiao Meng for pointing this out. 9 | 10 | _Update 1_: Keep your key private, ie. not on Github. 11 | 12 | In this assignment we will write a client that pulls data from the 13 | US Department of Agriculture Quick Stats database. 14 | 15 | The goals of this exercise are: 16 | 17 | - Become more familiar with the basics of Python 18 | - Develop skill in interacting with web API's 19 | 20 | The documentation for this API is available 21 | [here](http://quickstats.nass.usda.gov/api). Read it carefully. 22 | 23 | We've provided a template in this folder for your use called 24 | `usda_blank.py`. After you request a key, your task is to fill in the 25 | function bodies in the template and create a working program. Some sample 26 | queries are included to get started. 27 | 28 | From Ipython you can use %run to run the sample queries and verify that the 29 | results match what you expected. Then try building your own queries. 30 | 31 | ### Task: 32 | 33 | - List the top 8 tobacco producing states in 2012 as measured by 34 | production in dollars. 35 | -------------------------------------------------------------------------------- /exercise1/usda.py: -------------------------------------------------------------------------------- 1 | ''' 2 | A simple client for USDA statistics 3 | http://quickstats.nass.usda.gov/api 4 | 5 | Found the data set in the data.gov catalog: 6 | http://catalog.data.gov/dataset/quick-stats-agricultural-database-api 7 | 8 | Fun fact: This actually queries a database with 31 million records 9 | ''' 10 | 11 | import requests 12 | 13 | 14 | # One could also remove these lines and manually set the key via: 15 | # usda_key = 16 | with open('usda_key.txt') as f: 17 | usda_key = f.read().rstrip() 18 | 19 | 20 | def get_param_values(param, key=usda_key): 21 | ''' 22 | Returns the possible values for a single parameter 'param' 23 | 24 | >>> get_param_values('sector_desc')[:3] 25 | ['ANIMALS & PRODUCTS', 'CROPS', 'DEMOGRAPHICS'] 26 | 27 | ''' 28 | url = 'http://quickstats.nass.usda.gov/api/get_param_values' 29 | parameters = {'param': param, 'key': key} 30 | response = requests.get(url=url, params=parameters) 31 | return response.json()[param] 32 | 33 | 34 | def query(parameters, key=usda_key): 35 | ''' 36 | Returns the JSON response from the USDA agricultural database 37 | 38 | 'parameters' is a dictionary of parameters that can be referenced here: 39 | http://quickstats.nass.usda.gov/api 40 | 41 | Example: Return all the records around cattle in Tehama County 42 | 43 | >>> cowparams = {'commodity_desc': 'CATTLE', 44 | 'state_name': 'CALIFORNIA', 45 | 'county_name': 'TEHAMA'} 46 | >>> tehamacow = query(cowparams) 47 | 48 | ''' 49 | url = 'http://quickstats.nass.usda.gov/api/api_GET' 50 | parameters['key'] = key 51 | response = requests.get(url=url, params=parameters) 52 | try: 53 | return response.json()['data'] 54 | except KeyError: 55 | return response.json() 56 | 57 | 58 | if __name__ == '__main__': 59 | 60 | import operator 61 | # A few examples of usage 62 | 63 | # Possible values for 'commodity_desc' 64 | # commodity_desc = get_param_values('commodity_desc') 65 | # Expect: 66 | # ['AG LAND', 'AG SERVICES', 'AG SERVICES & RENT', 67 | # 'ALMONDS', ... 68 | 69 | # Value of rice crops in Yolo (Davis) county since 2005 70 | riceparams = {'sector_desc': 'CROPS', 71 | 'commodity_desc': 'RICE', 72 | 'state_name': 'CALIFORNIA', 73 | 'county_name': 'YOLO', 74 | 'year__GE': '2005', 75 | 'unit_desc': '$', 76 | } 77 | 78 | # yolorice = query(riceparams) 79 | 80 | # Try using a dictionary comprehension to filter 81 | # yearvalue = {x['year']: x['Value'] for x in yolorice} 82 | # Expect: 83 | # {'2007': '26,697,000', '2012': '51,148,000'} 84 | 85 | # This was the sales, not the production 86 | tobacco = query({'commodity_desc': 'TOBACCO', 87 | 'sector_desc': 'CROPS', 88 | 'source_desc': 'CENSUS', 89 | 'domain_desc': 'TOTAL', 90 | 'agg_level_desc': 'STATE', 91 | 'year': '2012', 92 | 'freq_desc': 'ANNUAL', 93 | 'unit_desc': '$', 94 | }) 95 | 96 | # Rylan's query, which does production 97 | tprod = query({'sector_desc': 'CROPS', 98 | 'commodity_desc': 'TOBACCO', 99 | 'year': '2012', 100 | 'agg_level_desc': 'STATE', 101 | 'statisticcat_desc': 'PRODUCTION', 102 | 'short_desc': 'TOBACCO - PRODUCTION, MEASURED IN $' 103 | }) 104 | 105 | def cleanup(value): 106 | ''' 107 | Massage data into proper form 108 | ''' 109 | try: 110 | return int(value.replace(',', '')) 111 | 112 | # Some contain strings with '(D)' 113 | except ValueError: 114 | return 0 115 | 116 | t2 = [(cleanup(x['Value']), x['state_name']) for x in tobacco] 117 | t2.sort(key=operator.itemgetter(0), reverse=True) 118 | 119 | tprod2 = [(cleanup(x['Value']), x['state_name']) for x in tprod] 120 | tprod2.sort(key=operator.itemgetter(0), reverse=True) 121 | 122 | print(t2) 123 | -------------------------------------------------------------------------------- /exercise1/usda_blank.py: -------------------------------------------------------------------------------- 1 | ''' 2 | A simple client for USDA statistics 3 | http://quickstats.nass.usda.gov/api 4 | 5 | Found the data set in the data.gov catalog: 6 | http://catalog.data.gov/dataset/quick-stats-agricultural-database-api 7 | 8 | Fun fact: This actually queries a database with 31 million records 9 | ''' 10 | 11 | import requests 12 | 13 | 14 | # Posting your keys online for anyone to find and use is a BAD IDEA! 15 | 16 | # You'll need to change this to the key from the email. 17 | # Only use this technique if this script will remain private, ie. stored 18 | # just on your local computer. 19 | usda_key = 'your key here' 20 | 21 | # If you save this publicly to Github then it's better to keep your key in 22 | # a separate private plain text file called 'usda_key.txt' which is NOT 23 | # added / committed to the repository. 24 | try: 25 | with open('usda_key.txt') as f: 26 | usda_key = f.read().rstrip() 27 | except FileNotFoundError: 28 | pass 29 | 30 | 31 | def get_param_values(param, key=usda_key): 32 | ''' 33 | Returns the possible values for a single parameter 'param' 34 | 35 | >>> get_param_values('sector_desc')[:3] 36 | ['ANIMALS & PRODUCTS', 'CROPS', 'DEMOGRAPHICS'] 37 | 38 | ''' 39 | # Your task- fill this in 40 | pass 41 | 42 | 43 | def query(parameters, key=usda_key): 44 | ''' 45 | Returns the JSON response from the USDA agricultural database 46 | 47 | 'parameters' is a dictionary of parameters that can be referenced here: 48 | http://quickstats.nass.usda.gov/api 49 | 50 | Example: Return all the records around cattle in Tehama County 51 | 52 | >>> cowparams = {'commodity_desc': 'CATTLE', 53 | 'state_name': 'CALIFORNIA', 54 | 'county_name': 'TEHAMA'} 55 | >>> tehamacow = query(cowparams) 56 | 57 | ''' 58 | # Your task- fill this in 59 | pass 60 | 61 | 62 | if __name__ == '__main__': 63 | # A few examples of usage 64 | 65 | # Possible values for 'commodity_desc' 66 | commodity_desc = get_param_values('commodity_desc') 67 | # Expect: 68 | # ['AG LAND', 'AG SERVICES', 'AG SERVICES & RENT', 69 | # 'ALMONDS', ... 70 | 71 | # Value of rice crops in Yolo (Davis) county since 2005 72 | riceparams = {'sector_desc': 'CROPS', 73 | 'commodity_desc': 'RICE', 74 | 'state_name': 'CALIFORNIA', 75 | 'county_name': 'YOLO', 76 | 'year__GE': '2005', 77 | 'unit_desc': '$', 78 | } 79 | 80 | yolorice = query(riceparams) 81 | 82 | # Try using a dictionary comprehension to filter 83 | yearvalue = {x['year']: x['Value'] for x in yolorice} 84 | # Expect: 85 | # {'2007': '26,697,000', '2012': '51,148,000'} 86 | -------------------------------------------------------------------------------- /exercise2/exercise2.md: -------------------------------------------------------------------------------- 1 | # Exercise 2 - Two-dimensional Random Walk 2 | 3 | In this exercise, you will write a two-dimensional [random walk][] simulator. 4 | The walk starts at the origin (0, 0). 5 | At each time point, the walker randomly selects a direction (up, down, left, or 6 | right) and moves one step in that direction. The walk is symmetric if all 7 | directions have equal probability. 8 | 9 | The goals of the exercise are: 10 | 11 | * Become accustomed to using Numpy's array and random number generation 12 | facilities 13 | * Learn about functions, function parameters, and default arguments 14 | * Practice incremental development 15 | 16 | You may need to consult the [Python][] and [Numpy][] documentation to complete 17 | this exercise. 18 | 19 | A template file, `random_walk.py`, has been provided. As you work through the 20 | tasks and modify the template file, you can load your work in IPython with the 21 | command 22 | 23 | %run random_walk.py 24 | 25 | and then test your code by calling the corresponding function directly. For 26 | example, 27 | 28 | x, y = random_walk(10) 29 | 30 | The template also includes a function `draw_walk()` which will animate the path 31 | traced out by the walk; it may be helpful for testing and entertainment 32 | purposes. 33 | 34 | ## Task 1 35 | 36 | Fill in the function `random_walk()`, which accepts an arbitrary number of 37 | steps to take and returns x and y coordinates of the walker at each step. These 38 | should be Numpy arrays, and should include the starting position (0, 0). The 39 | following algorithm is suggested: 40 | 41 | 1. Randomly sample the direction moved at each step. See the `numpy.random` 42 | submodule. 43 | 2. Convert the sampled directions into moves (e.g., up is a (0, 1) move). 44 | + Preallocate arrays for storing the moves with `numpy.zeros()`. 45 | + The `numpy.where()` function may prove useful here. 46 | + Consider storing moves in the x direction and moves in the y direction in 47 | separate arrays. 48 | 3. Use cumulative summation of the moves to calculate the x and y coordinates. 49 | 50 | ## Task 2 51 | 52 | Modify the `random_walk()` function to accept a starting position parameters 53 | `x_start` and `y_start`: 54 | 55 | def random_walk(n, x_start, y_start): 56 | # < your code > 57 | return x, y 58 | 59 | Adjust the body of the function to take the starting position of the walk into 60 | account. 61 | 62 | Make the `random_walk()` function start at the origin (0, 0) when 63 | `x_start` and `y_start` are not specified by adding default arguments: 64 | 65 | def random_walk(n, x_start = 0, y_start = 0) 66 | # < your code > 67 | return x, y 68 | 69 | Test that the function still works correctly. 70 | 71 | ## Task 3 72 | 73 | Add a `p` parameter to `random_walk()`, where `p` is a vector of probabilities 74 | for up, down, left, and right. Modify the body of the function to take this 75 | parameter into account. 76 | 77 | When `p` is not specified, default to a symmetric random walk. You could use 78 | `numpy.ones()` and vectorization for this, but there are other possibilities as 79 | well. 80 | 81 | Once again, test that the function behaves as expected. 82 | 83 | # Bonus Task 84 | 85 | Use `if` statements to check that `p` has length 4 and sums to 1. If either of 86 | these conditions are not true, [raise an exception][exception]: 87 | 88 | raise ValueError('p does not have length 4!') 89 | 90 | [random walk]: https://en.wikipedia.org/wiki/Random_walk 91 | [Python]: https://docs.python.org/3/tutorial 92 | [Numpy]: http://docs.scipy.org/doc/numpy/reference 93 | [exception]: http://stackoverflow.com/questions/2052390/how-do-i-manually-throw-raise-an-exception-in-python 94 | 95 | 96 | -------------------------------------------------------------------------------- /exercise2/random_walk.py: -------------------------------------------------------------------------------- 1 | ''' 2 | A two-dimensional random walk simulator and animator. 3 | ''' 4 | 5 | # The turtle package is part of Python's standard library. It provides some 6 | # very primitive graphics capabilities. For more details see 7 | # 8 | # https://docs.python.org/3/library/turtle.html 9 | # 10 | import turtle 11 | 12 | import numpy as np 13 | 14 | def random_walk(n): 15 | ''' Simulate a two-dimensional random walk. 16 | 17 | Args: 18 | n number of steps 19 | 20 | Returns: 21 | Two Numpy arrays containing the x and y coordinates, respectively, at 22 | each step (including the initial position). 23 | ''' 24 | 25 | # Your task: fill this in. 26 | 27 | return x, y 28 | 29 | 30 | 31 | # Notice that the documentation automatically shows up when you use ? 32 | def draw_walk(x, y, speed = 'slowest', scale = 20): 33 | ''' Animate a two-dimensional random walk. 34 | 35 | Args: 36 | x x positions 37 | y y positions 38 | speed speed of the animation 39 | scale scale of the drawing 40 | ''' 41 | # Reset the turtle. 42 | turtle.reset() 43 | turtle.speed(speed) 44 | 45 | # Combine the x and y coordinates. 46 | walk = zip(x * scale, y * scale) 47 | start = next(walk) 48 | 49 | # Move the turtle to the starting point. 50 | turtle.penup() 51 | turtle.goto(*start) 52 | 53 | # Draw the random walk. 54 | turtle.pendown() 55 | for _x, _y in walk: 56 | turtle.goto(_x, _y) 57 | 58 | -------------------------------------------------------------------------------- /exercise2/random_walk_solution.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | ''' 3 | A two-dimensional random walk simulator and animator. 4 | ''' 5 | 6 | # The turtle package is part of Python's standard library. It provides some 7 | # very primitive graphics capabilities. For more details see 8 | # 9 | # https://docs.python.org/3/library/turtle.html 10 | # 11 | import turtle 12 | 13 | import numpy as np 14 | 15 | def random_walk(n, x_start=0, y_start=0, p=np.repeat(0.25, 4)): 16 | ''' Simulate a two-dimensional random walk. 17 | 18 | Args: 19 | n number of steps 20 | x_start starting x position 21 | y_start starting y position 22 | p probabilities of (right, left, up, down) 23 | 24 | Returns: 25 | Two Numpy arrays containing the x and y coordinates, respectively, at 26 | each step (including the initial position). 27 | ''' 28 | # Sample the steps. 29 | steps = np.random.choice(4, size=n, p=p) 30 | 31 | # Preallocate position vectors. 32 | x_steps = np.zeros(n + 1, dtype='int') 33 | y_steps = np.zeros(n + 1, dtype='int') 34 | 35 | # Fill in the steps. 36 | x_steps[0], y_steps[0] = x_start, y_start 37 | 38 | x_steps[1:] += (steps == 0) 39 | x_steps[1:] -= (steps == 1) 40 | 41 | y_steps[1:] += (steps == 2) 42 | y_steps[1:] -= (steps == 3) 43 | 44 | # Compute the positions. 45 | x = x_steps.cumsum() 46 | y = y_steps.cumsum() 47 | 48 | return x, y 49 | 50 | # Notice that the documentation automatically shows up when you use ? 51 | def draw_walk(x, y, speed = 'slowest', scale = 20): 52 | ''' Animate a two-dimensional random walk. 53 | 54 | Args: 55 | x x positions 56 | y y positions 57 | speed speed of the animation 58 | scale scale of the drawing 59 | ''' 60 | # Reset the turtle. 61 | turtle.reset() 62 | turtle.speed(speed) 63 | 64 | # Combine the x and y coordinates. 65 | walk = zip(x * scale, y * scale) 66 | start = next(walk) 67 | 68 | # Move the turtle to the starting point. 69 | turtle.penup() 70 | turtle.goto(*start) 71 | 72 | # Draw the random walk. 73 | turtle.pendown() 74 | for _x, _y in walk: 75 | turtle.goto(_x, _y) 76 | 77 | -------------------------------------------------------------------------------- /exercise3/exercise3.mdown: -------------------------------------------------------------------------------- 1 | # Exercise 3 - War and Peace and Text 2 | 3 | In this exercise we'll use Python to read the book 'War and Peace' 4 | and create a graph of character relationships using `networkx`. 5 | 6 | The goal of this assignment are: 7 | - become more comfortable working with text 8 | - think about data sources in a new way 9 | - gain confidence in building graphs 10 | 11 | ### Download data 12 | 13 | Download the text for 'War and Peace' from Project Gutenberg at 14 | https://www.gutenberg.org/cache/epub/2600/pg2600.txt 15 | 16 | ### Regular Expressions 17 | 18 | Use the regular expression that we wrote in class to find all the 19 | characters / places. Note that it may need some modification so that it 20 | doesn't pick up the space in front. You'll also need to modify it to work 21 | with other sentence delimiters besides '.'. Consider '?' and '!'. 22 | 23 | Note- this doesn't have to be perfect. You're going to pick up names of 24 | places and words like 'Prince'. That's ok. 25 | 26 | ### Count appearances 27 | 28 | Count the number of times that each character/name appears in the text. 29 | Which are the top 10 and what are their counts? 30 | 31 | ### Together in sentences 32 | 33 | Use the list of names to build an undirected graph in networkx. 34 | Weight each edge according to the number of times that the two appear in 35 | the same sentence together. For example, if 'Sonya' and 'Lisa' appear 36 | together in 20 sentences, then the weight of that edge should be 20. 37 | 38 | ## Visualize 39 | 40 | Use `networkx` to visualize the top 10 characters. Let the appearance count 41 | determine the size of the node. 42 | Use the weight of the edge to determine the aesthetics of the line. 43 | Something like this: 44 | http://networkx.github.io/documentation/networkx-1.9.1/examples/drawing/weighted_graph.html 45 | -------------------------------------------------------------------------------- /exercise3/wpgraph.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nick-ulle/2015-python/18e324ea50435d94c72ec853eec56ea6dfa5fe05/exercise3/wpgraph.pdf -------------------------------------------------------------------------------- /exercise3/wpgraph.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Building a graph of characters from the novel War and Peace 3 | ''' 4 | 5 | import re 6 | from itertools import combinations 7 | from collections import Counter 8 | 9 | import matplotlib.pyplot as plt 10 | import networkx as nx 11 | 12 | 13 | namepattern = re.compile(r''' 14 | (? 30] 78 | esmall=[(u,v) for (u,v,d) in G.edges(data=True) if d['weight'] > 10] 79 | 80 | nx.draw_networkx_edges(G, pos, edge_color='g', alpha=0.05) 81 | nx.draw_networkx_edges(G, pos, edgelist=esmall, edge_color='g', 82 | alpha=0.3) 83 | nx.draw_networkx_edges(G, pos, edgelist=elarge, edge_color='b', 84 | alpha=0.6, width=2) 85 | 86 | plt.axis('off') 87 | plt.savefig('wpgraph.pdf') 88 | -------------------------------------------------------------------------------- /exercise4/convert.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | ''' Utility to convert the supermarket data to a SQLite database. 3 | 4 | The directory structure should be 5 | 6 | convert.py 7 | tables/ 8 | +-- supermarket_distances 9 | +-- supermarket_prices 10 | +-- supermarket_purchases 11 | 12 | This script can then be executed using 13 | 14 | python3 convert.py 15 | 16 | in the same directory. 17 | ''' 18 | 19 | import os 20 | 21 | import pandas as pd 22 | import sqlalchemy as sa 23 | 24 | def main(): 25 | ''' Convert the supermarket data to a SQLite database. 26 | ''' 27 | file_dir = 'tables' 28 | files = os.listdir(file_dir) 29 | 30 | engine = sa.create_engine('sqlite:///supermarket.sqlite') 31 | 32 | for f in files: 33 | print('Writing table {}...'.format(f), end='', flush=True) 34 | 35 | # Read the data in 500,000-row chunks. 36 | data_chunker = pd.read_table(os.path.join(file_dir, f), sep=' ', 37 | chunksize=5e5) 38 | data_chunks = iter(data_chunker) 39 | 40 | # Write the first chunk to the database; append the remaining chunks. 41 | # Use 1000-row commits. 42 | next(data_chunks).to_sql(f, engine, index=False, chunksize=1000) 43 | for chunk in data_chunks: 44 | chunk.to_sql(f, engine, if_exists='append', index=False, 45 | chunksize=1000) 46 | 47 | print('done!') 48 | 49 | if __name__ == '__main__': 50 | main() 51 | 52 | -------------------------------------------------------------------------------- /exercise4/exercise4.md: -------------------------------------------------------------------------------- 1 | 2 | # Exercise 4 - Supermarket Data Analysis 3 | 4 | In this exercise, you'll use Python to analyze data from the Italian 5 | supermarket chain [Coop][]. 6 | 7 | The goals for this exercise are: 8 | 9 | * Experience working with a large data set in Python 10 | * Examine when SQL databases are appropriate 11 | * Use `pandas` and `SQLAlchemy` 12 | 13 | This exercise is deliberately challenging and open-ended so that you can 14 | experiment with all of the tools you've learned. 15 | 16 | ## Data 17 | 18 | Download the data [here][data]. 19 | 20 | The data records the supermarket purchases of 60,366 customers between January 21 | 2007 and December 2011. More details, and a description of the data, can be 22 | found [here][desc]. This data was generously provided by the paper 23 | 24 | > Pennacchioli, D., Coscia, M., Rinzivillo, S., Pedreschi, D. and Giannotti, F. 25 | > (2013), "[Explaining the Product Range Effect in Purchase Data][paper]," Big 26 | > Data 2013 Conference. 27 | 28 | The data is distributed as three space-delimited text files, in a zip archive. 29 | After extracting the files, you're encouraged to convert these files into a 30 | SQLite database. 31 | 32 | You can do this using the provided script, `convert.py`. The header of the 33 | script includes instructions for use. It may take up to 30 minutes, and 34 | requires 1 GB of disk space. 35 | 36 | [Coop]: https://en.wikipedia.org/wiki/Coop_%28Italy%29 37 | [data]: http://michelecoscia.com/wp-content/uploads/2013/02/supermarket_data.zip 38 | [desc]: http://www.michelecoscia.com/?page_id=379 39 | [paper]: http://www.michelecoscia.com/wp-content/uploads/2013/09/geocoop.pdf 40 | 41 | ## Tasks 42 | 43 | Determine which customer spent the most, and which customer bought the most 44 | items. How much did they spend, in USD? Note that the prices are given in 45 | Euros (as a bonus, use `requests` to fetch the current exchange rate). 46 | 47 | Compute the total spending for each customer, and make a histogram. Does a 48 | gamma distribution fit well? If not, try to find a distribution that does. 49 | Keep in mind that total spending is theoretically unbounded. 50 | 51 | Analyze the network of customers and stores within short walking distance (say, 52 | less than 2500 meters). Are there any isolated customers or stores? Do spending 53 | habits differ for customers with few stores in walking distance? It may be 54 | helpful to use `networkx` here. 55 | 56 | -------------------------------------------------------------------------------- /feedback/feedback.csv: -------------------------------------------------------------------------------- 1 | Timestamp,The pace of this course was:,I would recommend this course to others.,The Python packages presented were relevant to my interests.,"Overall, this course helped me learn Python.",The teaching style of this course was effective.,What did you like about this course?,What would you change about this course?,I would like to see...,Additional feedback:,What topics would you like to see most in the final week? 2 | 1/29/2015 14:06:44,3,5,4,5,5,,,,,"Statistical Modelling (scipy, statsmodels), Databases (pandas, sqlalchemy), Functional Programming (iterators, generators, lazy evaluation)" 3 | 1/29/2015 14:07:05,4,5,5,5,4,,,,,"More Web Scraping (lxml2, selenium), Machine Learning (pylearn, scikit-learn), Databases (pandas, sqlalchemy)" 4 | 1/29/2015 14:10:35,4,5,4,3,4,,,,,"Machine Learning (pylearn, scikit-learn), Databases (pandas, sqlalchemy), Functional Programming (iterators, generators, lazy evaluation)" 5 | 1/29/2015 14:17:33,4,4,4,5,3,,,,,"Statistical Modelling (scipy, statsmodels), Machine Learning (pylearn, scikit-learn), Databases (pandas, sqlalchemy)" 6 | 1/29/2015 14:18:03,3,5,5,5,5,,,,,"Statistical Modelling (scipy, statsmodels), Machine Learning (pylearn, scikit-learn), Databases (pandas, sqlalchemy)" 7 | 1/29/2015 14:20:37,3,5,5,4,5,"Great course! Thanks Guys. Didn't get to put as much into it as I wanted because I'm in the middle of writing thesis. But, overall, great class. 8 | 9 | It is very, very refreshing to find people willing to teach, that have passion and insight into their fields, and exercise both expertise, relatability and patience. 10 | 11 | Thanks! 12 | 13 | How about a course on big ""geodata"" ???",,,,"Databases (pandas, sqlalchemy), More Mapping (cartopy, kartograph), Cat Pictures" 14 | 1/29/2015 14:33:39,3,4,4,4,4,,,,,"More Web Scraping (lxml2, selenium), Machine Learning (pylearn, scikit-learn), Databases (pandas, sqlalchemy)" 15 | 1/29/2015 14:42:19,2,3,3,4,4,,"I hope it is a faster-paced course that goes over more packages and functions, rather than elaborating on a few of them.",,,"More Web Scraping (lxml2, selenium), Functional Programming (iterators, generators, lazy evaluation), Testing (doctest, unittest, nose)" 16 | 1/29/2015 14:46:01,3,5,5,4,5,Yes,yes,,Yes,"Machine Learning (pylearn, scikit-learn), Databases (pandas, sqlalchemy), Object-oriented Programming (classes)" 17 | 1/29/2015 14:50:48,4,4,4,4,4,,,,,"Statistical Modelling (scipy, statsmodels), Machine Learning (pylearn, scikit-learn), Databases (pandas, sqlalchemy)" 18 | 1/29/2015 15:01:52,3,5,5,5,5,,,,Thanks for doing this!,"Statistical Modelling (scipy, statsmodels), Machine Learning (pylearn, scikit-learn), More Plotting (vincent, bokeh)" 19 | 1/29/2015 15:34:02,3,3,3,3,3,,,,,"Functional Programming (iterators, generators, lazy evaluation), Object-oriented Programming (classes), Testing (doctest, unittest, nose)" 20 | 1/29/2015 15:42:56,4,5,5,5,5,,,,,"Machine Learning (pylearn, scikit-learn), Databases (pandas, sqlalchemy), More Plotting (vincent, bokeh)" 21 | 1/29/2015 16:06:07,3,5,5,4,5,The presentations were concise and interactive. ,A lot more could be covered in 12 weeks. A credited course in the stats department could offer better choices for students interested in Data Science.,,,"More Web Scraping (lxml2, selenium), Statistical Modelling (scipy, statsmodels), Databases (pandas, sqlalchemy)" 22 | 1/29/2015 16:09:49,3,5,5,5,5,"The class definitely broadened my knowledge about python libraries. I like the zen of python, by the way ^.^",I myself is a CS student so I've feel good to follow the class. But when it comes to some complexed part I would say it may be a bit difficult to catch for students with less experience coding.,,Thank you guys so much for organizing this workshop!,"Machine Learning (pylearn, scikit-learn), Functional Programming (iterators, generators, lazy evaluation), Testing (doctest, unittest, nose)" 23 | 1/29/2015 16:17:55,5,5,5,4,4,Interactive teaching,"Stop and ask if students have questions. Often, students are too busy typing the code instead of thinking and asking. Give them time to type the code and think. However, this can't be achieved with the given time constraint. Maybe extend to 5 weeks or discuss one topics in two days.",,,"Databases (pandas, sqlalchemy), Object-oriented Programming (classes), Testing (doctest, unittest, nose)" 24 | 1/29/2015 16:36:03,4,4,4,4,4,Examples were interesting and helpful,Sometimes it was a bit too fast,,"Wish the course was longer, would like to learn more and do more in-depth examples","Statistical Modelling (scipy, statsmodels), Machine Learning (pylearn, scikit-learn), Databases (pandas, sqlalchemy)" 25 | 1/29/2015 16:58:12,4,5,5,5,5,,,,,"Databases (pandas, sqlalchemy), Functional Programming (iterators, generators, lazy evaluation), Object-oriented Programming (classes)" 26 | 1/29/2015 19:00:55,3,5,4,4,4,the atmosphere of study,elongate the length of course,,,"More Web Scraping (lxml2, selenium), Machine Learning (pylearn, scikit-learn), Object-oriented Programming (classes)" 27 | 1/29/2015 20:13:17,4,5,4,5,4,,,,,"Machine Learning (pylearn, scikit-learn), Functional Programming (iterators, generators, lazy evaluation), Object-oriented Programming (classes)" 28 | 1/29/2015 20:23:55,3,5,5,5,5,"- We used real world examples (e.g. USDA data and food truck data), not some toy data that was handed to us. 29 | - The topics of the course seemed very useful. 30 | - Some knowledge of programming was assumed, but not too much. As a result, we didn't spend time writing a ""Hello world"" program and a program to convert between Celsius and Fahrenheit. Also, we learned about object types, Python notation, etc. by doing rather than by being told and expected to remember. ","Overall, I think the course was great. My only comment is that I think some of the stuff we've learned in the 2nd and 3rd weeks, such as the structure of scripts and functions, should have been introduced in the first week. ",,,"More Web Scraping (lxml2, selenium), Machine Learning (pylearn, scikit-learn), Databases (pandas, sqlalchemy)" 31 | 1/30/2015 9:44:06,2,4,5,4,3,,,,,"More Web Scraping (lxml2, selenium), Statistical Modelling (scipy, statsmodels), Machine Learning (pylearn, scikit-learn)" 32 | 1/30/2015 9:58:47,2,,,3,3,,,,,"Machine Learning (pylearn, scikit-learn), Functional Programming (iterators, generators, lazy evaluation), Object-oriented Programming (classes)" 33 | 1/30/2015 11:59:15,3,5,4,5,5,,,,,"Machine Learning (pylearn, scikit-learn), More Plotting (vincent, bokeh), Functional Programming (iterators, generators, lazy evaluation)" 34 | 1/30/2015 12:03:30,4,4,4,4,4,,,,,"Statistical Modelling (scipy, statsmodels), Machine Learning (pylearn, scikit-learn), Databases (pandas, sqlalchemy)" 35 | 1/30/2015 13:55:45,4,4,4,4,4,,,,,"Statistical Modelling (scipy, statsmodels), Machine Learning (pylearn, scikit-learn), Databases (pandas, sqlalchemy)" 36 | 1/30/2015 16:34:58,4,5,5,5,4,"I liked that this course was accessible and open: undergraduate students could audit the course, materials available on GitHub, video recordings of courses, Piazza forum. The instructors were very receptive to questions and provided a safe learning environment. There was also a variety of different tools and applications, from data viz to data scraping, so that people from different fields and with various skill-levels could approach data analysis.",I suggest break-out sections for students to work on coding together. Maybe discuss or demo a tool/topic for the first-half. Then students can break out into different tables and work on the assignment/own programs/ applications using the tool. ,,"Many thanks for offering this course! I felt greater confidence about Python and coding after this course. After exposure to all the new libraries and packages, I am inspired with all the new methods of data analysis. ","More Web Scraping (lxml2, selenium), Machine Learning (pylearn, scikit-learn), Testing (doctest, unittest, nose)" -------------------------------------------------------------------------------- /iid_talk/Makefile: -------------------------------------------------------------------------------- 1 | slides.html: slides.mdown 2 | pandoc -t revealjs -s $< -o $@ 3 | 4 | view: slides.html 5 | open $< 6 | -------------------------------------------------------------------------------- /iid_talk/iidata workshop.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Python for R Users\n", 8 | "\n", 9 | "## Clark Fitzgerald \n", 10 | "\n", 11 | "`@clarkfitzg`\n", 12 | "\n", 13 | "## UC Davis iidata\n", 14 | "\n", 15 | "May 21, 2016\n", 16 | "\n", 17 | "![](images/python-logo-master-v3-TM.png)" 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "# Basics of Python\n", 25 | "\n", 26 | "- High level dynamic language\n", 27 | "- Open Source\n", 28 | "- Used everywhere for everything :)" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "![](images/Rlogo.png)\n", 36 | "\n", 37 | "# Differences\n", 38 | "\n", 39 | "R grew from statistics -> data science\n", 40 | "\n", 41 | "Python evolved from general purpose -> data science\n", 42 | "\n", 43 | "It's good to know both!" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 1, 49 | "metadata": { 50 | "collapsed": false 51 | }, 52 | "outputs": [ 53 | { 54 | "name": "stdout", 55 | "output_type": "stream", 56 | "text": [ 57 | "The Zen of Python, by Tim Peters\n", 58 | "\n", 59 | "Beautiful is better than ugly.\n", 60 | "Explicit is better than implicit.\n", 61 | "Simple is better than complex.\n", 62 | "Complex is better than complicated.\n", 63 | "Flat is better than nested.\n", 64 | "Sparse is better than dense.\n", 65 | "Readability counts.\n", 66 | "Special cases aren't special enough to break the rules.\n", 67 | "Although practicality beats purity.\n", 68 | "Errors should never pass silently.\n", 69 | "Unless explicitly silenced.\n", 70 | "In the face of ambiguity, refuse the temptation to guess.\n", 71 | "There should be one-- and preferably only one --obvious way to do it.\n", 72 | "Although that way may not be obvious at first unless you're Dutch.\n", 73 | "Now is better than never.\n", 74 | "Although never is often better than *right* now.\n", 75 | "If the implementation is hard to explain, it's a bad idea.\n", 76 | "If the implementation is easy to explain, it may be a good idea.\n", 77 | "Namespaces are one honking great idea -- let's do more of those!\n" 78 | ] 79 | } 80 | ], 81 | "source": [ 82 | "# Come for the language, stay for the community\n", 83 | "import this" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": 2, 89 | "metadata": { 90 | "collapsed": false 91 | }, 92 | "outputs": [ 93 | { 94 | "name": "stdout", 95 | "output_type": "stream", 96 | "text": [ 97 | "hello world!\n" 98 | ] 99 | } 100 | ], 101 | "source": [ 102 | "print('hello world!')" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": {}, 108 | "source": [ 109 | "## Goals:\n", 110 | "\n", 111 | "- Expose you to the language\n", 112 | "- Show you how to start\n", 113 | "\n", 114 | "First download and install the Anaconda distribution for Python 3.5: https://www.continuum.io/downloads\n", 115 | "\n", 116 | "You're looking at the Ipython Notebook" 117 | ] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "metadata": {}, 122 | "source": [ 123 | "Many skills / ideas transfer between R and Python." 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": 3, 129 | "metadata": { 130 | "collapsed": true 131 | }, 132 | "outputs": [], 133 | "source": [ 134 | "# A few preliminaries\n", 135 | "\n", 136 | "# Lets us plot\n", 137 | "%matplotlib inline\n", 138 | "\n", 139 | "# Import the data analysis libraries we'll use\n", 140 | "import matplotlib.pyplot as plt\n", 141 | "import numpy as np\n", 142 | "import pandas as pd" 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "metadata": {}, 148 | "source": [ 149 | "## Dots `.` are important!\n", 150 | "\n", 151 | "Dots are used for namespace lookups" 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": 4, 157 | "metadata": { 158 | "collapsed": false 159 | }, 160 | "outputs": [], 161 | "source": [ 162 | "# Look in pandas and find the read_csv function\n", 163 | "apt = pd.read_csv('/Users/clark/projects/sts98/sts98notes/discussion/clark/cl_apartments.csv')" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": 5, 169 | "metadata": { 170 | "collapsed": false 171 | }, 172 | "outputs": [ 173 | { 174 | "data": { 175 | "text/plain": [ 176 | "(18084, 22)" 177 | ] 178 | }, 179 | "execution_count": 5, 180 | "metadata": {}, 181 | "output_type": "execute_result" 182 | } 183 | ], 184 | "source": [ 185 | "apt.shape" 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": {}, 191 | "source": [ 192 | "## Subsetting in Python\n", 193 | "\n", 194 | "Just remember to start counting at 0!" 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": 6, 200 | "metadata": { 201 | "collapsed": false 202 | }, 203 | "outputs": [ 204 | { 205 | "data": { 206 | "text/html": [ 207 | "
\n", 208 | "
\n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | "
titledate_postedprice
0$995 / 3br - 1350ft2 - Lancaster Apartment Uni...2016-04-13 15:34:06995.0
1$1935 / 2br - 1154ft2 - No place like The Colo...2016-04-16 17:55:451935.0
2$1825 / 2br - 1056ft2 - No place like The Colo...2016-04-16 17:58:011825.0
3$650 / 1br - FURNISHED 1 BED GUEST QUARTERS, D...2016-04-17 21:26:15650.0
4$1599 / 2br - 951ft2 - Big Savings!2016-04-28 11:58:281599.0
5$1335 / 1br - 701ft2 - Looking for your next h...2016-04-28 13:12:201335.0
\n", 256 | "" 257 | ], 258 | "text/plain": [ 259 | " title date_posted \\\n", 260 | "0 $995 / 3br - 1350ft2 - Lancaster Apartment Uni... 2016-04-13 15:34:06 \n", 261 | "1 $1935 / 2br - 1154ft2 - No place like The Colo... 2016-04-16 17:55:45 \n", 262 | "2 $1825 / 2br - 1056ft2 - No place like The Colo... 2016-04-16 17:58:01 \n", 263 | "3 $650 / 1br - FURNISHED 1 BED GUEST QUARTERS, D... 2016-04-17 21:26:15 \n", 264 | "4 $1599 / 2br - 951ft2 - Big Savings! 2016-04-28 11:58:28 \n", 265 | "5 $1335 / 1br - 701ft2 - Looking for your next h... 2016-04-28 13:12:20 \n", 266 | "\n", 267 | " price \n", 268 | "0 995.0 \n", 269 | "1 1935.0 \n", 270 | "2 1825.0 \n", 271 | "3 650.0 \n", 272 | "4 1599.0 \n", 273 | "5 1335.0 " 274 | ] 275 | }, 276 | "execution_count": 6, 277 | "metadata": {}, 278 | "output_type": "execute_result" 279 | } 280 | ], 281 | "source": [ 282 | "small = apt.loc[:5, ['title', 'date_posted', 'price']]\n", 283 | "small" 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "execution_count": 8, 289 | "metadata": { 290 | "collapsed": false 291 | }, 292 | "outputs": [ 293 | { 294 | "data": { 295 | "text/plain": [ 296 | "[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]" 297 | ] 298 | }, 299 | "execution_count": 8, 300 | "metadata": {}, 301 | "output_type": "execute_result" 302 | } 303 | ], 304 | "source": [ 305 | "[x**2 for x in range(10)]" 306 | ] 307 | }, 308 | { 309 | "cell_type": "markdown", 310 | "metadata": {}, 311 | "source": [ 312 | "## Object Oriented\n", 313 | "Dots are used to call methods" 314 | ] 315 | }, 316 | { 317 | "cell_type": "code", 318 | "execution_count": 9, 319 | "metadata": { 320 | "collapsed": false 321 | }, 322 | "outputs": [ 323 | { 324 | "data": { 325 | "text/plain": [ 326 | "2338.7185471233488" 327 | ] 328 | }, 329 | "execution_count": 9, 330 | "metadata": {}, 331 | "output_type": "execute_result" 332 | } 333 | ], 334 | "source": [ 335 | "apt.price.mean()" 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": 11, 341 | "metadata": { 342 | "collapsed": false 343 | }, 344 | "outputs": [ 345 | { 346 | "data": { 347 | "text/plain": [ 348 | "dtype('float64')" 349 | ] 350 | }, 351 | "execution_count": 11, 352 | "metadata": {}, 353 | "output_type": "execute_result" 354 | } 355 | ], 356 | "source": [ 357 | "apt.price." 358 | ] 359 | }, 360 | { 361 | "cell_type": "code", 362 | "execution_count": 10, 363 | "metadata": { 364 | "collapsed": false 365 | }, 366 | "outputs": [ 367 | { 368 | "data": { 369 | "text/plain": [ 370 | "" 371 | ] 372 | }, 373 | "execution_count": 10, 374 | "metadata": {}, 375 | "output_type": "execute_result" 376 | }, 377 | { 378 | "data": { 379 | "image/png": [ 380 | "iVBORw0KGgoAAAANSUhEUgAAAZ0AAAEACAYAAABoJ6s/AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\n", 381 | "AAALEgAACxIB0t1+/AAAFn9JREFUeJzt3X/wXXV95/HnSyIKNYXNuA0/EgtOwy5ptWpa4251zFaL\n", 382 | "0XWATh3AKpO2Gbszseq6O1ZS7ZJ/1tHu+AOnA9MuKEFrahSlOJthiazZ8Y+FWAWNBiR0TUu+muCi\n", 383 | "BbRdm8h7/7ifr7lmvyE35Hs/N9/L8zFz5/u5n3PO/Xw+HL73lfM553tOqgpJknp42qQ7IEl66jB0\n", 384 | "JEndGDqSpG4MHUlSN4aOJKkbQ0eS1M3YQifJR5IcSLJrqO6/JLk3yVeTfCbJGUPLNibZk+S+JBcN\n", 385 | "1a9Ksqstu2ao/hlJPtnq70zy8+MaiyRpfozzSOejwNoj6m4HfrGqfhm4H9gIkGQlcDmwsm1zbZK0\n", 386 | "ba4D1lfVCmBFktnPXA883Oo/CLxvjGORJM2DsYVOVX0R+P4Rddur6vH29i5gWStfAmypqoNVtRd4\n", 387 | "AFid5GxgcVXtbOvdBFzayhcDm1v5ZuAVYxmIJGneTPKczu8B21r5HGDf0LJ9wLlz1M+0etrPBwGq\n", 388 | "6hDwSJIl4+ywJOnETCR0krwL+Keq+sQk2pckTcai3g0m+R3gNfz0dNgMsHzo/TIGRzgzHJ6CG66f\n", 389 | "3eY5wLeTLALOqKrvzdGeN5eTpCehqnLstY5P19BpFwG8A3h5Vf3foUW3Ap9I8gEG02YrgJ1VVUke\n", 390 | "TbIa2AlcCXx4aJt1wJ3A64A7jtbuOP7DnSySbKqqTZPux7g4voVrmscGT4nxjeUf7GMLnSRbgJcD\n", 391 | "z07yIHA1g6vVTgW2t4vT/ldVbaiq3Um2AruBQ8CGOnz76w3AjcBpwLaquq3V3wB8LMke4GHginGN\n", 392 | "RZI0P8YWOlX1+jmqP/IE678HeM8c9V8GnjdH/Y+Ay06kj5KkvrwjwcK3Y9IdGLMdk+7AmO2YdAfG\n", 393 | "aMekOzBmOybdgYUo0/4QtzYv+YaOTf5NVd3VsT1JmndJahznw58ioXPpY31a+9bT4X/fXPXoG/u0\n", 394 | "J0njMa7Q6X7J9GR8dnGfdq4H3uGUpSQdhV+QkqRuDB1JUjeGjiSpG0NHktSNoSNJ6sbQkSR1Y+hI\n", 395 | "kroxdCRJ3Rg6kqRuDB1JUjeGjiSpG0NHktSNoSNJ6sbQkSR1Y+hIkroxdCRJ3Rg6kqRuDB1JUjeG\n", 396 | "jiSpG0NHktSNoSNJ6sbQkSR1Y+hIkroxdCRJ3YwtdJJ8JMmBJLuG6pYk2Z7k/iS3JzlzaNnGJHuS\n", 397 | "3JfkoqH6VUl2tWXXDNU/I8knW/2dSX5+XGORJM2PcR7pfBRYe0TdVcD2qroAuKO9J8lK4HJgZdvm\n", 398 | "2iRp21wHrK+qFcCKJLOfuR54uNV/EHjfGMciSZoHYwudqvoi8P0jqi8GNrfyZuDSVr4E2FJVB6tq\n", 399 | "L/AAsDrJ2cDiqtrZ1rtpaJvhz7oZeMW8D0KSNK96n9NZWlUHWvkAsLSVzwH2Da23Dzh3jvqZVk/7\n", 400 | "+SBAVR0CHkmyZEz9liTNg4ldSFBVBdSk2pck9beoc3sHkpxVVfvb1NlDrX4GWD603jIGRzgzrXxk\n", 401 | "/ew2zwG+nWQRcEZVfW/uZjcNlde0lyRpVpI1dPhy7B06twLrGJz0XwfcMlT/iSQfYDBttgLYWVWV\n", 402 | "5NEkq4GdwJXAh4/4rDuB1zG4MOEoNs33OCRpqlTVDmDH7PskV4+jnbGFTpItwMuBZyd5EPhPwHuB\n", 403 | "rUnWA3uBywCqaneSrcBu4BCwoU2/AWwAbgROA7ZV1W2t/gbgY0n2AA8DV4xrLJKk+ZHD3+3TKUn1\n", 404 | "O3V0PfCOLVXf/+1ODUrSWCSpqsqx1zw+3pFAktSNoSNJ6sbQkSR1Y+hIkroxdCRJ3Rg6kqRuDB1J\n", 405 | "UjeGjiSpG0NHktSNoSNJ6sbQkSR1Y+hIkroxdCRJ3Rg6kqRuDB1JUjeGjiSpG0NHktSNoSNJ6sbQ\n", 406 | "kSR1Y+hIkroxdCRJ3Rg6kqRuDB1JUjeGjiSpG0NHktSNoSNJ6sbQkSR1Y+hIkroxdCRJ3UwkdJK8\n", 407 | "PcnXk+xK8okkz0iyJMn2JPcnuT3JmUPrb0yyJ8l9SS4aql/VPmNPkmsmMRZJ0ui6h06Sc4G3AKuq\n", 408 | "6nnAKcAVwFXA9qq6ALijvSfJSuByYCWwFrg2SdrHXQesr6oVwIoka7sORpJ0XCY1vbYIOD3JIuB0\n", 409 | "4NvAxcDmtnwzcGkrXwJsqaqDVbUXeABYneRsYHFV7Wzr3TS0jSTpJNQ9dKpqBng/8HcMwubvq2o7\n", 410 | "sLSqDrTVDgBLW/kcYN/QR+wDzp2jfqbVS5JOUot6N5jknzE4qjkPeAT4VJI3Dq9TVZWk5q/VTUPl\n", 411 | "Ne0lSZqVZA0dvhy7hw7wSuBbVfUwQJLPAP8K2J/krKra36bOHmrrzwDLh7ZfxuAIZ6aVh+tn5m5y\n", 412 | "0zx2X5KmT1XtAHbMvk9y9TjamcQ5nb8FXpLktHZBwCuB3cDngHVtnXXALa18K3BFklOTnA+sAHZW\n", 413 | "1X7g0SSr2+dcObSNJOkk1P1Ip6p2Jvk08BXgUPv558BiYGuS9cBe4LK2/u4kWxkE0yFgQ1XNTr1t\n", 414 | "AG4ETgO2VdVtHYciSTpOOfz9PZ0G54Z6jfF64B1bqr7/250alKSxSFJVlWOveXy8I4EkqRtDR5LU\n", 415 | "jaEjSerG0JEkdWPoSJK6MXQkSd0YOpKkbgwdSVI3xwydJM/r0RFJ0vQb5UjnuiRfSrIhyRlj75Ek\n", 416 | "aWodM3Sq6qXAG4DnAF9JsmX4kdGSJI1qpHM6VXU/8G7gncDLgWuSfDPJb42zc5Kk6TLKOZ1fTvJB\n", 417 | "4F7g14HXVtWFwL8BPjjm/kmSpsgojzb4MHAD8K6q+ofZyqr6dpJ3j61nkqSpM0ro/FvgH6vqxwBJ\n", 418 | "TgGeWVU/rKqbxto7SdJUGeWczucZPCRt1unA9vF0R5I0zUYJnWdW1Q9m31TVYwyCR5Kk4zJK6Pww\n", 419 | "yarZN0l+BfjH8XVJkjStRjmn8++BrUm+096fDVw+vi5JkqbVMUOnqr6U5ELgXwAFfLOqDo69Z5Kk\n", 420 | "qTPKkQ7ArwDnt/VflASvXJMkHa9jhk6SjwPPBe4Bfjy0yNCRJB2XUY50VgErq6rG3RlJ0nQb5eq1\n", 421 | "rzO4eECSpBMyypHOPwd2J9kJ/KjVVVVdPL5uSZKm0Sihs6n9LCBDZUmSjssol0zvSHIe8AtV9fkk\n", 422 | "p4+ynSRJRxrl0Qa/D3wK+LNWtQz47Dg7JUmaTqNcSPBm4KXAo/CTB7r93Ik0muTMJJ9Ocm+S3UlW\n", 423 | "J1mSZHuS+5PcnuTMofU3JtmT5L7hp5YmWZVkV1t2zYn0SZI0fqOEzo+qavYCApIs4sTP6VwDbGsP\n", 424 | "g3s+cB9wFbC9qi4A7mjvSbKSwW13VgJrgWuTzJ5bug5YX1UrgBVJ1p5gvyRJYzRK6PzPJO8CTk/y\n", 425 | "Gwym2j73ZBtMcgbwsqr6CEBVHaqqR4CLgc1ttc3Apa18CbClqg5W1V7gAWB1krOBxVW1s61309A2\n", 426 | "kqST0CihcxXwXWAX8O+AbcCJPDH0fOC7ST6a5CtJ/muSnwGWVtWBts4BYGkrnwPsG9p+H3DuHPUz\n", 427 | "rV6SdJIa5eq1HwN/3l7z1eaLgD9oNxP9EG0qbajNSjKPl2VvGiqvaS9J0qwka+jw5TjKvde+NUd1\n", 428 | "VdVzn2Sb+4B9VfWl9v7TwEZgf5Kzqmp/mzp7qC2fAZYPbb+sfcZMKw/Xz8zd5KYn2VVJemqoqh3A\n", 429 | "jtn3Sa4eRzujTK/96tDrZQwuAviLJ9tgVe0HHkxyQat6JfANBueJ1rW6dcAtrXwrcEWSU5OcD6wA\n", 430 | "drbPebRd+RbgyqFtJEknoVGm1/7PEVUfSvIV4I9PoN23AH+R5FTgb4DfBU5h8LC49cBe4LLW/u4k\n", 431 | "W4HdwCFgw9DNRzcANwKnMbga7rYT6JMkacxGmV5bxeFLpJ/G4Nk6p5xIo1X1VQZHTkd65VHWfw/w\n", 432 | "njnqvww870T6IknqZ5Tb2byfw6FziKGjEM3l71+f5PU9W6yqHHstSZq8UabX1nTox5TpeT9U80bS\n", 433 | "wjHK9Np/5P//Fv3J3aar6gPz3itJ0lQa9cmhv8rgKrIArwW+BNw/xn5JkqbQKKGzHHhRVT0GP7l2\n", 434 | "e1tVvWGsPZMkTZ1R/k7n54CDQ+8PcoJ3mZYkPTWNcqRzE7AzyWcYTK9dyuEbc0qSNLJRrl77z0lu\n", 435 | "Y/BMHYDfqaq7x9stSdI0GmV6DeB04LGqugbY125HI0nScRnlcdWbgD/k8J2gTwU+PsY+SZKm1ChH\n", 436 | "Or/J4EFqPwSoqhlg8Tg7JUmaTqM+rvrx2TftgWuSJB23UULnU0n+DDgzye8DdwDXj7dbkqRp9IRX\n", 437 | "r7Xn1HwS+JfAY8AFwB9X1fYOfZMkTZlR/k5nW1X9EnD7uDsjSZpuTzi91h6W9uUkL+7UH0nSFBvl\n", 438 | "SOclwBuT/C3tCjYGefT88XVLkjSNjho6SZ5TVX8HvIrBow18cIsk6YQ80ZHOXwEvrKq9SW6uqt/q\n", 439 | "1SlJ0nQa9TY4zx1rLyRJTwmjho4kSSfsiabXnp/ksVY+bagMgwsJfnaM/ZIkTaGjhk5VndKzI5Kk\n", 440 | "6ef0miSpG0NHktSNoSNJ6sbQkSR1Y+hIkrqZWOgkOSXJ3Uk+194vSbI9yf1Jbk9y5tC6G5PsSXJf\n", 441 | "kouG6lcl2dWWXTOJcUiSRjfJI523AbsZ3NcN4Cpge1VdwOBBcVcBJFkJXA6sBNYC17bn/ABcB6yv\n", 442 | "qhXAiiRrO/ZfknScJhI6SZYBr2HwBNLZALkY2NzKm4FLW/kSYEtVHayqvcADwOokZwOLq2pnW++m\n", 443 | "oW0kSSehSR3pfBB4B/D4UN3SqjrQygeApa18DrBvaL19wLlz1M+0eknSSWqU5+nMqySvBR6qqruT\n", 444 | "rJlrnaqqJDXXsidn01B5TXtJkma17+M1426ne+gA/xq4OMlrgGcCP5vkY8CBJGdV1f42dfZQW38G\n", 445 | "WD60/TIGRzgzrTxcPzN3k5vms/+SNHWqagewY/Z9kqvH0U736bWq+qOqWl5V5wNXAP+jqq4EbgXW\n", 446 | "tdXWAbe08q3AFUlOTXI+sALYWVX7gUeTrG4XFlw5tI0k6SQ0iSOdI81Oo70X2JpkPbAXuAygqnYn\n", 447 | "2crgSrdDwIaqmt1mA3AjcBqwrapu69hvSdJxyuHv7+k0ODfUa4zXA2+iX3sAoap8lLikeZWkxvHd\n", 448 | "4h0JJEndGDqSpG4MHUlSN4aOJKkbQ0eS1I2hI0nqxtCRJHVj6EiSujF0JEndGDqSpG4MHUlSN4aO\n", 449 | "JKkbQ0eS1I2hI0nqxtCRJHVj6EiSujF0JEndGDqSpG4MHUlSN4aOJKkbQ0eS1I2hI0nqxtCRJHVj\n", 450 | "6EiSujF0JEndGDqSpG4MHUlSN4aOJKmb7qGTZHmSLyT5RpKvJ3lrq1+SZHuS+5PcnuTMoW02JtmT\n", 451 | "5L4kFw3Vr0qyqy27pvdYJEnHZxJHOgeBt1fVLwIvAd6c5ELgKmB7VV0A3NHek2QlcDmwElgLXJsk\n", 452 | "7bOuA9ZX1QpgRZK1fYciSToe3UOnqvZX1T2t/APgXuBc4GJgc1ttM3BpK18CbKmqg1W1F3gAWJ3k\n", 453 | "bGBxVe1s6900tI0k6SQ00XM6Sc4DXgjcBSytqgNt0QFgaSufA+wb2mwfg5A6sn6m1UuSTlITC50k\n", 454 | "zwJuBt5WVY8NL6uqAmoiHZMkjc2iSTSa5OkMAudjVXVLqz6Q5Kyq2t+mzh5q9TPA8qHNlzE4wplp\n", 455 | "5eH6mblb3DRUXtNekqRZSdbQ4csxg4OKftpFAJuBh6vq7UP1f9Lq3pfkKuDMqrqqXUjwCeDFDKbP\n", 456 | "Pg/8QlVVkruAtwI7gf8GfLiqbjuivep30HQ98Cb6HqSFqsqx15Ok0SWpcXy3TOJI59eANwJfS3J3\n", 457 | "q9sIvBfYmmQ9sBe4DKCqdifZCuwGDgEb6nBSbgBuBE4Dth0ZOJKkk0v3I53ePNKRpOM3riMd70gg\n", 458 | "SerG0JEkdWPoSJK6MXQkSd0YOpKkbgwdSVI3ho4kqRtDR5LUjaEjSerG0JEkdWPoSJK6MXQkSd0Y\n", 459 | "OpKkbgwdSVI3ho4kqRtDR5LUjaEjSerG0JEkdWPoSJK6MXQkSd0YOpKkbgwdSVI3ho4kqRtDR5LU\n", 460 | "jaEjSepm0aQ7oBOXpHq2V1Xp2Z6k6WHoTIWemWPeSHrynF6TJHWz4EMnydok9yXZk+Sdk+6PJOno\n", 461 | "FnToJDkF+FNgLbASeH2SCyfbK82nJGsm3YdxmubxTfPYYPrHNy4LOnSAFwMPVNXeqjoI/CVwyYT7\n", 462 | "pPm1ZtIdGLM1k+7AGK2ZdAfGbM2kO7AQLfQLCc4FHhx6vw9YPaG+PGX0vloO2NS5PUljstBDZ8Qv\n", 463 | "v19/ZLzdmDVzKnBan7YmyavlJD05qer9j9b5k+QlwKaqWtvebwQer6r3Da2zcAcoSRM0jr/JW+ih\n", 464 | "swj4JvAK4NvATuD1VXXvRDsmSZrTgp5eq6pDSf4A+O/AKcANBo4knbwW9JGOJGlhWeiXTD+hhfqH\n", 465 | "o0n2JvlakruT7Gx1S5JsT3J/ktuTnDm0/sY2xvuSXDRUvyrJrrbsmkmMpfXjI0kOJNk1VDdv40ny\n", 466 | "jCSfbPV3Jvn5fqM76vg2JdnX9uHdSV49tGzBjC/J8iRfSPKNJF9P8tZWPxX77wnGNy3775lJ7kpy\n", 467 | "TxvfplY/uf1XVVP5YjDd9gBwHvB04B7gwkn3a8S+fwtYckTdnwB/2MrvBN7byivb2J7exvoAh49g\n", 468 | "dwIvbuVtwNoJjedlwAuBXeMYD7ABuLaVLwf+8iQY39XAf5hj3QU1PuAs4AWt/CwG51AvnJb99wTj\n", 469 | "m4r919o8vf1cBNzJ4M9KJrb/pvlIZ6H/4eiRV41cDGxu5c3Apa18CbClqg5W1V4G/5OsTnI2sLiq\n", 470 | "drb1bhrapquq+iLw/SOq53M8w591M4MLS7o5yvhg7uu9F9T4qmp/Vd3Tyj8A7mXw93FTsf+eYHww\n", 471 | "BfsPoKr+oRVPZRAmxQT33zSHzlx/OHruUdY92RTw+SR/neRNrW5pVR1o5QPA0lY+h8HYZs2O88j6\n", 472 | "GU6u8c/neH6yr6vqEPBIkiVj6vfxeEuSrya5YWj6YsGOL8l5DI7o7mIK99/Q+O5sVVOx/5I8Lck9\n", 473 | "DPbT7S04Jrb/pjl0FvIVEr9WVS8EXg28OcnLhhfW4Dh2IY/vp0zbeJrrgPOBFwDfAd4/2e6cmCTP\n", 474 | "YvCv2LdV1WPDy6Zh/7XxfZrB+H7AFO2/qnq8ql4ALGNw1PJLRyzvuv+mOXRmgOVD75fz00l90qqq\n", 475 | "77Sf3wU+y2Cq8ECSswDaoe5DbfUjx7mMwThnWnm4fma8PT8u8zGefUPbPKd91iLgjKr63vi6fmxV\n", 476 | "9VA1wPUM9iEswPEleTqDwPlYVd3Sqqdm/w2N7+Oz45um/Terqh4BvgC8ignuv2kOnb8GViQ5L8mp\n", 477 | "DE5w3TrhPh1TktOTLG7lnwEuAnYx6Pu6tto6YPaX/1bgiiSnJjkfWAHsrKr9wKNJVicJcOXQNieD\n", 478 | "+RjPX83xWa8D7ugxgCfSfpFn/SaDfQgLbHytLzcAu6vqQ0OLpmL/HW18U7T/nj07NZjkNOA3GJy3\n", 479 | "mtz+63kVRe8Xg+mpbzI4GbZx0v0Zsc/nM7h65B7g67P9BpYAnwfuB24Hzhza5o/aGO8DXjVUv4rB\n", 480 | "L8sDwIcnOKYtDO4Y8U8M5n5/dz7HAzwD2ArsYTAff96Ex/d7DE60fg34avuFXroQxwe8FHi8/f94\n", 481 | "d3utnZb9d5TxvXqK9t/zgK+0cewC3t3qJ7b//ONQSVI30zy9Jkk6yRg6kqRuDB1JUjeGjiSpG0NH\n", 482 | "ktSNoSNJ6sbQkSR1Y+hIkrr5f5BfUgwHbg6jAAAAAElFTkSuQmCC\n" 483 | ], 484 | "text/plain": [ 485 | "" 486 | ] 487 | }, 488 | "metadata": {}, 489 | "output_type": "display_data" 490 | } 491 | ], 492 | "source": [ 493 | "# We can plot\n", 494 | "apt.price.plot.hist()" 495 | ] 496 | }, 497 | { 498 | "cell_type": "markdown", 499 | "metadata": {}, 500 | "source": [ 501 | "## Excellent data analysis and scientific libraries\n", 502 | "\n", 503 | "### Example: Create a SQL database" 504 | ] 505 | }, 506 | { 507 | "cell_type": "code", 508 | "execution_count": 12, 509 | "metadata": { 510 | "collapsed": false 511 | }, 512 | "outputs": [], 513 | "source": [ 514 | "from sqlalchemy import create_engine\n", 515 | "\n", 516 | "engine = create_engine('sqlite:///apartment.db')\n", 517 | "apt.to_sql('apartment', engine)" 518 | ] 519 | }, 520 | { 521 | "cell_type": "markdown", 522 | "metadata": {}, 523 | "source": [ 524 | "# General Purpose data structures\n", 525 | "\n", 526 | "Many are available for your programming enjoyment." 527 | ] 528 | }, 529 | { 530 | "cell_type": "markdown", 531 | "metadata": {}, 532 | "source": [ 533 | "## Dictionaries, Sets" 534 | ] 535 | }, 536 | { 537 | "cell_type": "code", 538 | "execution_count": 13, 539 | "metadata": { 540 | "collapsed": false, 541 | "scrolled": true 542 | }, 543 | "outputs": [ 544 | { 545 | "data": { 546 | "text/plain": [ 547 | "{'date_posted': '2016-04-13 15:34:06',\n", 548 | " 'price': 995.0,\n", 549 | " 'title': '$995 / 3br - 1350ft2 - Lancaster Apartment Unit Beautiful Remodeled 3 Bed, 2 Bath House in West Lancast (Lancaster)'}" 550 | ] 551 | }, 552 | "execution_count": 13, 553 | "metadata": {}, 554 | "output_type": "execute_result" 555 | } 556 | ], 557 | "source": [ 558 | "# Python is built on dictionaries\n", 559 | "# AKA hash tables\n", 560 | "\n", 561 | "# A dictionary from the apartment data\n", 562 | "d1 = small.iloc[0, :].to_dict()\n", 563 | "d1" 564 | ] 565 | }, 566 | { 567 | "cell_type": "code", 568 | "execution_count": 14, 569 | "metadata": { 570 | "collapsed": false 571 | }, 572 | "outputs": [ 573 | { 574 | "data": { 575 | "text/plain": [ 576 | "{'date_posted', 'price', 'title'}" 577 | ] 578 | }, 579 | "execution_count": 14, 580 | "metadata": {}, 581 | "output_type": "execute_result" 582 | } 583 | ], 584 | "source": [ 585 | "s = set(d1.keys())\n", 586 | "s" 587 | ] 588 | }, 589 | { 590 | "cell_type": "code", 591 | "execution_count": 15, 592 | "metadata": { 593 | "collapsed": false 594 | }, 595 | "outputs": [ 596 | { 597 | "data": { 598 | "text/plain": [ 599 | "True" 600 | ] 601 | }, 602 | "execution_count": 15, 603 | "metadata": {}, 604 | "output_type": "execute_result" 605 | } 606 | ], 607 | "source": [ 608 | "'price' in s" 609 | ] 610 | }, 611 | { 612 | "cell_type": "code", 613 | "execution_count": 16, 614 | "metadata": { 615 | "collapsed": false 616 | }, 617 | "outputs": [ 618 | { 619 | "data": { 620 | "text/plain": [ 621 | "{'prices': [12, 23, 43], 'title': 'davis', 'year': 2016}" 622 | ] 623 | }, 624 | "execution_count": 16, 625 | "metadata": {}, 626 | "output_type": "execute_result" 627 | } 628 | ], 629 | "source": [ 630 | "# Make a dictionary from scratch\n", 631 | "d2 = {'year': 2016, 'prices': [12, 23, 43], 'title': 'davis'}\n", 632 | "d2" 633 | ] 634 | }, 635 | { 636 | "cell_type": "markdown", 637 | "metadata": {}, 638 | "source": [ 639 | "## Tuples, Lists" 640 | ] 641 | }, 642 | { 643 | "cell_type": "code", 644 | "execution_count": 18, 645 | "metadata": { 646 | "collapsed": false 647 | }, 648 | "outputs": [ 649 | { 650 | "data": { 651 | "text/plain": [ 652 | "[12, 'a', [3, 4, 'cats']]" 653 | ] 654 | }, 655 | "execution_count": 18, 656 | "metadata": {}, 657 | "output_type": "execute_result" 658 | } 659 | ], 660 | "source": [ 661 | "l = [12, 'a', [3, 4, 'cats']]\n", 662 | "l" 663 | ] 664 | }, 665 | { 666 | "cell_type": "code", 667 | "execution_count": 19, 668 | "metadata": { 669 | "collapsed": false 670 | }, 671 | "outputs": [ 672 | { 673 | "data": { 674 | "text/plain": [ 675 | "('davis', 2016, 'iidata')" 676 | ] 677 | }, 678 | "execution_count": 19, 679 | "metadata": {}, 680 | "output_type": "execute_result" 681 | } 682 | ], 683 | "source": [ 684 | "t = ('davis', 2016, 'iidata')\n", 685 | "t" 686 | ] 687 | }, 688 | { 689 | "cell_type": "markdown", 690 | "metadata": {}, 691 | "source": [ 692 | "## More exotic" 693 | ] 694 | }, 695 | { 696 | "cell_type": "code", 697 | "execution_count": 20, 698 | "metadata": { 699 | "collapsed": false 700 | }, 701 | "outputs": [ 702 | { 703 | "data": { 704 | "text/plain": [ 705 | "deque([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])" 706 | ] 707 | }, 708 | "execution_count": 20, 709 | "metadata": {}, 710 | "output_type": "execute_result" 711 | } 712 | ], 713 | "source": [ 714 | "# Doubly linked block lists\n", 715 | "from collections import deque\n", 716 | "d = deque(x for x in range(10))\n", 717 | "d" 718 | ] 719 | }, 720 | { 721 | "cell_type": "code", 722 | "execution_count": 21, 723 | "metadata": { 724 | "collapsed": false 725 | }, 726 | "outputs": [ 727 | { 728 | "data": { 729 | "text/plain": [ 730 | "[0, 2, 89, 129, 90]" 731 | ] 732 | }, 733 | "execution_count": 21, 734 | "metadata": {}, 735 | "output_type": "execute_result" 736 | } 737 | ], 738 | "source": [ 739 | "# Heaps\n", 740 | "from heapq import heapify\n", 741 | "h = [90, 2, 89, 129, 0]\n", 742 | "heapify(h)\n", 743 | "h" 744 | ] 745 | }, 746 | { 747 | "cell_type": "markdown", 748 | "metadata": {}, 749 | "source": [ 750 | "## Iterators\n", 751 | "\n", 752 | "Lazily evaluated data streams" 753 | ] 754 | }, 755 | { 756 | "cell_type": "code", 757 | "execution_count": 31, 758 | "metadata": { 759 | "collapsed": false 760 | }, 761 | "outputs": [ 762 | { 763 | "data": { 764 | "text/plain": [ 765 | " at 0x10dddae10>" 766 | ] 767 | }, 768 | "execution_count": 31, 769 | "metadata": {}, 770 | "output_type": "execute_result" 771 | } 772 | ], 773 | "source": [ 774 | "a = (x**2 for x in range(10))\n", 775 | "a" 776 | ] 777 | }, 778 | { 779 | "cell_type": "code", 780 | "execution_count": 29, 781 | "metadata": { 782 | "collapsed": false 783 | }, 784 | "outputs": [], 785 | "source": [ 786 | "r = range(int(1e9))\n", 787 | "r\n", 788 | "t = sum(r)" 789 | ] 790 | }, 791 | { 792 | "cell_type": "code", 793 | "execution_count": 30, 794 | "metadata": { 795 | "collapsed": false 796 | }, 797 | "outputs": [ 798 | { 799 | "data": { 800 | "text/plain": [ 801 | "499999999500000000" 802 | ] 803 | }, 804 | "execution_count": 30, 805 | "metadata": {}, 806 | "output_type": "execute_result" 807 | } 808 | ], 809 | "source": [ 810 | "t" 811 | ] 812 | }, 813 | { 814 | "cell_type": "code", 815 | "execution_count": 32, 816 | "metadata": { 817 | "collapsed": false 818 | }, 819 | "outputs": [ 820 | { 821 | "data": { 822 | "text/plain": [ 823 | "4" 824 | ] 825 | }, 826 | "execution_count": 32, 827 | "metadata": {}, 828 | "output_type": "execute_result" 829 | } 830 | ], 831 | "source": [ 832 | "def f(x):\n", 833 | " \"\"\"Square x\n", 834 | " \n", 835 | " >>> f(2)\n", 836 | " 4\n", 837 | " \"\"\"\n", 838 | " return x**2\n", 839 | "\n", 840 | "f(2)" 841 | ] 842 | } 843 | ], 844 | "metadata": { 845 | "kernelspec": { 846 | "display_name": "Python 3", 847 | "language": "python", 848 | "name": "python3" 849 | }, 850 | "language_info": { 851 | "codemirror_mode": { 852 | "name": "ipython", 853 | "version": 3 854 | }, 855 | "file_extension": ".py", 856 | "mimetype": "text/x-python", 857 | "name": "python", 858 | "nbconvert_exporter": "python", 859 | "pygments_lexer": "ipython3", 860 | "version": "3.4.4" 861 | } 862 | }, 863 | "nbformat": 4, 864 | "nbformat_minor": 0 865 | } 866 | -------------------------------------------------------------------------------- /iid_talk/slides.mdown: -------------------------------------------------------------------------------- 1 | # Python for R Users 2 | 3 | Clark Fitzgerald 4 | 5 | UC Davis iidata 6 | 7 | May 21, 2016 8 | 9 | ![](images/python-logo-master-v3-TM.png) 10 | 11 | ------------------ 12 | 13 | # Goals 14 | 15 | - Basics of Python 16 | - Motivate you to learn more 17 | 18 | ``` 19 | for i in range(3): 20 | i += 2 21 | ``` 22 | 23 | ## Getting up 24 | 25 | 26 | - Turn off alarm 27 | - Get out of bed 28 | 29 | ## Breakfast 30 | 31 | - Eat eggs 32 | - Drink coffee 33 | 34 | # In the evening 35 | 36 | ## Dinner 37 | 38 | - Eat spaghetti 39 | - Drink wine 40 | 41 | ------------------ 42 | 43 | ![picture of spaghetti](images/spaghetti.jpg) 44 | 45 | ## Going to sleep 46 | 47 | - Get in bed 48 | - Count sheep 49 | --------------------------------------------------------------------------------