├── .gitignore
├── README.md
├── day1
    ├── git_followup.sh
    ├── git_lifejacket.sh
    ├── hello_world.py
    ├── plots
    │   ├── languages.png
    │   ├── python_comfort.png
    │   └── topics.png
    └── pythonintro.pdf
├── day2
    ├── backup
    │   ├── datasci.json
    │   ├── error.json
    │   ├── log.txt
    │   └── notes.mdown
    ├── day2log.txt
    ├── download_github.py
    └── notes15jan.mdown
├── day3
    ├── day3log.txt
    ├── day3notes.md
    ├── lecture.py
    └── photo.png
├── day4
    ├── ISO_3166-1
    ├── backup
    │   ├── country.py
    │   ├── log.py
    │   └── simple.py
    ├── decade.py
    ├── languages.csv
    ├── log.py
    ├── notes.mdown
    ├── population.csv
    └── prepare.py
├── day5
    ├── README.md
    ├── backup
    │   └── truck_counts.csv
    ├── day5log.py
    ├── map
    │   ├── data.json
    │   ├── map.html
    │   └── sf_nhoods.json
    ├── maps.ipynb
    ├── mobile_food.csv
    ├── mpg_plots.py
    ├── notes.md
    ├── sf_nhoods.json
    ├── truck_map.py
    └── vehicles.csv
├── day6
    ├── analysis.py
    ├── announce.py
    ├── babynamer.py
    ├── day6log.txt
    ├── dearyeji.txt
    ├── findnames.py
    ├── textgraph.py
    └── yob2013.txt
├── day7
    ├── README.md
    ├── aa.db
    ├── atus.db
    ├── atus_blank.py
    ├── backup
    │   ├── atus.py
    │   └── lecture.py
    ├── day7log1.py
    └── day7log2.py
├── day8
    ├── backup
    │   ├── notes.py
    │   └── titanic.py
    ├── betafit.py
    ├── day8.py
    ├── final.mdown
    ├── kaggle.py
    ├── sleepminutes.csv
    └── titanic.csv
├── exercise1
    ├── exercise1.md
    ├── usda.py
    └── usda_blank.py
├── exercise2
    ├── exercise2.md
    ├── random_walk.py
    └── random_walk_solution.py
├── exercise3
    ├── exercise3.mdown
    ├── wpgraph.pdf
    └── wpgraph.py
├── exercise4
    ├── convert.py
    └── exercise4.md
├── feedback
    ├── feedback.csv
    └── feedback.ipynb
└── iid_talk
    ├── Makefile
    ├── iidata workshop.ipynb
    └── slides.mdown


/.gitignore:
--------------------------------------------------------------------------------
1 | __pycache__
2 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | [Course Videos](http://dsi.ucdavis.edu/PythonMiniCourse/)
  2 | 
  3 | # Python for Data Mining
  4 | 
  5 | [Python][] is a programming language designed to have clear, concise, and
  6 | expressive code.
  7 | An extremely popular general-purpose language, Python has been used for tasks
  8 | as diverse as web development, teaching, and systems administration.
  9 | This mini-course provides an introduction to Python for data mining.
 10 | 
 11 | Messy data has an inconsistent or inconvenient format, and may have missing
 12 | values.
 13 | Noisy data has measurement error.
 14 | *Data mining extracts meaningful information from messy, noisy data.*
 15 | This is a start-to-finish process that includes gathering, cleaning,
 16 | visualizing, modeling, and reporting.
 17 | 
 18 | Programming and research best practices are a secondary focus of the 
 19 | mini-course, because [Python is a philosophy][zen] as well as a language.
 20 | Core concepts include: writing organized, well-documented code; being a
 21 | self-sufficient learner; using version control for code management and
 22 | collaboration; ensuring reproducibility of results; producing concise,
 23 | informative analyses and visualizations.
 24 | 
 25 | We will meet for four weeks during the Winter 2015 quarter at the
 26 | University of California, Davis.
 27 | 
 28 | [zen]: http://legacy.python.org/dev/peps/pep-0020/
 29 | [Python]: https://www.python.org/
 30 | 
 31 | ## Target Audience
 32 | The mini-course is open to undergraduate and graduate students from all
 33 | departments.
 34 | We recommend that students have prior programming experience
 35 | and a basic understanding of statistical methods,
 36 | so they can follow along with the examples.
 37 | For instance, completion of STA 108 and STA 141 is sufficient
 38 | (but not required).
 39 | 
 40 | ## Topics
 41 | 
 42 | ### Core Python
 43 | The mini-course will kick off with a quick introduction to the syntax of
 44 | Python, including operators, data types, control flow statements, function
 45 | definition, and string manipulation.
 46 | Slower, in-depth coverage will be given to uniquely Pythonic features such as
 47 | built-in data structures, list comprehensions, iterators, and docstrings.
 48 | 
 49 | Authoring packages and other advanced topics may also be discussed.
 50 | 
 51 | ### Scientific Computing
 52 | Support for stable, high-performance vector operations is provided by the NumPy
 53 | package.
 54 | NumPy will be introduced early and used often, because it's the foundation for
 55 | most other scientific computing packages.
 56 | We will also cover SciPy, which extends NumPy with functions for
 57 | linear algebra, optimization, and elementary statistics.
 58 | 
 59 | Specialized packages will be discussed during the final week.
 60 | 
 61 | ### Data Manipulation
 62 | The pandas package provides tabular data structures and convenience functions
 63 | for manipulating them.
 64 | This includes a two-dimensional data frame similar to the one found in R.
 65 | Pandas will be covered extensively, because it makes it easy to
 66 | 
 67 | + Read and write many formats (CSV, JSON, HDF, database)
 68 | + Filter and restructure data
 69 | + Handle missing values gracefully
 70 | + Perform group-by operations (`apply` functions)
 71 | 
 72 | ### Data Visualization
 73 | 
 74 | Many visualization packages are available for Python, but the mini-course will
 75 | focus on Seaborn, which is a user-friendly abstraction of the venerable 
 76 | matplotlib package.
 77 | 
 78 | Other packages such as ggplot2, Vincent, Bokeh, and mpld3 may also be covered.
 79 | 
 80 | ## Programming Environment
 81 | Python 3 has syntax changes and new features that break compatibility with
 82 | Python 2.
 83 | All of the major scientific computing packages have added support for Python 3
 84 | over the last few years, so it will be our focus.
 85 | We recommend the [Anaconda][] Python 3 distribution,
 86 | which bundles most packages we'll use into one download.
 87 | Any other packages needed can be installed using `pip` or `conda`.
 88 | 
 89 | Python code is supported by a vast array of editors.
 90 | 
 91 | + [Spyder IDE][Spyder], included in Anaconda, 
 92 |   is a Python equivalent of RStudio, 
 93 |   designed with scientific computing in mind.
 94 | + [PyCharm IDE][PyCharm] and [Sublime][] provide good user interfaces.
 95 | + Terminal-based text editors, such as [Vim][] and [Emacs][], are a great
 96 |   choice for ambitious students. They can be used with any language. 
 97 |   See [here][Text Editors] for more details. Clark and Nick both use Vim.
 98 | 
 99 | [Anaconda]: http://continuum.io/downloads
100 | [Spyder]: https://code.google.com/p/spyderlib/
101 | [PyCharm]: https://www.jetbrains.com/pycharm/
102 | [Sublime]: http://www.sublimetext.com/
103 | [Vim]: http://www.vim.org/
104 | [Emacs]: https://www.gnu.org/software/emacs/
105 | [Text Editors]: http://heather.cs.ucdavis.edu/~matloff/ProgEdit/ProgEdit.html
106 | 
107 | ## References
108 | 
109 | No books are required, but we recommend Wes McKinney's book:
110 | 
111 | + McKinney, W. (2012). _Python for Data Analysis: Data Wrangling with Pandas, 
112 |   NumPy, and IPython_. O'Reilly Media.
113 | 
114 | Python and most of the packages we'll use have excellent documentation, which
115 | can be found at the following links.
116 | 
117 | + [Python 3](https://docs.python.org/3/)
118 | + [NumPy](http://docs.scipy.org/doc/numpy/)
119 | + [SciPy](http://docs.scipy.org/doc/scipy/reference/)
120 | + [pandas](http://pandas.pydata.org/pandas-docs/stable/)
121 | + [matplotlib](http://matplotlib.org/contents.html)
122 | + [Seaborn](http://stanford.edu/~mwaskom/software/seaborn/tutorial.html)
123 | + [scikit-learn](http://scikit-learn.org/stable/documentation.html)
124 | + [IPython](http://ipython.org/documentation.html)
125 | 
126 | Due to Python's popularity, a large number of general references are available.
127 | While these don't focus specifically on data analysis, they're helpful for
128 | learning the language and its idioms.
129 | Some of our favorites are listed below, many of which are free.
130 | 
131 | + Swaroop, C. H. (2003). _[A Byte of Python][]_. ([PDF][ABoP PDF])
132 | + Reitz, K. _[Hitchhiker's Guide to Python][Hitchhiker's Guide]_. 
133 |   ([PDF][HGoP PDF])
134 | + Lutz, M. (2014). _Python Pocket Reference_. O'Reilly Media. 
135 | + Beazley, D. (2009). _Python Essential Reference_. Addison-Wesley.
136 | + Pilgrim, M., & Willison, S. (2009). _[Dive Into Python 3][]_. Apress.
137 | + [Non-programmer's Tutorial for Python 3][Non]
138 | + [Beginner's Guide to Python][Beginner's Guide]
139 | + [Five Lifejackets to Throw to the New Coder][New Coder]
140 | + [Pyvideo][Pyvideo]\*
141 | + [StackOverflow][]. Please be conscious of the [rules][SO Rules]!
142 | 
143 | \* Videos featuring Guido Van Rossum, Raymond Hettinger, Travis Oliphant, 
144 | Fernando Perez, David Beazley, and Alex Martelli are suggested.
145 | 
146 | 
147 | [A Byte of Python]: http://www.swaroopch.com/notes/python/
148 | [ABoP PDF]: http://files.swaroopch.com/python/byte_of_python.pdf
149 | 
150 | [Hitchhiker's Guide]: http://docs.python-guide.org/en/latest/
151 | [HGop PDF]: https://media.readthedocs.org/pdf/python-guide/latest/python-guide.pdf
152 | 
153 | [Dive Into Python 3]: http://www.diveintopython3.net/
154 | [Non]: http://en.wikibooks.org/wiki/Non-Programmer%27s_Tutorial_for_Python_3
155 | [Beginner's Guide]: https://wiki.python.org/moin/BeginnersGuide
156 | [New Coder]: http://newcoder.io/
157 | [Pyvideo]: http://pyvideo.org/
158 | [StackOverflow]: http://stackoverflow.com/questions/tagged/python
159 | [SO Rules]: http://stackoverflow.com/tour
160 | 
161 | 


--------------------------------------------------------------------------------
/day1/git_followup.sh:
--------------------------------------------------------------------------------
  1 | 
  2 | # More on Git
  3 | # ===========
  4 | # This file describes some everyday tasks in git.
  5 | 
  6 | # Creating Repositories
  7 | # =====================
  8 | # You don't have to use GitHub to create a new repository. You can create an
  9 | # new local repository named NAME with:
 10 | #
 11 | #     git init [NAME]
 12 | #
 13 | git init my-repo
 14 | 
 15 | # If you want to use a repository you created locally with a remote repository,
 16 | # you have to tell git where the remote repository is:
 17 | #
 18 | #     git remote [add NAME URL]
 19 | #
 20 | 
 21 | # Add a remote called origin:
 22 | git remote add origin https://github.com/nick-ulle/my-repo.git
 23 | 
 24 | # List all remotes for the current repo:
 25 | git remote
 26 | 
 27 | # After adding the remote repository, you can push and pull as usual.
 28 | 
 29 | # Branching
 30 | # =========
 31 | # Git supports keeping multiple working versions of a repository at once. These
 32 | # are represented as branches. Every repository starts with a branch called
 33 | # 'master'.
 34 | #
 35 | # To make a new branch named NAME, use:
 36 | #
 37 | #     git branch [NAME]
 38 | #
 39 | git branch experimental
 40 | 
 41 | # On its own, `git branch` will list the repository's branches:
 42 | 
 43 | git branch
 44 | 
 45 | # You can switch branches with:
 46 | #
 47 | #     git checkout NAME
 48 | #
 49 | git checkout experimental
 50 | 
 51 | # You can delete a branch with the following command:
 52 | #
 53 | #     git branch -d NAME
 54 | #
 55 | git checkout master
 56 | git branch -d experimental
 57 | 
 58 | git branch
 59 | 
 60 | # Now let's recreate that branch and switch to it.
 61 | git branch experimental
 62 | git checkout experimental
 63 | 
 64 | # If we make some changes on the branch and commit them, they don't get applied
 65 | # to any other branch.
 66 | touch testing.py
 67 | git add testing.py
 68 | git commit
 69 | 
 70 | ls
 71 | git log
 72 | 
 73 | git checkout master
 74 | ls
 75 | git log
 76 | 
 77 | git branch -v
 78 | 
 79 | # To merge commits from another branch into the current branch, use:
 80 | #
 81 | #     git merge BRANCH
 82 | #
 83 | git merge experimental
 84 | 
 85 | git status
 86 | git log
 87 | ls
 88 | 
 89 | # How should you use branches?
 90 | #
 91 | # Branches are useful for isolating experimental changes from code you already
 92 | # have working. They also make it easy to keep manage multiple drafts. 
 93 | #
 94 | # Use branches judiciously according to your own work habits, especially for
 95 | # projects where you're the only contributor. In larger, public projects, try
 96 | # to follow guidelines agreed upon by all the contributors.
 97 | #
 98 | # A popular branching workflow is described here:
 99 | #
100 | #     http://nvie.com/posts/a-successful-git-branching-model/
101 | #
102 | 
103 | # Stashing Changes
104 | # ================
105 | # You might want to switch branches, saving your work on the current branch
106 | # without committing it. The solution to this is stashing:
107 | #
108 | #     git stash [pop]
109 | #
110 | 
111 | # For example, suppose you add a Python file:
112 | touch foo.py
113 | git add foo.py
114 | 
115 | # Then you realize you want to switch branches and work on something completely
116 | # different. Stash your current work to get it out of the way:
117 | git stash
118 | 
119 | # Switch to a different branch and do other work, committing when you're done:
120 | # (If you're following along, create the branch and switch to it with
121 | #
122 | #      git checkout -b branch
123 | #
124 | # )
125 | git checkout other-branch
126 | 
127 | # When you're ready to go back to work on foo.py, you can switch back to the
128 | # original branch and pop the stash:
129 | git checkout master
130 | git stash pop
131 | ls
132 | 
133 | # Revising Commits
134 | # ================
135 | # What if you forget to add a change, make a typo in the commit message, etc?
136 | # Fix it with:
137 | #
138 | #     git commit --amend
139 | #
140 | vim README.md
141 | git add README.md
142 | git commit --amend
143 | 
144 | # NEVER amend a commit you've already pushed to a remote repository. It will
145 | # interfere with git's normal operation! Instead, you should make a new commit
146 | # for any changes you forgot. 
147 | 
148 | # Restoring Previous Commits
149 | # ==========================
150 | # Since git tracks all of the commits you make, you can easily restore the
151 | # state of the repository at an earlier point in time. You can do this with:
152 | #
153 | #     git reset COMMIT [FILE]
154 | #
155 | # where COMMIT is a commit hash (listed with `git log`). You only need
156 | # to type the first few characters of the commit hash. For example:
157 | 
158 | git reset f4fca
159 | 
160 | # A more in-depth explanation of `git reset` is given in the git documentation.
161 | 
162 | 


--------------------------------------------------------------------------------
/day1/git_lifejacket.sh:
--------------------------------------------------------------------------------
  1 | 
  2 | # Git Lifejacket
  3 | # ==============
  4 | # Git is a distributed version control system. What does that mean?
  5 | #
  6 | #   * Git makes it easy to share files (distributed).
  7 | #   * Git tracks the changes you make to files (version control system).
  8 | #
  9 | # Use git to...
 10 | #
 11 | #   1. Quickly distribute sets of files.
 12 | #   2. Keep multiple versions of a file without making copies.
 13 | #   3. Selectively undo changes you made 3 days or even 3 months ago.
 14 | #   4. Efficiently merge work from several different people.
 15 | #   5. ...
 16 | #
 17 | # This tutorial is a git lifejacket, meant to get you up and running.
 18 | #
 19 | # For more, see (try!) the git followup notes, posted online. Also check out 
 20 | # the git documentation at:
 21 | #
 22 | #     http://www.git-scm.com/doc
 23 | #
 24 | # You can also get help with commands by appending `--help` to the end. For
 25 | # example:
 26 | git status --help
 27 | 
 28 | # Git Repositories
 29 | # ================
 30 | # A git repository (or 'repo') is just a set of files tracked by git.
 31 | #
 32 | # GitHub.com is a host for git repositories on the web, widely-used by
 33 | # open-source projects. GitHub provides free hosting for public repositories.
 34 | 
 35 | # You might want to work on a repository someone else created. Download a copy
 36 | # of a repository from the web by cloning it:
 37 | #
 38 | #     git clone URL
 39 | #
 40 | git clone https://github.com/nick-ulle/2015-python.git
 41 | 
 42 | # Check the status of a repository with:
 43 | #
 44 | #     git status
 45 | #
 46 | git status
 47 | 
 48 | # You can also create new, empty repositories on GitHub, and then clone them:
 49 | 
 50 | git clone https://github.com/nick-ulle/demo.git
 51 | 
 52 | # Committing Changes
 53 | # ==================
 54 | # After changing some files, you can save a snapshot of the repository by
 55 | # making a commit. This is a 2 step process.
 56 | 
 57 | # Step 1:
 58 | #
 59 | # Add, or 'stage', the changes you want to save in the commit:
 60 | #
 61 | #     git add FILE
 62 | #
 63 | git add README.md
 64 | 
 65 | # To stage every file in the current repository:
 66 | #
 67 | #     git add --all
 68 | 
 69 | # Use `git status` to see which files are staged.
 70 | 
 71 | # Step 2:
 72 | #
 73 | # Save the staged changes with the command:
 74 | #
 75 | #     git commit
 76 | #
 77 | git commit
 78 | 
 79 | # The commit command will ask you to type a message summarizing the changes.
 80 | # The first line should be a short description, no more than 50 characters.
 81 | # If you want to write more, insert a blank line. For example:
 82 | #
 83 | #     Adds README.md, describing the repository.
 84 | #
 85 | #     The added README.md also contains a classic programming phrase and the
 86 | #     meaning of life, the universe, and everything.
 87 | 
 88 | # If you examine the repository status, git no longer sees any changes. This
 89 | # is because they've been committed.
 90 | 
 91 | # When should you make a commit?
 92 | #
 93 | # Common advice is to commit early and often. Commits are your save points,
 94 | # and you never know when you'll need to go back. You could commit every time
 95 | # you finish a small piece of work, such as writing a function.
 96 | 
 97 | # You can see a history of the last N commits with the command:
 98 | #
 99 | #     git log [-N]
100 | #
101 | git log
102 | git log -3
103 | 
104 | # Your Turn!
105 | # ==========
106 | # Clone the demo repository from
107 | #
108 | #   https://www.github.com/nick-ulle/demo.git
109 | #
110 | # make a new file, type something in it, and commit the file.
111 | 
112 | # Try making your GitHub account, creating a repo, cloning, and make a commit!
113 | 
114 | # Working With Remote Repositories
115 | # ================================
116 | # What if you want to share your work online (say, GitHub)? An online
117 | # repository is a 'remote' repository.
118 | 
119 | # Given permission (e.g., you own the repo), you can push commits to a remote
120 | # repository with the command:
121 | #
122 | #     git push [REMOTE BRANCH]
123 | #
124 | 
125 | # Push commits to the remote repo 'origin' on branch 'master':
126 | 
127 | git push origin master
128 | 
129 | # You can also retrieve commits other people have made to a repository. Do
130 | # this with:
131 | #
132 | #     git pull [REMOTE BRANCH]
133 | #
134 | git pull origin master
135 | 
136 | 


--------------------------------------------------------------------------------
/day1/hello_world.py:
--------------------------------------------------------------------------------
 1 | 
 2 | # To open Python...
 3 | #   * Windows users: find ipython (3) in the Start Menu
 4 | #   * All others: enter `ipython` in terminal (without quotes)
 5 | 
 6 | print('Hello world!')
 7 | 
 8 | # Now check that packages we'll use often were installed correctly.
 9 | # All of the following import lines should run without any message.
10 | 
11 | # Testing NumPy
12 | import numpy as np
13 | 
14 | # Testing SciPy
15 | import scipy as sp
16 | 
17 | # Testing Pandas
18 | import pandas as pd
19 | 
20 | # Testing Matplotlib
21 | import matplotlib.pyplot as plt
22 | 
23 | # Finally, check that you can make plots:
24 | plt.plot(range(5))
25 | plt.show()
26 | 
27 | 


--------------------------------------------------------------------------------
/day1/plots/languages.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nick-ulle/2015-python/18e324ea50435d94c72ec853eec56ea6dfa5fe05/day1/plots/languages.png


--------------------------------------------------------------------------------
/day1/plots/python_comfort.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nick-ulle/2015-python/18e324ea50435d94c72ec853eec56ea6dfa5fe05/day1/plots/python_comfort.png


--------------------------------------------------------------------------------
/day1/plots/topics.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nick-ulle/2015-python/18e324ea50435d94c72ec853eec56ea6dfa5fe05/day1/plots/topics.png


--------------------------------------------------------------------------------
/day1/pythonintro.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nick-ulle/2015-python/18e324ea50435d94c72ec853eec56ea6dfa5fe05/day1/pythonintro.pdf


--------------------------------------------------------------------------------
/day2/backup/error.json:
--------------------------------------------------------------------------------
1 | {"documentation_url": "https://developer.github.com/v3/search", "errors": [{"code": "missing", "field": "q", "resource": "Search"}], "message": "Validation Failed"}


--------------------------------------------------------------------------------
/day2/backup/log.txt:
--------------------------------------------------------------------------------
  1 | clear
  2 | davis = {'cool': True,
  3 | 'students': 35e3}
  4 | davis
  5 | # This is a dictionary. The whole Python language is built on dictionaries
  6 | davis['cool']
  7 | davis['students']
  8 | clear
  9 | davis
 10 | davis.keys()
 11 | davis.values()
 12 | # Keys and values together are called 'items'
 13 | davis.items()
 14 | davis
 15 | clear
 16 | 'cool' in Davis
 17 | 'cool' in davis
 18 | # This says that 'cool' is a key in the davis dictionary
 19 | davis['cool']
 20 | # To answer this you need to know the philosophy of Python
 21 | import this
 22 | davis
 23 | davis['stats']
 24 | # Lets add stats to davis
 25 | davis['stats'] = 'excellent'
 26 | davis
 27 | depts = ['stats', 'math', 'ecology']
 28 | clear
 29 | depts
 30 | depts = ['stats', 'math', 'ecology']
 31 | davis
 32 | depts
 33 | type(davis)
 34 | type(depts)
 35 | type(123)
 36 | type(123.1)
 37 | type('123.1')
 38 | type(True)
 39 | davis['cool']
 40 | type(davis['cool'])
 41 | clear
 42 | depts
 43 | nums = [0, 10, 20, 30, 40]
 44 | # We are learning how to explore the basic data structures
 45 | len
 46 | ?len
 47 | len?
 48 | nums
 49 | len(nums)
 50 | # how do we pick out the first two?
 51 | nums[:2]
 52 | # Python begins counting at 0!!!
 53 | nums[0]
 54 | nums[1]
 55 | nums[2]
 56 | nums
 57 | nums[:2]
 58 | # This is called 'slicing'
 59 | nums[-1]
 60 | nums[-2]
 61 | nums[-3]
 62 | nums
 63 | nums[2:]
 64 | # nums up to, but excluding the 2nd element
 65 | nums[:2]
 66 | # nums beginning with the 2nd element
 67 | nums[2:]
 68 | nums
 69 | nums[:2]
 70 | nums[2:]
 71 | nums
 72 | # What do you observe?
 73 | nums[:2] + nums[2:]
 74 | nums
 75 | nums[:2] + nums[2:] == nums
 76 | k = 3
 77 | nums[:k] + nums[k:] == nums
 78 | k = 10
 79 | len(nums)
 80 | nums[:k] + nums[k:] == nums
 81 | # Can also do negative slices
 82 | nums[-3:]
 83 | nums
 84 | nums[:k] + nums[k:] == nums
 85 | # The positive side of 0 indexing is that we get this pretty identity
 86 | # We have learned slicing
 87 | import this
 88 | import this
 89 | # flat is better than nested
 90 | davis
 91 | depts
 92 | # We've seen it in Python - now in JSON!
 93 | # Ways to transfer data
 94 | davis
 95 | import json
 96 | json.dumps?
 97 | json.dumps(davis)
 98 | davis
 99 | davis
100 | davis['dept'] = dept
101 | davis['dept'] = depts
102 | davis
103 | davis['dept']
104 | type(davis['dept'])
105 | json.dumps(davis)
106 | davis
107 | # We have seen that JSON looks just like Python lists and dicts
108 | # Time to fetch some live data!
109 | # import requests
110 | import requests
111 | requests.get
112 | baseurl = 'https://api.github.com/search/repositories'
113 | response = requests.get(baseurl)
114 | response
115 | response
116 | response.status_code
117 | response
118 | response.content
119 | response.json
120 | response.json()
121 | r = response.json()
122 | type(r)
123 | r.keys()
124 | r
125 | r['errors']
126 | type(r['errors'])
127 | len(r['errors'])
128 | r['errors']
129 | r['errors'][0]
130 | type(r['errors'][0])
131 | r
132 | payload = {'q': 'data science', 'sort': 'stars'}
133 | ds = requests.get(baseurl, params=payload)
134 | ds
135 | ds.json()
136 | dsj = ds.json()
137 | type(dsj)
138 | len(dsj)
139 | dsj.keys()
140 | dsj['total_count']
141 | dsj['incomplete_results']
142 | dsji = dsji['items']
143 | dsji = dsj['items']
144 | clear
145 | dsji
146 | clear
147 | len(dsji)
148 | type(dsji)
149 | dsji[0]
150 | # Lets get the description for all of them!
151 | type(dsji[0])
152 | [x['description'] for x in dsji]
153 | %run download_github.py
154 | ds
155 | %run download_github.py
156 | ds
157 | 


--------------------------------------------------------------------------------
/day2/backup/notes.mdown:
--------------------------------------------------------------------------------
 1 | Lecture 2
 2 | Thu Jan  8 19:32:53 PST 2015
 3 | 
 4 | # Today's Goals:
 5 | 
 6 | - Introduce basic Python
 7 | - Learn two data structures- lists and dicts
 8 | - See that Python lists and dicts correspond to JSON
 9 | - Write a client for a REST API
10 | 
11 | Why this is awesome:
12 |     If you can write a client for any REST API you can programmatically
13 | access an essentially unlimited amount of data.
14 | So we're done looking at the 'iris' data :)
15 | 
16 | Other names for Python dictionary:
17 | - key / value pair
18 | - dict
19 | - hash table
20 | - hash map
21 | - associative array
22 | 
23 | Ways of transferring data:
24 | 
25 | 1. verbally
26 | 2. look in a book
27 | 3. email
28 | 4. download CSV file from the web
29 | 5. query remote database or system
30 |     - Direct connection to underlying database technology *Danger zone!*
31 |     - Go through a REST API layer *Easier!*
32 | 
33 | Use common format like JSON, XMl, CSV
34 | CSV also called 'flat file'
35 | 
36 | Remote databases are very powerful
37 | 
38 | 
39 | Data Golf:
40 | 1 million integers in a Python list
41 | 


--------------------------------------------------------------------------------
/day2/day2log.txt:
--------------------------------------------------------------------------------
  1 | a = 123
  2 | a
  3 | a
  4 | # The answer to what happens when....
  5 | # Open an interpreter, try it, and see!!
  6 | type(a)
  7 | type(1.0)
  8 | type('hello')
  9 | type(True)
 10 | type(None)
 11 | davis = {'state': 'California', 'students': 35000}
 12 | type(davis)
 13 | # The whole python language is built on dictionaries (dicts)
 14 | davis.keys()
 15 | davis.values()
 16 | davis.items()
 17 | davis
 18 | davis['state']
 19 | davis['state'] = 'CA'
 20 | davis
 21 | davis['cool'] = True
 22 | davis
 23 | type(davis['cool'])
 24 | # that was the dictionary
 25 | # Moving on to the list
 26 | nums = [0, 10, 20, 30, 40]
 27 | nums
 28 | type(nums)
 29 | ?len
 30 | nums
 31 | len(nums)
 32 | # We want to pick elements from the list
 33 | nums[0]
 34 | # Python starts counting at 0!!!!!!!!!!!
 35 | # Very important
 36 | nums
 37 | nums[0]
 38 | nums[1]
 39 | nums[2]
 40 | nums
 41 | nums[len(nums)]
 42 | nums
 43 | nums[-1]
 44 | # In python this is called 'slicing'
 45 | nums[-2]
 46 | nums[0:2]
 47 | # That says 'up to, but excluding the 2nd element'
 48 | nums[:2]
 49 | # To get all elements starting with the 2nd:
 50 | nums[2:]
 51 | nums[1:]
 52 | clear
 53 | nums[:2]
 54 | nums[2:]
 55 | k = 3
 56 | nums[:k]
 57 | nums[k:]
 58 | nums[:k] + nums[k:]
 59 | nums
 60 | nums + 5
 61 | nums + [5]
 62 | nums
 63 | # Suppose we want to add 5 to each element
 64 | # In math notation you might write this:
 65 | # {x + 5 for x in nums}
 66 | [x + 5 for x in nums]
 67 | # This is called a list comprehension
 68 | nums
 69 | davis
 70 | a = [12, ['a', 'b']]
 71 | a
 72 | import json
 73 | json.dumps?
 74 | json.dumps(davis)
 75 | davis
 76 | davis
 77 | davis['nums'] = nums
 78 | davis
 79 | json.dumps(davis)
 80 | json.loads?
 81 | json.dumps(davis)
 82 | dstring = json.dumps(davis)
 83 | dstring
 84 | davis2 = json.loads(dstring)
 85 | davis2
 86 | davis
 87 | davis2
 88 | davis == davis2
 89 | a = 'It's a beautiful day!'
 90 | a = "It's a beautiful day!"
 91 | a
 92 | b = '''
 93 | Hello everybody.
 94 | blah blah
 95 | '''
 96 | b
 97 | clear
 98 | davis
 99 | 'cool' in davis
100 | clear
101 | davis
102 | 2 ** 10
103 | 2 ** 20
104 | 2 ** 10
105 | before = 18.7
106 | x = [float(x) for x in range(int(1e6))]
107 | len(x)
108 | x[:4]
109 | x[-5:]
110 | after = 44.2
111 | before
112 | after - before
113 | x = [float(x) for x in range(4)]
114 | x
115 | a = range(int(1e12))
116 | a
117 | b = range(10, 100)
118 | b
119 | next(b)
120 | b.step
121 | import requests
122 | requests.get?
123 | google = requests.get('http://www.google.com')
124 | google = requests.get('https://www.google.com')
125 | cd backup/
126 | ls
127 | google = requests.get('https://www.google.com')
128 | google = requests.get('https://www.google.com')
129 | google
130 | google = requests.get('https://www.google.com')
131 | # How do we learn about unknown objects?
132 | # Use the `type` function
133 | type(google)
134 | google.status_code
135 | google.text
136 | import requests
137 | base = 'https://api.github.com/search/repositories'
138 | response = requests.get(base)
139 | response
140 | type(response)
141 | response.text
142 | response.json?
143 | response.json()
144 | a = response.json()
145 | a
146 | a
147 | type(a)
148 | len(a)
149 | a
150 | a.keys()
151 | payload = {'q': 'data science'}
152 | type(payload)
153 | payload['sort'] = 'stars'
154 | payload
155 | response = requests.get(base, params=payload)
156 | a = response.json()
157 | response
158 | a
159 | type(a)
160 | len(a)
161 | a.keys()
162 | a['total_count']
163 | a['items']
164 | len(a['items'])
165 | type(a['items'])
166 | b = a['items']
167 | b
168 | b[0]
169 | b[0].keys()
170 | c = [[x['full_name'], x['watchers']] for x in b]
171 | c
172 | ls
173 | pwd
174 | dir()
175 | davis
176 | a = 'hello'
177 | a = 10
178 | a.real
179 | % history -f day2log.txt
180 | 


--------------------------------------------------------------------------------
/day2/download_github.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | Script to download some interesting data from Github.com
 3 | 
 4 | Thu Dec 18 17:13:16 PST 2014
 5 | 
 6 | There are about 25k repositories that match the query 'data science'.
 7 | We'll look at the most popular ones.
 8 | '''
 9 | 
10 | import json
11 | import requests
12 | 
13 | 
14 | def query_github(params):
15 |     '''
16 |     Query Github using dict of params
17 |     '''
18 |     base = 'https://api.github.com/search/repositories'
19 |     response = requests.get(base, params=params)
20 |     return response.json()
21 | 
22 | 
23 | if __name__ == '__main__':
24 | 
25 |     payload = {'q': 'data science', 'sort': 'stars'}
26 |     ds = query_github(payload)
27 | 


--------------------------------------------------------------------------------
/day2/notes15jan.mdown:
--------------------------------------------------------------------------------
 1 | Lecture 2
 2 | 
 3 | Thu Jan 15 14:10:15 PST 2015
 4 | 
 5 | # Today's Goals
 6 | - Introduce basic python
 7 | - Learn two important data structures lists and dicts
 8 | - See that Python lists and dicts correspond to JSON
 9 | - Write a client for a REST API
10 | 
11 | # Why this is awesome!!
12 |     You can access an *enormous* amount of data.
13 | 
14 | Python dictionaries:
15 | - Key / Value Pair
16 | - Hash table
17 | - Hash map
18 | - Associative array
19 | 
20 | Data golf!!!!
21 | 
22 | floating point number - 8 bytes
23 | 
24 | How large is a Python list with 1 million floating point numbers?
25 | A lower bound should be about 8 MB.
26 | 
27 | Data transfer:
28 | - Talking
29 | - Writing stuff down in books
30 | - Email each other data in a CSV attachment
31 | - download a csv file from the web
32 | - direct machine to machine transfer from a larger backend system
33 |   (database)
34 |     1. Direct connection to that technology
35 |     2. To use REST API (abstraction layer)
36 | 
37 | Common forms for data:
38 | XML, CSV, JSON
39 | 
40 | 


--------------------------------------------------------------------------------
/day3/day3log.txt:
--------------------------------------------------------------------------------
  1 | a = [1, 2, 3]
  2 | a + 5
  3 | [x + 5 for x in a]
  4 | import numpy
  5 | import numpy as np
  6 | b = np.array(a)
  7 | a
  8 | b
  9 | type(b)
 10 | np.array([1, 2, 6, 4])
 11 | c = np.arange(1e6)
 12 | 41.7 - 33.8
 13 | d = [float(x) for x in range(int(1e6))]
 14 | 73.2 - 41.7
 15 | a
 16 | a + 5
 17 | b + 5
 18 | %timeit print('Hello world!')
 19 | c
 20 | %timeit c + 5
 21 | d
 22 | %timeit [x + 5 for x in d]
 23 | type(d)
 24 | len(d)
 25 | b
 26 | b_copy = b
 27 | b
 28 | b_copy
 29 | b_copy[1] = 7
 30 | b_copy
 31 | b
 32 | import this
 33 | b = np.array([1, 2, 3])
 34 | b_copy = b.copy()
 35 | b
 36 | b_copy
 37 | b_copy[1] = 7
 38 | b_copy
 39 | b
 40 | A = np.array([[1, 2], [3, 4]])
 41 | A
 42 | A = np.array([1, 2, 3, 4]).reshape(2, 2)
 43 | A
 44 | np.zeros(4).reshape(2, 2)
 45 | I = np.eye(2)
 46 | I
 47 | A
 48 | A * I
 49 | A.dot(I)
 50 | np.dot(A, I)
 51 | A @ I
 52 | A[1, 0]
 53 | A
 54 | np.random.binomial?
 55 | steps = np.random.binomial(1, 0.5, 100)
 56 | steps
 57 | steps[steps == 0] = -1
 58 | steps
 59 | positions = steps.cumsum()
 60 | positions
 61 | import matplotlib.pyplot as plt
 62 | plt.plot(positions)
 63 | plt.show()
 64 | import skimage.io as io
 65 | img = io.imread('photo.png', as_grey=True)
 66 | img
 67 | img
 68 | img.shape
 69 | type(img)
 70 | ctr = np.mean(img, axis=0)
 71 | ctr.shape
 72 | ctr_img = img - ctr
 73 | plt.imshow(img)
 74 | plt.show()
 75 | plt.imshow(img).set_cmap('gray')
 76 | plt.show()
 77 | plt.imshow(ctr_img).set_cmap('gray')
 78 | plt.show()
 79 | np.linalg.svd?
 80 | _, _, v = np.linalg.svd(ctr_img)
 81 | v.shape
 82 | v = v.T
 83 | reduced_v = v[:, :100]
 84 | v.shape
 85 | reduced_v.shape
 86 | prin_comp = ctr_img.dot(reduced_v)
 87 | prin_comp.shape
 88 | reconst = prin_comp.dot(reduced_v.T) + ctr
 89 | plt.imshow(reconst).set_cmap('gray')
 90 | plt.show()
 91 | fig, ax = plt.subplots(1, 2)
 92 | ax[0].imshow(img).set_cmap('gray')
 93 | ax[1].imshow(reconst).set_cmap('gray')
 94 | fig.show()
 95 | img.size
 96 | reconst = prin_comp.dot(reduced_v.T) + ctr
 97 | prin_comp.size + reduced_v.size + ctr.size
 98 | original_size = img.size
 99 | new_size = prin_comp.size + reduced_v.size + ctr.size
100 | new_size / original_size
101 | reconst.size
102 | %history -f code_log.txt
103 | 


--------------------------------------------------------------------------------
/day3/day3notes.md:
--------------------------------------------------------------------------------
 1 | 
 2 | 1 million floating point values in a list: 24 - 32 MB (should be 8 MB)
 3 | 
 4 | 1. Lists take up too much memory!
 5 | 2. We'd like to have vectorization.
 6 | 3. Lists are kind of slow.
 7 | 
 8 | NumPy provides an alternative: the n-dimensional array.
 9 | 
10 | DATA GOLF!
11 | How much memory does a 1 million element ndarray of floats use?
12 | 8 MB!
13 | 
14 | NumPy takes 8.6 ms to add 5 to 1 million element ndarray.
15 | Python lists take 501ms!
16 | 
17 | Be careful with references! 
18 | 
19 | We saw NumPy gives us arrays, matrices, ... But what else?
20 | 
21 | * random number generation (numpy.random)
22 | * fast Fourier transforms (numpy.fft)
23 | * polynomials (numpy.polynomial)
24 | * linear algebra (numpy.linalg)
25 | * support for calling C libraries
26 | 
27 | # Simple Random Walk!
28 | 
29 | Random walk with 100 steps
30 | 
31 | 1. Flip a coin 100 times--these are the steps (0 means down).
32 | 2. Take cumulative sums
33 | 
34 | # PCA / SVD
35 | The SVD is a very important decomposition, especially to statistics.
36 | A great statistics example is Principal Components Analysis (PCA).
37 | What is PCA typically used for?
38 | 
39 | Say X is a centered n-by-p data matrix. Then
40 | 
41 | X = UDV' = λ₁u₁v₁' + ... + λₖuₖvₖ' + ... + λₚuₚvₚ'   (n < p)
42 | 
43 | PCA takes the first k principal components, λ₁u₁, ..., λₖuₖ.
44 | A slightly different perspective: PCA approximates the original data by
45 | 
46 | X ~ λ₁u₁v₁' + ... + λₖuₖvₖ'
47 | 
48 | Lossy data compression!
49 | 
50 | DATA GOLF!
51 | Original image: 180,500
52 | Compressed image: 86,461
53 | 


--------------------------------------------------------------------------------
/day3/lecture.py:
--------------------------------------------------------------------------------
  1 | 
  2 | # Today's Agenda:
  3 | #
  4 | # 1. Introduce NumPy
  5 | # 2. Simple random walk
  6 | # 3. Visualize SVD/PCA
  7 | 
  8 | a = [1, 2, 3]
  9 | 
 10 | # Python's lists are awesome, but they have a few serious limitations for
 11 | # numerical computing:
 12 | #
 13 | # 1. They use a lot of memory (~32 MB for 1 million 64-bit floats).
 14 | # 2. They're slow.
 15 | # 3. They're not vectorized. We could use list comprehensions, but do you
 16 | #    really want to write
 17 | 
 18 | [x + 5 for x in a]
 19 | 
 20 | #    instead of
 21 | 
 22 | a + 5
 23 | 
 24 | # So we need a better solution when we do numerical computing. Enter NumPy!
 25 | 
 26 | import numpy
 27 | 
 28 | # We're going to use NumPy all the time, so let's save some typing with an
 29 | # alias:
 30 | 
 31 | import numpy as np
 32 | 
 33 | # Using `np` for NumPy is a common convention.
 34 | # 
 35 | # You can convert a Python list to a NumPy array with `array()`.
 36 | 
 37 | np.array([1, 2])
 38 | b = np.array(a)
 39 | 
 40 | # DATA GOLF! A list of 1 million floats uses ~32 MB.
 41 | # How much memory does a NumPy array of 1 million floats use?
 42 | 
 43 | big_np_array = np.arange(1e6)
 44 | big_list = [float(x) for x in range(int(1e6))]
 45 | 
 46 | # NumPy also knows we want to do numerical work, where vectorized operations
 47 | # make sense.
 48 | 
 49 | b + 5
 50 | 
 51 | # These vectorized operations are also faster than corresponding operations on
 52 | # lists.
 53 | 
 54 | %timeit big_np_array + 5
 55 | %timeit [x + 5 for x in big_list]
 56 | 
 57 | # Indexing and slicing uses `[ ]`, the same as Python lists.
 58 | 
 59 | b[1:]
 60 | 
 61 | # WARNING: Python objects, including lists and NumPy arrays, are stored as
 62 | # references. Compared to R, they might not behave as you'd expect.
 63 | 
 64 | b_copy = b
 65 | b_copy[1] = 7
 66 | 
 67 | # What are the elements of `b_copy`? What're the elements of `b`?
 68 | b_copy
 69 | b
 70 | 
 71 | # ZEN: Explicit is better than implicit.
 72 | # If you wanted to make a copy, do it explicitly!
 73 | 
 74 | b = np.array(a)
 75 | 
 76 | b_copy = b.copy()
 77 | b_copy[1] = 7
 78 | 
 79 | b_copy
 80 | b
 81 | 
 82 | # Matrices can be defined directly
 83 | 
 84 | np.array([[1, 2], [3, 4]])
 85 | 
 86 | # or by reshaping an array
 87 | 
 88 | A = np.array([1, 2, 3, 4]).reshape(2, 2)
 89 | A
 90 | 
 91 | # How do you do matrix multiplication?
 92 | I = np.eye(2)
 93 | I
 94 | 
 95 | A * I
 96 | 
 97 | # Using `*` does element-by-element multiplication. Instead, call the `dot()`
 98 | # method  or `dot()` function.
 99 | 
100 | A.dot(I)
101 | np.dot(A, I)
102 | 
103 | # In Python 3.5, there will be an operator, `@`, for matrix multiplication.
104 | 
105 | # What else do we get with NumPy?
106 | #
107 | # * linear algebra (numpy.linalg)
108 | # * random number generation (numpy.random)
109 | # * Fourier transforms (numpy.fft)
110 | # * polynomials (numpy.polynomial)
111 | 
112 | # Let's write a simple random walk!
113 | #
114 | # 1. Get the step (up or down) at each time. That is, take N independent
115 | #    samples from a binomial(1, p) or Bernoulli(p) distribution.
116 | # 2. Transform 0s to -1s (down).
117 | # 3. Use cumulative sums to calculate the position at each time.
118 | 
119 | N = 10
120 | p = 0.5
121 | steps = np.random.binomial(1, p, N)
122 | 
123 | steps[steps == 0] = -1
124 | steps
125 | 
126 | positions = steps.cumsum()
127 | 
128 | import matplotlib.pyplot as plt
129 | 
130 | # Turn on Matplotlib's interactive mode.
131 | plt.ion()
132 | 
133 | plt.plot(positions)
134 | 
135 | # The SVD is a very important decomposition, especially to statistics.
136 | # A great statistics example is Principal Components Analysis (PCA).
137 | # What is PCA typically used for?
138 | 
139 | # Say X is a centered n-by-p data matrix. Then
140 | #
141 | # X = UDV' = λ₁u₁v₁' + ... + λₖuₖvₖ' + ... + λₚuₚvₚ'   (n < p)
142 | #
143 | # PCA takes the first k principal components, λ₁u₁, ..., λₖuₖ.
144 | # A slightly different perspective: PCA approximates the original data by
145 | #
146 | # X ~ λ₁u₁v₁' + ... + λₖuₖvₖ'
147 | #
148 | # Lossy data compression!
149 | 
150 | # To load an image, use SciPy, Matplotlib, or Scikit-Image.
151 | import skimage as ski, skimage.io
152 | 
153 | img = ski.io.imread('photo.png', as_grey=True)
154 | 
155 | plt.imshow(img).set_cmap('gray')
156 | 
157 | # Center the image and take the SVD.
158 | mean = np.mean(img, axis=0)
159 | 
160 | ctr_img = img - mean
161 | 
162 | plt.imshow(ctr_img).set_cmap('gray')
163 | 
164 | _, _, v = np.linalg.svd(ctr_img)
165 | 
166 | # `linalg.svd()` returns V', not V.
167 | v = v.T
168 | 
169 | # Remove some columns from V (terms in the SVD sum).
170 | v_reduced = v[:, 0:100]
171 | prin_comp = ctr_img.dot(v_reduced)
172 | 
173 | # Reconstruct the data.
174 | reconst = np.dot(prin_comp, v_reduced.T) + mean
175 | 
176 | # Another way:
177 | reconst = prin_comp.dot(v_reduced.T) + mean
178 | 
179 | fig, axs = plt.subplots(1, 2)
180 | for ax, im in zip(axs, [img, reconst]):
181 |     ax.imshow(im).set_cmap('gray')
182 | fig.show()
183 | 
184 | # DATA GOLF! How much compression?
185 | # Original image: 
186 | 
187 | original_size = img.size
188 | 
189 | # Compressed image:
190 | 
191 | compressed_size = mean.size + v_reduced.size + prin_comp.size
192 | 
193 | compressed_size / original_size
194 | 
195 | 


--------------------------------------------------------------------------------
/day3/photo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nick-ulle/2015-python/18e324ea50435d94c72ec853eec56ea6dfa5fe05/day3/photo.png


--------------------------------------------------------------------------------
/day4/backup/country.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | Lecture 4
 3 | 
 4 | Wed Jan 21 20:41:05 PST 2015
 5 | 
 6 | Today's Goals:
 7 |     - functions
 8 |     - scripting
 9 |     - pandas
10 | 
11 | A python script is just a plain text file with a .py extension. It contains
12 | a bunch of commands which will be executed in the order that they appear.
13 | 
14 | Pandas gives us the DataFrame and all the power that comes with it.
15 | 
16 | Data Munging: Putting your data into the correct form and type for
17 | analysis. This is the foundation for any analysis you wish to do. It pays
18 | to be able to do it efficiently.
19 | 
20 | Glamour: *
21 | Utility: *****
22 | 
23 | Pandas makes data munging easier.
24 | 
25 | Documentation and doctests are sooooo easy in Python. 
26 | '''
27 | 
28 | def decade(year):
29 |     '''
30 |     Return the decade of a given year
31 | 
32 |     >>> decade(1954)
33 |     1950
34 | 
35 |     '''
36 |     return 10 * (year // 10)
37 | 
38 | 
39 | if __name__ == '__main__':
40 |     import doctest
41 |     doctest.testmod()
42 | 


--------------------------------------------------------------------------------
/day4/backup/log.py:
--------------------------------------------------------------------------------
  1 | clear
  2 | # The first thing we do is import the libraries we need
  3 | # This can get a little tiresome
  4 | # If I want to use everything in the Numpy library:
  5 | # First let's see what I have in base Python:
  6 | dir()
  7 | # Very little
  8 | # Suppose I now want everything from the Numpy library for my interactive work:
  9 | from numpy import *
 10 | dir()
 11 | # We just brought everything from Numpy into the workspace
 12 | # More precisely- they are all global variables and functions now
 13 | # This is convenient for interactive work
 14 | # But not recommended for scripting / programming
 15 | # Ipython makes this easy:
 16 | %pylab
 17 | # This imported everything from numpy and matplotlib and set up the plotting
 18 | import seaborn
 19 | # seaborn gives you pretty graphics :)
 20 | plot([1, 2, 1, 2])
 21 | import pandas as pd
 22 | import pandas as funstuff
 23 | clear
 24 | # Functions are a _wonderful_ thing
 25 | def greeter(name):
 26 |     print('hello', name)
 27 | greeter('Qi')
 28 | ls
 29 | %run country.py
 30 | # read in the population data
 31 | p = pd.read_csv('population.csv')
 32 | type(p)
 33 | p.shape
 34 | p.dtypes
 35 | # Date should be a timestamp, not object
 36 | # We need to fix it!
 37 | pd.to_datetime?
 38 | p
 39 | p.columns
 40 | # Lets start out with the basics
 41 | # Columns of a DataFrame are called Series
 42 | # The name comes from time series
 43 | # hint- they work really well for time series
 44 | pd.DateOffset?
 45 | pd.date_range?
 46 | a = pd.date_range('2014-01-01')
 47 | # I've just illustrated a valuable technique-
 48 | If you don't know what will happen, just try it and see what error you get
 49 | # If you don't know what will happen, just try it and see what error you get
 50 | a = pd.date_range('2014-01-01', periods=10)
 51 | a
 52 | aser = pd.Series(a)
 53 | aser
 54 | import numpy as np
 55 | a = np.ones(5)
 56 | a
 57 | aser = pd.Series(a)
 58 | aser
 59 | aser.index
 60 | # Pandas is designed to preserve the structure of your data
 61 | # That means the index will be maintained throughout the operations
 62 | bser = pd.Series(np.ones(10))
 63 | bser
 64 | aser
 65 | bser
 66 | aser + bser
 67 | # Alignment was preserved
 68 | aser
 69 | aser.index[0]
 70 | # This would be a crazy thing to doaser.index[0] = 27
 71 | aser.index[0] = 27
 72 | aser.index
 73 | a = aser
 74 | a
 75 | a.index
 76 | # Index is an attribut of a dataframe or series
 77 | a.index = pd.date_range('2014-01-01', periods=5)
 78 | a
 79 | b = bser
 80 | b.index = pd.date_range('2014-01-03', periods=10)
 81 | b
 82 | a
 83 | b
 84 | # Dataframes try hard to preserve alignment
 85 | a + b
 86 | a.data
 87 | a.as_matrix()
 88 | # Indexing is important
 89 | # You get speed and data integrity
 90 | p
 91 | p.dtypes
 92 | p.columns
 93 | # Selecting columns is like key lookups in a dictionary
 94 | p['Date']
 95 | p['Date'][:10]
 96 | p['Date'].head()
 97 | pd.to_datetime(p['Date'])
 98 | # Before we had strings
 99 | # These are nanosecond timestamps
100 | p['Date'] = pd.to_datetime(p['Date'])
101 | # I just updated my script with the new commands
102 | # lets make sure it worked
103 | %run country.py
104 | p.dtypes
105 | # type is correct
106 | # We have another table:
107 | c = pd.read_csv('languages.csv')
108 | c
109 | c.shape
110 | c.dtypes
111 | # Our goal is to put the language into the population table
112 | p.head()
113 | # But we don't have country names
114 | wiki = pd.read_html('http://en.wikipedia.org/wiki/ISO_3166-1')
115 | # You may expect this to be a DataFrame
116 | # But HTML pages can contain many tables
117 | type(wiki)
118 | len(wiki)
119 | wiki[0]
120 | wiki[0].head()
121 | # We only need columns 0 and 2
122 | w = wiki[0].iloc[:, [0, 2]]
123 | # df.iloc lets us do integer selction of rows, columns
124 | w.shape
125 | w.dtypes
126 | w.head()
127 | w.columns
128 | w.columns = ['country', 'ISO']
129 | w.columns
130 | # Now for the join
131 | l
132 | dir()
133 | sh short name (upper/lower case)  Alpha-3 code5
134 | c
135 | l = c
136 | l
137 | l
138 | w
139 | w.head()
140 | w.columns
141 | l.columns
142 | l.merge(w)
143 | p.columns
144 | p2 = p.merge(l.merge(w))
145 | p2.columns
146 | p2.dtypes
147 | p2.head()
148 | # A successful join
149 | p.head()
150 | p['ISO'].unique()
151 | p.shape
152 | p.head()
153 | p.pivot?
154 | p.pivot(index='Date', columns='ISO', values='population')
155 | p3 = p.pivot(index='Date', columns='ISO', values='population')
156 | p3.dtypes
157 | p3.head()
158 | p3.iloc[:10, ]
159 | p3.iloc[10:20, ]
160 | p3
161 | p3.plot()
162 | p3.dtypes
163 | p2
164 | p2.head()
165 | p2['language'].unique()
166 | ppop = p2['population']
167 | p2.groupby?
168 | p2.groupby('language')
169 | grouper = p2.groupby('language')
170 | type(grouper)
171 | grouper.name
172 | grouper.ngroups
173 | grouper.mean()
174 | # Let's put this together
175 | p2.groupby('language').mean()
176 | p2.groupby('language').count()
177 | grouper.mad?
178 | grouper.size?
179 | grouper.size()
180 | # Suppose we want to know the average population in each country by decade
181 | p
182 | p2
183 | p3
184 | # Suppose we want to know the average population in each country by decade
185 | # If p3 had a column with the decades we could do a groupby
186 | p3.index
187 | p3.index.year
188 | p3
189 | p3['year'] = p3.index.year
190 | p3.dtypes
191 | p3.head()
192 | # We need to get the decade from the year
193 | # ie 1954 -> 1950
194 | 1954 // 10
195 | # floor division by 10
196 | 10 * 1954 // 10
197 | 10 * (1954 // 10)
198 | %run country.py
199 | %run country.py
200 | %run country.py
201 | decade
202 | ?decade
203 | decade(2018)
204 | p2
205 | p3.dtypes
206 | p3['year'].apply(decade)
207 | p3['decade'] = p3['year'].apply(decade)
208 | p3.head()
209 | p3.groupby('decade').mean()
210 | # So what if I want a column with this?
211 | g = p3['CHN'].groupby('decade')
212 | g = p3[['CHN', 'decade']].groupby('decade')
213 | g.transform?
214 | g.transform(np.mean)
215 | p3['china10yr'] = p3[['CHN', 'decade']].groupby('decade').transform(np.mean)
216 | p3.plot()
217 | p3[['CHN', 'china10yr']].plot()
218 | a = np.arange(20)
219 | a
220 | a // 10
221 | ls
222 | a = pd.read_html('ISO_3166-1')
223 | 


--------------------------------------------------------------------------------
/day4/backup/simple.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | Write what your program does here in the header
 3 | '''
 4 | 
 5 | # Imports come first
 6 | import json
 7 | 
 8 | 
 9 | print('This runs when the script is imported')
10 | 
11 | 
12 | def greet(person):
13 |     '''
14 |     This is a function docstring
15 |     '''
16 |     print('hello', person)
17 | 
18 | 
19 | if __name__ == '__main__':
20 | 
21 |     # Put the action code here
22 |     # Also a good place for tests
23 | 
24 |     print('the __name__ is __main__!')
25 |     greet('Matt')
26 | 


--------------------------------------------------------------------------------
/day4/decade.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | Tools for working with dates
 3 | '''
 4 | 
 5 | #import pandas as pd
 6 | 
 7 | def decade(year):
 8 |     '''
 9 |     Given a year, return the decade
10 | 
11 |     >>> decade(1986)
12 |     1980
13 | 
14 |     '''
15 |     return 10 * (year // 10)
16 | 
17 | 
18 | if __name__ == '__main__':
19 |     import doctest
20 |     doctest.testmod()
21 | 


--------------------------------------------------------------------------------
/day4/languages.csv:
--------------------------------------------------------------------------------
1 | country,language
2 | Canada,English
3 | China,Mandarin
4 | United States of America,English
5 | 


--------------------------------------------------------------------------------
/day4/log.py:
--------------------------------------------------------------------------------
  1 | # first lets look at what we have
  2 | dir()
  3 | # lets pull in everything from Numpy
  4 | from numpy import *
  5 | dir()
  6 | %pylab
  7 | plot?
  8 | # everything in matplotlib and numpy have become global variables
  9 | # This is recommended interactive work- but not for scripting or programming
 10 | import seaborn
 11 | # Seaborn gives pretty graphics
 12 | plot([0, 1, 0])
 13 | # In summary just use %pylab
 14 | import pandas as pd
 15 | import pandas as penguins
 16 | # I could have given pandas any name I want
 17 | import numpy as np
 18 | # Pandas maintains data integrity across operations
 19 | a = pd.Series(np.ones(10))
 20 | a
 21 | a
 22 | type(a)
 23 | # the dataframe is like a table
 24 | # the columns in a dataframe are called series
 25 | a
 26 | b = pd.Series(np.ones(5))
 27 | b
 28 | type(b)
 29 | b
 30 | b.index
 31 | a + b
 32 | pd.date_range('2014-01-01', periods=10)
 33 | a
 34 | pd.date_range('2014-01-01', periods=10)
 35 | a.index
 36 | a.index[1]
 37 | a.index[1] = 29
 38 | a.index = pd.date_range('2014-01-01', periods=10)
 39 | a
 40 | a.index = pd.date_range('2014-01-01', periods=10)
 41 | b.index = pd.date_range('2014-01-03', periods=5)
 42 | a
 43 | b
 44 | a + b
 45 | a
 46 | a[:10]
 47 | a[4]
 48 | # Now for real data
 49 | # Pandas makes it very easy to read and write data
 50 | # in various formats
 51 | p = pd.read_csv('population.csv')
 52 | type(p)
 53 | # How big is it?
 54 | p.shape
 55 | # We have 186 rows
 56 | # and 3 columns
 57 | p.dtypes
 58 | p.head()
 59 | p.dtypes
 60 | p['Date'].head()
 61 | p['Date'][0]
 62 | d1 = p['Date'][0]
 63 | type(d1)
 64 | # Lets convert to datetime
 65 | pd.to_datetime(d1['Date'])
 66 | pd.to_datetime(p['Date'])
 67 | p['Date'] = pd.to_datetime(p['Date'])
 68 | p.dtypes
 69 | # How many countries are there?
 70 | p['ISO']
 71 | p['ISO'].unique()
 72 | p['ISO'].head()
 73 | p.head()
 74 | p.dtypes
 75 | p.pivot?
 76 | p2 = p.pivot(index='Date', columns='ISO', values='population')
 77 | p2.head()
 78 | # Now we can plot
 79 | p2.plot()
 80 | # I have another table
 81 | lang = pd.read_csv('languages.csv')
 82 | lang
 83 | p.head()
 84 | lang
 85 | wiki = pd.read_html('http://en.wikipedia.org/wiki/ISO_3166-1')
 86 | type(wiki)
 87 | len(wiki)
 88 | wiki[0]
 89 | w = wiki[0]
 90 | w.head()
 91 | w2 = w.iloc[1:, [0, 2]]
 92 | w2.head()
 93 | w2.columns
 94 | p.columns
 95 | lang.columns
 96 | w2.columns = ['country', ISO']
 97 | w2.columns = ['country', 'ISO']
 98 | w2.head()
 99 | lang.merge(w2)
100 | p.head()
101 | p.merge(lang.merge(w2))
102 | p2 = p.merge(lang.merge(w2))
103 | p2.columns
104 | p2.head()
105 | # given a year- how do we get the decade?
106 | % run decade.py
107 | # 1958 -> 1950
108 | 1958 // 10
109 | 10 * (1958 // 10)
110 | %run decade.py
111 | decade?
112 | import decade
113 | p2.dtypes
114 | p2['year'] = p2.index.year
115 | p2.index[:10]
116 | p2.head()
117 | p2.index = p2['Date']
118 | p2.index.year
119 | p2['year'] = p2.index.year
120 | p2.dtypes
121 | p2.head()
122 | p2['year'].apply(decade)
123 | p2['year'].apply(decade.decade)
124 | p2['decade'] = p2['year'].apply(decade.decade)
125 | p2.head()
126 | # The question was- what was the mean population in China by decade?
127 | p2.groupby(['country', 'decade'])
128 | p2.groupby(['country', 'decade']).mean()
129 | p3 = p2.groupby(['country', 'decade']).mean()
130 | p3
131 | pd.__version__
132 | %history -f 'day4log.py'
133 | 


--------------------------------------------------------------------------------
/day4/notes.mdown:
--------------------------------------------------------------------------------
 1 | Lecture 4
 2 | 
 3 | Thu Jan 22 14:09:21 PST 2015
 4 | 
 5 | The story of Wes:
 6 | * keep integrity of data
 7 | * express data analysis operations cleanly and efficiently
 8 | * handle time series very well
 9 | 
10 | Pandas - panel data analysis - DataFrame
11 | 
12 | Data Munging - preparing your data
13 | Glamour - @
14 | Utility - @@@@@
15 | 
16 | Why this is awesome:
17 |     It's foundational for all subsequent analysis
18 | Looking for concise, expressive code
19 | 


--------------------------------------------------------------------------------
/day4/population.csv:
--------------------------------------------------------------------------------
  1 | Date,ISO,population
  2 | 1950-01-01 00:00:00,CHN,562.579779
  3 | 1951-01-01 00:00:00,CHN,567.100762
  4 | 1952-01-01 00:00:00,CHN,574.5362739999999
  5 | 1953-01-01 00:00:00,CHN,584.1913539999999
  6 | 1954-01-01 00:00:00,CHN,594.7248440000001
  7 | 1955-01-01 00:00:00,CHN,606.729654
  8 | 1956-01-01 00:00:00,CHN,619.135938
  9 | 1957-01-01 00:00:00,CHN,633.214551
 10 | 1958-01-01 00:00:00,CHN,646.703076
 11 | 1959-01-01 00:00:00,CHN,654.3494300000001
 12 | 1960-01-01 00:00:00,CHN,650.6605129999999
 13 | 1961-01-01 00:00:00,CHN,644.669932
 14 | 1962-01-01 00:00:00,CHN,653.302104
 15 | 1963-01-01 00:00:00,CHN,674.248708
 16 | 1964-01-01 00:00:00,CHN,696.064936
 17 | 1965-01-01 00:00:00,CHN,715.546458
 18 | 1966-01-01 00:00:00,CHN,735.903786
 19 | 1967-01-01 00:00:00,CHN,755.3201190000001
 20 | 1968-01-01 00:00:00,CHN,776.152777
 21 | 1969-01-01 00:00:00,CHN,798.640508
 22 | 1970-01-01 00:00:00,CHN,820.403282
 23 | 1971-01-01 00:00:00,CHN,842.4556779999999
 24 | 1972-01-01 00:00:00,CHN,863.439051
 25 | 1973-01-01 00:00:00,CHN,883.019765
 26 | 1974-01-01 00:00:00,CHN,901.3180649999999
 27 | 1975-01-01 00:00:00,CHN,917.898537
 28 | 1976-01-01 00:00:00,CHN,932.5887270000001
 29 | 1977-01-01 00:00:00,CHN,946.093816
 30 | 1978-01-01 00:00:00,CHN,958.835162
 31 | 1979-01-01 00:00:00,CHN,972.136875
 32 | 1980-01-01 00:00:00,CHN,984.73646
 33 | 1981-01-01 00:00:00,CHN,997.000718
 34 | 1982-01-01 00:00:00,CHN,1012.490488
 35 | 1983-01-01 00:00:00,CHN,1028.3565350000001
 36 | 1984-01-01 00:00:00,CHN,1042.75605
 37 | 1985-01-01 00:00:00,CHN,1058.007717
 38 | 1986-01-01 00:00:00,CHN,1074.522563
 39 | 1987-01-01 00:00:00,CHN,1093.7257120000002
 40 | 1988-01-01 00:00:00,CHN,1112.866405
 41 | 1989-01-01 00:00:00,CHN,1130.729412
 42 | 1990-01-01 00:00:00,CHN,1148.36447
 43 | 1991-01-01 00:00:00,CHN,1163.610388
 44 | 1992-01-01 00:00:00,CHN,1177.535611
 45 | 1993-01-01 00:00:00,CHN,1190.761826
 46 | 1994-01-01 00:00:00,CHN,1203.8017439999999
 47 | 1995-01-01 00:00:00,CHN,1216.3784440000002
 48 | 1996-01-01 00:00:00,CHN,1227.882189
 49 | 1997-01-01 00:00:00,CHN,1238.1256640000001
 50 | 1998-01-01 00:00:00,CHN,1247.502219
 51 | 1999-01-01 00:00:00,CHN,1255.9924990000002
 52 | 2000-01-01 00:00:00,CHN,1263.6375309999999
 53 | 2001-01-01 00:00:00,CHN,1270.744232
 54 | 2002-01-01 00:00:00,CHN,1277.59472
 55 | 2003-01-01 00:00:00,CHN,1284.3033159999998
 56 | 2004-01-01 00:00:00,CHN,1291.001804
 57 | 2005-01-01 00:00:00,CHN,1297.765318
 58 | 2006-01-01 00:00:00,CHN,1304.2618850000001
 59 | 2007-01-01 00:00:00,CHN,1310.583544
 60 | 2008-01-01 00:00:00,CHN,1317.065677
 61 | 2009-01-01 00:00:00,CHN,1323.591583
 62 | 2010-01-01 00:00:00,CHN,1330.141295
 63 | 2011-01-01 00:00:00,CHN,
 64 | 1950-01-01 00:00:00,USA,
 65 | 1951-01-01 00:00:00,USA,
 66 | 1952-01-01 00:00:00,USA,
 67 | 1953-01-01 00:00:00,USA,
 68 | 1954-01-01 00:00:00,USA,
 69 | 1955-01-01 00:00:00,USA,
 70 | 1956-01-01 00:00:00,USA,
 71 | 1957-01-01 00:00:00,USA,
 72 | 1958-01-01 00:00:00,USA,
 73 | 1959-01-01 00:00:00,USA,
 74 | 1960-01-01 00:00:00,USA,180.67
 75 | 1961-01-01 00:00:00,USA,183.69
 76 | 1962-01-01 00:00:00,USA,186.54
 77 | 1963-01-01 00:00:00,USA,189.24
 78 | 1964-01-01 00:00:00,USA,191.89
 79 | 1965-01-01 00:00:00,USA,194.3
 80 | 1966-01-01 00:00:00,USA,196.56
 81 | 1967-01-01 00:00:00,USA,198.71
 82 | 1968-01-01 00:00:00,USA,200.71
 83 | 1969-01-01 00:00:00,USA,202.68
 84 | 1970-01-01 00:00:00,USA,205.05
 85 | 1971-01-01 00:00:00,USA,207.66
 86 | 1972-01-01 00:00:00,USA,209.9
 87 | 1973-01-01 00:00:00,USA,211.91
 88 | 1974-01-01 00:00:00,USA,213.85
 89 | 1975-01-01 00:00:00,USA,215.97
 90 | 1976-01-01 00:00:00,USA,218.04
 91 | 1977-01-01 00:00:00,USA,220.24
 92 | 1978-01-01 00:00:00,USA,222.59
 93 | 1979-01-01 00:00:00,USA,225.06
 94 | 1980-01-01 00:00:00,USA,227.73
 95 | 1981-01-01 00:00:00,USA,229.97
 96 | 1982-01-01 00:00:00,USA,232.19
 97 | 1983-01-01 00:00:00,USA,234.31
 98 | 1984-01-01 00:00:00,USA,236.35
 99 | 1985-01-01 00:00:00,USA,238.47
100 | 1986-01-01 00:00:00,USA,240.65
101 | 1987-01-01 00:00:00,USA,242.8
102 | 1988-01-01 00:00:00,USA,245.02
103 | 1989-01-01 00:00:00,USA,247.34
104 | 1990-01-01 00:00:00,USA,250.13
105 | 1991-01-01 00:00:00,USA,253.49
106 | 1992-01-01 00:00:00,USA,256.89
107 | 1993-01-01 00:00:00,USA,260.26
108 | 1994-01-01 00:00:00,USA,263.44
109 | 1995-01-01 00:00:00,USA,266.56
110 | 1996-01-01 00:00:00,USA,269.67
111 | 1997-01-01 00:00:00,USA,272.91
112 | 1998-01-01 00:00:00,USA,276.12
113 | 1999-01-01 00:00:00,USA,279.3
114 | 2000-01-01 00:00:00,USA,282.38
115 | 2001-01-01 00:00:00,USA,285.31
116 | 2002-01-01 00:00:00,USA,288.1
117 | 2003-01-01 00:00:00,USA,290.82
118 | 2004-01-01 00:00:00,USA,293.46
119 | 2005-01-01 00:00:00,USA,296.19
120 | 2006-01-01 00:00:00,USA,299.0
121 | 2007-01-01 00:00:00,USA,302.0
122 | 2008-01-01 00:00:00,USA,304.8
123 | 2009-01-01 00:00:00,USA,307.44
124 | 2010-01-01 00:00:00,USA,309.98
125 | 2011-01-01 00:00:00,USA,312.24
126 | 1950-01-01 00:00:00,CAN,
127 | 1951-01-01 00:00:00,CAN,
128 | 1952-01-01 00:00:00,CAN,
129 | 1953-01-01 00:00:00,CAN,
130 | 1954-01-01 00:00:00,CAN,
131 | 1955-01-01 00:00:00,CAN,
132 | 1956-01-01 00:00:00,CAN,
133 | 1957-01-01 00:00:00,CAN,
134 | 1958-01-01 00:00:00,CAN,
135 | 1959-01-01 00:00:00,CAN,
136 | 1960-01-01 00:00:00,CAN,17.91
137 | 1961-01-01 00:00:00,CAN,18.27
138 | 1962-01-01 00:00:00,CAN,18.61
139 | 1963-01-01 00:00:00,CAN,18.96
140 | 1964-01-01 00:00:00,CAN,19.33
141 | 1965-01-01 00:00:00,CAN,19.68
142 | 1966-01-01 00:00:00,CAN,20.05
143 | 1967-01-01 00:00:00,CAN,20.41
144 | 1968-01-01 00:00:00,CAN,20.73
145 | 1969-01-01 00:00:00,CAN,21.03
146 | 1970-01-01 00:00:00,CAN,21.32
147 | 1971-01-01 00:00:00,CAN,21.96
148 | 1972-01-01 00:00:00,CAN,22.22
149 | 1973-01-01 00:00:00,CAN,22.49
150 | 1974-01-01 00:00:00,CAN,22.81
151 | 1975-01-01 00:00:00,CAN,23.14
152 | 1976-01-01 00:00:00,CAN,23.45
153 | 1977-01-01 00:00:00,CAN,23.73
154 | 1978-01-01 00:00:00,CAN,23.96
155 | 1979-01-01 00:00:00,CAN,24.2
156 | 1980-01-01 00:00:00,CAN,24.52
157 | 1981-01-01 00:00:00,CAN,24.82
158 | 1982-01-01 00:00:00,CAN,25.12
159 | 1983-01-01 00:00:00,CAN,25.37
160 | 1984-01-01 00:00:00,CAN,25.61
161 | 1985-01-01 00:00:00,CAN,25.84
162 | 1986-01-01 00:00:00,CAN,26.1
163 | 1987-01-01 00:00:00,CAN,26.45
164 | 1988-01-01 00:00:00,CAN,26.79
165 | 1989-01-01 00:00:00,CAN,27.28
166 | 1990-01-01 00:00:00,CAN,27.69
167 | 1991-01-01 00:00:00,CAN,28.04
168 | 1992-01-01 00:00:00,CAN,28.37
169 | 1993-01-01 00:00:00,CAN,28.68
170 | 1994-01-01 00:00:00,CAN,29.0
171 | 1995-01-01 00:00:00,CAN,29.3
172 | 1996-01-01 00:00:00,CAN,29.61
173 | 1997-01-01 00:00:00,CAN,29.91
174 | 1998-01-01 00:00:00,CAN,30.16
175 | 1999-01-01 00:00:00,CAN,30.4
176 | 2000-01-01 00:00:00,CAN,30.69
177 | 2001-01-01 00:00:00,CAN,31.02
178 | 2002-01-01 00:00:00,CAN,31.35
179 | 2003-01-01 00:00:00,CAN,31.64
180 | 2004-01-01 00:00:00,CAN,31.94
181 | 2005-01-01 00:00:00,CAN,32.25
182 | 2006-01-01 00:00:00,CAN,32.58
183 | 2007-01-01 00:00:00,CAN,32.93
184 | 2008-01-01 00:00:00,CAN,33.32
185 | 2009-01-01 00:00:00,CAN,33.73
186 | 2010-01-01 00:00:00,CAN,34.13
187 | 2011-01-01 00:00:00,CAN,34.48
188 | 


--------------------------------------------------------------------------------
/day4/prepare.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import Quandl
 3 | 
 4 | china = Quandl.get("FRED/POPTTLCNA173NUPN")
 5 | usa = Quandl.get("FRED/USAPOPL")
 6 | canada = Quandl.get("FRED/CANPOPL")
 7 | 
 8 | # Convert from thousands to millions
 9 | china = china / 1000
10 | 
11 | p = pd.concat([china, usa, canada], axis=1)
12 | p.columns = ['china', 'usa', 'canada']
13 | 
14 | p.reset_index(inplace=True)
15 | 
16 | popmelt = pd.melt(p, id_vars='Date')
17 | 
18 | popmelt.columns = ['Date', 'ISO', 'population']
19 | 
20 | popmelt['ISO'][popmelt['ISO'] == 'china'] = 'CHN'
21 | popmelt['ISO'][popmelt['ISO'] == 'usa'] = 'USA'
22 | popmelt['ISO'][popmelt['ISO'] == 'canada'] = 'CAN'
23 | 
24 | popmelt.to_csv('population.csv', index=False)
25 | 
26 | languages = {'country':
27 |              ['Canada', 'China', 'United States of America'],
28 |              'language':
29 |              ['English', 'Mandarin', 'English']}
30 | 
31 | ldf = pd.DataFrame(languages)
32 | ldf.to_csv('languages.csv', index=False)
33 | 


--------------------------------------------------------------------------------
/day5/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | To view the map, you must download the entire `map` directory. The map is
 3 | powered by [OpenStreetMap](https://www.openstreetmap.org/).
 4 | 
 5 | The mobile food permits data & San Francisco neighborhoods GeoJSON was sourced
 6 | from San Francisco's [open data collection](https://data.sfgov.org/).
 7 | 
 8 | The vehicle mileage data was sourced from the Environmental Protection Agency's
 9 | National Vehicle and Fuel Emissions Laboratory, which makes data freely
10 | available [here](http://fueleconomy.gov/feg/download.shtml).
11 | 
12 | 


--------------------------------------------------------------------------------
/day5/backup/truck_counts.csv:
--------------------------------------------------------------------------------
 1 | Neighborhood,Trucks
 2 | South of Market,88
 3 | Financial District,75
 4 | Mission,61
 5 | Central Waterfront,47
 6 | Bayview,38
 7 | Mission Bay,28
 8 | Potrero Hill,26
 9 | Produce Market,24
10 | Downtown / Union Square,22
11 | Hunters Point,20
12 | India Basin,18
13 | Bret Harte,17
14 | Dogpatch,15
15 | Northern Waterfront,14
16 | Showplace Square,13
17 | South Beach,11
18 | Rincon Hill,9
19 | Pacific Heights,9
20 | Mission Dolores,7
21 | Silver Terrace,4
22 | Western Addition,4
23 | Bernal Heights,4
24 | Civic Center,4
25 | Little Hollywood,4
26 | Presidio Heights,4
27 | Hayes Valley,3
28 | Candlestick Point SRA,3
29 | Cow Hollow,3
30 | Castro,3
31 | Apparel City,3
32 | Panhandle,2
33 | Lower Nob Hill,2
34 | North Beach,2
35 | Parnassus Heights,2
36 | Inner Sunset,2
37 | Cathedral Hill,2
38 | Tenderloin,2
39 | Anza Vista,2
40 | Marina,2
41 | Golden Gate Heights,2
42 | Russian Hill,2
43 | Lakeshore,1
44 | Portola,1
45 | Ingleside,1
46 | Polk Gulch,1
47 | Laurel Heights / Jordan Park,1
48 | Lower Haight,1
49 | Eureka Valley,1
50 | Outer Mission,1
51 | Crocker Amazon,1
52 | Lone Mountain,1
53 | Outer Richmond,1
54 | Noe Valley,1
55 | Westwood Park,1
56 | Laguna Honda,1
57 | Alamo Square,1
58 | 


--------------------------------------------------------------------------------
/day5/day5log.py:
--------------------------------------------------------------------------------
 1 | import matplotlib.pyplot as plt
 2 | import numpy as np
 3 | x = np.linspace(0, 1, 100)
 4 | y_cos = np.cos(x)
 5 | y_sin = np.sin(x)
 6 | plt.plot(x, y_cos)
 7 | plt.show()
 8 | plt.ion()
 9 | plt.plot(x, y_cos)
10 | plt.plot(x, y_cos, 'r-x')
11 | plt.plot?
12 | plt.plot(x, y_cos, color='red', linestyle='-', marker='x')
13 | plt.plot(x, y_cos, x, y_sin)
14 | plt.plot(x, y_cos)
15 | plt.plot(x, y_sin)
16 | # use plt.subplots() to make multiple plots.
17 | fig, ax = plt.subplots(2, 1)
18 | ax.plot(x, y_sin)
19 | ax[0].plot(x, y_sin)
20 | ax[1].plot(x, y_cos)
21 | # plt.draw() forces the plot display to update.
22 | plt.draw()
23 | # plot in XKCD style!
24 | with plt.xkcd():
25 |     plt.plot(x, y_sin)
26 | plt.plot(x, y_cos)
27 | %history -f day5log.py
28 | 


--------------------------------------------------------------------------------
/day5/map/data.json:
--------------------------------------------------------------------------------
1 | [{"Potrero Hill": 26, "Alamo Square": 1, "Civic Center": 4, "Presidio Heights": 4, "Downtown / Union Square": 22, "Dogpatch": 15, "Cathedral Hill": 2, "Tenderloin": 2, "South Beach": 11, "Lower Nob Hill": 2, "India Basin": 18, "Outer Mission": 1, "Golden Gate Heights": 2, "Bayview": 38, "Cow Hollow": 3, "Showplace Square": 13, "Eureka Valley": 1, "Russian Hill": 2, "Little Hollywood": 4, "Mission": 61, "Inner Sunset": 2, "Marina": 2, "Silver Terrace": 4, "Central Waterfront": 47, "Northern Waterfront": 14, "Lone Mountain": 1, "Anza Vista": 2, "Ingleside": 1, "Noe Valley": 1, "North Beach": 2, "Castro": 3, "Lakeshore": 1, "Laurel Heights / Jordan Park": 1, "Western Addition": 4, "Financial District": 75, "Polk Gulch": 1, "Rincon Hill": 9, "Lower Haight": 1, "Outer Richmond": 1, "Mission Bay": 28, "South of Market": 88, "Laguna Honda": 1, "Bret Harte": 17, "Apparel City": 3, "Mission Dolores": 7, "Parnassus Heights": 2, "Portola": 1, "Candlestick Point SRA": 3, "Hayes Valley": 3, "Crocker Amazon": 1, "Pacific Heights": 9, "Produce Market": 24, "Panhandle": 2, "Hunters Point": 20, "Bernal Heights": 4, "Westwood Park": 1}]


--------------------------------------------------------------------------------
/day5/map/map.html:
--------------------------------------------------------------------------------
  1 | <!DOCTYPE html>
  2 | <head>
  3 |    <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
  4 |    <link rel="stylesheet" href="http://cdn.leafletjs.com/leaflet-0.7/leaflet.css" />
  5 |    <script src="http://cdn.leafletjs.com/leaflet-0.7/leaflet.js"></script>
  6 |    <script src="http://d3js.org/d3.v3.min.js" charset="utf-8"></script>
  7 |    <script src="http://d3js.org/queue.v1.min.js"></script>
  8 | 
  9 |    
 10 |    
 11 |    
 12 |    
 13 | 
 14 |    <style>
 15 | 
 16 |       html, body {
 17 |         width: 100%;
 18 |         height: 100%;
 19 |         margin: 0;
 20 |         padding: 0;
 21 |       }
 22 | 
 23 |       .legend {
 24 |           padding: 0px 0px;
 25 |           font: 10px sans-serif;
 26 |           background: white;
 27 |           background: rgba(255,255,255,0.8);
 28 |           box-shadow: 0 0 15px rgba(0,0,0,0.2);
 29 |           border-radius: 5px;
 30 |       }
 31 | 
 32 |       .key path {
 33 |         display: none;
 34 |       }
 35 | 
 36 |    </style>
 37 | </head>
 38 | 
 39 | <body>
 40 | 
 41 |    <div id="map" style="width: 960px; height: 500px"></div>
 42 | 
 43 |    <script>
 44 | 
 45 |       queue()
 46 |           .defer(d3.json, 'data.json')
 47 |           .defer(d3.json, 'sf_nhoods.json')
 48 |           .await(makeMap)
 49 | 
 50 |       function makeMap(error, data_2,gjson_2) {
 51 | 
 52 |           
 53 | 
 54 |           
 55 | 
 56 |           function matchKey(datapoint, key_variable){
 57 |               if (typeof key_variable[0][datapoint] === 'undefined') {
 58 |                   return null;
 59 |               }
 60 |               else {
 61 |                   return parseFloat(key_variable[0][datapoint]);
 62 |               };
 63 |           };
 64 | 
 65 |           
 66 |           var color = d3.scale.threshold()
 67 |               .domain([1.0, 3.0, 10.0, 20.0, 30.0])
 68 |               .range(['#EDF8FB', '#CCECE6', '#CCECE6', '#66C2A4', '#41AE76', '#238B45']);
 69 |           
 70 | 
 71 |           var map = L.map('map').setView([37.77, -122.45], 12);
 72 | 
 73 |           L.tileLayer('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png', {
 74 |               maxZoom: 18,
 75 |               attribution: 'Map data (c) <a href="http://openstreetmap.org">OpenStreetMap</a> contributors'
 76 |           }).addTo(map);
 77 | 
 78 |           
 79 | 
 80 |           
 81 | 
 82 |           
 83 | 
 84 |           
 85 | 
 86 |           
 87 |           function style_2(feature) {
 88 |     return {
 89 |         fillColor: color(matchKey(feature.properties.name, data_2)),
 90 |         weight: 1,
 91 |         opacity: 1,
 92 |         color: 'black',
 93 |         fillOpacity: 0.6
 94 |     };
 95 | }
 96 |           
 97 | 
 98 |           
 99 |           gJson_layer_2 = L.geoJson(gjson_2, {style: style_2}).addTo(map)
100 |           
101 | 
102 |           
103 |               var legend = L.control({position: 'topright'});
104 | 
105 |     legend.onAdd = function (map) {var div = L.DomUtil.create('div', 'legend'); return div};
106 | 
107 |     legend.addTo(map);
108 | 
109 |     var x = d3.scale.linear()
110 |     .domain([0, 33])
111 |     .range([0, 400]);
112 | 
113 |     var xAxis = d3.svg.axis()
114 |         .scale(x)
115 |         .orient("top")
116 |         .tickSize(1)
117 |         .tickValues([1.0, 3.0, 10.0, 20.0, 30.0]);
118 | 
119 |     var svg = d3.select(".legend.leaflet-control").append("svg")
120 |         .attr("id", 'legend')
121 |         .attr("width", 450)
122 |         .attr("height", 40);
123 | 
124 |     var g = svg.append("g")
125 |         .attr("class", "key")
126 |         .attr("transform", "translate(25,16)");
127 | 
128 |     g.selectAll("rect")
129 |         .data(color.range().map(function(d, i) {
130 |           return {
131 |             x0: i ? x(color.domain()[i - 1]) : x.range()[0],
132 |             x1: i < color.domain().length ? x(color.domain()[i]) : x.range()[1],
133 |             z: d
134 |           };
135 |         }))
136 |       .enter().append("rect")
137 |         .attr("height", 10)
138 |         .attr("x", function(d) { return d.x0; })
139 |         .attr("width", function(d) { return d.x1 - d.x0; })
140 |         .style("fill", function(d) { return d.z; });
141 | 
142 |     g.call(xAxis).append("text")
143 |         .attr("class", "caption")
144 |         .attr("y", 21)
145 |         .text('Count');
146 |           
147 | 
148 |       };
149 | 
150 |    </script>
151 | </body>


--------------------------------------------------------------------------------
/day5/mpg_plots.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | ''' Plots for EPA fuel economy data.
 3 | '''
 4 | 
 5 | import pandas as pd
 6 | import seaborn as sns
 7 | import matplotlib.pyplot as plt
 8 | 
 9 | def main():
10 |     ''' Run a demo of the functions in this module.
11 |     '''
12 |     vehicles = pd.read_csv('vehicles.csv')
13 | 
14 |     make_mpg_plot(vehicles)
15 | 
16 | def make_mpg_plot(vehicles):
17 |     ''' Make a plot of MPG by Make and Year.
18 |     '''
19 |     # Take means by make and year.
20 |     groups = vehicles.groupby(['year', 'make'])
21 |     avg_mpg = groups.city_mpg.mean()
22 | 
23 |     # Convert the indexes to columns, then pivot.
24 |     avg_mpg = avg_mpg.reset_index()
25 |     mpg_matrix = avg_mpg.pivot('year', 'make', 'city_mpg')
26 | 
27 |     # Subset down to a few brands.
28 |     brands = ['Honda', 'Toyota', 'Hyundai', 'BMW', 'Mercedes-Benz',
29 |               'Ferrari', 'Lamborghini', 'Maserati', 'Fiat', 'Bentley',
30 |               'Ford', 'Chevrolet', 'Dodge', 'Jeep']
31 |     brand_matrix = mpg_matrix[brands]
32 | 
33 |     # Plot the data using seaborn and matplotlib.
34 |     # The `subplots()` method returns a figure and an axes.
35 |     fig, ax = plt.subplots(1, 1)
36 |     sns.heatmap(brand_matrix, cmap='BuGn', ax=ax)
37 |     ax.set_xlabel('Make')
38 |     ax.set_ylabel('Year')
39 |     ax.set_title('MPG by Make and Year')
40 | 
41 |     # Setting a tight layout ensures the entire plot fits in the plotting
42 |     # window.
43 |     fig.tight_layout()
44 | 
45 |     # Show the figure.
46 |     fig.show()
47 |     # Alternatively:
48 |     # fig.savefig('mpg_plot.png')
49 | 
50 | # Only call `main()` when the script is executed directly (not when it's
51 | # imported with `import mpg_plots`). This allows the script to be used as an
52 | # application AND as a module in other scripts.
53 | if __name__ == '__main__':
54 |     main()
55 | 
56 | 


--------------------------------------------------------------------------------
/day5/notes.md:
--------------------------------------------------------------------------------
 1 | 
 2 | There are many different packages for plotting with Python. These are split
 3 | into two ecosystems, "matplotlib" and "javascript". 
 4 | 
 5 | * The "matplotlib" ecosystem has highly-customizable tools well-suited for
 6 |   printed graphics and interative (offline) applications. Some members:
 7 |     + matplotlib - general plotting
 8 |     + seaborn - "pretty" statistical plotting
 9 |     + ggplot - grammar of graphics plotting
10 |     + basemap - old maps package, using shapefiles
11 |     + cartopy - new maps package, using GeoJSON and recent Python GIS libraries
12 | * The "javascript" ecosystem is less mature, but growing fast. These modules
13 |   are designed for interactive web graphics. Each leverages a javascript
14 |   library. Many require an HTML server to work. Some members:
15 |     + bokeh (D3.js) - general plotting, especially large/streaming data
16 |     + vincent (vega.js) - general plotting
17 |     + folium (leaflet.js) - interactive world maps (similar to Google Maps)
18 |     + kartograph (kartograph.js) - interactive maps
19 | 
20 | Since it's the most mature, we'll look at matplotlib first.
21 | 
22 | matplotlib can be used interactively (PyLab) or in a non-interactive style. The
23 | interactive functions are in `matplotlib.pyplot`.
24 | 
25 | The basic plotting function is `plot()`. It can produce line and scatter plots.
26 | Plots can be customized quickly with format strings, or more verbosely using
27 | parameters. Note the `plot()` can plot several lines in one call.
28 | 
29 | Other plots are available:
30 | 
31 | * acorr
32 | * bar
33 | * barh
34 | * boxplot
35 | * contour
36 | * errorbar
37 | * hist
38 | * scatter
39 | * violinplot
40 | * xcorr
41 | 
42 | Writing non-interactive matplotlib code gives you more control over the
43 | resulting plots. In order to understand the documentation, you need to learn
44 | some jargon:
45 | 
46 | * Figure - a drawing surface
47 | * Axes - a single plot, including its axes, title, lines, etc...
48 | * Artist - a single element of a plot; for example, a line
49 | 
50 | It's possible to use non-interactive methods directly at the interpreter, but
51 | easuer to use them in a script. Every Python script is a module, meaning it
52 | can be imported in other scripts. Scripts are also very easy to document using
53 | docstrings (triple-quoted strings).
54 | 
55 | 


--------------------------------------------------------------------------------
/day5/truck_map.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | ''' Map food trucks by neighborhood in San Francisco.
 3 | '''
 4 | 
 5 | import json
 6 | import folium
 7 | import pandas as pd
 8 | import shapely.geometry as geom
 9 | 
10 | def main():
11 |     ''' Produce a map of neighborhood food trucks in San Francisco.
12 |     '''
13 |     # Load the neighborhoods GeoJSON data, and extract the neighborhood names
14 |     # and shapes.
15 |     nhoods = load_geo_json('sf_nhoods.json')
16 |     features = extract_features(nhoods)
17 | 
18 |     # Load the food truck data, and extract (longitude, latitude) pairs.
19 |     trucks = pd.read_csv('mobile_food.csv')
20 |     loc = trucks[['Longitude', 'Latitude']].dropna()
21 | 
22 |     # Determine the neighborhood each food truck is in.
23 |     trucks['Neighborhood'] = loc.apply(get_nhood, axis=1, features=features)
24 | 
25 |     counts = trucks['Neighborhood'].value_counts()
26 |     counts = counts.reset_index()
27 |     counts.columns = pd.Index(['Neighborhood', 'Trucks'])
28 | 
29 |     # Create a map.
30 |     my_map = folium.Map(location=[37.77, -122.45], zoom_start=12)
31 | 
32 |     my_map.geo_json(geo_path = 'sf_nhoods.json',
33 |                     key_on='feature.properties.name',
34 |                     data = counts, columns = ['Neighborhood', 'Trucks'],
35 |                     fill_color='BuGn')
36 | 
37 |     my_map.create_map('map.html')
38 | 
39 | def get_nhood(truck, features):
40 |     ''' Identify the neighborhood of a given point.
41 |     '''
42 |     truck = geom.Point(*truck)
43 |     for name, boundary in features:
44 |         if truck.within(boundary):
45 |             return name
46 | 
47 |     return None
48 | 
49 | def load_geo_json(path):
50 |     ''' Load a GeoJSON file.
51 |     '''
52 |     with open(path) as file:
53 |         geo_json = json.load(file)
54 | 
55 |     return geo_json
56 | 
57 | def extract_features(geo_json):
58 |     ''' Extract names and geometries from a geo_json dict.
59 |     '''
60 |     features = []
61 |     for feature in geo_json['features']:
62 |         name = feature['properties']['name']
63 |         geometry = geom.asShape(feature['geometry'])
64 | 
65 |         features.append((name, geometry))
66 | 
67 |     return features
68 | 
69 | if __name__ == '__main__':
70 |     main()
71 | 
72 | 


--------------------------------------------------------------------------------
/day6/analysis.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | We'd like to do network analysis on a body of text.
 3 | 
 4 | Build a graph where the nodes are names, and an edge exists if the two
 5 | names appear in the same sentence.
 6 | 
 7 | The first rule about regular expressions:
 8 |     Avoid them when possible!!!!
 9 | 
10 | Data Golf!
11 | Today it's for a prize- a book on Python
12 | Hints:
13 |     - An ASCII character of text is 1 byte
14 |     - A typical word is 6 characters
15 |     - A typical novel has 200K words
16 |     - War and Peace is 3x a normal book
17 | 
18 | How big is War and Peace?
19 | 
20 | regex should match capital letters not at the beginning of a sentence.
21 | '''
22 | 
23 | import re
24 | from itertools import combinations
25 | import networkx as nx
26 | 
27 | names = 'Nick Clark Rick Qi'.split(' ')
28 | 
29 | namepairs = set(combinations(names, 2))
30 | 
31 | pattern = re.compile(r'''
32 |         [^^]        # Not the beginning of a string
33 |         (?<!\.\s)   # Exclude the period
34 |         [A-Z]       # A single capital letter
35 |         [a-z]{1,}   # Rest of the word
36 | ''', flags=re.VERBOSE)
37 | 
38 | pattern2 = re.compile(r'[^^](?<!\.\s)[A-Z][a-z]{1,}')
39 | 
40 | 
41 | s = '''The teachers are Nick and Clark.
42 | The students include Rick and Qi.
43 | '''
44 | 
45 | print(pattern.findall(s))
46 | 
47 | sentences = s.split('.')
48 | 
49 | # Note- this algorithm is O(N^2) quadratically slow in the number of names
50 | # since there are n choose 2 names.
51 | # You should modify it so that it's O(N) or better.
52 | 
53 | def together(namepair, sentence):
54 |     '''
55 |     Return True if all the elements of namepair are in
56 |     the sentence
57 |     '''
58 |     return all(x in sentence for x in namepair)
59 | 
60 | 
61 | G = nx.Graph()
62 | 
63 | for s in sentences:
64 |     for namepair in namepairs:
65 |         if together(namepair, s):
66 |             # * does unpacking
67 |             G.add_edge(*namepair)
68 | 


--------------------------------------------------------------------------------
/day6/announce.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | A data driven approach to naming babies
 3 | '''
 4 | 
 5 | import numpy as np
 6 | 
 7 | from babynamer import babynamer
 8 | 
 9 | # Set a random seed to get the same name
10 | np.random.seed(812)
11 | 
12 | b = dict(babynamer())
13 | 
14 | # This part is called the template
15 | 
16 | dearwife = '''
17 | Yeji is having a baby {gender}.
18 | The name will be {name}. In 2013,
19 | there were {count:,} babies with this name. 
20 | The probability of getting this name was {probs:.3g}.
21 | '''
22 | 
23 | print(dearwife.format(**b))
24 | 
25 | with open('dearyeji.txt', 'w') as f:
26 |     f.write(dearwife.format(**b))
27 | 


--------------------------------------------------------------------------------
/day6/babynamer.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | A data driven approach to naming babies
 3 | 
 4 | Names and counts are from the US Social Security office for 2013.
 5 | 3.6 million - probably all the babies born in 2013
 6 | '''
 7 | 
 8 | import numpy as np
 9 | import pandas as pd
10 | 
11 | names = pd.read_csv('yob2013.txt', names=['name', 'gender', 'count'])
12 | 
13 | # Assign probabilities
14 | names['probs'] = names['count'] / names['count'].sum()
15 | 
16 | 
17 | def babynamer():
18 |     '''
19 |     Returns a baby name and gender
20 |     '''
21 |     i = np.random.choice(len(names), p=names['probs'])
22 |     return names.iloc[i, :]
23 | 
24 | if __name__ == '__main__':
25 | 
26 |     b1 = babynamer()
27 | 
28 |     s = '''
29 |     {} is a {}. In 2013 there were {:,} babies with that name. The
30 |     probability is {:.3g}.
31 |     '''
32 | 
33 |     sfull = s.format(*b1)
34 | 
35 |     print(sfull)
36 | 
37 |     s2 = '''
38 |     {name} is a {gender}. In 2013 there were {count:,} babies with that name.
39 |     The probability is {probs:.3g}.
40 |     '''
41 | 
42 |     s2full = s2.format(**b1)
43 | 
44 |     print(s2full)
45 | 


--------------------------------------------------------------------------------
/day6/day6log.txt:
--------------------------------------------------------------------------------
  1 | clear
  2 | import matplotlib.pyplot as plt
  3 | import networkx as nx
  4 | plt.ion()
  5 | # Make a graph
  6 | G = nx.Graph()
  7 | G.add_edge('A', 'B')
  8 | clear
  9 | G.add_edge('A', 'C')
 10 | G.add_edge('B', 'C')
 11 | G.add_edge('B', 'D')
 12 | nx.draw_networkx(G, node_size=1000, alpha=0.2)
 13 | clear
 14 | nx.dijkstra_path(G, 'C', 'D')
 15 | clear
 16 | nx.dijkstra_path(G, 'C', 'D')
 17 | %run analysis.py
 18 | s
 19 | clear
 20 | %run analysis.py
 21 | names
 22 | %run analysis.py
 23 | together(('X', 'Y'), 'X Y Z')
 24 | together(('X', 'Y'), 'X  Z')
 25 | clear
 26 | %run analysis.py
 27 | namepairs
 28 | %run analysis.py
 29 | namepairs
 30 | set(1, 2, 3)
 31 | clear
 32 | set([1, 2, 3])
 33 | %run analysis.py
 34 | sentences
 35 | %run analysis.py
 36 | %run analysis.py
 37 | clear
 38 | %run analysis.py
 39 | clear
 40 | namepairs
 41 | clear
 42 | %run analysis.py
 43 | G
 44 | G.edges
 45 | clear
 46 | G.edges()
 47 | nx.draw_networkx(G, node_size=1000, alpha=0.2)
 48 | nx.dijkstra_path(G, 'Nick', 'Qi')
 49 | clear
 50 | # That's because they were separated components
 51 | list(nx.connected_components(G))
 52 | clear
 53 | list(nx.connected_components(G))
 54 | nx.connected_components(G)
 55 | # How to avoid regex:
 56 | g = 'google.com'
 57 | 'goog' in g
 58 | g.endswith('.com')
 59 | clear
 60 | korean = '안뇽 하세요?'
 61 | korean
 62 | %run analysis.py
 63 | pattern
 64 | %run analysis.py
 65 | clear
 66 | %run analysis.py
 67 | %run analysis.py
 68 | %run analysis.py
 69 | %run analysis.py
 70 | %run analysis.py
 71 | clear
 72 | %run analysis.py
 73 | %run analysis.py
 74 | %run analysis.py
 75 | pattern2.findall(s)
 76 | clear
 77 | %run announce.py
 78 | b
 79 | %run announce.py
 80 | dearwife
 81 | clear
 82 | dearwife
 83 | dearwife.format(b)
 84 | dearwife.format(*b)
 85 | clear
 86 | dearwife.format(*b)
 87 | b
 88 | clear
 89 | %run announce.py
 90 | b
 91 | dict(b)
 92 | clear
 93 | %run announce.py
 94 | b
 95 | %run announce.py
 96 | clear
 97 | %run announce.py
 98 | clear
 99 | %run announce.py
100 | %run announce.py
101 | clear
102 | %run announce.py
103 | %run announce.py
104 | clear
105 | %run announce.py
106 | %run announce.py
107 | clear
108 | %run announce.py
109 | ls
110 | clear
111 | cat dearyeji.txt
112 | %history -f 'day6log.txt'
113 | 


--------------------------------------------------------------------------------
/day6/dearyeji.txt:
--------------------------------------------------------------------------------
1 | 
2 | Yeji is having a baby M.
3 | The name will be William. In 2013,
4 | there were 16,495 babies with this name. 
5 | The probability of getting this name was 0.00457.
6 | 


--------------------------------------------------------------------------------
/day6/findnames.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | We'd like to pick the names out of text.
 3 | 
 4 | Our simple approach:
 5 | 
 6 | A regular expression for words beginning with an uppercase letter
 7 | which are not at the beginning of a sentence
 8 | 
 9 | points: 
10 | - r for \n
11 | '''
12 | 
13 | import re
14 | 
15 | pattern = re.compile(r'''
16 |         [^^]            # Not the beginning of a string
17 |         (?<!\.\s)       # No period in front
18 |         [A-Z]           # single uppercase letter
19 |         [a-z]{1,}       # at least 1 lowercase letter
20 | ''', flags=re.VERBOSE)
21 | 
22 | pattern2 = re.compile(r'[^^](?<!\.\s)[A-Z][a-z]{1,}')
23 | 
24 | 
25 | s = 'Hello there Yeji. Your baby will be named Joe.'
26 | 
27 | g = pattern.findall(s)
28 | 
29 | print(g)
30 | 


--------------------------------------------------------------------------------
/day6/textgraph.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | We'd like to do some network analysis of a body of text.
 3 | 
 4 | Let's build a graph where the nodes are names and the 
 5 | weight of the edge is the number of times that the two names
 6 | appear in the same sentence together.
 7 | '''
 8 | 
 9 | from pprint import pprint
10 | import itertools
11 | 
12 | import matplotlib.pyplot as plt
13 | import networkx as nx
14 | 
15 | names = {'Clark', 'Yeji', 'Dongmin', 'Kevin Bacon'}
16 | 
17 | pairs = set(itertools.combinations(names, 2))
18 | 
19 | 
20 | def together(namepair, string):
21 |     '''
22 |     Return True if all elements of names are in string
23 |     '''
24 |     return all(n in string for n in namepair)
25 | 
26 | 
27 | paragraph = '''
28 | Clark and Yeji are married. Yeji's cousin is Dongmin.
29 | Everybody likes Kevin Bacon. Dongmin and Clark play ping pong.
30 | '''
31 | 
32 | # Look through the paragraph and add names that appear in the
33 | # same sentence, and add an edge to the graph
34 | 
35 | G = nx.Graph()
36 | G.add_nodes_from(names)
37 | 
38 | sentences = paragraph.split('.')
39 | 
40 | for s in sentences:
41 |     for namepair in pairs:
42 |         if together(namepair, s):
43 |             G.add_edge(*namepair)
44 | 
45 | nx.draw_networkx(G, node_size=1500, alpha=0.2)
46 | plt.show()
47 | 
48 | # We can now pull out an adjacency matrix
49 | adj = nx.adjacency_matrix(G)
50 | adj = adj.todense()
51 | 
52 | print(list(nx.connected_components(G)))
53 | print(nx.algorithms.triangles(G))
54 | 


--------------------------------------------------------------------------------
/day7/README.md:
--------------------------------------------------------------------------------
1 | 
2 | The American Time Use Survey data is available [here][tus].
3 | 
4 | [tus]: http://www.bls.gov/tus
5 | 


--------------------------------------------------------------------------------
/day7/aa.db:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nick-ulle/2015-python/18e324ea50435d94c72ec853eec56ea6dfa5fe05/day7/aa.db


--------------------------------------------------------------------------------
/day7/atus.db:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nick-ulle/2015-python/18e324ea50435d94c72ec853eec56ea6dfa5fe05/day7/atus.db


--------------------------------------------------------------------------------
/day7/atus_blank.py:
--------------------------------------------------------------------------------
 1 | 
 2 | # # American Time Use Survey
 3 | #
 4 | # The Bureau of Labor Statistics collects economic and demographic information
 5 | # about Americans annually, using the 8-month long *Current Population Survey*
 6 | # (CPS). At the end of this survey, some respondents are selected for one more
 7 | # interview as part of the *American Time Use Survey* (ATUS).
 8 | #
 9 | # # Using SQLAlchemy
10 | #
11 | # One year (2013) of ATUS results makes for a small, but more realistic
12 | # database.
13 | 
14 | 
15 | # SQLAlchemy shines when it has metadata about the database. Fortunately, it
16 | # can extract metadata automatically.
17 | 
18 | 
19 | # It mentions 'dict', so maybe we can treat it like a dict?
20 | 
21 | 
22 | # ATUS includes several tables:
23 | #
24 | # * activity: activity, location, duration, etc...
25 | # * eldercare: elders the respondent cared for
26 | # * respondent: education, earnings, time alone, etc...
27 | # * roster: age, gender, and relationships of each household member
28 | # * summary: summary timings for activity data
29 | # * who: who was present for each activity?
30 | #
31 | # Most of the data is encoded, so we'll focus on one task.
32 | #
33 | # How much did the average American sleep per day, in 2013?
34 | #
35 | # Important variables:
36 | #
37 | # * TULINENO - line number (1 is the respondent)
38 | # * activity:
39 | #     - TUCASEID - id number for case
40 | #     - TRCODE - activity code (010101 is restful sleep)
41 | #     - TUACTDUR24 - activity duration, in minutes
42 | # * roster:
43 | #     - TEAGE - respondent age
44 | #     - TESEX - respondent sex
45 | 
46 | 
47 | # We can use Python instead of SQL code to build up a query, and it will
48 | # automatically be translated to match the SQL dialect used by the database
49 | # we're querying.
50 | 
51 | 
52 | # To query the database, we have to set up a connection.
53 | 
54 | 
55 | # Now we can execute our query; the result is an iterator over rows.
56 | 
57 | 
58 | # This can be converted to a DataFrame without much effort.
59 | 
60 | 
61 | # At this point, we could use Pandas to finish the analysis. Let's use
62 | # SQLAlchemy instead, to see what else it can do.
63 | 
64 | 
65 | 


--------------------------------------------------------------------------------
/day7/backup/atus.py:
--------------------------------------------------------------------------------
  1 | 
  2 | # # American Time Use Survey
  3 | #
  4 | # The Bureau of Labor Statistics collects economic and demographic information
  5 | # about Americans annually, using the 8-month long *Current Population Survey*
  6 | # (CPS). At the end of this survey, some respondents are selected for one more
  7 | # interview as part of the *American Time Use Survey* (ATUS).
  8 | #
  9 | # # Using SQLAlchemy
 10 | #
 11 | # One year (2013) of ATUS results makes for a small, but more realistic
 12 | # database.
 13 | 
 14 | import sqlalchemy as sa
 15 | 
 16 | engine = sa.create_engine('sqlite:///atus.db')
 17 | 
 18 | engine.table_names()
 19 | 
 20 | # SQLAlchemy shines when it has metadata about the database. Fortunately, it
 21 | # can extract metadata automatically.
 22 | 
 23 | meta = sa.MetaData(engine, reflect=True)
 24 | 
 25 | meta.tables
 26 | 
 27 | tables = meta.tables
 28 | type(tables)
 29 | 
 30 | # It mentions 'dict', so maybe we can treat it like a dict?
 31 | 
 32 | tables.keys()
 33 | 
 34 | who = tables['who']
 35 | 
 36 | who.columns
 37 | print(who.columns)
 38 | 
 39 | # ATUS includes several tables:
 40 | #
 41 | # * activity: activity, location, duration, etc...
 42 | # * eldercare: elders the respondent cared for
 43 | # * respondent: education, earnings, time alone, etc...
 44 | # * roster: age, gender, and relationships of each household member
 45 | # * summary: summary timings for activity data
 46 | # * who: who was present for each activity?
 47 | #
 48 | # Most of the data is encoded, so we'll focus on one task.
 49 | #
 50 | # How much did the average American sleep per day, in 2013?
 51 | #
 52 | # Important variables:
 53 | #
 54 | # * TULINENO - line number (1 is the respondent)
 55 | # * activity:
 56 | #     - TUCASEID - id number for case
 57 | #     - TRCODE - activity code (010101 is restful sleep)
 58 | #     - TUACTDUR24 - activity duration, in minutes
 59 | # * roster:
 60 | #     - TEAGE - respondent age
 61 | #     - TESEX - respondent sex
 62 | 
 63 | activity = tables['activity']
 64 | print(activity.columns)
 65 | 
 66 | # We can use Python instead of SQL code to build up a query, and it will
 67 | # automatically be translated to match the SQL dialect used by the database
 68 | # we're querying.
 69 | 
 70 | from sqlalchemy import select, literal_column, func
 71 | 
 72 | sleep_sums_q = select([
 73 |     activity.c.TRCODE, 
 74 |     func.sum(activity.c.TUACTDUR24).label('SLEEP')
 75 | ]). \
 76 |     where(activity.c.TRCODE == 10101). \
 77 |     group_by(activity.c.TUCASEID)
 78 | 
 79 | print(sleep_sums_q)
 80 | 
 81 | # To query the database, we have to set up a connection.
 82 | 
 83 | conn = engine.connect()
 84 | 
 85 | # Now we can execute our query; the result is an iterator over rows.
 86 | 
 87 | result = conn.execute(sleep_sums_q)
 88 | 
 89 | for row in result:
 90 |     print('Code: {}, Duration: {} min'.format(*row))
 91 | 
 92 | # This can be converted to a DataFrame without much effort.
 93 | 
 94 | result = conn.execute(sleep_sums_q)
 95 | 
 96 | import pandas as pd
 97 | 
 98 | sleep_sums = pd.DataFrame.from_records(row for row in result)
 99 | 
100 | # At this point, we could use Pandas to finish the analysis. Let's use
101 | # SQLAlchemy instead, to see what else it can do.
102 | 
103 | sleep_avg_q = select([ 
104 |     func.avg(literal_column('SLEEP')).label('AVG_SLEEP')
105 | ]).select_from(sleep_sums_q)
106 | 
107 | print(sleep_avg_q)
108 | 
109 | result = conn.execute(sleep_avg_q).fetchall()
110 | 
111 | result = result[0][0]
112 | sleep_avg = result / 60
113 | 
114 | 


--------------------------------------------------------------------------------
/day7/backup/lecture.py:
--------------------------------------------------------------------------------
 1 | 
 2 | # # Data Storage
 3 | #
 4 | # What are some ways to store data?
 5 | #
 6 | # * CSV / delimited
 7 | # * JSON
 8 | # * XML
 9 | # * databases
10 | #     - relational (SQL) database
11 | #     - NoSQL database
12 | # * HDF5
13 | #
14 | # Why are there so many formats?
15 | #
16 | # Each format has certain benefits and drawbacks. CSV, JSON, and XML are all
17 | # human-readable! HDF5 files are not human-readable, but they can be loaded
18 | # quickly and use minimal disk space.
19 | #
20 | # Databases differ from the other formats, because they support safe,
21 | # simultaneous access by many users. Databases can be queried, as well, to
22 | # extract specific subsets of the data.
23 | #
24 | # Relational databases store data as a collection of tables, which are very
25 | # much like DataFrames (each column has a type). These are sometimes called SQL
26 | # databases, because most relation databases can be queried using Structured
27 | # Query Language (SQL). Examples:
28 | #
29 | # * SQLite
30 | # * MariaDB (formerly MySQL)
31 | # * MS SQL Server
32 | # * Oracle
33 | #
34 | # NoSQL databases store data in other (possibly heterogeneous) formats, and
35 | # abandon some of the safety checks in favor of pure speed. This is what Google
36 | # and Facebook run on. Examples:
37 | #
38 | # * MongoDB
39 | # * Cassandra
40 | #
41 | # Yes, you can access almost all of these from Python!
42 | #
43 | # ## Relational Databases
44 | #
45 | # Pandas makes it easy to access tables in a SQL database through the function
46 | # `read_sql()`.
47 | 
48 | import pandas as pd
49 | 
50 | # Looking at the help file for this function, it has parameters for a SQL query
51 | # string and a 'SQLAlchemy engine'.
52 | #
53 | # SQLAlchemy is a unified interface to SQL databases. For the moment, we're
54 | # only going to use it to connect to the database. An *engine* is a wrapper
55 | # around the database, that knows how to communicate with it.
56 | 
57 | import sqlalchemy as sa
58 | 
59 | engine = sa.create_engine('sqlite:///aa.db')
60 | 
61 | # The basics of SQL are not too complicated. If we want to select some rows
62 | # from a database, we use a select statement:
63 | #
64 | #     SELECT <columns> FROM <table>
65 | #
66 | # How can we tell what tables are in the database?
67 | 
68 | engine.table_names()
69 | 
70 | # Let's select all columns in the 'aa' table.
71 | 
72 | aa = pd.read_sql('SELECT * FROM aa', engine)
73 | 
74 | # Can anyone tell what this dataset is?
75 | 
76 | aa.head()
77 | aa.tail()
78 | 
79 | # Let's try a more complicated query using `WHERE`.
80 | 
81 | pd.read_sql('SELECT * FROM aa WHERE year >= 1950 AND year < 1960', engine)
82 | 
83 | # Who's the oldest person in the table?
84 | 
85 | pd.read_sql('SELECT * FROM aa ORDER BY age DESC LIMIT 10', engine)
86 | 
87 | # YOUR TURN:
88 | # 
89 | # * Who's the youngest person in the table?
90 | # * How many Hepburns are in the table?
91 | # * How many Californians have won?
92 | # * Who has the most wins?
93 | 
94 | # The trick is to take advantage of SQL and Pandas together. Generally, you
95 | # want to use SQL to reduce the data to a manageable size (i.e., so it fits in
96 | # memory) and then use Pandas to investigate the fine details.
97 | 
98 | 


--------------------------------------------------------------------------------
/day7/day7log1.py:
--------------------------------------------------------------------------------
 1 | ls
 2 | import pandas as pd
 3 | pd.read_sql?
 4 | import sqlalchemy as sa
 5 | engine = sa.create_engine('sql
 6 | engine = sa.create_engine('sqlite:///aa.db')
 7 | engine
 8 | engine.table_names()
 9 | query = 'SELECT * FROM aa'
10 | aa = pd.read_sql(query, engine)
11 | aa.head()
12 | aa.tail()
13 | query = 'SELECT * FROM aa WHERE person LIKE "%Brando"'
14 | query
15 | pd.read_sql(query, engine)
16 | query = 'SELECT person FROM aa WHERE year >= 1920 AND year < 1930'
17 | query
18 | pd.read_sql(query, engine)
19 | query
20 | engine = sa.create_engine('sqlite:///aa.db')
21 | query = 'SELECT * FROM aa ORDER BY age DESC LIMIT 5'
22 | query
23 | pd.read_sql(query, engine)
24 | q = 'SELECT * FROM aa ORDER BY age ASC LIMIT 5'
25 | query
26 | q
27 | pd.read_sql(q, engine)
28 | q = 'SELECT * FROM aa WHERE person LIKE "%Hepburn"'
29 | q
30 | pd.read_sql(q, engine)
31 | q = 'SELECT * FROM aa WHERE birthplace == "California"'
32 | q
33 | pd.read_sql(q, engine)
34 | q = 'SELECT DISTINCT person f
35 | q = 'SELECT DISTINCT person FROM aa WHERE birthplace == "California"'
36 | q
37 | pd.read_sql(q, engine)
38 | pd.read_sql(q, engine).size
39 | aa
40 | aa.Person.value_counts()
41 | q = 'SELECT COUNT(*) FROM aa GROUP BY person'
42 | q
43 | pd.read_sql(q, engine)
44 | q
45 | q = 'SELECT person, COUNT(*) as Count FROM aa GROUP BY person ORDER BY count DESC LIMIT 5'
46 | q
47 | q
48 | pd.read_sql(q, engine)
49 | %history -f log1.py
50 | 


--------------------------------------------------------------------------------
/day7/day7log2.py:
--------------------------------------------------------------------------------
 1 | import sqlalchemy as sa
 2 | engine = sa.create_engine('sqlite:///atus.db')
 3 | engine.table_names()
 4 | meta = sa.MetaData(engine, reflect=True)
 5 | meta.tables
 6 | tables = meta.tables
 7 | type(tables)
 8 | tables.keys()
 9 | engine.table_names()
10 | who = tables['who']
11 | type(who)
12 | who
13 | who.columns
14 | print(who.columns)
15 | act = tables['activity']
16 | print(act.columns)
17 | from sqlalchemy import select, literal_column, func
18 | sums_q = select([func.sum(act.c.TUACTDUR24).label('sleep')]).\
19 |     group_by(act.c.TUCASEID)
20 | print(sums_q)
21 | sums_q = select([func.sum(act.c.TUACTDUR24).label('sleep')]).\
22 |     where(act.c.TRCODE == 10101).\
23 |     group_by(act.c.TUCASEID)
24 | print(sums_q)
25 | conn = engine.connect()
26 | result = conn.execute(sums_q)
27 | for row in result:
28 |     print(row)
29 | import pandas as pd
30 | pd.DataFrame.from_records(row for row in result)
31 | result = conn.execute(sums_q)
32 | pd.DataFrame.from_records(row for row in result)
33 | avg_q = select([func.avg(literal_column('Sleep').label('AVG_Sleep')]).\
34 |                select_from(sums_q)
35 | 
36 | 
37 | 
38 | )
39 | avg_q = select([func.avg(literal_column('Sleep')).label('AVG_Sleep')]).\
40 |                select_from(sums_q)
41 | print(avg_q)
42 | result = conn.execute(avg_q)
43 | result
44 | result = result[0]
45 | result = resultresult = conn.execute(avg_q).fetchall()
46 | result
47 | result = result[0][0]
48 | result
49 | result / 60
50 | %history log2.py
51 | %history -f log2.py
52 | 


--------------------------------------------------------------------------------
/day8/backup/notes.py:
--------------------------------------------------------------------------------
  1 | from scipy import stats
  2 | # stats is a scipy submodule, not an object
  3 | 
  4 | import matplotlib.pyplot as plt
  5 | 
  6 | plt.ion()
  7 | # Starting out:
  8 | # Statistically you might say Z ~ N(0, 1)
  9 | # To denote a normal random variable
 10 | Z = stats.norm(0, 1)
 11 | # This is how you do it in Python
 12 | # There is a tight correspondence between the code and the mathematical notation
 13 | # :)
 14 | # Both Python and Julia have had the luxury of learning from R
 15 | # I find random variables much more intuitive here
 16 | # For example:
 17 | # What's the mean?
 18 | Z.mean()
 19 | Z.median()
 20 | Z.interval(0.95)
 21 | # A 95 percent confidence interval
 22 | # The object oriented approach is quite convenient
 23 | Z.cdf(2)
 24 | Z.moment?
 25 | Z.moment(0)
 26 | Z.moment(1)
 27 | Z.moment(2)
 28 | Z.moment(3)
 29 | Z.pdf(0)
 30 | Z.ppf(0.975)
 31 | # percentage point function- also called quantile
 32 | # pmf is probability mass function- but continuous RV's don't have a pmf
 33 | Z.pmf(0)
 34 | Z.rvs(10)
 35 | Z.stats()
 36 | Z.std()
 37 | # scipy.stats supports nearly 100 different distributions
 38 | # And the all behave EXACTLY like this :)
 39 | stats.zipf?
 40 | Z2 = stats.zipf(4)
 41 | Z2.mean()
 42 | x = np.linspace(-2, 2)
 43 | # Visualize it
 44 | plt.plot(x, Z.pdf(x))
 45 | plt.clf()
 46 | plt.plot(x, Z.pdf(x))
 47 | x = np.linspace(-4, 4)
 48 | plt.plot(x, Z.pdf(x))
 49 | plt.plot(x, Z.cdf(x))
 50 | # You might ask- what's the survival function?
 51 | plt.plot(x, Z.sf(x))
 52 | 
 53 | 
 54 | # Now lets look at the sleep data that Nick introduced last time
 55 | # Some exploratory analysis of this distribution
 56 | # We'd like to see if we can fit a statistical distribution to this data.
 57 | 
 58 | import pandas as pd
 59 | sleep = pd.read_csv('sleepminutes.csv')
 60 | sleep.head()
 61 | type(sleep)
 62 | sleep = sleep['minutes']
 63 | # We're just working with one series here
 64 | plt.hist(sleep, bins=30)
 65 | # Suppose that we want to model this with an appropriate statistical distribution
 66 | # The only possible range of values are between 0 and 24 * 60
 67 | 24 * 60
 68 | sleep.max()
 69 | # So somebody slept 23.5 / 24 hours
 70 | # It's possible to fit this using a beta distribution
 71 | # Beta is only defined on [0, 1]
 72 | sleep.min()
 73 | # So we'll need to scale it
 74 | # All RV's in Scipy have parameters for shape, location and scale
 75 | stats.beta.fit?
 76 | stats.beta.fit(sleep, floc=0, fscale = 24 * 60)
 77 | # floc means 'fixed location'
 78 | bparams = stats.beta.fit(sleep, floc=0, fscale = 24 * 60)
 79 | # We know the shape and scale, so we'll fit using this knowledge
 80 | # This is the MLE for alpha and beta parameters
 81 | # Now we make a random variable
 82 | sbeta = stats.beta(*bparams)
 83 | sbeta
 84 | sbeta.interval(1)
 85 | sbeta.mean()
 86 | sleep.mean()
 87 | 
 88 | # So sbeta here is the beta distribution fitted to this data
 89 | # Let's plot it and make sure
 90 | # that it looks reasonable
 91 | 
 92 | x = np.linspace(0, 60*24)
 93 | h, edges = np.histogram(sleep, 30, normed=True)
 94 | plt.bar(edges[:-1], h, width=np.diff(edges))
 95 | plt.plot(x, sbeta.pdf(x), linewidth=4, color='orange')
 96 | 
 97 | plt.clf()
 98 | 
 99 | # Titanic model building
100 | preds = rfmod.predict(X)
101 | 
102 | plt.boxplot(scores)
103 | 


--------------------------------------------------------------------------------
/day8/backup/titanic.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | Predict survivors of the Titanic
 3 | '''
 4 | 
 5 | import pandas as pd
 6 | from sklearn.cross_validation import cross_val_score
 7 | from sklearn.ensemble import RandomForestClassifier
 8 | 
 9 | 
10 | ##################################################
11 | # 
12 | # 1) Prepare the data
13 | #
14 | # Machine learning is regression with a sexier name.
15 | #
16 | # Fitting regression models requires clean numeric data.
17 | # Typically we have an n x 1 vector response y and an n x p 
18 | # predictor matrix X. 
19 | #
20 | # Much of the art is in preparing an appropriate X. 
21 | # Tasks include:
22 | #   - Separating validation data set
23 | #   - Joins / calculated columns
24 | #   - transform categorical data and text
25 | #   - impute missing data
26 | # 
27 | ##################################################
28 | 
29 | # Read in the data
30 | titanic = pd.read_csv('titanic.csv')
31 | 
32 | y = titanic['Survived']
33 | 
34 | # Some numeric features
35 | features = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
36 | 
37 | X = titanic[features].copy()
38 | 
39 | # Add in the gender
40 | X['Sex'], genders = pd.factorize(titanic['Sex'])
41 | 
42 | # Need full rows
43 | #X.fillna(X.mean(), inplace=True)
44 | # Drop rows with missing data - although we could impute
45 | fullrows = -X.isnull().any(axis=1)
46 | X = X[fullrows]
47 | y = y[fullrows]
48 | 
49 | ##################################################
50 | # 
51 | # 2) Fit and evaluate models
52 | # 
53 | # Fitting models is easy in scikit learn.
54 | #
55 | # Some techniques:
56 | #   - Cross validation
57 | #   - Grid searches over model parameter spaces
58 | #   - Various ways to score
59 | # 
60 | ##################################################
61 | 
62 | rfmod = RandomForestClassifier()
63 | rfmod.fit(X, y)
64 | 
65 | # How well did it perform?
66 | # 10 fold cross validation
67 | scores = cross_val_score(rfmod, X, y, cv=10)
68 | 
69 | ##################################################
70 | #
71 | # 3) Predict
72 | #
73 | # Make predictions on new data using fitted model
74 | #
75 | ##################################################
76 | 


--------------------------------------------------------------------------------
/day8/betafit.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | Fitting a beta distribution to the sleep data
 3 | '''
 4 | 
 5 | import numpy as np
 6 | from scipy import stats
 7 | import matplotlib.pyplot as plt
 8 | import pandas as pd
 9 | 
10 | plt.style.use('ggplot')
11 | 
12 | sleep = pd.read_csv('sleepminutes.csv')
13 | sleep = sleep['minutes'] / 60
14 | 
15 | # Fit the MLE of the beta distribution
16 | p = stats.beta.fit(sleep, floc=0, fscale=24)
17 | 
18 | # The corresponding random variable
19 | Xsleep = stats.beta(*p)
20 | 
21 | plt.hist(sleep, bins=30, normed=True)
22 | x = np.linspace(0, 24)
23 | plt.plot(x, Xsleep.pdf(x), linewidth=4)
24 | s = 'alpha: {:.3}, beta: {:.3}, loc: {}, scale: {}'
25 | plt.title(s.format(*p))
26 | plt.xlabel('Hours of sleep')
27 | plt.savefig('sleep.pdf')
28 | 


--------------------------------------------------------------------------------
/day8/day8.py:
--------------------------------------------------------------------------------
  1 | from scipy import stats
  2 | # Random variables
  3 | # Statistically: Z ~ N(0, 1)
  4 | Z = stats.norm(0, 1)
  5 | Z.mean()
  6 | Z.interval(0.95)
  7 | Z.ppf(0.975)
  8 | Z.std()
  9 | Z.pdf(0)
 10 | import matplotlib.pyplot as plt
 11 | plt.ion()
 12 | # plot density of normal from -4 to 4
 13 | import numpy as np
 14 | x = np.linspace(-4, 4)
 15 | Z
 16 | Z.pdf(x)
 17 | plt.plot(x, Z.pdf(x))
 18 | plt.plot(x, Z.cdf(x))
 19 | Z.sf?
 20 | plt.plot(x, Z.sf(x))
 21 | import pandas as pd
 22 | sleep = pd.read_csv('sleepminutes.csv')
 23 | sleep.shape
 24 | sleep.head()
 25 | sleep = sleep['minutes']
 26 | type(sleep)
 27 | # Lets model this with a statistical distribution
 28 | plt.hist(sleep)
 29 | sleep.min()
 30 | sleep.max()
 31 | 24 * 60
 32 | # Goal- fit the MLE of the beta distribution
 33 | # to our sleep data
 34 | stats.beta.fit(sleep, floc=0, fscale=24*60)
 35 | Xsleep = stats.beta(stats.beta.fit(sleep, floc=0, fscale=24*60))
 36 | Xsleep = stats.beta(*stats.beta.fit(sleep, floc=0, fscale=24*60))
 37 | Xsleep
 38 | stats.beta.fit(sleep, floc=0, fscale=24*60))
 39 | stats.beta.fit(sleep, floc=0, fscale=24*60))
 40 | Xsleep
 41 | stats.beta?
 42 | sleep.mean()
 43 | Xsleep.mean()
 44 | # Now lets plot it
 45 | # Now lets plot it
 46 | h, edges = np.histogram(sleep, 30, normed=True)
 47 | h
 48 | plt.bar(edges[:-1], h, width=np.diff(edges))
 49 | plt.clf()
 50 | plt.bar(edges[:-1], h, width=np.diff(edges))
 51 | # Now add our fitted beta distribution
 52 | x = np.linspace(0, 60*24)
 53 | x
 54 | x = np.linspace(0, 60*24)
 55 | plt.plot(x, Xsleep.pdf(x), linewidth=4, color='orange')
 56 | # Now let's move on to the machine learning
 57 | clear
 58 | %run kaggle.py
 59 | type(titanic)
 60 | titanic.dtypes()
 61 | titanic.dtypes
 62 | clear
 63 | titanic.dtypes
 64 | clear
 65 | %run kaggle.py
 66 | X.shape
 67 | pd.factorize?
 68 | clear
 69 | titanic.dtypes
 70 | %run kaggle.py
 71 | clear
 72 | genders
 73 | a, b = (0, 1)
 74 | a
 75 | b
 76 | a, b = 0, 1
 77 | a
 78 | b
 79 | clear
 80 | %run kaggle.py
 81 | plt.hist?
 82 | clear
 83 | X.isnull().sum()
 84 | # Pull out stats module from Scipy
 85 | from scipy import stats
 86 | %run kaggle.py
 87 | X.shape
 88 | clear
 89 | rfmod
 90 | rfmod
 91 | rfmod.score()
 92 | # We're going to do cross validation
 93 | from sklearn.cross_validation import cross_val_score
 94 | scores = cross_val_score(rfmod, X, y, cv=10)
 95 | scores
 96 | plt.box(scores)
 97 | q
 98 | plt.clf()
 99 | plt.box(scores)
100 | plt.boxplot(scores)
101 | scores
102 | plt.boxplot(scores)
103 | rfmod.predict(X)
104 | rfmod.predict_proba(X)
105 | a = rfmod.predict_proba(X)
106 | a.shape
107 | len(a)
108 | a.size
109 | %history -f day8.py
110 | 


--------------------------------------------------------------------------------
/day8/final.mdown:
--------------------------------------------------------------------------------
 1 | Quick summary:
 2 | 
 3 | Web API's   - requests, JSON
 4 | numerical computing-    Numpy
 5 | Munging     -pandas
 6 | DataVis, maps     - matplotlib
 7 | graphs, text    - networkx, re
 8 | databases       - sqlalchemy
 9 | stats / ML      - scipy, sklearn
10 | 
11 | You can be extremely productive with these tools.
12 | 
13 | But theree's so much more!!
14 | 
15 | How to continue learning:
16 | - Kaggle
17 | - Project Euler
18 | - Read good code
19 | 


--------------------------------------------------------------------------------
/day8/kaggle.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | Titanic data set
 3 | 
 4 | Predict who will survive
 5 | '''
 6 | 
 7 | import pandas as pd
 8 | from sklearn.ensemble import RandomForestClassifier
 9 | 
10 | ##################################################
11 | # 
12 | # 1) Prepare the data
13 | #
14 | # Machine learning - better marketing
15 | #
16 | # Fitting regression models requires clean numeric data.
17 | # Typically we have an n x 1 vector response y and an n x p 
18 | # predictor matrix X. 
19 | #
20 | # Much of the art is in preparing an appropriate X. 
21 | # Tasks include:
22 | #   - Separating validation data set
23 | #   - Joins / calculated columns
24 | #   - transform categorical data and text
25 | #   - impute missing data
26 | # 
27 | ##################################################
28 | 
29 | titanic = pd.read_csv('titanic.csv')
30 | 
31 | y = titanic['Survived']
32 | 
33 | # Features to include in the model
34 | features = ['Pclass', 'SibSp', 'Parch', 'Fare']
35 | 
36 | X = titanic[features].copy()
37 | 
38 | # Add in the gender
39 | X['Sex'], genders = pd.factorize(titanic['Sex'])
40 | 
41 | ##################################################
42 | # 
43 | # 2) Fit and evaluate models
44 | # 
45 | # Fitting models is easy in scikit learn.
46 | #
47 | # Some techniques:
48 | #   - Cross validation
49 | #   - Grid searches over model parameter spaces
50 | #   - Various ways to score
51 | # 
52 | ##################################################
53 | 
54 | 
55 | rfmod = RandomForestClassifier()
56 | 
57 | rfmod.fit(X, y)
58 | 
59 | ##################################################
60 | #
61 | # 3) Predict
62 | #
63 | # Make predictions on new data using fitted model
64 | #
65 | ##################################################
66 | 


--------------------------------------------------------------------------------
/exercise1/exercise1.md:
--------------------------------------------------------------------------------
 1 | # Exercise 1 - Agricultural Web Scraping
 2 | 
 3 | _Update 2_: Hint should be-
 4 | 
 5 | 7th is Pennsylvania with $43,487,000. And there are only 8 states
 6 | listed, not 10.
 7 | 
 8 | Thanks to Rylan Schaeffer and Haomiao Meng for pointing this out.
 9 | 
10 | _Update 1_: Keep your key private, ie. not on Github.
11 | 
12 | In this assignment we will write a client that pulls data from the
13 | US Department of Agriculture Quick Stats database.  
14 | 
15 | The goals of this exercise are:
16 | 
17 | - Become more familiar with the basics of Python
18 | - Develop skill in interacting with web API's
19 | 
20 | The documentation for this API is available
21 | [here](http://quickstats.nass.usda.gov/api). Read it carefully.
22 | 
23 | We've provided a template in this folder for your use called
24 | `usda_blank.py`. After you request a key, your task is to fill in the
25 | function bodies in the template and create a working program. Some sample
26 | queries are included to get started. 
27 | 
28 | From Ipython you can use %run to run the sample queries and verify that the
29 | results match what you expected. Then try building your own queries.
30 | 
31 | ### Task: 
32 | 
33 | - List the top 8 tobacco producing states in 2012 as measured by
34 |   production in dollars.
35 | 


--------------------------------------------------------------------------------
/exercise1/usda.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | A simple client for USDA statistics
  3 |     http://quickstats.nass.usda.gov/api
  4 | 
  5 | Found the data set in the data.gov catalog:
  6 |     http://catalog.data.gov/dataset/quick-stats-agricultural-database-api
  7 | 
  8 | Fun fact: This actually queries a database with 31 million records
  9 | '''
 10 | 
 11 | import requests
 12 | 
 13 | 
 14 | # One could also remove these lines and manually set the key via:
 15 | # usda_key = <your key here>
 16 | with open('usda_key.txt') as f:
 17 |     usda_key = f.read().rstrip()
 18 | 
 19 | 
 20 | def get_param_values(param, key=usda_key):
 21 |     '''
 22 |     Returns the possible values for a single parameter 'param'
 23 | 
 24 |     >>> get_param_values('sector_desc')[:3]
 25 |     ['ANIMALS & PRODUCTS', 'CROPS', 'DEMOGRAPHICS']
 26 | 
 27 |     '''
 28 |     url = 'http://quickstats.nass.usda.gov/api/get_param_values'
 29 |     parameters = {'param': param, 'key': key}
 30 |     response = requests.get(url=url, params=parameters)
 31 |     return response.json()[param]
 32 | 
 33 | 
 34 | def query(parameters, key=usda_key):
 35 |     '''
 36 |     Returns the JSON response from the USDA agricultural database
 37 | 
 38 |     'parameters' is a dictionary of parameters that can be referenced here:
 39 |         http://quickstats.nass.usda.gov/api
 40 | 
 41 |     Example: Return all the records around cattle in Tehama County
 42 | 
 43 |     >>> cowparams = {'commodity_desc': 'CATTLE',
 44 |                      'state_name': 'CALIFORNIA',
 45 |                      'county_name': 'TEHAMA'}
 46 |     >>> tehamacow = query(cowparams)
 47 | 
 48 |     '''
 49 |     url = 'http://quickstats.nass.usda.gov/api/api_GET'
 50 |     parameters['key'] = key
 51 |     response = requests.get(url=url, params=parameters)
 52 |     try:
 53 |         return response.json()['data']
 54 |     except KeyError:
 55 |         return response.json()
 56 | 
 57 | 
 58 | if __name__ == '__main__':
 59 | 
 60 |     import operator
 61 |     # A few examples of usage
 62 | 
 63 |     # Possible values for 'commodity_desc'
 64 |     # commodity_desc = get_param_values('commodity_desc')
 65 |     # Expect:
 66 |     # ['AG LAND', 'AG SERVICES', 'AG SERVICES & RENT',
 67 |     # 'ALMONDS', ...
 68 | 
 69 |     # Value of rice crops in Yolo (Davis) county since 2005
 70 |     riceparams = {'sector_desc': 'CROPS',
 71 |                   'commodity_desc': 'RICE',
 72 |                   'state_name': 'CALIFORNIA',
 73 |                   'county_name': 'YOLO',
 74 |                   'year__GE': '2005',
 75 |                   'unit_desc': '$',
 76 |                   }
 77 | 
 78 |     # yolorice = query(riceparams)
 79 | 
 80 |     # Try using a dictionary comprehension to filter
 81 |     # yearvalue = {x['year']: x['Value'] for x in yolorice}
 82 |     # Expect:
 83 |     # {'2007': '26,697,000', '2012': '51,148,000'}
 84 | 
 85 |     # This was the sales, not the production
 86 |     tobacco = query({'commodity_desc': 'TOBACCO',
 87 |                      'sector_desc': 'CROPS',
 88 |                      'source_desc': 'CENSUS',
 89 |                      'domain_desc': 'TOTAL',
 90 |                      'agg_level_desc': 'STATE',
 91 |                      'year': '2012',
 92 |                      'freq_desc': 'ANNUAL',
 93 |                      'unit_desc': '$',
 94 |                      })
 95 | 
 96 |     # Rylan's query, which does production
 97 |     tprod = query({'sector_desc': 'CROPS',
 98 |                    'commodity_desc': 'TOBACCO',
 99 |                    'year': '2012',
100 |                    'agg_level_desc': 'STATE',
101 |                    'statisticcat_desc': 'PRODUCTION',
102 |                    'short_desc': 'TOBACCO - PRODUCTION, MEASURED IN $'
103 |                    })
104 | 
105 |     def cleanup(value):
106 |         '''
107 |         Massage data into proper form
108 |         '''
109 |         try:
110 |             return int(value.replace(',', ''))
111 | 
112 |         # Some contain strings with '(D)'
113 |         except ValueError:
114 |             return 0
115 | 
116 |     t2 = [(cleanup(x['Value']), x['state_name']) for x in tobacco]
117 |     t2.sort(key=operator.itemgetter(0), reverse=True)
118 | 
119 |     tprod2 = [(cleanup(x['Value']), x['state_name']) for x in tprod]
120 |     tprod2.sort(key=operator.itemgetter(0), reverse=True)
121 | 
122 |     print(t2)
123 | 


--------------------------------------------------------------------------------
/exercise1/usda_blank.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | A simple client for USDA statistics
 3 |     http://quickstats.nass.usda.gov/api
 4 | 
 5 | Found the data set in the data.gov catalog:
 6 |     http://catalog.data.gov/dataset/quick-stats-agricultural-database-api
 7 | 
 8 | Fun fact: This actually queries a database with 31 million records
 9 | '''
10 | 
11 | import requests
12 | 
13 | 
14 | # Posting your keys online for anyone to find and use is a BAD IDEA!
15 | 
16 | # You'll need to change this to the key from the email.
17 | # Only use this technique if this script will remain private, ie. stored
18 | # just on your local computer.
19 | usda_key = 'your key here'
20 | 
21 | # If you save this publicly to Github then it's better to keep your key in
22 | # a separate private plain text file called 'usda_key.txt' which is NOT
23 | # added / committed to the repository.
24 | try:
25 |     with open('usda_key.txt') as f:
26 |         usda_key = f.read().rstrip()
27 | except FileNotFoundError:
28 |     pass
29 | 
30 | 
31 | def get_param_values(param, key=usda_key):
32 |     '''
33 |     Returns the possible values for a single parameter 'param'
34 | 
35 |     >>> get_param_values('sector_desc')[:3]
36 |     ['ANIMALS & PRODUCTS', 'CROPS', 'DEMOGRAPHICS']
37 | 
38 |     '''
39 |     # Your task- fill this in
40 |     pass
41 | 
42 | 
43 | def query(parameters, key=usda_key):
44 |     '''
45 |     Returns the JSON response from the USDA agricultural database
46 | 
47 |     'parameters' is a dictionary of parameters that can be referenced here:
48 |         http://quickstats.nass.usda.gov/api
49 | 
50 |     Example: Return all the records around cattle in Tehama County
51 | 
52 |     >>> cowparams = {'commodity_desc': 'CATTLE',
53 |                      'state_name': 'CALIFORNIA',
54 |                      'county_name': 'TEHAMA'}
55 |     >>> tehamacow = query(cowparams)
56 | 
57 |     '''
58 |     # Your task- fill this in
59 |     pass
60 | 
61 | 
62 | if __name__ == '__main__':
63 |     # A few examples of usage
64 | 
65 |     # Possible values for 'commodity_desc'
66 |     commodity_desc = get_param_values('commodity_desc')
67 |     # Expect:
68 |     # ['AG LAND', 'AG SERVICES', 'AG SERVICES & RENT',
69 |     # 'ALMONDS', ...
70 | 
71 |     # Value of rice crops in Yolo (Davis) county since 2005
72 |     riceparams = {'sector_desc': 'CROPS',
73 |                   'commodity_desc': 'RICE',
74 |                   'state_name': 'CALIFORNIA',
75 |                   'county_name': 'YOLO',
76 |                   'year__GE': '2005',
77 |                   'unit_desc': '$',
78 |                   }
79 | 
80 |     yolorice = query(riceparams)
81 | 
82 |     # Try using a dictionary comprehension to filter
83 |     yearvalue = {x['year']: x['Value'] for x in yolorice}
84 |     # Expect:
85 |     # {'2007': '26,697,000', '2012': '51,148,000'}
86 | 


--------------------------------------------------------------------------------
/exercise2/exercise2.md:
--------------------------------------------------------------------------------
 1 | # Exercise 2 - Two-dimensional Random Walk
 2 | 
 3 | In this exercise, you will write a two-dimensional [random walk][] simulator.
 4 | The walk starts at the origin (0, 0).
 5 | At each time point, the walker randomly selects a direction (up, down, left, or
 6 | right) and moves one step in that direction. The walk is symmetric if all
 7 | directions have equal probability.
 8 | 
 9 | The goals of the exercise are:
10 | 
11 | * Become accustomed to using Numpy's array and random number generation
12 |   facilities
13 | * Learn about functions, function parameters, and default arguments
14 | * Practice incremental development
15 | 
16 | You may need to consult the [Python][] and [Numpy][] documentation to complete
17 | this exercise.
18 | 
19 | A template file, `random_walk.py`, has been provided. As you work through the
20 | tasks and modify the template file, you can load your work in IPython with the
21 | command
22 | 
23 |     %run random_walk.py
24 | 
25 | and then test your code by calling the corresponding function directly. For
26 | example,
27 | 
28 |     x, y = random_walk(10)
29 | 
30 | The template also includes a function `draw_walk()` which will animate the path
31 | traced out by the walk; it may be helpful for testing and entertainment
32 | purposes.
33 | 
34 | ## Task 1
35 | 
36 | Fill in the function `random_walk()`, which accepts an arbitrary number of
37 | steps to take and returns x and y coordinates of the walker at each step. These
38 | should be Numpy arrays, and should include the starting position (0, 0). The 
39 | following algorithm is suggested:
40 | 
41 | 1. Randomly sample the direction moved at each step. See the `numpy.random`
42 |    submodule.
43 | 2. Convert the sampled directions into moves (e.g., up is a (0, 1) move).
44 |     + Preallocate arrays for storing the moves with `numpy.zeros()`.
45 |     + The `numpy.where()` function may prove useful here.
46 |     + Consider storing moves in the x direction and moves in the y direction in
47 |       separate arrays.
48 | 3. Use cumulative summation of the moves to calculate the x and y coordinates.
49 | 
50 | ## Task 2
51 | 
52 | Modify the `random_walk()` function to accept a starting position parameters
53 | `x_start` and `y_start`:
54 | 
55 |     def random_walk(n, x_start, y_start):
56 |         # < your code >
57 |         return x, y
58 | 
59 | Adjust the body of the function to take the starting position of the walk into
60 | account. 
61 | 
62 | Make the `random_walk()` function start at the origin (0, 0) when
63 | `x_start` and `y_start` are not specified by adding default arguments:
64 | 
65 |     def random_walk(n, x_start = 0, y_start = 0)
66 |         # < your code >
67 |         return x, y
68 | 
69 | Test that the function still works correctly.
70 | 
71 | ## Task 3
72 | 
73 | Add a `p` parameter to `random_walk()`, where `p` is a vector of probabilities
74 | for up, down, left, and right. Modify the body of the function to take this
75 | parameter into account.
76 | 
77 | When `p` is not specified, default to a symmetric random walk. You could use
78 | `numpy.ones()` and vectorization for this, but there are other possibilities as
79 | well.
80 | 
81 | Once again, test that the function behaves as expected.
82 | 
83 | # Bonus Task
84 | 
85 | Use `if` statements to check that `p` has length 4 and sums to 1. If either of
86 | these conditions are not true, [raise an exception][exception]:
87 | 
88 |     raise ValueError('p does not have length 4!')
89 | 
90 | [random walk]: https://en.wikipedia.org/wiki/Random_walk
91 | [Python]: https://docs.python.org/3/tutorial
92 | [Numpy]: http://docs.scipy.org/doc/numpy/reference
93 | [exception]: http://stackoverflow.com/questions/2052390/how-do-i-manually-throw-raise-an-exception-in-python
94 | 
95 | 
96 | 


--------------------------------------------------------------------------------
/exercise2/random_walk.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | A two-dimensional random walk simulator and animator.
 3 | '''
 4 | 
 5 | # The turtle package is part of Python's standard library. It provides some
 6 | # very primitive graphics capabilities. For more details see
 7 | #
 8 | #   https://docs.python.org/3/library/turtle.html
 9 | #
10 | import turtle
11 | 
12 | import numpy as np
13 | 
14 | def random_walk(n):
15 |     ''' Simulate a two-dimensional random walk.
16 | 
17 |     Args:
18 |         n           number of steps
19 | 
20 |     Returns:
21 |         Two Numpy arrays containing the x and y coordinates, respectively, at
22 |         each step (including the initial position).
23 |     '''
24 | 
25 |     # Your task: fill this in.
26 | 
27 |     return x, y
28 | 
29 | 
30 | 
31 | # Notice that the documentation automatically shows up when you use ?
32 | def draw_walk(x, y, speed = 'slowest', scale = 20):
33 |     ''' Animate a two-dimensional random walk.
34 | 
35 |     Args:
36 |         x       x positions
37 |         y       y positions
38 |         speed   speed of the animation
39 |         scale   scale of the drawing
40 |     '''
41 |     # Reset the turtle.
42 |     turtle.reset()
43 |     turtle.speed(speed)
44 | 
45 |     # Combine the x and y coordinates.
46 |     walk = zip(x * scale, y * scale)
47 |     start = next(walk)
48 | 
49 |     # Move the turtle to the starting point.
50 |     turtle.penup()
51 |     turtle.goto(*start)
52 | 
53 |     # Draw the random walk.
54 |     turtle.pendown()
55 |     for _x, _y in walk:
56 |         turtle.goto(_x, _y)
57 | 
58 | 


--------------------------------------------------------------------------------
/exercise2/random_walk_solution.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | '''
 3 | A two-dimensional random walk simulator and animator.
 4 | '''
 5 | 
 6 | # The turtle package is part of Python's standard library. It provides some
 7 | # very primitive graphics capabilities. For more details see
 8 | #
 9 | #   https://docs.python.org/3/library/turtle.html
10 | #
11 | import turtle
12 | 
13 | import numpy as np
14 | 
15 | def random_walk(n, x_start=0, y_start=0, p=np.repeat(0.25, 4)):
16 |     ''' Simulate a two-dimensional random walk.
17 | 
18 |     Args:
19 |         n           number of steps
20 |         x_start     starting x position
21 |         y_start     starting y position
22 |         p           probabilities of (right, left, up, down)
23 | 
24 |     Returns:
25 |         Two Numpy arrays containing the x and y coordinates, respectively, at
26 |         each step (including the initial position).
27 |     '''
28 |     # Sample the steps.
29 |     steps = np.random.choice(4, size=n, p=p)
30 | 
31 |     # Preallocate position vectors.
32 |     x_steps = np.zeros(n + 1, dtype='int')
33 |     y_steps = np.zeros(n + 1, dtype='int')
34 | 
35 |     # Fill in the steps.
36 |     x_steps[0], y_steps[0] = x_start, y_start
37 | 
38 |     x_steps[1:] += (steps == 0)
39 |     x_steps[1:] -= (steps == 1)
40 | 
41 |     y_steps[1:] += (steps == 2)
42 |     y_steps[1:] -= (steps == 3)
43 | 
44 |     # Compute the positions.
45 |     x = x_steps.cumsum()
46 |     y = y_steps.cumsum()
47 | 
48 |     return x, y
49 | 
50 | # Notice that the documentation automatically shows up when you use ?
51 | def draw_walk(x, y, speed = 'slowest', scale = 20):
52 |     ''' Animate a two-dimensional random walk.
53 | 
54 |     Args:
55 |         x       x positions
56 |         y       y positions
57 |         speed   speed of the animation
58 |         scale   scale of the drawing
59 |     '''
60 |     # Reset the turtle.
61 |     turtle.reset()
62 |     turtle.speed(speed)
63 | 
64 |     # Combine the x and y coordinates.
65 |     walk = zip(x * scale, y * scale)
66 |     start = next(walk)
67 | 
68 |     # Move the turtle to the starting point.
69 |     turtle.penup()
70 |     turtle.goto(*start)
71 | 
72 |     # Draw the random walk.
73 |     turtle.pendown()
74 |     for _x, _y in walk:
75 |         turtle.goto(_x, _y)
76 | 
77 | 


--------------------------------------------------------------------------------
/exercise3/exercise3.mdown:
--------------------------------------------------------------------------------
 1 | # Exercise 3 - War and Peace and Text
 2 | 
 3 | In this exercise we'll use Python to read the book 'War and Peace'
 4 | and create a graph of character relationships using `networkx`.
 5 | 
 6 | The goal of this assignment are:
 7 | - become more comfortable working with text 
 8 | - think about data sources in a new way
 9 | - gain confidence in building graphs
10 | 
11 | ### Download data
12 | 
13 | Download the text for 'War and Peace' from Project Gutenberg at 
14 | https://www.gutenberg.org/cache/epub/2600/pg2600.txt
15 | 
16 | ### Regular Expressions
17 | 
18 | Use the regular expression that we wrote in class to find all the
19 | characters / places. Note that it may need some modification so that it
20 | doesn't pick up the space in front. You'll also need to modify it to work
21 | with other sentence delimiters besides '.'. Consider '?' and '!'.
22 | 
23 | Note- this doesn't have to be perfect. You're going to pick up names of
24 | places and words like 'Prince'. That's ok.
25 | 
26 | ### Count appearances
27 | 
28 | Count the number of times that each character/name appears in the text. 
29 | Which are the top 10 and what are their counts?
30 | 
31 | ### Together in sentences
32 | 
33 | Use the list of names to build an undirected graph in networkx. 
34 | Weight each edge according to the number of times that the two appear in
35 | the same sentence together. For example, if 'Sonya' and 'Lisa' appear
36 | together in 20 sentences, then the weight of that edge should be 20.
37 | 
38 | ## Visualize
39 | 
40 | Use `networkx` to visualize the top 10 characters. Let the appearance count
41 | determine the size of the node.
42 | Use the weight of the edge to determine the aesthetics of the line. 
43 | Something like this:
44 | http://networkx.github.io/documentation/networkx-1.9.1/examples/drawing/weighted_graph.html
45 | 


--------------------------------------------------------------------------------
/exercise3/wpgraph.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nick-ulle/2015-python/18e324ea50435d94c72ec853eec56ea6dfa5fe05/exercise3/wpgraph.pdf


--------------------------------------------------------------------------------
/exercise3/wpgraph.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | Building a graph of characters from the novel War and Peace
 3 | '''
 4 | 
 5 | import re
 6 | from itertools import combinations
 7 | from collections import Counter
 8 | 
 9 | import matplotlib.pyplot as plt
10 | import networkx as nx
11 | 
12 | 
13 | namepattern = re.compile(r'''
14 |         (?<![\.\?!]\s)   # Not the end of a sentence
15 |         [A-Z]           # single uppercase letter
16 |         [a-z]{3,}       # at least 3 lowercase letters
17 | ''', flags=re.VERBOSE)
18 | 
19 | wordpattern = re.compile(r"[\s']")
20 | 
21 | sentpattern = re.compile(r"[\.\?!']")
22 | 
23 | def getpairs(sentence, nameset):
24 |     '''
25 |     Return all the pairs of words in a sentence
26 |     '''
27 |     words = set(wordpattern.split(sentence))
28 |     # All the names that appear in the sentence
29 |     found = words.intersection(nameset)
30 |     return combinations(found, 2)
31 | 
32 | def heavynodes(G, n=10):
33 |     '''
34 |     Return a subgraph of the n nodes with the largest weights
35 |     '''
36 |     wt = lambda x: x[1]['weight']
37 |     w = sorted(G.node.items(), key = wt, reverse=True)
38 |     return G.subgraph((x[0] for x in w[:n]))
39 | 
40 | 
41 | if __name__ == '__main__':
42 | 
43 |     with open('/Users/clark/data/warandpeace.txt') as f:
44 |         text = f.read()
45 | 
46 |     allnames = set(namepattern.findall(text))
47 | 
48 |     G1 = nx.Graph()
49 | 
50 |     for name in allnames:
51 |         G1.add_node(name, weight=text.count(name))
52 | 
53 |     G = heavynodes(G1, n=30)
54 |     G.remove_nodes_from('That There They This Well What When \
55 |             Princes'.split(' '))
56 |     bignames = set(G.node.keys())
57 | 
58 |     # First add all the edges with 0 weight
59 |     sentences = sentpattern.split(text)
60 |     for s in sentences:
61 |         nodewt = ((a, b, 0) for a, b in getpairs(s, bignames))
62 |         G.add_weighted_edges_from(nodewt)
63 | 
64 |     # Then add all the weights
65 |     for s in sentences:
66 |         for a, b in getpairs(s, bignames):
67 |             G[a][b]['weight'] += 1
68 | 
69 |     # Drawing
70 |     nodesizes = [2 * x['weight'] for x in G.node.values()]
71 |     pos = nx.circular_layout(G)
72 |     nx.draw_networkx_nodes(G, pos, node_size=nodesizes, alpha=0.3)
73 |     nx.draw_networkx_labels(G, pos, font_size=8)
74 | 
75 |     # Add lines for 30, 10
76 |     # Would have been cool to do a more continuous version here
77 |     elarge=[(u,v) for (u,v,d) in G.edges(data=True) if d['weight'] > 30]
78 |     esmall=[(u,v) for (u,v,d) in G.edges(data=True) if d['weight'] > 10]
79 | 
80 |     nx.draw_networkx_edges(G, pos, edge_color='g', alpha=0.05)
81 |     nx.draw_networkx_edges(G, pos, edgelist=esmall, edge_color='g',
82 |             alpha=0.3)
83 |     nx.draw_networkx_edges(G, pos, edgelist=elarge, edge_color='b', 
84 |             alpha=0.6, width=2)
85 | 
86 |     plt.axis('off')
87 |     plt.savefig('wpgraph.pdf')
88 | 


--------------------------------------------------------------------------------
/exercise4/convert.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | ''' Utility to convert the supermarket data to a SQLite database.
 3 | 
 4 | The directory structure should be
 5 |     
 6 |     convert.py
 7 |     tables/
 8 |         +-- supermarket_distances
 9 |         +-- supermarket_prices
10 |         +-- supermarket_purchases
11 | 
12 | This script can then be executed using
13 | 
14 |     python3 convert.py
15 | 
16 | in the same directory.
17 | '''
18 | 
19 | import os
20 | 
21 | import pandas as pd
22 | import sqlalchemy as sa
23 | 
24 | def main():
25 |     ''' Convert the supermarket data to a SQLite database.
26 |     '''
27 |     file_dir = 'tables'
28 |     files = os.listdir(file_dir)
29 | 
30 |     engine = sa.create_engine('sqlite:///supermarket.sqlite')
31 | 
32 |     for f in files:
33 |         print('Writing table {}...'.format(f), end='', flush=True)
34 | 
35 |         # Read the data in 500,000-row chunks.
36 |         data_chunker = pd.read_table(os.path.join(file_dir, f), sep=' ',
37 |                                      chunksize=5e5)
38 |         data_chunks = iter(data_chunker)
39 | 
40 |         # Write the first chunk to the database; append the remaining chunks.
41 |         # Use 1000-row commits.
42 |         next(data_chunks).to_sql(f, engine, index=False, chunksize=1000)
43 |         for chunk in data_chunks:
44 |             chunk.to_sql(f, engine, if_exists='append', index=False,
45 |                          chunksize=1000)
46 |         
47 |         print('done!')
48 | 
49 | if __name__ == '__main__':
50 |     main()
51 | 
52 | 


--------------------------------------------------------------------------------
/exercise4/exercise4.md:
--------------------------------------------------------------------------------
 1 | 
 2 | # Exercise 4 - Supermarket Data Analysis
 3 | 
 4 | In this exercise, you'll use Python to analyze data from the Italian
 5 | supermarket chain [Coop][].
 6 | 
 7 | The goals for this exercise are:
 8 | 
 9 | * Experience working with a large data set in Python
10 | * Examine when SQL databases are appropriate
11 | * Use `pandas` and `SQLAlchemy`
12 | 
13 | This exercise is deliberately challenging and open-ended so that you can
14 | experiment with all of the tools you've learned.
15 | 
16 | ## Data
17 | 
18 | Download the data [here][data].
19 | 
20 | The data records the supermarket purchases of 60,366 customers between January
21 | 2007 and December 2011. More details, and a description of the data, can be
22 | found [here][desc]. This data was generously provided by the paper
23 | 
24 | > Pennacchioli, D., Coscia, M., Rinzivillo, S., Pedreschi, D. and Giannotti, F.
25 | > (2013), "[Explaining the Product Range Effect in Purchase Data][paper]," Big
26 | > Data 2013 Conference.
27 | 
28 | The data is distributed as three space-delimited text files, in a zip archive.
29 | After extracting the files, you're encouraged to convert these files into a
30 | SQLite database.
31 | 
32 | You can do this using the provided script, `convert.py`. The header of the
33 | script includes instructions for use. It may take up to 30 minutes, and
34 | requires 1 GB of disk space.
35 | 
36 | [Coop]: https://en.wikipedia.org/wiki/Coop_%28Italy%29
37 | [data]: http://michelecoscia.com/wp-content/uploads/2013/02/supermarket_data.zip
38 | [desc]: http://www.michelecoscia.com/?page_id=379
39 | [paper]: http://www.michelecoscia.com/wp-content/uploads/2013/09/geocoop.pdf
40 | 
41 | ## Tasks
42 | 
43 | Determine which customer spent the most, and which customer bought the most
44 | items. How much did they spend, in USD? Note that the prices are given in
45 | Euros (as a bonus, use `requests` to fetch the current exchange rate).
46 | 
47 | Compute the total spending for each customer, and make a histogram. Does a
48 | gamma distribution fit well? If not, try to find a distribution that does.
49 | Keep in mind that total spending is theoretically unbounded.
50 | 
51 | Analyze the network of customers and stores within short walking distance (say,
52 | less than 2500 meters). Are there any isolated customers or stores? Do spending
53 | habits differ for customers with few stores in walking distance? It may be
54 | helpful to use `networkx` here.
55 | 
56 | 


--------------------------------------------------------------------------------
/feedback/feedback.csv:
--------------------------------------------------------------------------------
 1 | Timestamp,The pace of this course was:,I would recommend this course to others.,The Python packages presented were relevant to my interests.,"Overall, this course helped me learn Python.",The teaching style of this course was effective.,What did you like about this course?,What would you change about this course?,I would like to see...,Additional feedback:,What topics would you like to see most in the final week?
 2 | 1/29/2015 14:06:44,3,5,4,5,5,,,,,"Statistical Modelling (scipy, statsmodels), Databases (pandas, sqlalchemy), Functional Programming (iterators, generators, lazy evaluation)"
 3 | 1/29/2015 14:07:05,4,5,5,5,4,,,,,"More Web Scraping (lxml2, selenium), Machine Learning (pylearn, scikit-learn), Databases (pandas, sqlalchemy)"
 4 | 1/29/2015 14:10:35,4,5,4,3,4,,,,,"Machine Learning (pylearn, scikit-learn), Databases (pandas, sqlalchemy), Functional Programming (iterators, generators, lazy evaluation)"
 5 | 1/29/2015 14:17:33,4,4,4,5,3,,,,,"Statistical Modelling (scipy, statsmodels), Machine Learning (pylearn, scikit-learn), Databases (pandas, sqlalchemy)"
 6 | 1/29/2015 14:18:03,3,5,5,5,5,,,,,"Statistical Modelling (scipy, statsmodels), Machine Learning (pylearn, scikit-learn), Databases (pandas, sqlalchemy)"
 7 | 1/29/2015 14:20:37,3,5,5,4,5,"Great course!  Thanks Guys.  Didn't get to put as much into it as I wanted because I'm in the middle of writing thesis.  But, overall, great class.  
 8 | 
 9 | It is very, very refreshing to find people willing to teach, that have passion and insight into their fields, and exercise both expertise, relatability and patience. 
10 | 
11 | Thanks!
12 | 
13 | How about a course on big ""geodata"" ???",,,,"Databases (pandas, sqlalchemy), More Mapping (cartopy, kartograph), Cat Pictures"
14 | 1/29/2015 14:33:39,3,4,4,4,4,,,,,"More Web Scraping (lxml2, selenium), Machine Learning (pylearn, scikit-learn), Databases (pandas, sqlalchemy)"
15 | 1/29/2015 14:42:19,2,3,3,4,4,,"I hope it is a faster-paced course that goes over more packages and functions, rather than elaborating on a few of them.",,,"More Web Scraping (lxml2, selenium), Functional Programming (iterators, generators, lazy evaluation), Testing (doctest, unittest, nose)"
16 | 1/29/2015 14:46:01,3,5,5,4,5,Yes,yes,,Yes,"Machine Learning (pylearn, scikit-learn), Databases (pandas, sqlalchemy), Object-oriented Programming (classes)"
17 | 1/29/2015 14:50:48,4,4,4,4,4,,,,,"Statistical Modelling (scipy, statsmodels), Machine Learning (pylearn, scikit-learn), Databases (pandas, sqlalchemy)"
18 | 1/29/2015 15:01:52,3,5,5,5,5,,,,Thanks for doing this!,"Statistical Modelling (scipy, statsmodels), Machine Learning (pylearn, scikit-learn), More Plotting (vincent, bokeh)"
19 | 1/29/2015 15:34:02,3,3,3,3,3,,,,,"Functional Programming (iterators, generators, lazy evaluation), Object-oriented Programming (classes), Testing (doctest, unittest, nose)"
20 | 1/29/2015 15:42:56,4,5,5,5,5,,,,,"Machine Learning (pylearn, scikit-learn), Databases (pandas, sqlalchemy), More Plotting (vincent, bokeh)"
21 | 1/29/2015 16:06:07,3,5,5,4,5,The presentations were concise and interactive. ,A lot more could be covered in 12 weeks.  A credited course in the stats department could offer better choices for students interested in Data Science.,,,"More Web Scraping (lxml2, selenium), Statistical Modelling (scipy, statsmodels), Databases (pandas, sqlalchemy)"
22 | 1/29/2015 16:09:49,3,5,5,5,5,"The class definitely broadened my knowledge about python libraries. I like the zen of python, by the way ^.^",I myself is a CS student so I've feel good to follow the class. But when it comes to some complexed part I would say it may be a bit difficult to catch for students with less experience coding.,,Thank you guys so much for organizing this workshop!,"Machine Learning (pylearn, scikit-learn), Functional Programming (iterators, generators, lazy evaluation), Testing (doctest, unittest, nose)"
23 | 1/29/2015 16:17:55,5,5,5,4,4,Interactive teaching,"Stop and ask if students have questions. Often, students are too busy typing the code instead of thinking and asking. Give them time to type the code and think. However, this can't be achieved with the given time constraint. Maybe extend to 5 weeks or discuss one topics in two days.",,,"Databases (pandas, sqlalchemy), Object-oriented Programming (classes), Testing (doctest, unittest, nose)"
24 | 1/29/2015 16:36:03,4,4,4,4,4,Examples were interesting and helpful,Sometimes it was a bit too fast,,"Wish the course was longer, would like to learn more and do more in-depth examples","Statistical Modelling (scipy, statsmodels), Machine Learning (pylearn, scikit-learn), Databases (pandas, sqlalchemy)"
25 | 1/29/2015 16:58:12,4,5,5,5,5,,,,,"Databases (pandas, sqlalchemy), Functional Programming (iterators, generators, lazy evaluation), Object-oriented Programming (classes)"
26 | 1/29/2015 19:00:55,3,5,4,4,4,the atmosphere of study,elongate the length of course,,,"More Web Scraping (lxml2, selenium), Machine Learning (pylearn, scikit-learn), Object-oriented Programming (classes)"
27 | 1/29/2015 20:13:17,4,5,4,5,4,,,,,"Machine Learning (pylearn, scikit-learn), Functional Programming (iterators, generators, lazy evaluation), Object-oriented Programming (classes)"
28 | 1/29/2015 20:23:55,3,5,5,5,5,"- We used real world examples (e.g. USDA data and food truck data), not some toy data that was handed to us.  
29 | - The topics of the course seemed very useful.  
30 | - Some knowledge of programming was assumed, but not too much.  As a result, we didn't spend time writing a ""Hello world"" program and a program to convert between Celsius and Fahrenheit.  Also, we learned about object types, Python notation, etc. by doing rather than by being told and expected to remember.  ","Overall, I think the course was great.  My only comment is that I think some of the stuff we've learned in the 2nd and 3rd weeks, such as the structure of scripts and functions, should have been introduced in the first week.  ",,,"More Web Scraping (lxml2, selenium), Machine Learning (pylearn, scikit-learn), Databases (pandas, sqlalchemy)"
31 | 1/30/2015 9:44:06,2,4,5,4,3,,,,,"More Web Scraping (lxml2, selenium), Statistical Modelling (scipy, statsmodels), Machine Learning (pylearn, scikit-learn)"
32 | 1/30/2015 9:58:47,2,,,3,3,,,,,"Machine Learning (pylearn, scikit-learn), Functional Programming (iterators, generators, lazy evaluation), Object-oriented Programming (classes)"
33 | 1/30/2015 11:59:15,3,5,4,5,5,,,,,"Machine Learning (pylearn, scikit-learn), More Plotting (vincent, bokeh), Functional Programming (iterators, generators, lazy evaluation)"
34 | 1/30/2015 12:03:30,4,4,4,4,4,,,,,"Statistical Modelling (scipy, statsmodels), Machine Learning (pylearn, scikit-learn), Databases (pandas, sqlalchemy)"
35 | 1/30/2015 13:55:45,4,4,4,4,4,,,,,"Statistical Modelling (scipy, statsmodels), Machine Learning (pylearn, scikit-learn), Databases (pandas, sqlalchemy)"
36 | 1/30/2015 16:34:58,4,5,5,5,4,"I liked that this course was accessible and open: undergraduate students could audit the course, materials available on GitHub, video recordings of courses, Piazza forum. The instructors were very receptive to questions and provided a safe learning environment. There was also a variety of different tools and applications, from data viz to data scraping, so that people from different fields and with various skill-levels could approach data analysis.",I suggest break-out sections for students to work on coding together. Maybe discuss or demo a tool/topic for the first-half. Then students can break out into different tables and work on the assignment/own programs/ applications using the tool. ,,"Many thanks for offering this course! I felt greater confidence about Python and coding after this course. After exposure to all the new libraries and packages, I am inspired with all the new methods of data analysis. ","More Web Scraping (lxml2, selenium), Machine Learning (pylearn, scikit-learn), Testing (doctest, unittest, nose)"


--------------------------------------------------------------------------------
/iid_talk/Makefile:
--------------------------------------------------------------------------------
1 | slides.html: slides.mdown
2 | 	pandoc -t revealjs -s $< -o $@
3 | 
4 | view: slides.html
5 | 	open $<
6 | 


--------------------------------------------------------------------------------
/iid_talk/iidata workshop.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Python for R Users\n",
  8 |     "\n",
  9 |     "## Clark Fitzgerald \n",
 10 |     "\n",
 11 |     "`@clarkfitzg`\n",
 12 |     "\n",
 13 |     "## UC Davis iidata\n",
 14 |     "\n",
 15 |     "May 21, 2016\n",
 16 |     "\n",
 17 |     "![](images/python-logo-master-v3-TM.png)"
 18 |    ]
 19 |   },
 20 |   {
 21 |    "cell_type": "markdown",
 22 |    "metadata": {},
 23 |    "source": [
 24 |     "# Basics of Python\n",
 25 |     "\n",
 26 |     "- High level dynamic language\n",
 27 |     "- Open Source\n",
 28 |     "- Used everywhere for everything :)"
 29 |    ]
 30 |   },
 31 |   {
 32 |    "cell_type": "markdown",
 33 |    "metadata": {},
 34 |    "source": [
 35 |     "![](images/Rlogo.png)\n",
 36 |     "\n",
 37 |     "# Differences\n",
 38 |     "\n",
 39 |     "R grew from statistics -> data science\n",
 40 |     "\n",
 41 |     "Python evolved from general purpose -> data science\n",
 42 |     "\n",
 43 |     "It's good to know both!"
 44 |    ]
 45 |   },
 46 |   {
 47 |    "cell_type": "code",
 48 |    "execution_count": 1,
 49 |    "metadata": {
 50 |     "collapsed": false
 51 |    },
 52 |    "outputs": [
 53 |     {
 54 |      "name": "stdout",
 55 |      "output_type": "stream",
 56 |      "text": [
 57 |       "The Zen of Python, by Tim Peters\n",
 58 |       "\n",
 59 |       "Beautiful is better than ugly.\n",
 60 |       "Explicit is better than implicit.\n",
 61 |       "Simple is better than complex.\n",
 62 |       "Complex is better than complicated.\n",
 63 |       "Flat is better than nested.\n",
 64 |       "Sparse is better than dense.\n",
 65 |       "Readability counts.\n",
 66 |       "Special cases aren't special enough to break the rules.\n",
 67 |       "Although practicality beats purity.\n",
 68 |       "Errors should never pass silently.\n",
 69 |       "Unless explicitly silenced.\n",
 70 |       "In the face of ambiguity, refuse the temptation to guess.\n",
 71 |       "There should be one-- and preferably only one --obvious way to do it.\n",
 72 |       "Although that way may not be obvious at first unless you're Dutch.\n",
 73 |       "Now is better than never.\n",
 74 |       "Although never is often better than *right* now.\n",
 75 |       "If the implementation is hard to explain, it's a bad idea.\n",
 76 |       "If the implementation is easy to explain, it may be a good idea.\n",
 77 |       "Namespaces are one honking great idea -- let's do more of those!\n"
 78 |      ]
 79 |     }
 80 |    ],
 81 |    "source": [
 82 |     "# Come for the language, stay for the community\n",
 83 |     "import this"
 84 |    ]
 85 |   },
 86 |   {
 87 |    "cell_type": "code",
 88 |    "execution_count": 2,
 89 |    "metadata": {
 90 |     "collapsed": false
 91 |    },
 92 |    "outputs": [
 93 |     {
 94 |      "name": "stdout",
 95 |      "output_type": "stream",
 96 |      "text": [
 97 |       "hello world!\n"
 98 |      ]
 99 |     }
100 |    ],
101 |    "source": [
102 |     "print('hello world!')"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "markdown",
107 |    "metadata": {},
108 |    "source": [
109 |     "## Goals:\n",
110 |     "\n",
111 |     "- Expose you to the language\n",
112 |     "- Show you how to start\n",
113 |     "\n",
114 |     "First download and install the Anaconda distribution for Python 3.5: https://www.continuum.io/downloads\n",
115 |     "\n",
116 |     "You're looking at the Ipython Notebook"
117 |    ]
118 |   },
119 |   {
120 |    "cell_type": "markdown",
121 |    "metadata": {},
122 |    "source": [
123 |     "Many skills / ideas transfer between R and Python."
124 |    ]
125 |   },
126 |   {
127 |    "cell_type": "code",
128 |    "execution_count": 3,
129 |    "metadata": {
130 |     "collapsed": true
131 |    },
132 |    "outputs": [],
133 |    "source": [
134 |     "# A few preliminaries\n",
135 |     "\n",
136 |     "# Lets us plot\n",
137 |     "%matplotlib inline\n",
138 |     "\n",
139 |     "# Import the data analysis libraries we'll use\n",
140 |     "import matplotlib.pyplot as plt\n",
141 |     "import numpy as np\n",
142 |     "import pandas as pd"
143 |    ]
144 |   },
145 |   {
146 |    "cell_type": "markdown",
147 |    "metadata": {},
148 |    "source": [
149 |     "## Dots `.` are important!\n",
150 |     "\n",
151 |     "Dots are used for namespace lookups"
152 |    ]
153 |   },
154 |   {
155 |    "cell_type": "code",
156 |    "execution_count": 4,
157 |    "metadata": {
158 |     "collapsed": false
159 |    },
160 |    "outputs": [],
161 |    "source": [
162 |     "# Look in pandas and find the read_csv function\n",
163 |     "apt = pd.read_csv('/Users/clark/projects/sts98/sts98notes/discussion/clark/cl_apartments.csv')"
164 |    ]
165 |   },
166 |   {
167 |    "cell_type": "code",
168 |    "execution_count": 5,
169 |    "metadata": {
170 |     "collapsed": false
171 |    },
172 |    "outputs": [
173 |     {
174 |      "data": {
175 |       "text/plain": [
176 |        "(18084, 22)"
177 |       ]
178 |      },
179 |      "execution_count": 5,
180 |      "metadata": {},
181 |      "output_type": "execute_result"
182 |     }
183 |    ],
184 |    "source": [
185 |     "apt.shape"
186 |    ]
187 |   },
188 |   {
189 |    "cell_type": "markdown",
190 |    "metadata": {},
191 |    "source": [
192 |     "## Subsetting in Python\n",
193 |     "\n",
194 |     "Just remember to start counting at 0!"
195 |    ]
196 |   },
197 |   {
198 |    "cell_type": "code",
199 |    "execution_count": 6,
200 |    "metadata": {
201 |     "collapsed": false
202 |    },
203 |    "outputs": [
204 |     {
205 |      "data": {
206 |       "text/html": [
207 |        "<div>\n",
208 |        "<table border=\"1\" class=\"dataframe\">\n",
209 |        "  <thead>\n",
210 |        "    <tr style=\"text-align: right;\">\n",
211 |        "      <th></th>\n",
212 |        "      <th>title</th>\n",
213 |        "      <th>date_posted</th>\n",
214 |        "      <th>price</th>\n",
215 |        "    </tr>\n",
216 |        "  </thead>\n",
217 |        "  <tbody>\n",
218 |        "    <tr>\n",
219 |        "      <th>0</th>\n",
220 |        "      <td>$995 / 3br - 1350ft2 - Lancaster Apartment Uni...</td>\n",
221 |        "      <td>2016-04-13 15:34:06</td>\n",
222 |        "      <td>995.0</td>\n",
223 |        "    </tr>\n",
224 |        "    <tr>\n",
225 |        "      <th>1</th>\n",
226 |        "      <td>$1935 / 2br - 1154ft2 - No place like The Colo...</td>\n",
227 |        "      <td>2016-04-16 17:55:45</td>\n",
228 |        "      <td>1935.0</td>\n",
229 |        "    </tr>\n",
230 |        "    <tr>\n",
231 |        "      <th>2</th>\n",
232 |        "      <td>$1825 / 2br - 1056ft2 - No place like The Colo...</td>\n",
233 |        "      <td>2016-04-16 17:58:01</td>\n",
234 |        "      <td>1825.0</td>\n",
235 |        "    </tr>\n",
236 |        "    <tr>\n",
237 |        "      <th>3</th>\n",
238 |        "      <td>$650 / 1br - FURNISHED 1 BED GUEST QUARTERS, D...</td>\n",
239 |        "      <td>2016-04-17 21:26:15</td>\n",
240 |        "      <td>650.0</td>\n",
241 |        "    </tr>\n",
242 |        "    <tr>\n",
243 |        "      <th>4</th>\n",
244 |        "      <td>$1599 / 2br - 951ft2 - Big Savings!</td>\n",
245 |        "      <td>2016-04-28 11:58:28</td>\n",
246 |        "      <td>1599.0</td>\n",
247 |        "    </tr>\n",
248 |        "    <tr>\n",
249 |        "      <th>5</th>\n",
250 |        "      <td>$1335 / 1br - 701ft2 - Looking for your next h...</td>\n",
251 |        "      <td>2016-04-28 13:12:20</td>\n",
252 |        "      <td>1335.0</td>\n",
253 |        "    </tr>\n",
254 |        "  </tbody>\n",
255 |        "</table>\n",
256 |        "</div>"
257 |       ],
258 |       "text/plain": [
259 |        "                                               title          date_posted  \\\n",
260 |        "0  $995 / 3br - 1350ft2 - Lancaster Apartment Uni...  2016-04-13 15:34:06   \n",
261 |        "1  $1935 / 2br - 1154ft2 - No place like The Colo...  2016-04-16 17:55:45   \n",
262 |        "2  $1825 / 2br - 1056ft2 - No place like The Colo...  2016-04-16 17:58:01   \n",
263 |        "3  $650 / 1br - FURNISHED 1 BED GUEST QUARTERS, D...  2016-04-17 21:26:15   \n",
264 |        "4                $1599 / 2br - 951ft2 - Big Savings!  2016-04-28 11:58:28   \n",
265 |        "5  $1335 / 1br - 701ft2 - Looking for your next h...  2016-04-28 13:12:20   \n",
266 |        "\n",
267 |        "    price  \n",
268 |        "0   995.0  \n",
269 |        "1  1935.0  \n",
270 |        "2  1825.0  \n",
271 |        "3   650.0  \n",
272 |        "4  1599.0  \n",
273 |        "5  1335.0  "
274 |       ]
275 |      },
276 |      "execution_count": 6,
277 |      "metadata": {},
278 |      "output_type": "execute_result"
279 |     }
280 |    ],
281 |    "source": [
282 |     "small = apt.loc[:5, ['title', 'date_posted', 'price']]\n",
283 |     "small"
284 |    ]
285 |   },
286 |   {
287 |    "cell_type": "code",
288 |    "execution_count": 8,
289 |    "metadata": {
290 |     "collapsed": false
291 |    },
292 |    "outputs": [
293 |     {
294 |      "data": {
295 |       "text/plain": [
296 |        "[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]"
297 |       ]
298 |      },
299 |      "execution_count": 8,
300 |      "metadata": {},
301 |      "output_type": "execute_result"
302 |     }
303 |    ],
304 |    "source": [
305 |     "[x**2 for x in range(10)]"
306 |    ]
307 |   },
308 |   {
309 |    "cell_type": "markdown",
310 |    "metadata": {},
311 |    "source": [
312 |     "## Object Oriented\n",
313 |     "Dots are used to call methods"
314 |    ]
315 |   },
316 |   {
317 |    "cell_type": "code",
318 |    "execution_count": 9,
319 |    "metadata": {
320 |     "collapsed": false
321 |    },
322 |    "outputs": [
323 |     {
324 |      "data": {
325 |       "text/plain": [
326 |        "2338.7185471233488"
327 |       ]
328 |      },
329 |      "execution_count": 9,
330 |      "metadata": {},
331 |      "output_type": "execute_result"
332 |     }
333 |    ],
334 |    "source": [
335 |     "apt.price.mean()"
336 |    ]
337 |   },
338 |   {
339 |    "cell_type": "code",
340 |    "execution_count": 11,
341 |    "metadata": {
342 |     "collapsed": false
343 |    },
344 |    "outputs": [
345 |     {
346 |      "data": {
347 |       "text/plain": [
348 |        "dtype('float64')"
349 |       ]
350 |      },
351 |      "execution_count": 11,
352 |      "metadata": {},
353 |      "output_type": "execute_result"
354 |     }
355 |    ],
356 |    "source": [
357 |     "apt.price."
358 |    ]
359 |   },
360 |   {
361 |    "cell_type": "code",
362 |    "execution_count": 10,
363 |    "metadata": {
364 |     "collapsed": false
365 |    },
366 |    "outputs": [
367 |     {
368 |      "data": {
369 |       "text/plain": [
370 |        "<matplotlib.axes._subplots.AxesSubplot at 0x10f1033c8>"
371 |       ]
372 |      },
373 |      "execution_count": 10,
374 |      "metadata": {},
375 |      "output_type": "execute_result"
376 |     },
377 |     {
378 |      "data": {
379 |       "image/png": [
380 |        "iVBORw0KGgoAAAANSUhEUgAAAZ0AAAEACAYAAABoJ6s/AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\n",
381 |        "AAALEgAACxIB0t1+/AAAFn9JREFUeJzt3X/wXXV95/HnSyIKNYXNuA0/EgtOwy5ptWpa4251zFaL\n",
382 |        "0XWATh3AKpO2Gbszseq6O1ZS7ZJ/1tHu+AOnA9MuKEFrahSlOJthiazZ8Y+FWAWNBiR0TUu+muCi\n",
383 |        "BbRdm8h7/7ifr7lmvyE35Hs/N9/L8zFz5/u5n3PO/Xw+HL73lfM553tOqgpJknp42qQ7IEl66jB0\n",
384 |        "JEndGDqSpG4MHUlSN4aOJKkbQ0eS1M3YQifJR5IcSLJrqO6/JLk3yVeTfCbJGUPLNibZk+S+JBcN\n",
385 |        "1a9Ksqstu2ao/hlJPtnq70zy8+MaiyRpfozzSOejwNoj6m4HfrGqfhm4H9gIkGQlcDmwsm1zbZK0\n",
386 |        "ba4D1lfVCmBFktnPXA883Oo/CLxvjGORJM2DsYVOVX0R+P4Rddur6vH29i5gWStfAmypqoNVtRd4\n",
387 |        "AFid5GxgcVXtbOvdBFzayhcDm1v5ZuAVYxmIJGneTPKczu8B21r5HGDf0LJ9wLlz1M+0etrPBwGq\n",
388 |        "6hDwSJIl4+ywJOnETCR0krwL+Keq+sQk2pckTcai3g0m+R3gNfz0dNgMsHzo/TIGRzgzHJ6CG66f\n",
389 |        "3eY5wLeTLALOqKrvzdGeN5eTpCehqnLstY5P19BpFwG8A3h5Vf3foUW3Ap9I8gEG02YrgJ1VVUke\n",
390 |        "TbIa2AlcCXx4aJt1wJ3A64A7jtbuOP7DnSySbKqqTZPux7g4voVrmscGT4nxjeUf7GMLnSRbgJcD\n",
391 |        "z07yIHA1g6vVTgW2t4vT/ldVbaiq3Um2AruBQ8CGOnz76w3AjcBpwLaquq3V3wB8LMke4GHginGN\n",
392 |        "RZI0P8YWOlX1+jmqP/IE678HeM8c9V8GnjdH/Y+Ay06kj5KkvrwjwcK3Y9IdGLMdk+7AmO2YdAfG\n",
393 |        "aMekOzBmOybdgYUo0/4QtzYv+YaOTf5NVd3VsT1JmndJahznw58ioXPpY31a+9bT4X/fXPXoG/u0\n",
394 |        "J0njMa7Q6X7J9GR8dnGfdq4H3uGUpSQdhV+QkqRuDB1JUjeGjiSpG0NHktSNoSNJ6sbQkSR1Y+hI\n",
395 |        "kroxdCRJ3Rg6kqRuDB1JUjeGjiSpG0NHktSNoSNJ6sbQkSR1Y+hIkroxdCRJ3Rg6kqRuDB1JUjeG\n",
396 |        "jiSpG0NHktSNoSNJ6sbQkSR1Y+hIkroxdCRJ3YwtdJJ8JMmBJLuG6pYk2Z7k/iS3JzlzaNnGJHuS\n",
397 |        "3JfkoqH6VUl2tWXXDNU/I8knW/2dSX5+XGORJM2PcR7pfBRYe0TdVcD2qroAuKO9J8lK4HJgZdvm\n",
398 |        "2iRp21wHrK+qFcCKJLOfuR54uNV/EHjfGMciSZoHYwudqvoi8P0jqi8GNrfyZuDSVr4E2FJVB6tq\n",
399 |        "L/AAsDrJ2cDiqtrZ1rtpaJvhz7oZeMW8D0KSNK96n9NZWlUHWvkAsLSVzwH2Da23Dzh3jvqZVk/7\n",
400 |        "+SBAVR0CHkmyZEz9liTNg4ldSFBVBdSk2pck9beoc3sHkpxVVfvb1NlDrX4GWD603jIGRzgzrXxk\n",
401 |        "/ew2zwG+nWQRcEZVfW/uZjcNlde0lyRpVpI1dPhy7B06twLrGJz0XwfcMlT/iSQfYDBttgLYWVWV\n",
402 |        "5NEkq4GdwJXAh4/4rDuB1zG4MOEoNs33OCRpqlTVDmDH7PskV4+jnbGFTpItwMuBZyd5EPhPwHuB\n",
403 |        "rUnWA3uBywCqaneSrcBu4BCwoU2/AWwAbgROA7ZV1W2t/gbgY0n2AA8DV4xrLJKk+ZHD3+3TKUn1\n",
404 |        "O3V0PfCOLVXf/+1ODUrSWCSpqsqx1zw+3pFAktSNoSNJ6sbQkSR1Y+hIkroxdCRJ3Rg6kqRuDB1J\n",
405 |        "UjeGjiSpG0NHktSNoSNJ6sbQkSR1Y+hIkroxdCRJ3Rg6kqRuDB1JUjeGjiSpG0NHktSNoSNJ6sbQ\n",
406 |        "kSR1Y+hIkroxdCRJ3Rg6kqRuDB1JUjeGjiSpG0NHktSNoSNJ6sbQkSR1Y+hIkroxdCRJ3UwkdJK8\n",
407 |        "PcnXk+xK8okkz0iyJMn2JPcnuT3JmUPrb0yyJ8l9SS4aql/VPmNPkmsmMRZJ0ui6h06Sc4G3AKuq\n",
408 |        "6nnAKcAVwFXA9qq6ALijvSfJSuByYCWwFrg2SdrHXQesr6oVwIoka7sORpJ0XCY1vbYIOD3JIuB0\n",
409 |        "4NvAxcDmtnwzcGkrXwJsqaqDVbUXeABYneRsYHFV7Wzr3TS0jSTpJNQ9dKpqBng/8HcMwubvq2o7\n",
410 |        "sLSqDrTVDgBLW/kcYN/QR+wDzp2jfqbVS5JOUot6N5jknzE4qjkPeAT4VJI3Dq9TVZWk5q/VTUPl\n",
411 |        "Ne0lSZqVZA0dvhy7hw7wSuBbVfUwQJLPAP8K2J/krKra36bOHmrrzwDLh7ZfxuAIZ6aVh+tn5m5y\n",
412 |        "0zx2X5KmT1XtAHbMvk9y9TjamcQ5nb8FXpLktHZBwCuB3cDngHVtnXXALa18K3BFklOTnA+sAHZW\n",
413 |        "1X7g0SSr2+dcObSNJOkk1P1Ip6p2Jvk08BXgUPv558BiYGuS9cBe4LK2/u4kWxkE0yFgQ1XNTr1t\n",
414 |        "AG4ETgO2VdVtHYciSTpOOfz9PZ0G54Z6jfF64B1bqr7/250alKSxSFJVlWOveXy8I4EkqRtDR5LU\n",
415 |        "jaEjSerG0JEkdWPoSJK6MXQkSd0YOpKkbgwdSVI3xwydJM/r0RFJ0vQb5UjnuiRfSrIhyRlj75Ek\n",
416 |        "aWodM3Sq6qXAG4DnAF9JsmX4kdGSJI1qpHM6VXU/8G7gncDLgWuSfDPJb42zc5Kk6TLKOZ1fTvJB\n",
417 |        "4F7g14HXVtWFwL8BPjjm/kmSpsgojzb4MHAD8K6q+ofZyqr6dpJ3j61nkqSpM0ro/FvgH6vqxwBJ\n",
418 |        "TgGeWVU/rKqbxto7SdJUGeWczucZPCRt1unA9vF0R5I0zUYJnWdW1Q9m31TVYwyCR5Kk4zJK6Pww\n",
419 |        "yarZN0l+BfjH8XVJkjStRjmn8++BrUm+096fDVw+vi5JkqbVMUOnqr6U5ELgXwAFfLOqDo69Z5Kk\n",
420 |        "qTPKkQ7ArwDnt/VflASvXJMkHa9jhk6SjwPPBe4Bfjy0yNCRJB2XUY50VgErq6rG3RlJ0nQb5eq1\n",
421 |        "rzO4eECSpBMyypHOPwd2J9kJ/KjVVVVdPL5uSZKm0Sihs6n9LCBDZUmSjssol0zvSHIe8AtV9fkk\n",
422 |        "p4+ynSRJRxrl0Qa/D3wK+LNWtQz47Dg7JUmaTqNcSPBm4KXAo/CTB7r93Ik0muTMJJ9Ocm+S3UlW\n",
423 |        "J1mSZHuS+5PcnuTMofU3JtmT5L7hp5YmWZVkV1t2zYn0SZI0fqOEzo+qavYCApIs4sTP6VwDbGsP\n",
424 |        "g3s+cB9wFbC9qi4A7mjvSbKSwW13VgJrgWuTzJ5bug5YX1UrgBVJ1p5gvyRJYzRK6PzPJO8CTk/y\n",
425 |        "Gwym2j73ZBtMcgbwsqr6CEBVHaqqR4CLgc1ttc3Apa18CbClqg5W1V7gAWB1krOBxVW1s61309A2\n",
426 |        "kqST0CihcxXwXWAX8O+AbcCJPDH0fOC7ST6a5CtJ/muSnwGWVtWBts4BYGkrnwPsG9p+H3DuHPUz\n",
427 |        "rV6SdJIa5eq1HwN/3l7z1eaLgD9oNxP9EG0qbajNSjKPl2VvGiqvaS9J0qwka+jw5TjKvde+NUd1\n",
428 |        "VdVzn2Sb+4B9VfWl9v7TwEZgf5Kzqmp/mzp7qC2fAZYPbb+sfcZMKw/Xz8zd5KYn2VVJemqoqh3A\n",
429 |        "jtn3Sa4eRzujTK/96tDrZQwuAviLJ9tgVe0HHkxyQat6JfANBueJ1rW6dcAtrXwrcEWSU5OcD6wA\n",
430 |        "drbPebRd+RbgyqFtJEknoVGm1/7PEVUfSvIV4I9PoN23AH+R5FTgb4DfBU5h8LC49cBe4LLW/u4k\n",
431 |        "W4HdwCFgw9DNRzcANwKnMbga7rYT6JMkacxGmV5bxeFLpJ/G4Nk6p5xIo1X1VQZHTkd65VHWfw/w\n",
432 |        "njnqvww870T6IknqZ5Tb2byfw6FziKGjEM3l71+f5PU9W6yqHHstSZq8UabX1nTox5TpeT9U80bS\n",
433 |        "wjHK9Np/5P//Fv3J3aar6gPz3itJ0lQa9cmhv8rgKrIArwW+BNw/xn5JkqbQKKGzHHhRVT0GP7l2\n",
434 |        "e1tVvWGsPZMkTZ1R/k7n54CDQ+8PcoJ3mZYkPTWNcqRzE7AzyWcYTK9dyuEbc0qSNLJRrl77z0lu\n",
435 |        "Y/BMHYDfqaq7x9stSdI0GmV6DeB04LGqugbY125HI0nScRnlcdWbgD/k8J2gTwU+PsY+SZKm1ChH\n",
436 |        "Or/J4EFqPwSoqhlg8Tg7JUmaTqM+rvrx2TftgWuSJB23UULnU0n+DDgzye8DdwDXj7dbkqRp9IRX\n",
437 |        "r7Xn1HwS+JfAY8AFwB9X1fYOfZMkTZlR/k5nW1X9EnD7uDsjSZpuTzi91h6W9uUkL+7UH0nSFBvl\n",
438 |        "SOclwBuT/C3tCjYGefT88XVLkjSNjho6SZ5TVX8HvIrBow18cIsk6YQ80ZHOXwEvrKq9SW6uqt/q\n",
439 |        "1SlJ0nQa9TY4zx1rLyRJTwmjho4kSSfsiabXnp/ksVY+bagMgwsJfnaM/ZIkTaGjhk5VndKzI5Kk\n",
440 |        "6ef0miSpG0NHktSNoSNJ6sbQkSR1Y+hIkrqZWOgkOSXJ3Uk+194vSbI9yf1Jbk9y5tC6G5PsSXJf\n",
441 |        "kouG6lcl2dWWXTOJcUiSRjfJI523AbsZ3NcN4Cpge1VdwOBBcVcBJFkJXA6sBNYC17bn/ABcB6yv\n",
442 |        "qhXAiiRrO/ZfknScJhI6SZYBr2HwBNLZALkY2NzKm4FLW/kSYEtVHayqvcADwOokZwOLq2pnW++m\n",
443 |        "oW0kSSehSR3pfBB4B/D4UN3SqjrQygeApa18DrBvaL19wLlz1M+0eknSSWqU5+nMqySvBR6qqruT\n",
444 |        "rJlrnaqqJDXXsidn01B5TXtJkma17+M1426ne+gA/xq4OMlrgGcCP5vkY8CBJGdV1f42dfZQW38G\n",
445 |        "WD60/TIGRzgzrTxcPzN3k5vms/+SNHWqagewY/Z9kqvH0U736bWq+qOqWl5V5wNXAP+jqq4EbgXW\n",
446 |        "tdXWAbe08q3AFUlOTXI+sALYWVX7gUeTrG4XFlw5tI0k6SQ0iSOdI81Oo70X2JpkPbAXuAygqnYn\n",
447 |        "2crgSrdDwIaqmt1mA3AjcBqwrapu69hvSdJxyuHv7+k0ODfUa4zXA2+iX3sAoap8lLikeZWkxvHd\n",
448 |        "4h0JJEndGDqSpG4MHUlSN4aOJKkbQ0eS1I2hI0nqxtCRJHVj6EiSujF0JEndGDqSpG4MHUlSN4aO\n",
449 |        "JKkbQ0eS1I2hI0nqxtCRJHVj6EiSujF0JEndGDqSpG4MHUlSN4aOJKkbQ0eS1I2hI0nqxtCRJHVj\n",
450 |        "6EiSujF0JEndGDqSpG4MHUlSN4aOJKmb7qGTZHmSLyT5RpKvJ3lrq1+SZHuS+5PcnuTMoW02JtmT\n",
451 |        "5L4kFw3Vr0qyqy27pvdYJEnHZxJHOgeBt1fVLwIvAd6c5ELgKmB7VV0A3NHek2QlcDmwElgLXJsk\n",
452 |        "7bOuA9ZX1QpgRZK1fYciSToe3UOnqvZX1T2t/APgXuBc4GJgc1ttM3BpK18CbKmqg1W1F3gAWJ3k\n",
453 |        "bGBxVe1s6900tI0k6SQ00XM6Sc4DXgjcBSytqgNt0QFgaSufA+wb2mwfg5A6sn6m1UuSTlITC50k\n",
454 |        "zwJuBt5WVY8NL6uqAmoiHZMkjc2iSTSa5OkMAudjVXVLqz6Q5Kyq2t+mzh5q9TPA8qHNlzE4wplp\n",
455 |        "5eH6mblb3DRUXtNekqRZSdbQ4csxg4OKftpFAJuBh6vq7UP1f9Lq3pfkKuDMqrqqXUjwCeDFDKbP\n",
456 |        "Pg/8QlVVkruAtwI7gf8GfLiqbjuivep30HQ98Cb6HqSFqsqx15Ok0SWpcXy3TOJI59eANwJfS3J3\n",
457 |        "q9sIvBfYmmQ9sBe4DKCqdifZCuwGDgEb6nBSbgBuBE4Dth0ZOJKkk0v3I53ePNKRpOM3riMd70gg\n",
458 |        "SerG0JEkdWPoSJK6MXQkSd0YOpKkbgwdSVI3ho4kqRtDR5LUjaEjSerG0JEkdWPoSJK6MXQkSd0Y\n",
459 |        "OpKkbgwdSVI3ho4kqRtDR5LUjaEjSerG0JEkdWPoSJK6MXQkSd0YOpKkbgwdSVI3ho4kqRtDR5LU\n",
460 |        "jaEjSepm0aQ7oBOXpHq2V1Xp2Z6k6WHoTIWemWPeSHrynF6TJHWz4EMnydok9yXZk+Sdk+6PJOno\n",
461 |        "FnToJDkF+FNgLbASeH2SCyfbK82nJGsm3YdxmubxTfPYYPrHNy4LOnSAFwMPVNXeqjoI/CVwyYT7\n",
462 |        "pPm1ZtIdGLM1k+7AGK2ZdAfGbM2kO7AQLfQLCc4FHhx6vw9YPaG+PGX0vloO2NS5PUljstBDZ8Qv\n",
463 |        "v19/ZLzdmDVzKnBan7YmyavlJD05qer9j9b5k+QlwKaqWtvebwQer6r3Da2zcAcoSRM0jr/JW+ih\n",
464 |        "swj4JvAK4NvATuD1VXXvRDsmSZrTgp5eq6pDSf4A+O/AKcANBo4knbwW9JGOJGlhWeiXTD+hhfqH\n",
465 |        "o0n2JvlakruT7Gx1S5JsT3J/ktuTnDm0/sY2xvuSXDRUvyrJrrbsmkmMpfXjI0kOJNk1VDdv40ny\n",
466 |        "jCSfbPV3Jvn5fqM76vg2JdnX9uHdSV49tGzBjC/J8iRfSPKNJF9P8tZWPxX77wnGNy3775lJ7kpy\n",
467 |        "TxvfplY/uf1XVVP5YjDd9gBwHvB04B7gwkn3a8S+fwtYckTdnwB/2MrvBN7byivb2J7exvoAh49g\n",
468 |        "dwIvbuVtwNoJjedlwAuBXeMYD7ABuLaVLwf+8iQY39XAf5hj3QU1PuAs4AWt/CwG51AvnJb99wTj\n",
469 |        "m4r919o8vf1cBNzJ4M9KJrb/pvlIZ6H/4eiRV41cDGxu5c3Apa18CbClqg5W1V4G/5OsTnI2sLiq\n",
470 |        "drb1bhrapquq+iLw/SOq53M8w591M4MLS7o5yvhg7uu9F9T4qmp/Vd3Tyj8A7mXw93FTsf+eYHww\n",
471 |        "BfsPoKr+oRVPZRAmxQT33zSHzlx/OHruUdY92RTw+SR/neRNrW5pVR1o5QPA0lY+h8HYZs2O88j6\n",
472 |        "GU6u8c/neH6yr6vqEPBIkiVj6vfxeEuSrya5YWj6YsGOL8l5DI7o7mIK99/Q+O5sVVOx/5I8Lck9\n",
473 |        "DPbT7S04Jrb/pjl0FvIVEr9WVS8EXg28OcnLhhfW4Dh2IY/vp0zbeJrrgPOBFwDfAd4/2e6cmCTP\n",
474 |        "YvCv2LdV1WPDy6Zh/7XxfZrB+H7AFO2/qnq8ql4ALGNw1PJLRyzvuv+mOXRmgOVD75fz00l90qqq\n",
475 |        "77Sf3wU+y2Cq8ECSswDaoe5DbfUjx7mMwThnWnm4fma8PT8u8zGefUPbPKd91iLgjKr63vi6fmxV\n",
476 |        "9VA1wPUM9iEswPEleTqDwPlYVd3Sqqdm/w2N7+Oz45um/Terqh4BvgC8ignuv2kOnb8GViQ5L8mp\n",
477 |        "DE5w3TrhPh1TktOTLG7lnwEuAnYx6Pu6tto6YPaX/1bgiiSnJjkfWAHsrKr9wKNJVicJcOXQNieD\n",
478 |        "+RjPX83xWa8D7ugxgCfSfpFn/SaDfQgLbHytLzcAu6vqQ0OLpmL/HW18U7T/nj07NZjkNOA3GJy3\n",
479 |        "mtz+63kVRe8Xg+mpbzI4GbZx0v0Zsc/nM7h65B7g67P9BpYAnwfuB24Hzhza5o/aGO8DXjVUv4rB\n",
480 |        "L8sDwIcnOKYtDO4Y8U8M5n5/dz7HAzwD2ArsYTAff96Ex/d7DE60fg34avuFXroQxwe8FHi8/f94\n",
481 |        "d3utnZb9d5TxvXqK9t/zgK+0cewC3t3qJ7b//ONQSVI30zy9Jkk6yRg6kqRuDB1JUjeGjiSpG0NH\n",
482 |        "ktSNoSNJ6sbQkSR1Y+hIkrr5f5BfUgwHbg6jAAAAAElFTkSuQmCC\n"
483 |       ],
484 |       "text/plain": [
485 |        "<matplotlib.figure.Figure at 0x107d6c940>"
486 |       ]
487 |      },
488 |      "metadata": {},
489 |      "output_type": "display_data"
490 |     }
491 |    ],
492 |    "source": [
493 |     "# We can plot\n",
494 |     "apt.price.plot.hist()"
495 |    ]
496 |   },
497 |   {
498 |    "cell_type": "markdown",
499 |    "metadata": {},
500 |    "source": [
501 |     "## Excellent data analysis and scientific libraries\n",
502 |     "\n",
503 |     "### Example: Create a SQL database"
504 |    ]
505 |   },
506 |   {
507 |    "cell_type": "code",
508 |    "execution_count": 12,
509 |    "metadata": {
510 |     "collapsed": false
511 |    },
512 |    "outputs": [],
513 |    "source": [
514 |     "from sqlalchemy import create_engine\n",
515 |     "\n",
516 |     "engine = create_engine('sqlite:///apartment.db')\n",
517 |     "apt.to_sql('apartment', engine)"
518 |    ]
519 |   },
520 |   {
521 |    "cell_type": "markdown",
522 |    "metadata": {},
523 |    "source": [
524 |     "# General Purpose data structures\n",
525 |     "\n",
526 |     "Many are available for your programming enjoyment."
527 |    ]
528 |   },
529 |   {
530 |    "cell_type": "markdown",
531 |    "metadata": {},
532 |    "source": [
533 |     "## Dictionaries, Sets"
534 |    ]
535 |   },
536 |   {
537 |    "cell_type": "code",
538 |    "execution_count": 13,
539 |    "metadata": {
540 |     "collapsed": false,
541 |     "scrolled": true
542 |    },
543 |    "outputs": [
544 |     {
545 |      "data": {
546 |       "text/plain": [
547 |        "{'date_posted': '2016-04-13 15:34:06',\n",
548 |        " 'price': 995.0,\n",
549 |        " 'title': '$995 / 3br - 1350ft2 - Lancaster Apartment Unit Beautiful Remodeled 3 Bed, 2 Bath House in West Lancast (Lancaster)'}"
550 |       ]
551 |      },
552 |      "execution_count": 13,
553 |      "metadata": {},
554 |      "output_type": "execute_result"
555 |     }
556 |    ],
557 |    "source": [
558 |     "# Python is built on dictionaries\n",
559 |     "# AKA hash tables\n",
560 |     "\n",
561 |     "# A dictionary from the apartment data\n",
562 |     "d1 = small.iloc[0, :].to_dict()\n",
563 |     "d1"
564 |    ]
565 |   },
566 |   {
567 |    "cell_type": "code",
568 |    "execution_count": 14,
569 |    "metadata": {
570 |     "collapsed": false
571 |    },
572 |    "outputs": [
573 |     {
574 |      "data": {
575 |       "text/plain": [
576 |        "{'date_posted', 'price', 'title'}"
577 |       ]
578 |      },
579 |      "execution_count": 14,
580 |      "metadata": {},
581 |      "output_type": "execute_result"
582 |     }
583 |    ],
584 |    "source": [
585 |     "s = set(d1.keys())\n",
586 |     "s"
587 |    ]
588 |   },
589 |   {
590 |    "cell_type": "code",
591 |    "execution_count": 15,
592 |    "metadata": {
593 |     "collapsed": false
594 |    },
595 |    "outputs": [
596 |     {
597 |      "data": {
598 |       "text/plain": [
599 |        "True"
600 |       ]
601 |      },
602 |      "execution_count": 15,
603 |      "metadata": {},
604 |      "output_type": "execute_result"
605 |     }
606 |    ],
607 |    "source": [
608 |     "'price' in s"
609 |    ]
610 |   },
611 |   {
612 |    "cell_type": "code",
613 |    "execution_count": 16,
614 |    "metadata": {
615 |     "collapsed": false
616 |    },
617 |    "outputs": [
618 |     {
619 |      "data": {
620 |       "text/plain": [
621 |        "{'prices': [12, 23, 43], 'title': 'davis', 'year': 2016}"
622 |       ]
623 |      },
624 |      "execution_count": 16,
625 |      "metadata": {},
626 |      "output_type": "execute_result"
627 |     }
628 |    ],
629 |    "source": [
630 |     "# Make a dictionary from scratch\n",
631 |     "d2 = {'year': 2016, 'prices': [12, 23, 43], 'title': 'davis'}\n",
632 |     "d2"
633 |    ]
634 |   },
635 |   {
636 |    "cell_type": "markdown",
637 |    "metadata": {},
638 |    "source": [
639 |     "## Tuples, Lists"
640 |    ]
641 |   },
642 |   {
643 |    "cell_type": "code",
644 |    "execution_count": 18,
645 |    "metadata": {
646 |     "collapsed": false
647 |    },
648 |    "outputs": [
649 |     {
650 |      "data": {
651 |       "text/plain": [
652 |        "[12, 'a', [3, 4, 'cats']]"
653 |       ]
654 |      },
655 |      "execution_count": 18,
656 |      "metadata": {},
657 |      "output_type": "execute_result"
658 |     }
659 |    ],
660 |    "source": [
661 |     "l = [12, 'a', [3, 4, 'cats']]\n",
662 |     "l"
663 |    ]
664 |   },
665 |   {
666 |    "cell_type": "code",
667 |    "execution_count": 19,
668 |    "metadata": {
669 |     "collapsed": false
670 |    },
671 |    "outputs": [
672 |     {
673 |      "data": {
674 |       "text/plain": [
675 |        "('davis', 2016, 'iidata')"
676 |       ]
677 |      },
678 |      "execution_count": 19,
679 |      "metadata": {},
680 |      "output_type": "execute_result"
681 |     }
682 |    ],
683 |    "source": [
684 |     "t = ('davis', 2016, 'iidata')\n",
685 |     "t"
686 |    ]
687 |   },
688 |   {
689 |    "cell_type": "markdown",
690 |    "metadata": {},
691 |    "source": [
692 |     "## More exotic"
693 |    ]
694 |   },
695 |   {
696 |    "cell_type": "code",
697 |    "execution_count": 20,
698 |    "metadata": {
699 |     "collapsed": false
700 |    },
701 |    "outputs": [
702 |     {
703 |      "data": {
704 |       "text/plain": [
705 |        "deque([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])"
706 |       ]
707 |      },
708 |      "execution_count": 20,
709 |      "metadata": {},
710 |      "output_type": "execute_result"
711 |     }
712 |    ],
713 |    "source": [
714 |     "# Doubly linked block lists\n",
715 |     "from collections import deque\n",
716 |     "d = deque(x for x in range(10))\n",
717 |     "d"
718 |    ]
719 |   },
720 |   {
721 |    "cell_type": "code",
722 |    "execution_count": 21,
723 |    "metadata": {
724 |     "collapsed": false
725 |    },
726 |    "outputs": [
727 |     {
728 |      "data": {
729 |       "text/plain": [
730 |        "[0, 2, 89, 129, 90]"
731 |       ]
732 |      },
733 |      "execution_count": 21,
734 |      "metadata": {},
735 |      "output_type": "execute_result"
736 |     }
737 |    ],
738 |    "source": [
739 |     "# Heaps\n",
740 |     "from heapq import heapify\n",
741 |     "h = [90, 2, 89, 129, 0]\n",
742 |     "heapify(h)\n",
743 |     "h"
744 |    ]
745 |   },
746 |   {
747 |    "cell_type": "markdown",
748 |    "metadata": {},
749 |    "source": [
750 |     "## Iterators\n",
751 |     "\n",
752 |     "Lazily evaluated data streams"
753 |    ]
754 |   },
755 |   {
756 |    "cell_type": "code",
757 |    "execution_count": 31,
758 |    "metadata": {
759 |     "collapsed": false
760 |    },
761 |    "outputs": [
762 |     {
763 |      "data": {
764 |       "text/plain": [
765 |        "<generator object <genexpr> at 0x10dddae10>"
766 |       ]
767 |      },
768 |      "execution_count": 31,
769 |      "metadata": {},
770 |      "output_type": "execute_result"
771 |     }
772 |    ],
773 |    "source": [
774 |     "a = (x**2 for x in range(10))\n",
775 |     "a"
776 |    ]
777 |   },
778 |   {
779 |    "cell_type": "code",
780 |    "execution_count": 29,
781 |    "metadata": {
782 |     "collapsed": false
783 |    },
784 |    "outputs": [],
785 |    "source": [
786 |     "r = range(int(1e9))\n",
787 |     "r\n",
788 |     "t = sum(r)"
789 |    ]
790 |   },
791 |   {
792 |    "cell_type": "code",
793 |    "execution_count": 30,
794 |    "metadata": {
795 |     "collapsed": false
796 |    },
797 |    "outputs": [
798 |     {
799 |      "data": {
800 |       "text/plain": [
801 |        "499999999500000000"
802 |       ]
803 |      },
804 |      "execution_count": 30,
805 |      "metadata": {},
806 |      "output_type": "execute_result"
807 |     }
808 |    ],
809 |    "source": [
810 |     "t"
811 |    ]
812 |   },
813 |   {
814 |    "cell_type": "code",
815 |    "execution_count": 32,
816 |    "metadata": {
817 |     "collapsed": false
818 |    },
819 |    "outputs": [
820 |     {
821 |      "data": {
822 |       "text/plain": [
823 |        "4"
824 |       ]
825 |      },
826 |      "execution_count": 32,
827 |      "metadata": {},
828 |      "output_type": "execute_result"
829 |     }
830 |    ],
831 |    "source": [
832 |     "def f(x):\n",
833 |     "    \"\"\"Square x\n",
834 |     "    \n",
835 |     "    >>> f(2)\n",
836 |     "    4\n",
837 |     "    \"\"\"\n",
838 |     "    return x**2\n",
839 |     "\n",
840 |     "f(2)"
841 |    ]
842 |   }
843 |  ],
844 |  "metadata": {
845 |   "kernelspec": {
846 |    "display_name": "Python 3",
847 |    "language": "python",
848 |    "name": "python3"
849 |   },
850 |   "language_info": {
851 |    "codemirror_mode": {
852 |     "name": "ipython",
853 |     "version": 3
854 |    },
855 |    "file_extension": ".py",
856 |    "mimetype": "text/x-python",
857 |    "name": "python",
858 |    "nbconvert_exporter": "python",
859 |    "pygments_lexer": "ipython3",
860 |    "version": "3.4.4"
861 |   }
862 |  },
863 |  "nbformat": 4,
864 |  "nbformat_minor": 0
865 | }
866 | 


--------------------------------------------------------------------------------
/iid_talk/slides.mdown:
--------------------------------------------------------------------------------
 1 | # Python for R Users
 2 | 
 3 | Clark Fitzgerald
 4 | 
 5 | UC Davis iidata
 6 | 
 7 | May 21, 2016
 8 | 
 9 | ![](images/python-logo-master-v3-TM.png)
10 | 
11 | ------------------
12 | 
13 | # Goals
14 | 
15 | - Basics of Python
16 | - Motivate you to learn more
17 | 
18 | ```
19 | for i in range(3):
20 |     i += 2
21 | ```
22 | 
23 | ## Getting up
24 | 
25 | 
26 | - Turn off alarm
27 | - Get out of bed
28 | 
29 | ## Breakfast
30 | 
31 | - Eat eggs
32 | - Drink coffee
33 | 
34 | # In the evening
35 | 
36 | ## Dinner
37 | 
38 | - Eat spaghetti
39 | - Drink wine
40 | 
41 | ------------------
42 | 
43 | ![picture of spaghetti](images/spaghetti.jpg)
44 | 
45 | ## Going to sleep
46 | 
47 | - Get in bed
48 | - Count sheep
49 | 


--------------------------------------------------------------------------------