├── .github └── workflows │ └── jekyll-gh-pages.yml ├── LICENSE.md ├── README.md ├── datasets.md ├── projects.md ├── r-resources.md ├── scott-davis-transcript.md └── transcripts └── scott-davis-transcript.md /.github/workflows/jekyll-gh-pages.yml: -------------------------------------------------------------------------------- 1 | # Sample workflow for building and deploying a Jekyll site to GitHub Pages 2 | name: Deploy Jekyll with GitHub Pages dependencies preinstalled 3 | 4 | on: 5 | # Runs on pushes targeting the default branch 6 | push: 7 | branches: ["master"] 8 | 9 | # Allows you to run this workflow manually from the Actions tab 10 | workflow_dispatch: 11 | 12 | # Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages 13 | permissions: 14 | contents: read 15 | pages: write 16 | id-token: write 17 | 18 | # Allow one concurrent deployment 19 | concurrency: 20 | group: "pages" 21 | cancel-in-progress: true 22 | 23 | jobs: 24 | # Build job 25 | build: 26 | runs-on: ubuntu-latest 27 | steps: 28 | - name: Checkout 29 | uses: actions/checkout@v3 30 | - name: Setup Pages 31 | uses: actions/configure-pages@v2 32 | - name: Build with Jekyll 33 | uses: actions/jekyll-build-pages@v1 34 | with: 35 | source: ./ 36 | destination: ./_site 37 | - name: Upload artifact 38 | uses: actions/upload-pages-artifact@v1 39 | 40 | # Deployment job 41 | deploy: 42 | environment: 43 | name: github-pages 44 | url: ${{ steps.deployment.outputs.page_url }} 45 | runs-on: ubuntu-latest 46 | needs: build 47 | steps: 48 | - name: Deploy to GitHub Pages 49 | id: deployment 50 | uses: actions/deploy-pages@v1 51 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | This is free and unencumbered software released into the public domain. 2 | 3 | Anyone is free to copy, modify, publish, use, compile, sell, or 4 | distribute this software, either in source code form or as a compiled 5 | binary, for any purpose, commercial or non-commercial, and by any 6 | means. 7 | 8 | In jurisdictions that recognize copyright laws, the author or authors 9 | of this software dedicate any and all copyright interest in the 10 | software to the public domain. We make this dedication for the benefit 11 | of the public at large and to the detriment of our heirs and 12 | successors. We intend this dedication to be an overt act of 13 | relinquishment in perpetuity of all present and future rights to this 14 | software under copyright law. 15 | 16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 17 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 18 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 19 | IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR 20 | OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, 21 | ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR 22 | OTHER DEALINGS IN THE SOFTWARE. 23 | 24 | 25 | For more information, please refer to 26 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Note from the [Editor](http://clarecorthell.org): `Take Two` 2 | 3 | In the old days of 2013, the OSDSM was born. Then, there were "little to no Data Scientists with 5 years experience, because the job simply did not exist." (_David Hardtke, Nov 2012_) Since then, history has witnessed many things, including: 4 | 5 | • Data Scientists working across industries and the world
6 | • social media manipulation disrupts many elections
7 | • BLM and #metoo and Extinction Rebellion and many other social movements
8 | • machine learning begins falling under engineering domain
9 | • a pandemic
10 | • climate change disasters becoming very frequent while climate warms faster than predicted
11 | • remote work becoming common 12 | • multiple global recession shocks 13 | 14 | In that decade, Data Science has seen growth of jobs, shortfall of goals, success in many industries, abject failure in others, and nefarious use cases. In particular, [adverse consequences and complications of learning from data](http://machinebias.org) appear in too many examples: [elections undermined by psychographics](https://en.wikipedia.org/wiki/Facebook%E2%80%93Cambridge_Analytica_data_scandal), [dismal gender (Men=74%) and BIPOC diversity in the AI field](https://www.nytimes.com/2016/06/26/opinion/sunday/artificial-intelligences-white-guy-problem.html), a revived [eugenics](https://www.technologyreview.com/s/612275/sociogenomics-is-opening-a-new-door-to-eugenics/), an [explainability crisis](https://hbr.org/2018/07/when-is-it-important-for-an-algorithm-to-explain-itself), [facial recognition](https://www.theguardian.com/technology/2014/may/04/facial-recognition-technology-identity-tesco-ethical-issues) used to [identify people](https://onlinelibrary.wiley.com/doi/abs/10.1002/widm.1278) and systematically [detain them](https://www.buzzfeednews.com/article/meghara/china-new-internment-camps-xinjiang-uighurs-muslims), ["aggression" detection microphones in schools](https://features.propublica.org/aggression-detector/the-unproven-invasive-surveillance-technology-schools-are-using-to-monitor-students/), and many others. It has never been more clear that **we need to talk about the real world impacts of our work, and consider how our creations are used.** As you consider this, read a prescient [novel](https://library.oapen.org/bitstream/id/24cb1da5-a512-4de1-b24c-639b6452dbec/628778.pdf) that grapples with the consequences of birthing, of creation, of technology. 15 | 16 | Like any tool, data-driven technologies are indifferent to the morality of their ends. Perhaps the greatest risk of all is leaving this tool in the hands of the few expensively-educated people who cannot possibly represent all of us. To balance this, open source movements seek to lower the barriers to education for everyone. Data science and data literacy must be widespread, accessible, and leveraged for building our collective future. More than ever, we need that future to be built by members of society who are diverse and focused on generative, sustainable, resilient, [emergent](https://bookshop.org/a/2958/9781849352604) solutions. After all, the things we build are mirrors of ourselves (seriously, read Shelley's [Frankenstein](https://library.oapen.org/bitstream/id/24cb1da5-a512-4de1-b24c-639b6452dbec/628778.pdf)). 17 | 18 | > _Computers reflect the biases and belief systems of the people programming them_ -[@alicegoldfuss](https://twitter.com/alicegoldfuss/status/1016034359134941184) 19 | 20 | The OSDSM is built with the belief that **open source education makes a diverse, collective, generative future-building possible.** I hope that you are one of the next people -- whether you call yourself a Data Scientist or not -- to help make better decisions with the scientific process, critical thinking, and everything else your unique perspective brings to the table. This rewritten curriculum focuses on what is needed to be successful in the entry-level role, but that is just a generic outline; truly, I hope where you take it extends far beyond that. 21 | 22 | *** 23 | 24 | Start here 👇 25 | 26 | ## The Open Source Data Science Masters 27 | 28 | The open-source curriculum for learning to be a Data Scientist. Curriculum resources from both universities and working Data Scientists focuses on foundational theory and applied skills. The OSDSM is collectively-maintained and open to PRs. 29 | 30 | The goal of this curriculum is to prepare the student for an entry level Data Scientist role, using open source materials, at no cost but with the same calibur of materials found in the most reputable paid programs. Books not offered for free are often available through a public library, also indicated here with current list price. The Masters is self-guided and self-accredited. To better support credibility, the structure now includes a Capstone project intended to demonstrate the student's problem solving approach, skills in execution, and communication. Upon completion, the student can award oneself a [Credential](https://help.accredible.com/add-your-credential-to-linkedin) on LinkedIn from the Open Source Data Science Masters. As with all things, the OSDSM is best played as a team sport (try finding people on [r/learndatascience](https://www.reddit.com/r/learndatascience/)). 31 | 32 | This is called a "Masters" because it is primarily concerned with "upper-level" college course material in mathematics, programming, economics, or related disciplines. Come as you are! 33 | 34 | 1. **📖 The Core** - This is a critical foundation for what is to come; don't skip the foundational lessons. 35 | 2. **❄️ Specialty** - Choose what is most interesting to you, or most relevant to the work you plan to do. 36 | 3. **🤝 Doing Data Science** - Learn about how doing science with others and for businesses can work. 37 | 4. **🧑‍💻 Capstone Project** - Choose a meaningful project or dataset to demonstrate what you've learned. 38 | 39 | 40 | ## 📖 The Core 41 | _This is a critical foundation for what is to come; don't skip!_ 42 | 43 | ### What is Data Science? 44 | One could argue that "Data Science" is a recent term for an already existing information analysis discipline. Humans instinctually search for patterns, a purpose we also see in this more digitized discipline. Read different sources (and search beyond this list) about the uses of data science. 45 | - The Signal and The Noise / Nate Silver [Book ```$18```](https://bookshop.org/a/2958/9780143125082) -- Narrated cases of Data Science at play in the real world. 46 | - Dataclysm: Who We Are (When We Think No One's Looking) / Christian Rudder [Book ```$17```](https://bookshop.org/a/2958/9780385347396) -- From the inside of OKCupid, real examples of how data science can illustrate human behavior. 47 | - Informatics of the Oppressed / Rodrigo Ochigame [Logic Magazine](https://logicmag.io/care/informatics-of-the-oppressed/) -- _Algorithms of oppression have been around for a long time. So have radical projects to dismantle them and build emancipatory alternatives._ 48 | * A showcase of [Jupyter Python Data Analysis Notebooks](https://github.com/jupyter/jupyter/wiki) across disciplines. 49 | 50 | ### Foundations of Data Science 51 | 52 | #### Problem Solving 53 | When there are no answers in the back of the book, how do you proceed? Breaking down problems is a skill, one that can and should be learned. Follow Pólya's process, and for extra credit, seek out resources on computer science [decomposition](http://sites.fas.harvard.edu/~libs111/files/lectures/unit1-3.pdf). 54 | * Problem-Solving Heuristics "How To Solve It" George Pólya [Berkeley / Summary](https://math.berkeley.edu/~gmelvin/polya.pdf) [Book ```$18```](https://bookshop.org/a/2958/9780691164076) 55 | 56 | ### The Scientific Process & Experimentation 57 | It is crucial as a Data Scientist that you show integrity in and transparency of scientific process. Even if you've been here before, review and draw out the process diagram for the scientific method. 58 | - [The Scientific Process](https://courses.lumenlearning.com/waymaker-psychology/chapter/reading-the-scientific-process-replace-content/) 59 | - [Research Design A Step-by-Step Guide](https://www.scribbr.com/methodology/research-design/) 60 | - [A Quick Guide to Experimental Design](https://www.scribbr.com/methodology/experimental-design/) 61 | 62 | #### Querying Data 63 | Get familiar and comfortable with manipulating data in a database with a common relational querying language. There are diverse query languages, but SQL is a widely used foundation. 64 | * SQL School [Mode Analytics / Tutorials](http://bit.ly/sqlschool) 65 | 66 | ### Math & Statistics 67 | 68 | #### Calculus 69 | * Single Variable Calculus [MIT OpenCourseWare](http://ocw.mit.edu/courses/mathematics/18-01-single-variable-calculus-fall-2006/) 70 | * Multivariable Calculus [MIT OpenCourseWare](http://ocw.mit.edu/courses/mathematics/18-02sc-multivariable-calculus-fall-2010/) 71 | 72 | #### Linear Algebra 73 | The foundational mathematics for working with large samples of data. Spend time in exercises until you feel highly confident in the key topics of Linear Algebra. It will serve you well. 74 | * An Intuitive Guide to Linear Algebra [Better Explained / Article](https://betterexplained.com/articles/linear-algebra-guide/) 75 | * A Programmer's Intuition for Matrix Multiplication [Better Explained / Article](https://betterexplained.com/articles/matrix-multiplication/) 76 | * Vector Calculus: Understanding the Cross Product [Better Explained / Article](https://betterexplained.com/articles/cross-product/) 77 | * Vector Calculus: Understanding the Dot Product [Better Explained / Article](https://betterexplained.com/articles/vector-calculus-understanding-the-dot-product/) 78 | * Linear Algebra [Khan Academy / Videos](http://bit.ly/khanlinalg) 79 | * Linear Algebra [MIT](http://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/) 80 | 81 | #### Statistics 82 | How can we answer questions with data? Everywhere you look, you'll see methods from statistics. Spend a lot of time here! 83 | * Stats in a Nutshell [Book ```$46```](https://bookshop.org/a/2958/9781449316822) 84 | * Think Stats: Probability and Statistics for Programmers [Digital](http://bit.ly/ebook-thinkstats) & [Book ```$34```](https://bookshop.org/a/2958/9781491907337) 85 | * Think Bayes [Digital](http://bit.ly/ebook-thinkbayes) & [Book ```$25```](https://bookshop.org/a/2958/9781492089469) 86 | * Probabilistic Programming and Bayesian Methods for Hackers [Github / Tutorials](http://bit.ly/ipnb-probabilisticprogramming) 87 | 88 | ### Working in Python 89 | 90 | #### Learn Python 91 | If you're starting from scratch with Python, start with this series. 92 | * Learn Python the Hard Way [Digital](http://bit.ly/ebook-learnpyhardway) & [Book ```$37```](https://bookshop.org/a/2958/9780134692883) 93 | 94 | #### Environment & Libraries 95 | Set up your computer to use tools locally. 96 | * Installing Basic Packages: [Python, virtualenv, NumPy, SciPy, matplotlib and IPython ](http://bit.ly/scientific-py-install) 97 | * For scientific uses: [Using Python Scientifically](http://bit.ly/lecture-scipy) & [Command Line Install Script](https://github.com/fonnesbeck/ScipySuperpack) for Scientific Python Packages 98 | 99 | #### Data Analysis 100 | Get familiar with using tools to do data analysis. Pro tip: Write out what you're going to do before you do it! When you hit a snag, return to your plan and rechart as necessary. 101 | * [pandas](http://bit.ly/py-pandas) tutorials 102 | * Pandas Cookbook [Examples](http://bit.ly/jvnspandascookbook) 103 | * Data Analysis in Python [Tutorial](http://bit.ly/mode-python-tutorials) 104 | * Big Data Analysis with Twitter [UC Berkeley / Lectures](http://bit.ly/cal-course-bigdatatwitter) 105 | * Intro to Data Science / [Course $0](https://www.udacity.com/course/intro-to-data-science--ud359) 106 | 107 | #### Python Programming + Algorithms 108 | How does a computer know what to do? Algorithms are instructions with a fancy name. Learn how instructions are encoded, how to think about structuring those instructions, and patterns for making it work in code. 109 | * Algorithms Design & Analysis I [Stanford / Coursera](http://bit.ly/coursera-algo) 110 | * [numpy Tutorial / Stanford CS231N](http://cs231n.github.io/python-numpy-tutorial/) 111 | 112 | ### Survey Courses 113 | Courses with many of the topics above included. Be sure you fill in any gaps! 114 | - Intro to Data Science / University of Washington [Lectures](https://www.youtube.com/playlist?list=PLMiChZq0IHh1A5mz4o0T_vWXJnsi-7EY-) 115 | - (Short Survey) Doing Data Science: Straight Talk from the Frontline [O'Reilly / Book ```$50```](https://bookshop.org/a/2958/9781449358655) 116 | 117 | ## ❄️ Specialty: Choose 2 118 | _Choose what is most interesting to you, or most relevant to the work you plan to do._ 119 | 120 | ### Causation 121 | A branch of statistics that uses graphical models and specialized statistics to describe and model cause and effect. 122 | 123 | * The Book of Why [Book ```$17```](https://bookshop.org/a/2958/9781541698963) 124 | * Causal Inference in Statistics [Book ```$46```](https://bookshop.org/a/2958/9781119186847) 125 | 126 | ### Natural Language Processing 127 | The imperfect and immensely useful art (science?) of transforming human language into data. 128 | * From Languages to Information / Stanford CS147 [Materials](http://bit.ly/nlpcs124) 129 | * NLP with Python (NLTK library) [Digital](http://bit.ly/py-nltk), [Book ```$55```](https://bookshop.org/a/2958/9780596516499) 130 | * How to Write a Spelling Correcter / Norvig [Tutorial](http://norvig.com/spell-correct.html) 131 | 132 | ### Graph Analysis 133 | Human relationships can be modeled as a network or graph. Many other things suit this model, too. Working with graphs 134 | * Social and Economic Networks: Models and Analysis / [Stanford / Coursera](http://bit.ly/stanford-socialeconnetworks) 135 | * Social Network Analysis for Startups [Chapter 1](https://www.oreilly.com/library/view/social-network-analysis/9781449311377/ch01.html) & [Book ```$25```](https://bookshop.org/a/2958/9781449306465) [```networkx```](http://bit.ly/py-networkx) 136 | 137 | ### Machine Learning 138 | This is a huge space with infinite things to learn. For advanced statistical foundation, see [The Elements of Statistical Learning](http://bit.ly/ebook-elemstatlearn). 139 | * Intro to scikit-learn, SciPy2013 [youtube tutorials](http://bit.ly/scikit-video-tuts) 140 | * Machine Learning for Hackers [ipynb / digital book](http://bit.ly/mlforhackers) 141 | * Machine Learning [Ng Stanford / Coursera](http://bit.ly/stanford-ml) & [Stanford CS 229](http://bit.ly/stanfordcs229) 142 | * Programming Collective Intelligence [Book ```$46```](https://bookshop.org/a/2958/9780596529321) 143 | 144 | ### Visualization 145 | The most persuasive data stories are ones you can see with your own eyes. Make it visual! 146 | 147 | #### Courses 148 | * Data Visualization [University of Washington / Slides & Resources](http://bit.ly/uw-dataviz) 149 | * Rice University's Data Viz class [Rice University / Slides](http://bit.ly/riceu-viz) 150 | 151 | #### Books 152 | * Envisioning Information [Tufte / Book ```$36```](http://amzn.to/Sn0QI4) 153 | * Interactive Data Visualization for the Web / Scott Murray [Online Book](http://bit.ly/interactive-data-viz-web) & [Book `$50`](https://bookshop.org/a/2958/9781491921289) 154 | 155 | ### Linear Programming + Convex Optimization 156 | If you have interest in operations management, manufacturing, supply chains, or other real world queuing problems, dig in here. 157 | * Linear Programming (Math 407) [University of Washington / Course](http://bit.ly/course-uw-linearprogramming) 158 | * Convex Optimization / Boyd [Stanford / Lectures](http://stanford.edu/class/ee364a/index.html) / [Book](http://stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf) 159 | 160 | ### Deep Learning / Neural Networks 161 | * Neural Networks [Andrej Karpathy / Python Walkthrough](http://bit.ly/karpathyneuralnets) 162 | * Neural Networks for Machine Learning [Geoffrey Hinton / U Toronto](https://www.youtube.com/playlist?list=PLoRl3Ht4JOcdU872GhiYWf6jwrk_SNhz9) 163 | * Deep Learning for Natural Language Processing CS224d [Stanford](http://cs224d.stanford.edu/syllabus.html) 164 | 165 | ## 🤝 Doing Data Science 166 | _Learn about how doing science with others and for businesses can work._ 167 | 168 | ### What is the job? 169 | 170 | In ideal terms, a Data Scientist advises strategic decision-making using data-backed analysis and tested hypotheses. YMMV as this depends on the company needs and the team being supported. 171 | 172 | * What Professional Data Scientists Actually Do [Video](https://youtu.be/iPdO9MwdcLE) 173 | * The Data Science Handbook: Advice and Insights from 25 Amazing Data Scientists [Book ```$25```](http://amzn.to/1J7lILJ) 174 | * Required reading: Why might machine learning not always the best approach? [Machine Learning: The High-Interest Credit Card of Technical Debt](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43146.pdf) 175 | 176 | ### Communication and Teamwork 177 | 178 | For a Data Scientist's work to be impactful, they must be effective at communicating their work and findings. In any setting, clear logic and effective business writing are crucial to reaching your audience. And of course, doing Data Science with a team over zoom is different from being in person in an office. There is much more written communication and asynchronous consumption of content in the remote office environment. More than ever, writing and communication skills are crucial to being an effective Data Scientist for yourself and your team. 179 | 180 | * LEADERSHIP LAB: The Craft of Writing Effectively [UChicago / Video](https://buomsoo-kim.github.io/learning/2020/03/30/Craft-of-writing-effectively.md/). Recommend watching this twice and taking notes. 181 | 182 | #### The Data Scientist works in a Team 183 | In the modern organization, it is very rare that a Data Scientist works in isolation. Communicating the value of the work being done is crucial to getting buy-in from partners whose decisions and operations depend on your work. Those partners might be: 184 | - Product Managment 185 | - Engineering 186 | - Design (User Experience, Research, Product) 187 | - Operations (Project Management, Customer Service Agents, Data Management) 188 | - Marketing 189 | - Finance Operations 190 | - etc. 191 | 192 | Typically, the more clearly you are able to communicate the "why", the **value of what you are doing**, the more these teams will be able to support you and your work in conversations you may not be a part of. Even if others don't understand "how" you do your work (which is very important to you and your manager!), they will be able to understand and repeat a well-communicated "why". This is why we write Specs, to get buy-in and allow for questions or input, before the work starts. 193 | 194 | #### The Spec 195 | A document conveying the motives, direction, investment, and expected value of the work. 196 | 197 | * Goal / "Why" -- What is the point of this work? What decision is the organization trying to make? 198 | * Impact -- What decisions might be made differently as a result of this work? What is the expected value? 199 | * Data -- What evidence will this draw on? 200 | * Assumptions -- What evidence does not exist? What assumptions are necessary or agreed upon? 201 | * Methods / "How" -- Overview methods expected to be used. Analysis, with what tools? Experimentation, with what methodology? 202 | * Results -- (to be filled in as completed) 203 | 204 | #### Results Presentation 205 | A slide deck or document with the goal of conveying the results of the work and how the findings support an important decision(s). 206 | 207 | Best appended to the Spec, and summarized in a slide deck for easy consumption. Depending on the culture of the group, slides or a short document may be easier to look through to understand the results of the work. In the remote work era, think about how your work will be passed around and make sure your "above the fold" is easy to understand and clearly conveys the "why" and results in particular. 208 | 209 | __Example__: A particularly polished [presentation](https://medium.com/lyft-engineering/how-lyft-discovered-openstreetmap-is-the-freshest-map-for-rideshare-a7a41bf92ec) of [map quality study results](https://drive.google.com/file/d/1Sb-dOUjeP1Ljqz4ra931D3Pe8B5C3pde/view) showing higher data quality in US maps on OSM than commercially available alternatives. The impact of this work was a) increased confidence in service reliability for the company and b) enabled the company to decide against buying a commercially available annual license costing millions of dollars annually. 210 | 211 | ## 🧑‍💻 Capstone Project 212 | _Choose a meaningful project or dataset to demonstrate what you've learned._ 213 | 214 | ### Pick a dataset that you care about 215 | * The very detailed [data is plural](https://www.data-is-plural.com/) 216 | * Collection of [datasets](http://bit.ly/osdsm-datasets-link) 217 | 218 | ### Formulate a Hypothesis & Write a Spec 219 | Review the earlier reading on [The Scientific Process](https://courses.lumenlearning.com/waymaker-psychology/chapter/reading-the-scientific-process-replace-content/). Formulate a clear, concise hypothesis. This is the headliner of your [Spec](#the-spec), flesh that out. 220 | 221 | ### Show your work + Explain why you chose this project 222 | Show the process you used to disprove your hypothesis, preferably in a jupyter notebook. See [examples](https://github.com/jupyter/jupyter/wiki) to get a taste of how you can showcase your work. 223 | 224 | ### Graduate! 225 | 1. Create a document or github repo showcasing the list of courses and materials you completed. Include your project materials. Also recommended: include a personal statement about why you chose this course of study and what you seek to do with it. 226 | 2. Award yourself a [Credential](https://help.accredible.com/add-your-credential-to-linkedin) on LinkedIn from _The Open Source Data Science Masters_, with a link to the documentation you created. 227 | 3. Congratulations! 🎉 228 | 229 | *** 230 | ### So Extra "Extracurriculars" 231 | * The Elements of Statistical Learning / Stanford [Digital](http://bit.ly/ebook-elemstatlearn) & [Book ```$90```](https://bookshop.org/a/2958/9780387848570) & [Study Group](http://www.reddit.com/r/eosl) 232 | * Python Data Science Handbook: Essential Tools for Working with Data [Book ```$60```](https://bookshop.org/a/2958/9781491912058) 233 | * The Manga Guide to Linear Algebra [Book ```$19```](https://bookshop.org/a/2958/9781593274139) 234 | * Mining The Social Web [Book ```$46```](https://bookshop.org/a/2958/9781491985045) 235 | * The Truthful Art: Data, Charts, and Maps for Communication [Cairo / Book ```$50```](https://bookshop.org/a/2958/9780321934079) 236 | * Exploratory Data Analysis [Tukey / Book ```$81```](http://amzn.to/1kNUEPa) [```$113```](https://bookshop.org/books/exploratory-data-analysis-classic-version/9780134995458) 237 | * Mining Massive Data Sets / Stanford [Course & Digital Textbook](http://bit.ly/mmds-course) & [Book ```$58```](https://bookshop.org/a/2958/9781108476348) 238 | * Introduction to Information Retrieval / Stanford [Digital](http://bit.ly/ebook-stanford-inforetrieval) & [Book ```$70```](https://bookshop.org/a/2958/9780521865715) 239 | * Probabilistic Graphical Models [Stanford / Coursera](http://bit.ly/stanford-pgm) 240 | * Differential Equations in Data Science [Python Tutorial](https://web.archive.org/web/20190617023702/https://nbviewer.jupyter.org/github/URXtech/techblog/blob/master/continuousTimeMarkovChain/markovChain.ipynb) 241 | * Algorithm Design, Kleinberg & Tardos [Book ```$125```](http://amzn.to/1iMnWm5) 242 | * [Tidy Data in Python](http://www.jeannicholashould.com/tidy-data-in-python.html) 243 | * Designing, Visualizing and Understanding Deep Neural Networks [Berkeley CS294-129](https://bcourses.berkeley.edu/courses/1453965/pages/cs294-129-designing-visualizing-and-understanding-deep-neural-networks) 244 | * Python for Data Analysis [Book ```$55```](https://bookshop.org/a/2958/9781491957660) 245 | * Think Python [Digital](http://bit.ly/ebook-thinkpy) & [Book ```$45```](https://bookshop.org/a/2958/9781491939369) 246 | * The Visual Display of Quantitative Information [Tufte / Book ```$27```](http://amzn.to/1q5FB91) 247 | * Information Dashboard Design: Displaying Data for At-a-Glance Monitoring [Stephen Few / Book ```$29```](https://bookshop.org/a/2958/9781938377006) 248 | * D3 Library / Scott Murray [Blog / Tutorials](https://alignedleft.com/tutorials) 249 | * SQL Tutorials [SQLZOO / Tutorials](http://bit.ly/tut-sqlzoo) 250 | * Machine Learning [Caltech / Edx](http://bit.ly/caltech-ml) 251 | * A Course in Machine Learning [UMD / Digital Book](http://bit.ly/22WyV3N) 252 | * Designing Data Intensive Applications [Book ```$56```](https://bookshop.org/a/2958/9781449373320) 253 | 254 | *** 255 | 256 | #### `Take Two` Change Log 257 | 1. Restructured ala the [2022 Plan](https://docs.google.com/presentation/d/18iSlwSG6F57URqIl47KUmEJCtvS_x4Uxp71caY0-wp8/edit#slide=id.g1196d9e5821_0_0). 258 | 2. Pruned broken links. It's been a while, and some of these resources have moved -- or worse -- been taken down. 259 | 3. Pared down links to a more opinionated list. 260 | 4. Proceeds. [Bookshop.org](http://bookshop.org) links for all books, which supports independent bookshops with commissions. Since the first commits in 2014, I have donated any related commissions to [Planned Parenthood](https://www.plannedparenthood.org/), which was one of the few healthcare providers in my community growing up and is the largest single provider of reproductive health services in the US. Though donations should flow to independent bookshops from now on, my personal commitment to PP remains. 261 | 262 | Please Contribute; **this is Open Source!** 263 | 264 | Fearless Maintainer: [@clarecorthell](http://bit.ly/clarecorthelltwitter) 265 | 266 | _RIP [v1.0 commit](https://github.com/datasciencemasters/go/tree/d6cec020ac3d038cd787e9a779a3cea188c779f2)_ 267 | -------------------------------------------------------------------------------- /datasets.md: -------------------------------------------------------------------------------- 1 | ### All The Datasets 2 | 3 | #### Machine Learning 4 | 5 | * [UCI Machine Learning Dataset Repository](https://archive.ics.uci.edu/ml/datasets.html) 6 | * [Machine Learning Dataset Repository](http://mldata.org/) 7 | 8 | #### Deep Learning 9 | 10 | * [Deep Learning Datasets](http://deeplearning.net/datasets/) for benchmarking deep learning algorithms 11 | 12 | #### Clean Sample Data (for Learning New Techniques) 13 | 14 | * [Scikit-learn sample datasets](http://scikit-learn.org/stable/datasets/index.html) 15 | * [Statsmodels datasets](http://statsmodels.sourceforge.net/devel/datasets/index.html) 16 | 17 | 18 | #### Networks 19 | * [Stanford Network Analysis Project](https://snap.stanford.edu/) 20 | 21 | #### Raw Dataz 22 | 23 | * [OpenFlights Airports Database](http://openflights.org/data.html) - Airport, airline and route data; 6977 airports 24 | * [The Guardian / Datasets](http://www.theguardian.com/news/datablog/interactive/2013/jan/14/all-our-datasets-index) 25 | * [Qandl](http://www.quandl.com) provides a lot of interesting data with a clean API. 26 | * [Time Series Data Library](http://datamarket.com/data/list/?q=provider:tsdl) 27 | * USA Congressional Voting Records [Voteview](http://voteview.org/downloads.asp) 28 | * [Traffic data](https://github.com/graphhopper/open-traffic-collection) 29 | 30 | ### Datasets Sources 31 | 32 | * [NIPS Feature Selection](http://www.nipsfsc.ecs.soton.ac.uk/datasets/) 33 | 34 | ### Curated Lists of Datasets 35 | 36 | * [@hmason's](https://twitter.com/hmason) curated dataset list [bit.ly](https://bitly.com/bundles/hmason/1) 37 | 38 | ### Sentiment Analysis 39 | * https://github.com/acquayefrank/sanders-twitter 40 | * https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data 41 | * http://ai.stanford.edu/~amaas/data/sentiment/ 42 | -------------------------------------------------------------------------------- /projects.md: -------------------------------------------------------------------------------- 1 | _These projects enable students to get hands on data with interesting and impactful projects. Projects contain a thesis or prompt, strengths focus, suggested tools and datasources._ 2 | 3 | ## Types of Projects 4 | - Tools that solve common problems in Data Science (software engineering) 5 | - Analytical and predictive projects (pivot, correlation, testing) 6 | - Theory-heavy projects (causal modeling, anomaly detection) 7 | - Structure discovery (flow, graphs, topics) 8 | 9 | ## Projects 10 | 11 | ### 1) Memes and Twitter 12 | 13 | **What is the half-life of a meme over time?** This project focuses on creative data sourcing, engagement measurement, and leverage your own beliefs about socio-cultural artifacts. 14 | 15 | **Toolkit** Python `nltk`, Twitter API 16 | 17 | - Start with trends.google.com and seek out a few "discrete cultural islands" that you find interesting. Search multiple memes and overlay them on each other. This visualization style may influence your own visualization styles. 18 | - Use the twitter api to gather engagement around a handful of memes. Dig into knowyourmeme to find a few new ones. Iterate on that definition of engagement based on your intuition. Look for memes that have peaked at different points in the last ten years, then plot the engagement with each over time. 19 | - What other types of cultural media exhibit this half-life type of engagement cycle? 20 | - Is there a difference between memes that are "funded" (movies, songs, books, etc) and non-economically-driven memes (ie: simple, easily mutable image memes like "Good Guy Greg", "Bad Luck Brian", and "Moth meme")? 21 | - How many variations of a meme appeared? Does that relate to its popularity over time? 22 | 23 | 24 | ### 2) Neurodiversity in Literature 25 | Google trends shows us how releveant a term is over time, but what about a phenomenon that has been named more recently? Many psychiatric phenomena have existed longer than we've "known" about them, but they may have been described. 26 | 27 | **Toolkit** Python `nltk`, Gutenberg Texts 28 | 29 | **What fictional characters exhibit psychiatric illness?** This project focuses on semantic analysis, natural language, and natural language in vector spaces. 30 | - Search trends.google.com for "autism." When does it appear to show up? What can you learn about the point where medicine began describing the phenomenon? 31 | - Read about the history of Autism, Schizophrenia, Depression, Bipolar on Wikipedia to learn about how medicine has come to understand the phenomenon of mental illness and non-neurotypicality and the people who experience it over time. Pay special attention to how a non-neurotypical person was described in various time periods. 32 | - Download a sampling of literature in .txt from Gutenberg texts. Write a python script to try to write a rules-based prototyping of descriptors you found in Wikipedia articles about a particular mental disorder or non-neurotypical behavior. NLTK is a great tool for iterating on this problem. 33 | - [Nate to complete suggested next steps / guide] 34 | 35 | 36 | ### 3) Migration and Climate in the United States 37 | We know two things: people migrate from place to place in the United States, and the climate in every part of the US changes over times. **How does the climate relate to migration?** This project will require you to be thoughtful about building tools to help you parse lat/longs and manipulate data without full support from libraries. 38 | 39 | **Toolkit** python 40 | 41 | - Download Tiger migration data [Nate to help rewrite this] and average temperature of MSAs [Census Bureau?] 42 | - What seems to be the "correct" climate? Does this differ by demographic (age, income, education, ethnic affiliation)? 43 | - Where have people migrated over the last 30 years, and in what direction? Do you believe it will continue? 44 | - How do you believe people's lives have changed over time as a result of changes in climate across the country? Do you believe there is a causal connection between movement of people and climate? How might you try to prove that it is causal or not causal? 45 | - (Harder Task) What do you believe the next timestep will look like? What cities will grow? 46 | - Search for tools online that enable you (for free) to visualize your results geospatially. Figure out how to produce data in a format that can be visualized, and produce your map(s). 47 | 48 | 49 | 50 | 51 | -------------------------------------------------------------------------------- /r-resources.md: -------------------------------------------------------------------------------- 1 | _[Note: The core of The Open Source Data Science Masters focuses on programmatic problem solving in python. This is dedicated to Data Science resources using R, a valuable and powerful technology for investigation and analysis.]_ 2 | 3 | ### R 4 | 5 | [The R Project for Statistical Computing / Software](http://www.r-project.org/) 6 | 7 | [Learn Data Science with R / Tutorials](https://www.datacamp.com/courses) 8 | 9 | #### Basic R 10 | 11 | * R in a Nutshell [O'Reilly / Book ```$41```](http://amzn.to/1s54OBf) 12 | * Software Design: The Art of R Programming [O'Reilly / Book ```$23```](http://amzn.to/1mqzpWw) 13 | 14 | #### Basic Statistics with R 15 | 16 | * An Introduction to Statistical Learning [Book pdf](https://www.statlearning.com/) ^also a Machine Learning resource 17 | 18 | #### Data Science with R 19 | * Introduction to Data Science [Syracuse University / ebook](http://jsresearch.net/index.html) 20 | * Learn R & Become a Data Analyst [Tutorial](https://www.datacamp.com/) 21 | * Doing Data Science: Straight Talk from the Frontline [O'Reilly / Book ```$25```](http://amzn.to/1vAIscK) 22 | * Practical Data Science with R [Manning Publications / Book ```$49.99```](http://www.manning.com/zumel/) 23 | 24 | #### Statistical Learning with R 25 | 26 | * Statistical Learning [Stanford / OpenEdX Course](https://class.stanford.edu/courses/HumanitiesScience/StatLearning/Winter2014/about) 27 | 28 | #### Forecasting with R 29 | 30 | * Forecasting: Principles and Practice *(Regression, Time Series, Forecasting)* [Monash University / Book](http://otexts.com/fpp/) 31 | 32 | #### Visualization with R 33 | 34 | * Viz and Elegant Graphics in R: ggplot2 [Springer / Book ```$65```](http://amzn.to/1fZMXVd) 35 | 36 | #### Machine Learning with R 37 | 38 | * Guide to Getting Started in Machine Learning [Tutorial](http://abeautifulwww.com/2009/10/11/guide-to-getting-started-in-machine-learning/) 39 | * Machine Learning in R [Tutorial](http://blog.revolutionanalytics.com/2009/09/machine-learning-in-r-in-a-nutshell.html) 40 | 41 | #### R Libraries 42 | 43 | * Natural Language Toolkit [OpenNLP](http://cran.r-project.org/web/packages/openNLP/index.html) 44 | * Text Mining [tm](http://cran.r-project.org/web/packages/tm/index.html) 45 | * Basic Viz [wordcloud](http://cran.r-project.org/web/packages/wordcloud/index.html) 46 | * Network Modeling & Viz [igraph](http://cran.r-project.org/web/packages/igraph/index.html) 47 | * Basic Machine Learning [e1071](http://cran.r-project.org/web/packages/e1071/index.html) 48 | * Kernel Method [kernlab](http://cran.r-project.org/web/packages/kernlab/index.html) 49 | * Chinese Language Processing [Rwordseg](http://jliblog.com/app/rwordseg) 50 | * Chinese Weibo Analysis [Rweibo](http://jliblog.com/app/rweibo) 51 | 52 | #### R Datasets 53 | 54 | * [Rdatasets](http://vincentarelbundock.github.io/Rdatasets/) 55 | 56 | #### R Blogs & Media 57 | 58 | * [R-bloggers](http://www.r-bloggers.com/) R news and tutorials contributed by (452) R bloggers 59 | ####R interactive visualizations 60 | * [Shiny](http://shiny.rstudio.com/) Interactive web application framework for R 61 | -------------------------------------------------------------------------------- /scott-davis-transcript.md: -------------------------------------------------------------------------------- 1 |

Scott Davis Transcript

2 |

Open Source Data Science Masters

3 | 4 |
I'm going to have some time for indepedent study this year so I plan to work through as much as possible. I work in the real estate industry and we have so much data that isn't used for meaningful analysis and the tools, though readily available, haven't caught up for the needs of real estate users. That's what I'm interested in working on. I use a lot of GIS and R, so my curriculum is tailored to follow [R](https://www.r-project.org/)/[Python](www.python.org) and [QGIS](www.qgis.org). I'm a bit of an open-source nut so I like learning much better this way. I'm looking for people to connect with, and possibly to work on projects.
5 | 6 | Want to collaborate? Get in touch: 7 | * [linkedin](http://www.linkedin.com/in/scottcdavis); 8 | * [twitter](http://www.twitter.com/scottdavisCRE); or 9 | * [email](mailto:scott@tisonadevelopment.com) 10 | 11 | 12 |

Open Source Curriculum

13 |

Base Introduction

14 | Data Science Introductions 15 | - [ ] Intro to Data Science by UW / Coursera, online course 16 | - [ ] Data Science Specialization by Johns Hopkins / Coursera 17 | - [X] [Data Scientists Toolbox](https://www.coursera.org/account/accomplishments/certificate/UY4EBM46HL) 18 | - [X] [R Programming](https://www.coursera.org/account/accomplishments/records/Va5vuEvGKyr7UyHEL) 19 | - [X] [Getting and Cleaning Data](https://www.coursera.org/account/accomplishments/records/ENSGmvNfx24sANRW) 20 | - [X] [Exploratory Data Analysis](https://www.coursera.org/account/accomplishments/records/2PPsRu2Us3sUehBQ) 21 | - [X] [Reproducible Research] 22 | - [ ] [Statistical Inference] (in progress) 23 | - [ ] [Regression Models] (in progress) 24 | - [X] [Practical Machine Learning] 25 | - [ ] [Developing Data Products] 26 | - [ ] [Data Science Capstone] 27 | - [ ] [Data Science by Harvard](http://cs109.github.io/2015/) (online course) 28 | - [ ] [Data Science with Open Source Tools](http://shop.oreilly.com/product/9780596802363.do) 29 | - [50 Years of Data Science](http://pages.cs.wisc.edu/~anhai/courses/784-fall15/50YearsDataScience.pdf) 30 | - [ ] [Datasmart](http://www.amazon.com/Data-Smart-Science-Transform-Information/dp/111866146X/ref=sr_1_1?s=books&ie=UTF8&qid=1458768727&sr=1-1&keywords=datasmart) - in Excel, but also works in LibreOffice and so much of business analytics is still in Excel. 31 | 32 | 33 |

Mathematics/Statistics

34 | - [ ] [Statistics for Spatial Data, Revised Edition](http://www.wiley.com/WileyCDA/WileyTitle/productCd-1119114616.html) 35 | - [ ] [Statistics for Spatio-Temporal Data](http://www.wiley.com/WileyCDA/WileyTitle/productCd-EHEP002348.html) 36 | - [ ] [Linear Algebra](http://www.amazon.com/Linear-Algebra-Dover-Books-Mathematics/dp/048663518X) 37 | - [ ] Problem-Solving Heuristics: [How to Solve It](http://www.amazon.com/How-Solve-It-Mathematical-Princeton/dp/069111966X) 38 | 39 |

Computing

40 | R: 41 | - [ ] [R in Action](https://www.manning.com/books/r-in-action-second-edition?a_bid=5c2b1e1d&a_aid=RiA2ed) 42 | - [ ] [R Cookbook](http://shop.oreilly.com/product/9780596809164.do) 43 | - [ ] [Forecasting: Principles and Practice](http://otexts.com/fpp/) 44 | 45 | R Libraries/Task Views 46 | * [ProjectTemplate](http://projecttemplate.net/index.html) 47 | * Spatial Data [CRAN Task View: Analysis of Spatial Data](https://cran.r-project.org/web/views/Spatial.html) 48 | * Spatio-Temporal Data [CRAN Task View: Handling and Analyzing Spatio-Temporal Data](https://cran.r-project.org/web/views/SpatioTemporal.html) 49 | * Optimization [CRAN Task View: Optimization and Mathematical Programming](https://cran.r-project.org/web/views/Optimization.html) 50 | * Finance [CRAN Task View: Empirical Finance](https://cran.r-project.org/web/views/Finance.html) 51 | 52 | Python: 53 | - [ ] [Dive Into Python](http://www.diveintopython.net/) 54 | - [ ] [Google's Python Class](code.google.com/edu/languages/google-python-class/) 55 | - [ ] [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do) 56 | - [ ] [Webscraping with Python](https://www.packtpub.com/big-data-and-business-intelligence/web-scraping-python) 57 | 58 | QGIS: 59 | - [X] [QGIS Tutorials and Tips](http://www.qgistutorials.com/en/) 60 | - [X] [Mastering QGIS](https://www.packtpub.com/application-development/mastering-qgis) 61 | - [ ] [Building Mapping Applications with QGIS](https://www.packtpub.com/application-development/building-mapping-applications-qgis) 62 | - [ ] [GIS Tutorial Workbook 1](https://esripress.esri.com/display/index.cfm?fuseaction=display&websiteID=232&moduleID=1) This is for ArcView, but you can work the examples in QGIS too 63 | - [ ] [GIS Tutorial Workbook 2: Spatial Analysis](https://esripress.esri.com/display/index.cfm?fuseaction=display&websiteID=230&moduleID=0) This is for ArcView, but you can work the examples in QGIS too 64 | - [ ] [QGIS Map Design](https://locatepress.com/qmd) I've just thumbed through this, but it's beautiful and belongs on any list of GIS books. 65 | 66 | MySQL: 67 | - [ ] [Learn MySQL in One Video](https://www.youtube.com/watch?v=yPu6qV5byu4) 68 | - [ ] [MySQL Workbench Starter](code.google.com/edu/languages/google-python-class/) 69 | 70 | Octave: 71 | - [ ] [GNU Octave Beginners Guide](https://www.packtpub.com/big-data-and-business-intelligence/gnu-octave-beginners-guide) 72 | - 73 | PostGIS/PostGRESQL: 74 | - [ ] [PostGIS Essentials](https://www.packtpub.com/big-data-and-business-intelligence/postgis-essentials) 75 | - [ ] [PostGRESQL Tutorial](http://www.postgresqltutorial.com/) 76 | - [ ] [PostgreSQL: Up and Running: A Practical Introduction to the Advanced Open Source Database](http://shop.oreilly.com/product/0636920032144.do) 77 | 78 |

Algorithms

79 | - [ ] [Algorithms Design & Analysis](http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=IntroToAlgorithms) Stanford openclassroom 80 | 81 |

Distributed Computing Paradigms

82 | - [ ] Intro to Hadoop and MapReduce by Cloudera and Udacity 83 | *Note: I might swap the above course with an EdX course on Apache Spark and distributed computing* 84 | 85 |

Data Mining

86 | - [ ] Mining Massive Data Sets, by Stanford and Coursera 87 | - [ ] [Clean Data](https://www.packtpub.com/big-data-and-business-intelligence/clean-data) 88 | 89 |

Machine Learning/Predictive Analytics - Foundational/Theoretical/Practical

90 | - [ ] Machine Learning, by Ng Stanford and Coursera (NB this class requires a lot of higher level math) 91 | - [ ] [An Introduction to Statistical Learning with Applications in R](http://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/) (by the authors of The Elements of Statistical Learning at Stanford.) 92 | - [ ] [Machine Learning with R](https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-r-second-edition) 93 | - [ ] [Building a Recommendation System in R](https://www.packtpub.com/big-data-and-business-intelligence/building-recommendation-system-r) 94 | - [ ] [Mastering Predictive Analytics in R](https://www.packtpub.com/application-development/mastering-predictive-analytics-r) 95 | - [ ] [Bootstrapping Machine Learning](http://www.louisdorard.com/machine-learning-book/) 96 | - [ ] [Applied Predictive Modeling](http://www.amazon.com/gp/product/1461468485?psc=1&redirect=true&ref_=oh_aui_detailpage_o08_s00) 97 | 98 |

Analysis

99 | - [ ] [Practical Data Science Cookbook](http://www.diveintopython.net/) 100 | - [ ] [R Data Analysis Cookbook](code.google.com/edu/languages/google-python-class/) 101 | 102 |

Spatial Analysis

103 | - [ ] [An Introduction to R for Spatial Analysis and Mapping](https://us.sagepub.com/en-us/nam/an-introduction-to-r-for-spatial-analysis-and-mapping/book241031) 104 | - [ ] [Applied Spatial Data Analysis with R](http://www.springer.com/us/book/9781461476177) 105 | 106 |

Land Use/Transport/Gravity Modeling

107 | - [ ] [Integrated Land Use and Transport Modelling: Decision Chains and Hierarchies](http://www.amazon.com/gp/product/0521022177?psc=1&redirect=true&ref_=oh_aui_detailpage_o03_s00) 108 | - [ ] [Gravity and Spatial Interaction Models (Scientific Geography Series)](http://www.amazon.com/gp/product/0803925441?psc=1&redirect=true&ref_=oh_aui_detailpage_o06_s00) 109 | - [ ] [TRANUS Model](http://www.tranus.com/tranus-english) 110 | - [ ] [Urban Sim](https://pypi.python.org/pypi/urbansim) 111 | - [ ] [Huff-tools Package in R](http://rstudio-pubs-static.s3.amazonaws.com/42357_1e6fcc5bcfec439096eb86a106ebf22e.html) 112 | - 113 |

Data Design/Data Viz

114 | - [ ] [Beautiful Evidence](http://www.edwardtufte.com/tufte/books_be) 115 | - [ ] [Semiology of Graphics](http://www.amazon.com/Semiology-Graphics-Diagrams-Networks-Maps/dp/1589482611) 116 | - [ ] [Visual Complexity Mapping Patterns of Information](hhttp://www.visualcomplexity.com/vc/book/) 117 | - [ ] [The Visual Display of Quantitative Information](http://www.edwardtufte.com/tufte/books_vdqi) 118 | - [ ] [Design for Information](http://isabelmeirelles.com/book-design-for-information/) 119 | - [ ] [Design Elements: A Graphical Style Manual](http://www.amazon.com/Design-Elements-Graphic-Style-Manual/dp/1592532616) 120 | - [ ] [Storytelling with Data](http://www.amazon.com/gp/product/1119002257?psc=1&redirect=true&ref_=oh_aui_detailpage_o09_s00) 121 | - [ ] [Mastering Python Data Visualization](https://www.packtpub.com/big-data-and-business-intelligence/mastering-python-data-visualization) 122 | - [ ] [The Grammar of Graphics](https://www.packtpub.com/big-data-and-business-intelligence/mastering-python-data-visualization) 123 | - [ ] [R Graphics Cookbook](http://shop.oreilly.com/product/9780596809164.do) 124 | 125 |

Relevant prior studies

126 | - [X] MS in Community and Regional Planning, UT-Austin 127 | - [X] BA in Liberal Arts, concentration in geography, UT-Austin 128 | 129 |

OpenSource Data Science Masters Capstone Project

130 | I'm interesting in using data science approaches for better intelligence behind real estate decisions, specifically evaluating population growth, transactions and location decisions. I'd also like to evaluate statistical learning technqiues to make better pricing decisions. Finally, I'd like to develop a model to optimize real estate portfolios. 131 | 132 | If you'd like to pair up for the capstone, [let me know](http://www.twitter.com/scottdavisCRE) 133 | 134 | -------------------------------------------------------------------------------- /transcripts/scott-davis-transcript.md: -------------------------------------------------------------------------------- 1 |

Scott Davis Transcript

2 |

Open Source Data Science Masters

3 | 4 | I'm going to have some time for indepedent study this year so I plan to work through as much as possible. I work in the real estate industry and we have so much data that isn't used for meaningful analysis and the tools, though readily available, haven't caught up for the needs of real estate users. That's what I'm interested in working on. I use a lot of GIS and R, so my curriculum is tailored to follow [R](https://www.r-project.org/)/[Python](www.python.org) and [QGIS](www.qgis.org). I'm a bit of an open-source nut so I like learning much better this way. I'm looking for people to connect with, and possibly to work on projects. Also, maybe not technically purely open source as I've used a lot of books - which I've linked to here. 5 | 6 | Want to collaborate? Get in touch: 7 | * [linkedin](http://www.linkedin.com/in/scottcdavis); 8 | * [twitter](http://www.twitter.com/scottdavisCRE); or 9 | * [email](mailto:scott@tisonadevelopment.com) 10 | 11 | 12 |

Open Source Curriculum

13 |

Base Introduction

14 | Data Science Introductions 15 | - [X] [Data Science with Open Source Tools](http://shop.oreilly.com/product/9780596802363.do) 16 | - [X] [Data Science from Scratch](http://shop.oreilly.com/product/0636920033400.do) 17 | - [X] [50 Years of Data Science](http://pages.cs.wisc.edu/~anhai/courses/784-fall15/50YearsDataScience.pdf) 18 | - [X] [Datasmart](http://www.amazon.com/Data-Smart-Science-Transform-Information/dp/111866146X/ref=sr_1_1?s=books&ie=UTF8&qid=1458768727&sr=1-1&keywords=datasmart) - This book is a thorough review of using Excel for data science tools. Every aspiring data scientist should work through this book because (1) you'll learn a lot because Excel makes you do every step and (2) you'll realize you need to learn R or python or some other way to do these analyses. 19 | - [X] [Data Science Specialization by Johns Hopkins / Coursera](https://www.coursera.org/account/accomplishments/specialization/3WN77YYQ7QK7) 20 | - [X] [Data Scientists Toolbox](https://www.coursera.org/account/accomplishments/certificate/UY4EBM46HL) 21 | - [X] [R Programming](https://www.coursera.org/account/accomplishments/records/Va5vuEvGKyr7UyHEL) 22 | - [X] [Getting and Cleaning Data](https://www.coursera.org/account/accomplishments/records/ENSGmvNfx24sANRW) 23 | - [X] [Exploratory Data Analysis](https://www.coursera.org/account/accomplishments/records/2PPsRu2Us3sUehBQ) 24 | - [X] [Reproducible Research](https://www.coursera.org/account/accomplishments/certificate/YRP8NLFYPCV9) 25 | - [X] [Statistical Inference](https://www.coursera.org/account/accomplishments/records/9733QCP94GEF) 26 | - [X] [Regression Models](https://www.coursera.org/account/accomplishments/records/PP8SKS7CPSDC) 27 | - [X] [Practical Machine Learning](https://www.coursera.org/account/accomplishments/certificate/AJJS85KTU6GZ) 28 | - [X] [Developing Data Products](https://www.coursera.org/account/accomplishments/certificate/6QREL457PPKE) 29 | - [X] [Data Science Capstone](https://www.coursera.org/account/accomplishments/certificate/A9M48VWHBAMT) 30 | 31 | 32 |

Mathematics/Statistics

33 | - [ ] [Statistics for Spatial Data, Revised Edition](http://www.wiley.com/WileyCDA/WileyTitle/productCd-1119114616.html) 34 | - [ ] [Statistics for Spatio-Temporal Data](http://www.wiley.com/WileyCDA/WileyTitle/productCd-EHEP002348.html) 35 | - [X] [Linear Programming: An Introduction With Applications (Second Edition)](http://www.amazon.com/Linear-Programming-Introduction-Applications-Edition/dp/1463543670?ie=UTF8&psc=1&redirect=true&ref_=oh_aui_detailpage_o01_s00) 36 | - [X] Problem-Solving Heuristics: [How to Solve It](http://www.amazon.com/How-Solve-It-Mathematical-Princeton/dp/069111966X) 37 | 38 |

Computing

39 | R: 40 | - [ ] [R in Action](https://www.manning.com/books/r-in-action-second-edition?a_bid=5c2b1e1d&a_aid=RiA2ed) 41 | - [ ] [R Cookbook](http://shop.oreilly.com/product/9780596809164.do) 42 | - [X] [Forecasting: Principles and Practice](http://otexts.com/fpp/) 43 | 44 | R Libraries/Task Views 45 | * [ProjectTemplate](http://projecttemplate.net/index.html) 46 | * Spatial Data [CRAN Task View: Analysis of Spatial Data](https://cran.r-project.org/web/views/Spatial.html) 47 | * Spatio-Temporal Data [CRAN Task View: Handling and Analyzing Spatio-Temporal Data](https://cran.r-project.org/web/views/SpatioTemporal.html) 48 | * Optimization [CRAN Task View: Optimization and Mathematical Programming](https://cran.r-project.org/web/views/Optimization.html) 49 | * Finance [CRAN Task View: Empirical Finance](https://cran.r-project.org/web/views/Finance.html) 50 | 51 | Python: 52 | - [X] [Jumpstart Python by Building 10 Apps](https://training.talkpython.fm/courses/details/python-language-jumpstart-building-10-apps) This is probably the best introduction to Python that I have seen. 53 | - [X] [Dive Into Python](http://www.diveintopython.net/) 54 | - [X] [Google's Python Class](code.google.com/edu/languages/google-python-class/) 55 | - [X] [Introduction to Python for Data Science - edx](https://courses.edx.org/courses/course-v1:Microsoft+DAT208x+2T2016/info) 56 | - [X] [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do) 57 | - [X] [Webscraping with Python](https://www.packtpub.com/big-data-and-business-intelligence/web-scraping-python) 58 | 59 | QGIS: 60 | - [X] [QGIS Tutorials and Tips](http://www.qgistutorials.com/en/) 61 | - [X] [Mastering QGIS](https://www.packtpub.com/application-development/mastering-qgis) 62 | - [X] [QGIS 2.0 Cookbook](https://www.packtpub.com/application-development/qgis-2-cookbook) Advanced data management, data visualization and spatial analysis techniques with QGIS. 63 | - [X] [Building Mapping Applications with QGIS](https://www.packtpub.com/application-development/building-mapping-applications-qgis) 64 | - [X] [GIS Tutorial Workbook 1](https://esripress.esri.com/display/index.cfm?fuseaction=display&websiteID=232&moduleID=1) This is for ArcView, but you can work the examples in QGIS too 65 | - [X] [GIS Tutorial Workbook 2: Spatial Analysis](https://esripress.esri.com/display/index.cfm?fuseaction=display&websiteID=230&moduleID=0) This is for ArcView, but you can work the examples in QGIS too 66 | - [ ] QGIS Python Programming Cookbook (https://www.packtpub.com/application-development/qgis-python-programming-cookbook) Automated desktop QGIS processing. 67 | - [ ] [QGIS Map Design](https://locatepress.com/qmd) I've just thumbed through this, but it's beautiful and belongs on any list of GIS books. 68 | 69 | 70 | MySQL: 71 | - [X] [Learn MySQL in One Video](https://www.youtube.com/watch?v=yPu6qV5byu4) 72 | - [X] [MySQL Explained](https://www.ostraining.com/books/mysql/about/) 73 | 74 | Octave: 75 | - [ ] [GNU Octave Beginners Guide](https://www.packtpub.com/big-data-and-business-intelligence/gnu-octave-beginners-guide) 76 | 77 | PostGIS/PostGRESQL: 78 | - [ ] [PostGIS Essentials](https://www.packtpub.com/big-data-and-business-intelligence/postgis-essentials) 79 | - [ ] [PostGRESQL Tutorial](http://www.postgresqltutorial.com/) 80 | - [ ] [PostgreSQL: Up and Running: A Practical Introduction to the Advanced Open Source Database](http://shop.oreilly.com/product/0636920032144.do) 81 | 82 |

Algorithms

83 | - [ ] Data Structures and Algorithms by UCSD / Coursera [Decided not to take the balance of the specialization) 84 | - [X] [Algorithmic Toolbox] in progress (https://www.coursera.org/account/accomplishments/certificate/RUKKXTCFDAPV) 85 | 86 |

Data Mining

87 | - [ ] [Clean Data] (https://www.packtpub.com/big-data-and-business-intelligence/clean-data) 88 | 89 |

Machine Learning/Predictive Analytics - Foundational/Theoretical/Practical

90 | - [ ] [Statistical Learning with Trevor Hastie and Robert Tibshirani](http://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/) 91 | - [ ] [An Introduction to Statistical Learning with Applications in R](http://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/) (by the authors of The Elements of Statistical Learning at Stanford.) 92 | - [ ] [Machine Learning with R](https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-r-second-edition) 93 | - [ ] [Building a Recommendation System in R](https://www.packtpub.com/big-data-and-business-intelligence/building-recommendation-system-r) 94 | - [ ] [Mastering Predictive Analytics in R](https://www.packtpub.com/application-development/mastering-predictive-analytics-r) 95 | - [X] [Bootstrapping Machine Learning](http://www.louisdorard.com/machine-learning-book/) 96 | - [ ] [Applied Predictive Modeling](http://www.amazon.com/gp/product/1461468485?psc=1&redirect=true&ref_=oh_aui_detailpage_o08_s00) 97 | 98 |

Analysis

99 | - [ ] [Practical Data Science Cookbook](https://www.packtpub.com/big-data-and-business-intelligence/practical-data-science-cookbook) 100 | - [X] [R Data Analysis Cookbook](http://www.amazon.com/Data-Analysis-Cookbook-Recipes-Deliver/dp/1783989068) 101 | - [X] [Python Data Science Essentials](https://www.packtpub.com/big-data-and-business-intelligence/python-data-science-essentials) 102 | 103 |

Spatial Analysis

104 | - [ ] [An Introduction to R for Spatial Analysis and Mapping](https://uk.sagepub.com/en-gb/eur/an-introduction-to-r-for-spatial-analysis-and-mapping/book241031) 105 | - [ ] [Applied Spatial Data Analysis with R](http://www.springer.com/us/book/9781461476177) 106 | - [ ] [Geospatial Analysis - 5th Edition, 2015 - de Smith, Goodchild, Longley](http://www.spatialanalysisonline.com/HTML/index.html) 107 | - [X] [Learning Geospatial Analysis with Python](https://www.packtpub.com/application-development/learning-geospatial-analysis-python) 108 | - [X] [Python Geospatial Development - Second Edition](https://www.packtpub.com/application-development/python-geospatial-development-second-edition) 109 | 110 |

Land Use/Transport/Gravity Modeling

111 | - [ ] [Integrated Land Use and Transport Modelling: Decision Chains and Hierarchies](http://www.amazon.com/gp/product/0521022177?psc=1&redirect=true&ref_=oh_aui_detailpage_o03_s00) 112 | - [ ] [Gravity and Spatial Interaction Models (Scientific Geography Series)](http://www.amazon.com/gp/product/0803925441?psc=1&redirect=true&ref_=oh_aui_detailpage_o06_s00) 113 | - [ ] [TRANUS Model](http://www.tranus.com/tranus-english) 114 | - [ ] [Urban Sim](https://pypi.python.org/pypi/urbansim) 115 | - [ ] [Huff-tools Package in R](http://rstudio-pubs-static.s3.amazonaws.com/42357_1e6fcc5bcfec439096eb86a106ebf22e.html) 116 | 117 | 118 |

Data Design/Data Viz

119 | - [ ] [Beautiful Evidence](http://www.edwardtufte.com/tufte/books_be) 120 | - [ ] [Semiology of Graphics](http://www.amazon.com/Semiology-Graphics-Diagrams-Networks-Maps/dp/1589482611) 121 | - [ ] [Visual Complexity Mapping Patterns of Information](http://www.visualcomplexity.com/vc/book/) 122 | - [ ] [The Visual Display of Quantitative Information](http://www.edwardtufte.com/tufte/books_vdqi) 123 | - [ ] [Design for Information](http://isabelmeirelles.com/book-design-for-information/) 124 | - [ ] [Design Elements: A Graphical Style Manual](http://www.amazon.com/Design-Elements-Graphic-Style-Manual/dp/1592532616) 125 | - [X] [Storytelling with Data](http://www.amazon.com/gp/product/1119002257?psc=1&redirect=true&ref_=oh_aui_detailpage_o09_s00) 126 | - [ ] [Mastering Python Data Visualization](https://www.packtpub.com/big-data-and-business-intelligence/mastering-python-data-visualization) 127 | - [ ] [The Grammar of Graphics](http://www.springer.com/us/book/9780387245447) 128 | - [X] [R Graphics Cookbook](http://shop.oreilly.com/product/9780596809164.do) 129 | 130 |

Relevant prior studies

131 | - [X] MS in Community and Regional Planning, UT-Austin 132 | - [X] BA in Liberal Arts, concentration in geography, UT-Austin 133 | 134 |

OpenSource Data Science Masters Capstone Project

135 | I'm interesting in using data science approaches for better intelligence behind real estate decisions, specifically evaluating population growth, transactions and location decisions. I'd also like to evaluate statistical learning technqiues to make better pricing decisions. Finally, I'd like to develop a model to optimize real estate portfolios. 136 | 137 | If you'd like to pair up for the capstone, [let me know](http://www.twitter.com/scottdavisCRE) 138 | 139 | --------------------------------------------------------------------------------