├── LICENSE.md ├── README.md ├── analysis-technologies.md ├── basic-programming.md ├── best-of-stackoverflow.md ├── blogs-n-media.md ├── database-tech.md ├── datasets.md ├── janitorial.md ├── learn-python.md ├── machine-learning.md ├── r-resources.md ├── specializations.md └── transcripts ├── clare-corthell-2013.md ├── linda-george-transcript.md └── nick-byrne-transcript.md /LICENSE.md: -------------------------------------------------------------------------------- 1 | This is free and unencumbered software released into the public domain. 2 | 3 | Anyone is free to copy, modify, publish, use, compile, sell, or 4 | distribute this software, either in source code form or as a compiled 5 | binary, for any purpose, commercial or non-commercial, and by any 6 | means. 7 | 8 | In jurisdictions that recognize copyright laws, the author or authors 9 | of this software dedicate any and all copyright interest in the 10 | software to the public domain. We make this dedication for the benefit 11 | of the public at large and to the detriment of our heirs and 12 | successors. We intend this dedication to be an overt act of 13 | relinquishment in perpetuity of all present and future rights to this 14 | software under copyright law. 15 | 16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 17 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 18 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 19 | IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR 20 | OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, 21 | ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR 22 | OTHER DEALINGS IN THE SOFTWARE. 23 | 24 | 25 | For more information, please refer to 26 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | _created & maintained by [@clarecorthell](http://bit.ly/clarecorthelltwitter), founding partner of [Luminant Data Science Consulting](http://bit.ly/luminantdata)_ 2 | 3 | ## The Open-Source Data Science Masters 4 | 5 | The open-source curriculum for learning Data Science. Foundational in both theory and technologies, the OSDSM breaks down the core competencies necessary to making use of data. 6 | 7 | ### The Internet is Your Oyster 8 | 9 | With Coursera, ebooks, Stack Overflow, and GitHub -- all free and open -- how can you afford not to take advantage of an open source education? 10 | 11 | ### The Motivation 12 | 13 | We need more Data Scientists. 14 | 15 | > ...by 2018 the United States will experience a shortage of 190,000 skilled data scientists, and 1.5 million managers and analysts capable of reaping actionable insights from the big data deluge. 16 | 17 | -- [McKinsey Report Highlights the Impending Data Scientist Shortage](http://bit.ly/datascienceshortage) 23 July 2013 18 | 19 | > There are little to no Data Scientists with 5 years experience, because the job simply did not exist. 20 | 21 | -- David Hardtke "How To Hire A Data Scientist" 13 Nov 2012 22 | 23 | ### An Academic Shortfall 24 | 25 | Classic academic conduits aren't providing Data Scientists -- this talent gap will be closed differently. 26 | 27 | > **Academic credentials are important but not necessary for high-quality data science.** The core aptitudes – curiosity, intellectual agility, statistical fluency, research stamina, scientific rigor, skeptical nature – that distinguish the best data scientists are widely distributed throughout the population. 28 | 29 | > We’re likely to see more uncredentialed, inexperienced individuals try their hands at data science, **bootstrapping their skills on the open-source ecosystem and using the diversity of modeling tools available.** Just as data-science platforms and tools are proliferating through the magic of open source, big data’s data-scientist pool will as well. 30 | 31 | > And there’s yet another trend that will alleviate any talent gap: the democratization of data science. While I agree wholeheartedly with Raden’s statement that “the crème-de-la-crème of data scientists will fill roles in academia, technology vendors, Wall Street, research and government,” I think he’s understating the extent to which **autodidacts – the self-taught, uncredentialed, data-passionate people – will come to play a significant role in many organizations’ data science initiatives.** 32 | 33 | -- James Kobielus, [Closing the Talent Gap](http://bit.ly/closingthetalentgap) 17 Jan 2013 34 | 35 | ### Ready? 36 | 37 | *** 38 | 39 | ## The Open Source Data Science Curriculum 40 | 41 | Start here. 42 | 43 | **Intro to Data Science** / UW [Videos](http://bit.ly/uwintrodatascience) 44 | * *Topics:* Python NLP on Twitter API, Distributed Computing Paradigm, MapReduce/Hadoop & Pig Script, SQL/NoSQL, Relational Algebra, Experiment design, Statistics, Graphs, Amazon EC2, Visualization. 45 | 46 | **Data Science** / Harvard [Videos](http://bit.ly/harvarddatasciencevideos) & [Course](http://bit.ly/harvarddatasciencecourse) 47 | * *Topics:* Data wrangling, data management, exploratory data analysis to generate hypotheses and intuition, prediction based on statistical methods such as regression and classification, communication of results through visualization, stories, and summaries. 48 | 49 | **Data Science with Open Source Tools** [Book ```$27```](http://bit.ly/book-datasciencewithopensourcetools) 50 | * *Topics:* Visualizing Data, Estimation, Models from Scaling Arguments, Arguments from Probability Models, What you Really Need to Know about Classical Statistics, Data Mining, Clustering, PCA, Map/Reduce, Predictive Analytics 51 | * *Example Code in:* R, Python, Sage, C, Gnu Scientific Library 52 | 53 | ### A Note About Direction 54 | This is an introduction geared toward those with at least **a minimum understanding of programming**, and (perhaps obviously) an interest in the components of Data Science (like statistics and distributed computing). 55 | Out of personal preference and need for focus, I geared the original curriculum toward **Python tools and resources**. R resources can be found [here](http://bit.ly/osdsm-rresources). 56 | 57 | ### Math 58 | 59 | [★ What are some good resources for learning about numerical analysis? / Quora ] 60 | (http://www.quora.com/What-are-some-good-resources-for-learning-about-numerical-analysis) 61 | 62 | * **Linear Algebra & Programming** 63 | * Linear Algebra [Khan Academy / Videos](http://bit.ly/khanlinalg) 64 | * Linear Algebra / Levandosky [Stanford / Book ```$10```](http://amzn.to/1kIfmmI) 65 | * Linear Programming (Math 407) [University of Washington / Course](http://bit.ly/course-uw-linearprogramming) 66 | * The Manga Guide to Linear Algebra [Book ```$19```](http://amzn.to/1n4hM5l) 67 | 68 | * **Convex Optimization** 69 | * Convex Optimization / Boyd [Stanford / Lectures](http://stanford.edu/class/ee364a/index.html) 70 | 71 | * **Statistics** 72 | * Statistics I [Princeton / Coursera](http://bit.ly/course-princeton-stats) 73 | * Stats in a Nutshell [Book ```$29```](http://amzn.to/1iMnx2X) 74 | * Think Stats: Probability and Statistics for Programmers [Digital](http://bit.ly/ebook-thinkstats) & [Book ```$25```](http://amzn.to/RcVnTf) 75 | * Think Bayes [Digital](http://bit.ly/ebook-thinkbayes) & [Book ```$25```](http://amzn.to/1hmy4Cr) 76 | 77 | * **Differential Equations & Calculus** 78 | * Differential Equations in Data Science [Python Tutorial](http://bit.ly/ipynb-differentialeq) 79 | 80 | * **Problem Solving** 81 | * Problem-Solving Heuristics "How To Solve It" [Polya / Book ```$10```](http://amzn.to/1mqJRSi) 82 | 83 | ### Computing 84 | 85 | Get your environment up and running with the [Data Science Toolbox](http://bit.ly/datascitoolbox) 86 | 87 | * **Algorithms** 88 | * Algorithms Design & Analysis I [Stanford / Coursera](http://bit.ly/coursera-algo) 89 | * Algorithm Design, Kleinberg & Tardos [Book ```$125```](http://amzn.to/1iMnWm5) 90 | 91 | * **Distributed Computing Paradigms** 92 | * *See Intro to Data Science [UW / Lectures on MapReduce](http://bit.ly/uwintrodatascience) 93 | * Intro to Hadoop and MapReduce [Cloudera / Udacity Course](http://bit.ly/udacity-hadoopmapreduce) *includes select free excerpts of Hadoop: The Definitive Guide [Book ```$29```](http://amzn.to/1i7wgLv) 94 | 95 | * **Databases** 96 | * Introduction to Databases [Stanford / Online Course](https://bit.ly/introdatabases) 97 | * SQL School [Mode Analytics / Tutorials](http://bit.ly/sqlschool) 98 | * SQL Tutorials [SQLZOO / Tutorials](http://bit.ly/tut-sqlzoo) 99 | 100 | * **Data Mining** 101 | * Mining Massive Data Sets / Stanford [Coursera](https://www.coursera.org/course/mmds) & [Digital](http://bit.ly/ebook-miningmassivedata) & [Book ```$58```](http://amzn.to/1txocpo) 102 | * Mining The Social Web [Book ```$30```](http://amzn.to/1mqxAsB) 103 | * Introduction to Information Retrieval / Stanford [Digital](http://bit.ly/ebook-stanford-inforetrieval) & [Book ```$56```](http://amzn.to/1mWbnUT) 104 | 105 | _OSDSM Specialization: [Web Scraping & Crawling](https://github.com/datasciencemasters/go/blob/master/specializations.md#web-scraping--crawling)_ 106 | 107 | * **Machine Learning** 108 | 109 | _Foundational & Theoretical_ 110 | * Machine Learning [Ng Stanford / Coursera](http://bit.ly/stanford-ml) & [Stanford CS 229](http://bit.ly/stanfordcs229) 111 | * A Course in Machine Learning [UMD / Digital Book](http://bit.ly/22WyV3N) 112 | * The Elements of Statistical Learning / Stanford [Digital](http://bit.ly/ebook-elemstatlearn) & [Book ```$80```](http://amzn.to/1hmyKry) & [Study Group](http://www.reddit.com/r/eosl) 113 | * Machine Learning [Caltech / Edx](http://bit.ly/caltech-ml) 114 | 115 | _Practical_ 116 | * Programming Collective Intelligence [Book ```$27```](http://amzn.to/1mqxYqW) 117 | * Machine Learning for Hackers [ipynb / digital book](http://bit.ly/mlforhackers) 118 | * Intro to scikit-learn, SciPy2013 [youtube tutorials](http://bit.ly/scikit-video-tuts) 119 | 120 | * **Probabilistic Modeling** 121 | * Probabilistic Programming and Bayesian Methods for Hackers [Github / Tutorials](http://bit.ly/ipnb-probabilisticprogramming) 122 | * Probabilistic Graphical Models [Stanford / Coursera](http://bit.ly/stanford-pgm) 123 | 124 | * **Deep Learning (Neural Networks)** 125 | * Neural Networks [Andrej Karpathy / Python Walkthrough](http://bit.ly/karpathyneuralnets) 126 | * Neural Networks [U Toronto / Coursera](http://bit.ly/utoronto-neuralnets) 127 | * Deep Learning for Natural Language Processing CS224d [Stanford](http://cs224d.stanford.edu/syllabus.html) 128 | 129 | * **Social Network & Graph Analysis** 130 | * Social and Economic Networks: Models and Analysis / [Stanford / Coursera](http://bit.ly/stanford-socialeconnetworks) 131 | * Social Network Analysis for Startups [Book ```$22```](http://amzn.to/1jySCCT) 132 | 133 | * **Natural Language Processing** 134 | * From Languages to Information / Stanford CS147 [Materials](http://bit.ly/nlpcs124) 135 | * NLP with Python (NLTK library) [Digital](http://bit.ly/ebook-nltk), [Book ```$36```](http://amzn.to/1iMrDIp) 136 | * How to Write a Spelling Correcter / Norvig (Tutorial)[http://norvig.com/spell-correct.html] 137 | 138 | ### Data Analysis 139 | One of the "unteachable" skills of data science is an intuition for analysis. What constitutes valuable, achievable, and well-designed analysis is extremely dependent on context and ends at hand. 140 | 141 | * Big Data Analysis with Twitter [UC Berkeley / Lectures](http://bit.ly/cal-course-bigdatatwitter) 142 | * Exploratory Data Analysis [Tukey / Book ```$81```](http://amzn.to/1kNUEPa) 143 | 144 | * **in Python** 145 | * Data Analysis in Python [Tutorial](http://bit.ly/mode-python-tutorials) 146 | * Python for Data Analysis [Book ```$24```](http://amzn.to/Q2pI5I) 147 | * An Example Data Science Process [ipynb](http://bit.ly/ipydsprocess) 148 | 149 | ### Data Communication and Design 150 | 151 | * **Visualization** 152 | 153 | _Data Visualization and Communication_ 154 | * The Truthful Art: Data, Charts, and Maps for Communication [Cairo / Book ```$21```](http://amzn.to/1UydGAc) 155 | 156 | _Theoretical Design of Information_ 157 | 158 | * Envisioning Information [Tufte / Book ```$36```](http://amzn.to/Sn0QI4) 159 | * The Visual Display of Quantitative Information [Tufte / Book ```$27```](http://amzn.to/1q5FB91) 160 | 161 | _Applied Design of Information_ 162 | * Information Dashboard Design: Displaying Data for At-a-Glance Monitoring [Stephen Few / Book ```$29```](http://amzn.to/1Vwz21v) 163 | 164 | _Theoretical Courses / Design & Visualization_ 165 | 166 | * Data Visualization [University of Washington / Slides & Resources](http://bit.ly/uw-dataviz) 167 | * Berkeley's Viz Class [UC Berkeley / Course Docs](http://bit.ly/cal-viz) 168 | * Rice University's Data Viz class [Rice University / Slides](http://bit.ly/riceu-viz) 169 | 170 | _Practical Visualization Resources_ 171 | 172 | * D3 Library / Scott Murray [Blog / Tutorials](http://bit.ly/tut-scottmurray-d3) 173 | * Interactive Data Visualization for the Web / Scott Murray [Online Book](http://bit.ly/interactive-data-viz-web) & [Book `$26`](http://amzn.to/1oK1xCN) 174 | 175 | _OSDSM Specialization: [Data Journalism](https://github.com/datasciencemasters/go/blob/master/specializations.md#data-journalism)_ 176 | 177 | #### **Python** (Learning) 178 | * Learn Python the Hard Way [Digital](http://bit.ly/ebook-learnpyhardway) & [Book ```$23```](http://amzn.to/1hmzGw9) 179 | * Python [Class / Google](http://bit.ly/T4j40A) 180 | * Think Python [Digital](http://bit.ly/ebook-thinkpy) & [Book ```$34```](http://amzn.to/1ktQ5ZU) 181 | 182 | #### **Python** (Libraries) 183 | Installing Basic Packages [Python, virtualenv, NumPy, SciPy, matplotlib and IPython ](http://bit.ly/scientific-py-install) & [Using Python Scientifically](http://bit.ly/lecture-scipy) 184 | 185 | [Command Line Install Script](https://github.com/fonnesbeck/ScipySuperpack) for Scientific Python Packages 186 | 187 | * [numpy Tutorial / Stanford CS231N](http://cs231n.github.io/python-numpy-tutorial/) 188 | * [Pandas Cookbook](http://bit.ly/jvnspandascookbook) (data structure library) 189 | 190 | _More Libraries can be found in the ["awesome machine learning"](https://github.com/josephmisiti/awesome-machine-learning#python) repo & in related [specializations](https://github.com/datasciencemasters/go/blob/master/specializations.md)_ 191 | 192 | * **Data Structures & Analysis Packages** 193 | * Flexible and powerful data analysis / manipulation library with labeled data structures objects, statistical functions, etc [pandas](http://bit.ly/py-pandas) & Tutorials [Python for Data Analysis / Book](http://amzn.to/Q2pI5I) 194 | 195 | * **Machine Learning Packages** 196 | * [scikit-learn](http://bit.ly/py-scikit) - Tools for Data Mining & Analysis 197 | 198 | * **Networks Packages** 199 | * [networkx](http://bit.ly/py-networkx) - Network Modeling & Viz 200 | 201 | * **Statistical Packages** 202 | * [PyMC](http://bit.ly/py-pymc) - Bayesian Inference & Markov Chain Monte Carlo sampling toolkit 203 | * [Statsmodels](http://bit.ly/py-statsmodel) - Python module that allows users to explore data, estimate statistical models, and perform statistical tests 204 | * [PyMVPA](http://bit.ly/py-mvpa) - Multivariate Pattern Analysis in Python 205 | 206 | * **Natural Language Processing & Understanding** 207 | * [NLTK](http://bit.ly/py-nltk) - Natural Language Toolkit 208 | * [Gensim](http://bit.ly/py-gensim) - Python library for topic modeling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community. 209 | 210 | * **Data APIs** 211 | * [twython](http://bit.ly/py-twython) - Python wrapper for the Twitter API 212 | 213 | * **Visualization Packages** 214 | * [matplotlib](http://bit.ly/matplotlib-docs) - well-integrated with analysis and data manipulation packages like numpy and pandas 215 | * [Seaborn](http://bit.ly/seaborn-python) - a high-level statistical visualization package built on top of matplotlib 216 | 217 | * **iPython Data Science Notebooks** 218 | * [Data Science in IPython Notebooks](http://bit.ly/ipynb-ds) (Linear Regression, Logistic Regression, Random Forests, K-Means Clustering) 219 | * [A Gallery of Interesting IPython Notebooks - Pandas for Data Analysis](http://bit.ly/ipyfordataanalysis) 220 | 221 | #### Datasets are now [here](http://bit.ly/osdsm-datasets-link) 222 | 223 | #### R resources are now [here](http://bit.ly/osdsm-rresources) 224 | 225 | ### Data Science as a Profession 226 | 227 | * Doing Data Science: Straight Talk from the Frontline [O'Reilly / Book ```$25```](http://amzn.to/1vAIscK) 228 | * The Data Science Handbook: Advice and Insights from 25 Amazing Data Scientists [Book ```$22```](http://amzn.to/1J7lILJ) 229 | 230 | ### Capstone Project 231 | * Capstone Analysis of Your Own Design; [Quora](http://bit.ly/quora-toyproblems)'s Idea Compendium 232 | * Healthcare Twitter Analysis [Coursolve & UW Data Science](http://bit.ly/project-healthcare-twitter-analysis) 233 | * Analyze your LinkedIn Network [Generate & Download Adjacency Matrix](http://socilab.com/) 234 | 235 | *** 236 | ### Resources 237 | 238 | #### Read 239 | * [DataTau](http://bit.ly/datatau) - The "Hacker News" of Data Science 240 | * [Wikipedia](http://bit.ly/1kKg0gD) - The free encyclopedia 241 | * [The Signal and The Noise - Nate Silver ```$15```](http://amzn.to/1hoxQoG) - Bestseller Pop Sci 242 | * [Zipfian Academy's List of Resources](http://bit.ly/1qoF1We) 243 | * [A Software Engineer's Guide to Getting Started with Data Science](http://bit.ly/1jwgV4p) 244 | * [Data Scientist Interviews / Metamarkets](http://bit.ly/1r1tJot) 245 | * [/r/MachineLearning](http://bit.ly/1uANaEM) 246 | 247 | #### Watch 248 | * [The Life of a Data Scientist / Josh Wills](https://www.youtube.com/watch?v=h9vQIPfe2uU) 249 | * [What Data Science Is / Hilary Mason](https://www.youtube.com/watch?v=fZuDwiM1XBQ) 250 | 251 | #### Learn 252 | * [Metacademy](http://bit.ly/metacademy) - Search for a concept you want to learn 253 | * [Coursera](http://bit.ly/coursera-online-courses) - Online university courses 254 | * [Wolfram Alpha](http://bit.ly/wolframalpha-torus) - The smart number and info cruncher 255 | * [Khan Academy](http://bit.ly/khan-academy-lifeinsurance) - High quality, free learning videos 256 | 257 | *** 258 | 259 | ### Notation 260 | Non-Open-Source books, courses, and resources are noted with ```$```. 261 | 262 | ## Contribute 263 | 264 | Please Contribute -- **this is Open Source!** 265 | 266 | [Follow me on Twitter @clarecorthell](http://bit.ly/clarecorthelltwitter) 267 | -------------------------------------------------------------------------------- /analysis-technologies.md: -------------------------------------------------------------------------------- 1 | #### **Weka (Java Framework)** 2 | 3 | * [Weka (MOOC)](http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/) for Data Mining 4 | 5 | #### **Lua** (Libraries) 6 | * [Torch7](http://torch.ch/) scientific computing framework with wide support for machine learning algorithms 7 | 8 | #### **R** [here](r-resources.md) 9 | 10 | NB: The core curriculum centers on python-based techniques and technologies 11 | -------------------------------------------------------------------------------- /basic-programming.md: -------------------------------------------------------------------------------- 1 | _[Please help make this better!]_ 2 | 3 | ### Basic Programming Methodology 4 | 5 | #### **Learning Programming** 6 | 7 | * Codecademy [code-right-away guided tutorials](http://www.codecademy.com/) 8 | * CS106a / Stanford Programming Methodology [Lectures](https://www.udemy.com/cs-106a-programming-methodology) - how Stanford students learn to code (Mehran Sahami is just *so* amazing and fun) 9 | * Computer Programming - Introduction to JS / Khan Academy [code-right-away guided tutorial](https://www.khanacademy.org/computing/cs/programming) 10 | * The Nature of Programming [Online Book](http://natureofcode.com/book/introduction/) uses the art-and-science-world-esteemed `processing` language 11 | * Python the Hard Way [Online Book & Tutorial](http://learnpythonthehardway.org/book/) - my personal favorite Pythonic starting point. 12 | 13 | #### **Startups and Software Engineering** 14 | * Startup Engineering [Stanford / Coursera](https://class.coursera.org/startup-001) _NB: This is a full-stack class; explains development from conception to deployment. Great granular, stepwise course explaining how to built an application from scratch._ 15 | * [How Browsers Work](http://www.html5rocks.com/en/tutorials/internals/howbrowserswork/) 16 | 17 | ### Programming Tools & key Python Resources 18 | 19 | #### **GIT** (Source control) 20 | * Git tutorial [Tutorial](http://gitimmersion.com/lab_01.html) 21 | 22 | #### **Testing** 23 | * nosetests [lib docs](https://nose.readthedocs.org/en/latest/) 24 | 25 | #### **Software Engineering** 26 | * Design Patterns [Classic Book](http://amzn.to/1z5CSiz) & [Synopsis](http://sourcemaking.com/design_patterns) 27 | 28 | ### Theory 29 | 30 | #### Algorithms 31 | * [Visualizations of Sorting Algorithms](http://www.sorting-algorithms.com) 32 | -------------------------------------------------------------------------------- /best-of-stackoverflow.md: -------------------------------------------------------------------------------- 1 | ## Best of StackOverflow 2 | 3 | _I read a lot of StackOverflow. I know the rest of you do. I'm going to get into the weeds and start listing out great answers to specific issues here._ 4 | 5 | ### Machine Learning 6 | 7 | - [Similarity between two documents](http://stackoverflow.com/questions/8897593/similarity-between-two-text-documents) (Tf-idf) 8 | -------------------------------------------------------------------------------- /blogs-n-media.md: -------------------------------------------------------------------------------- 1 | ### Aggregate Sources 2 | 3 | * [DataTau](http://www.datatau.com/) - [Hacker News](https://news.ycombinator.com/) for data scientists 4 | * [Quora - Best Blogs for Data Miners & Scientists to Read](https://www.quora.com/Data-Science/What-are-the-best-blogs-for-data-miners-and-data-scientists-to-read?share=1) 5 | 6 | ### Blogs 7 | 8 | * [Shape of Data](http://shapeofdata.wordpress.com/) 9 | * [yhat](http://blog.yhathq.com/) 10 | * [R-Bloggers](http://www.r-bloggers.com/) 11 | * [Flowing Data](https://flowingdata.com/) 12 | 13 | #### News 14 | * [KDnuggets](http://www.kdnuggets.com/) 15 | * [FiveThirtyEight](http://fivethirtyeight.com/) 16 | 17 | #### Machine Learning 18 | 19 | * [FastML](http://fastml.com/) 20 | * [Python for Machine Learning](http://pythonformachinelearning.wordpress.com/) 21 | 22 | #### Data Science as a Profession 23 | 24 | * [Data Science Weekly](http://www.datascienceweekly.org/blog) 25 | 26 | #### Learning Data Science 27 | 28 | * [A Journey into Data Science](http://ajourneyintodatascience.com/) 29 | 30 | #### Personal Blogs & Commentary 31 | 32 | * [Pythonic Perambulations](https://jakevdp.github.io/) - Musings and ramblings through the world of Python and beyond 33 | * [On the lambda](http://www.onthelambda.com/) 34 | * [Probably Overthinking It](http://allendowney.blogspot.com/) 35 | 36 | #### Podcasts 37 | * [Talking Machines](http://www.thetalkingmachines.com/) 38 | * [Partially Derivative](http://www.partiallyderivative.com/) 39 | -------------------------------------------------------------------------------- /database-tech.md: -------------------------------------------------------------------------------- 1 | ### Database Technologies & Management 2 | 3 | #### MongoDB 4 | 5 | * Data Wrangling with Mongo DB [Udacity Course](https://www.udacity.com/course/ud032) 6 | 7 | #### PostgreSQL 8 | * [PostgreSQL](https://pypi.python.org/pypi/psycopg2) 9 | -------------------------------------------------------------------------------- /datasets.md: -------------------------------------------------------------------------------- 1 | ### All The Datasets 2 | 3 | #### Machine Learning 4 | 5 | * [UCI Machine Learning Dataset Repository](https://archive.ics.uci.edu/ml/datasets.html) 6 | * [Machine Learning Dataset Repository](http://mldata.org/) 7 | 8 | #### Deep Learning 9 | 10 | * [Deep Learning Datasets](http://deeplearning.net/datasets/) for benchmarking deep learning algorithms 11 | 12 | #### Clean Sample Data (for Learning New Techniques) 13 | 14 | * [Scikit-learn sample datasets](http://scikit-learn.org/stable/datasets/index.html) 15 | * [Statsmodels datasets](http://statsmodels.sourceforge.net/devel/datasets/index.html) 16 | 17 | 18 | #### Networks 19 | * [Stanford Network Analysis Project](https://snap.stanford.edu/) 20 | 21 | #### Raw Dataz 22 | 23 | * [OpenFlights Airports Database](http://openflights.org/data.html) - Airport, airline and route data; 6977 airports 24 | * [The Guardian / Datasets](http://www.theguardian.com/news/datablog/interactive/2013/jan/14/all-our-datasets-index) 25 | * [Qandl](http://www.quandl.com) provides a lot of interesting data with a clean API. 26 | * [Time Series Data Library](http://datamarket.com/data/list/?q=provider:tsdl) 27 | * USA Congressional Voting Records [Voteview](http://voteview.org/downloads.asp) 28 | 29 | ### Datasets Sources 30 | 31 | * [NIPS Feature Selection](http://www.nipsfsc.ecs.soton.ac.uk/datasets/) 32 | 33 | ### Curated Lists of Datasets 34 | 35 | * [@hmason's](https://twitter.com/hmason) curated dataset list [bit.ly](https://bitly.com/bundles/hmason/1) 36 | 37 | ### Sentiment Analysis 38 | * https://github.com/acquayefrank/sanders-twitter 39 | * https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data 40 | * http://ai.stanford.edu/~amaas/data/sentiment/ 41 | -------------------------------------------------------------------------------- /janitorial.md: -------------------------------------------------------------------------------- 1 | ### Utilities, Tools, and Methods for Cleaning Up Data 2 | 3 | #### **Tools** 4 | 5 | * [CSV Fingerprints](http://setosa.io/blog/2014/08/03/csv-fingerprints/) makes it easy to spot irregularities in a CSV visually. 6 | -------------------------------------------------------------------------------- /learn-python.md: -------------------------------------------------------------------------------- 1 | ### Learn Python 2 | 3 | #### The Basics 4 | - [Learn Python the Hard Way](http://learnpythonthehardway.org/book/ex0.html) 5 | - [numpy and scipy tutorial](http://cs231n.github.io/python-numpy-tutorial/) 6 | 7 | #### Industry / Conferences 8 | - [pydata / many locations](http://pydata.org/events/) 9 | -------------------------------------------------------------------------------- /machine-learning.md: -------------------------------------------------------------------------------- 1 | 2 | ### Features 3 | 4 | * [tf-idf: text feature extraction](http://pyevolve.sourceforge.net/wordpress/?p=1589) 5 | * def: [```scikit / tf-idf```](http://stackoverflow.com/questions/8897593/similarity-between-two-text-documents) 6 | * [Vectorization and Features in sci-kit (ipynb)](http://nbviewer.ipython.org/github/bigsnarfdude/machineLearning/blob/master/Vectorizing.ipynb) 7 | 8 | ### Regression 9 | * [Linear Regression](http://alexhwoods.com/2015/07/19/guide-to-linear-regression/) 10 | 11 | ### Models & Methods 12 | 13 | * [cosine similarity for vector space models](http://pyevolve.sourceforge.net/wordpress/?p=2497) 14 | * [Bag of Words](http://en.wikipedia.org/wiki/Bag_of_words_model) (used in Bayesian Spam Filtering, document term frequency models) 15 | 16 | ### Classification 17 | * [An introduction to Classification / slides](http://www.slideshare.net/pierluca.lanzi/machine-learning-and-data-mining-10-introduction-to-classification) 18 | 19 | ### Statistical Definitions 20 | 21 | * [Accuracy & Precision](http://en.wikipedia.org/wiki/Accuracy_and_precision) (relevant to classification, measuring efficacy of learned models) 22 | -------------------------------------------------------------------------------- /r-resources.md: -------------------------------------------------------------------------------- 1 | _[Note: The core of The Open Source Data Science Masters focuses on programmatic problem solving in python. This is dedicated to Data Science resources using R, a valuable and powerful technology for investigation and analysis.]_ 2 | 3 | ### R 4 | 5 | [The R Project for Statistical Computing / Software](http://www.r-project.org/) 6 | 7 | [Learn Data Science with R / Tutorials](https://www.datacamp.com/courses) 8 | 9 | #### Basic R 10 | 11 | * R in a Nutshell [O'Reilly / Book ```$41```](http://amzn.to/1s54OBf) 12 | * Software Design: The Art of R Programming [O'Reilly / Book ```$23```](http://amzn.to/1mqzpWw) 13 | 14 | #### Basic Statistics with R 15 | 16 | * An Introduction to Statistical Learning [Book pdf](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf) ^also a Machine Learning resource 17 | 18 | #### Data Science with R 19 | * Introduction to Data Science [Syracuse University / ebook](http://jsresearch.net/index.html) 20 | * Learn R & Become a Data Analyst [Tutorial](https://www.datacamp.com/) 21 | * Doing Data Science: Straight Talk from the Frontline [O'Reilly / Book ```$25```](http://amzn.to/1vAIscK) 22 | * Practical Data Science with R [Manning Publications / Book ```$49.99```](http://www.manning.com/zumel/) 23 | 24 | #### Statistical Learning with R 25 | 26 | * Statistical Learning [Stanford / OpenEdX Course](https://class.stanford.edu/courses/HumanitiesScience/StatLearning/Winter2014/about) 27 | 28 | #### Forecasting with R 29 | 30 | * Forecasting: Principles and Practice *(Regression, Time Series, Forecasting)* [Monash University / Book](http://otexts.com/fpp/) 31 | 32 | #### Visualization with R 33 | 34 | * Viz and Elegant Graphics in R: ggplot2 [Springer / Book ```$65```](http://amzn.to/1fZMXVd) 35 | 36 | #### Machine Learning with R 37 | 38 | * Guide to Getting Started in Machine Learning [Tutorial](http://abeautifulwww.com/2009/10/11/guide-to-getting-started-in-machine-learning/) 39 | * Machine Learning in R [Tutorial](http://blog.revolutionanalytics.com/2009/09/machine-learning-in-r-in-a-nutshell.html) 40 | 41 | #### R Libraries 42 | 43 | * Natural Language Toolkit [OpenNLP](http://cran.r-project.org/web/packages/openNLP/index.html) 44 | * Text Mining [tm](http://cran.r-project.org/web/packages/tm/index.html) 45 | * Basic Viz [wordcloud](http://cran.r-project.org/web/packages/wordcloud/index.html) 46 | * Network Modeling & Viz [igraph](http://cran.r-project.org/web/packages/igraph/index.html) 47 | * Basic Machine Learning [e1071](http://cran.r-project.org/web/packages/e1071/index.html) 48 | * Kernel Method [kernlab](http://cran.r-project.org/web/packages/kernlab/index.html) 49 | * Chinese Language Processing [Rwordseg](http://jliblog.com/app/rwordseg) 50 | * Chinese Weibo Analysis [Rweibo](http://jliblog.com/app/rweibo) 51 | 52 | #### R Datasets 53 | 54 | * [Rdatasets](http://vincentarelbundock.github.io/Rdatasets/) 55 | 56 | #### R Blogs & Media 57 | 58 | * [R-bloggers](http://www.r-bloggers.com/) R news and tutorials contributed by (452) R bloggers 59 | ####R interactive visualizations 60 | * [Shiny](http://shiny.rstudio.com/) Interactive web application framework for R 61 | -------------------------------------------------------------------------------- /specializations.md: -------------------------------------------------------------------------------- 1 | _[Note: I'm adding this section due to the overwhelming amount of input from new Pull Requests that involve great materials that are slightly off-topic. They need a home, and that home is here. Please help make this better!]_ 2 | 3 | ### Data Science Specializations & Use Cases 4 | 5 | #### Machine Learning 6 | 7 | * Neural Networks for Machine Learning [U Toronto / Coursera](https://www.coursera.org/course/neuralnets) 8 | * [Building Machine Learning Systems with Python](http://www.packtpub.com/building-machine-learning-systems-with-python/book) [source code](https://github.com/luispedro/BuildingMachineLearningSystemsWithPython) 9 | 10 | Packages 11 | * [mlpy](http://mlpy.sourceforge.net) Machine Learning Python 12 | * Machine Learning Toolkit [MILK](http://packages.python.org/milk/) 13 | * [MDP](https://pypi.python.org/pypi/MDP) a collection of supervised and unsupervised learning algorithms 14 | * [pyBrain](http://pybrain.org/) modular Machine Learning Library for Python 15 | * [Caffe](http://caffe.berkeleyvision.org/) framework for convolutional neural network algorithms 16 | * [Nolearn](https://pypi.python.org/pypi/nolearn) framework wrapping scikit neural networks 17 | * [OverFeat](http://cilvr.nyu.edu/doku.php?id=software:overfeat:start) Convolutional Network-based image features extractor and classifier 18 | * [Hebel](https://github.com/hannes-brt/hebel) GPU-Accelerated Deep Learning Library in Python 19 | * [neurolab](https://code.google.com/p/neurolab/) simple and powerful Neural Network Library for Python. Contains based neural networks, train algorithms and flexible framework to create and explore other networks 20 | * [Pylearn2](http://deeplearning.net/software/pylearn2/) and [Theano](http://deeplearning.net/software/theano/) deep learning libraries 21 | 22 | #### Deep Learning 23 | 24 | [Wikipedia Definition](http://en.wikipedia.org/wiki/Deep_learning) 25 | 26 | * Deep Learning [Tutorials](http://deeplearning.net/tutorial/) 27 | * Deep Learning Course [Stanford / OpenClassroom](http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=DeepLearning) 28 | 29 | #### Web Scraping & Crawling 30 | 31 | * Introduction to WebAPIs including Twitter, Youtube, BitLy, Sunlight Foundation [CodeAcademy](http://www.codecademy.com/tracks/apis) 32 | * Twitter Analysis tools [ScraperWiki](https://scraperwiki.com/tools/twitter) 33 | * [Six tools for web-scraping](http://www.notprovided.eu/six-tools-web-scraping-use-data-journalism-creating-insightful-content/) 34 | * Web scraping [NewCoder / Tutorial](http://newcoder.io/scrape/) 35 | * Working with Web APIs [NewCoder / Tutorial](http://newcoder.io/api/) 36 | 37 | #### Visualization 38 | 39 | * [D3.js Tutorial](https://www.dashingd3js.com/table-of-contents) 40 | 41 | #### Social Network Analysis 42 | 43 | #### Data Journalism 44 | 45 | [Data Journalism](http://en.wikipedia.org/wiki/Data_journalism) and [Data-driven Journalism](http://en.wikipedia.org/wiki/Data_driven_journalism) involve investigation of journalistic stories by way of data perspectives and statistical methods. 46 | 47 | * The Data Journalism Handbook [DataJournalism / Book](http://datajournalismhandbook.org/1.0/en/index.html) 48 | 49 | -------------------------------------------------------------------------------- /transcripts/clare-corthell-2013.md: -------------------------------------------------------------------------------- 1 | ### The Open-Source Masters 2 | 3 | I couldn't wait to go back to grad school. Literally. So I designed my own grad school and spent 5 months learning & hacking in great delight! 4 | 5 | ### My Background ([linkedin](http://bit.ly/clarecorthell)) 6 | 7 | I'm a Stanford-educated Engineer, previously a Front-End Developer and UX Designer on early-stage products. I'm always in hot pursuit of deeper insight to social questions! 8 | 9 | ### Goals & Motivations of the Open Source M.S. 10 | 11 | Data Science is an ideal marriage for my technical capacities, social research inquisitions, and my geekish-freakish love of statistics. 12 | 13 | ### Next Steps? 14 | 15 | I'm now a Data Scientist with an incredible team at [Mattermark](http://www.mattermark.com)! 16 | 17 | *** 18 | 19 | ## The Data Science Curriculum / April-August 2013 20 | 21 | * **Intro to Data Science** [UW / Coursera](https://www.coursera.org/course/datasci) 22 | * *Topics:* Python NLP on Twitter API, Distributed Computing Paradigm, MapReduce/Hadoop & Pig Script, SQL/NoSQL, Relational Algebra, Experiment design, Statistics, Graphs, Amazon EC2, Visualization. 23 | 24 | ### Math 25 | * Linear Algebra / Levandosky [Stanford / Book](http://www.amazon.com/Linear-Algebra-Steven-Levandosky/dp/0536667470/ref=sr_1_1?ie=UTF8&qid=1376546498&sr=8-1&keywords=linear+algebra+levandosky#) 26 | * Statistics [Stats in a Nutshell / Book](http://shop.oreilly.com/product/9780596510497.do) 27 | * Problem-Solving Heuristics "How To Solve It" [Polya / Book](http://en.wikipedia.org/wiki/How_to_Solve_It) 28 | 29 | ### Computing 30 | * **Algorithms** 31 | * Algorithms Design & Analysis I [Stanford / Coursera](https://www.coursera.org/course/algo) 32 | * Algorithm Design [Kleinberg & Tardos / Book](http://www.amazon.com/Algorithm-Design-Jon-Kleinberg/dp/0321295358/ref=sr_1_1?ie=UTF8&qid=1376702127&sr=8-1&keywords=kleinberg+algorithms) 33 | 34 | * **Databases** 35 | * Introduction to Databases [Stanford / Coursera](https://www.coursera.org/course/db) 36 | 37 | * **Data Mining** 38 | * Mining Massive Data Sets [Stanford / Book](http://i.stanford.edu/~ullman/mmds.html) 39 | * Mining The Social Web [O'Reilly / Book](http://shop.oreilly.com/product/0636920010203.do) 40 | * Introduction to Information Retrieval [Stanford / Book](http://nlp.stanford.edu/IR-book/information-retrieval-book.html) 41 | 42 | * **Machine Learning** 43 | * Machine Learning / Ng [Stanford / Coursera](https://www.coursera.org/course/ml) 44 | * Programming Collective Intelligence [O'Reilly / Book](http://shop.oreilly.com/product/9780596529321.do) 45 | * Statistics [The Elements of Statistical Learning / Book](http://www-stat.stanford.edu/~tibs/ElemStatLearn/) ** *en process* 46 | 47 | * **Probabilistic Graphical Models** 48 | * Probabilistic Programming and Bayesian Methods for Hackers [Github / Tutorials] (https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers) 49 | * PGMs / Koller [Stanford / Coursera](https://www.coursera.org/course/pgm) ** *en process* 50 | 51 | * **Natural Language Processing** 52 | * NLP with Python [O'Reilly / Book](http://shop.oreilly.com/product/9780596516499.do) 53 | 54 | * **Analysis** 55 | * Python for Data Analysis [O'Reilly / Book](http://www.kqzyfj.com/click-7040302-11260198?url=http%3A%2F%2Fshop.oreilly.com%2Fproduct%2F0636920023784.do&cjsku=0636920023784) 56 | * Big Data Analysis with Twitter [UC Berkeley / Lectures](http://blogs.ischool.berkeley.edu/i290-abdt-s12/) 57 | * Social and Economic Networks: Models and Analysis / [Stanford / Coursera](https://www.coursera.org/course/networksonline) 58 | * Information Visualization ["Envisioning Information" Tufte / Book](http://www.amazon.com/Envisioning-Information-Edward-R-Tufte/dp/0961392118/ref=sr_1_8?ie=UTF8&qid=1376709039&sr=8-8&keywords=information+design) 59 | 60 | * **Python** (Learning) 61 | * New To Python: [Learn Python the Hard Way](http://learnpythonthehardway.org/), [Google's Python Class](code.google.com/edu/languages/google-python-class/) 62 | 63 | * **Python** (Libraries) 64 | * Basic Packages [Python, virtualenv, NumPy, SciPy, matplotlib and IPython ](http://www.lowindata.com/2013/installing-scientific-python-on-mac-os-x/) 65 | * Bayesian Inference | [pymc](https://github.com/pymc-devs/pymc) 66 | * Labeled data structures objects, statistical functions, etc [pandas](https://github.com/pydata/pandas) (See: Python for Data Analysis) 67 | * Python wrapper for the Twitter API [twython](https://github.com/ryanmcgrath/twython) 68 | * Tools for Data Mining & Analysis [scikit-learn](http://scikit-learn.org/stable/) 69 | * Network Modeling & Viz [networkx](http://networkx.github.io/) 70 | * Natural Language Toolkit [NLTK](http://nltk.org/) 71 | 72 | ### Projects 73 | * Coursework 74 | * Sentiment analysis, trending topics, and friendship mapping with Twitter API 75 | * Joins and Matrix Manipulation in MapReduce (AWS EC2) 76 | * In-database Text analysis (SQL) 77 | * Sentiment analysis of movie tweets (Python) 78 | 79 | 80 | *** 81 | ### A Note on Tools 82 | 83 | This degree is brought to you by: "THE INTERNET". 84 | 85 | Information is more democratized^ now than it was at any point in history. Given a little initiative and interest, you can tailor and excel in an education of your own design. The connective web made me what I am today, growing from the child obsessed with [Number Munchers](http://en.wikipedia.org/wiki/Munchers#Number_Munchers) to an adult jaw-dropping over [DBSCAN](http://en.wikipedia.org/wiki/DBSCAN). 86 | 87 | The most valuable resources I used were: 88 | * [Coursera](http://coursera.org) 89 | * [Khan Academy](https://www.khanacademy.org/math/probability/random-variables-topic/random_variables_prob_dist/v/term-life-insurance-and-death-probability) 90 | * [Wolfram Alpha](http://www.wolframalpha.com/input/?i=torus) 91 | * [Wikipedia](http://en.wikipedia.org/wiki/List_of_cognitive_biases) 92 | * [Quora](http://www.quora.com/Programming-Challenges-1/What-are-some-good-toy-problems-in-data-science) 93 | * **Kindle .mobis** (carrying textbooks is so 90s.) 94 | * PopSci Read: [The Signal and The Noise](http://www.amazon.com/Signal-Noise-Predictions-Fail-but-ebook/dp/B007V65R54/ref=tmm_kin_swatch_0?_encoding=UTF8&sr=8-1&qid=1376699450) Nate Silver 95 | * **Friends & Family** (Impossible without their support! Special Thanks to N.S.) 96 | 97 | *^ given internet access - an issue near and dear to me.* 98 | 99 | *** 100 | 101 | 102 | ### I "Forked" this into the [Open Source Data Science Masters](http://datasciencemasters.org) Curriculum. 103 | 104 | [Follow me on Twitter @clarecorthell](http://twitter.com/clarecorthell) -------------------------------------------------------------------------------- /transcripts/linda-george-transcript.md: -------------------------------------------------------------------------------- 1 | ###Open Source Data Science 2 | 3 | Transcript for L. George 4 | (snapshot as of June 20th, 2014) 5 | 6 | ###Data Science / Analytics Coursework 7 | 8 | Data Analysis: Applied statistics course using R. Coursera/Johns Hopkins U. Completed 3/22/13. 9 | 10 | Computing for Data Analysis: Using R for effective data analysis. Coursera/Johns Hopkins U. Completed 4/17/13. 11 | 12 | Web Intelligence and Big Data: Search, indexing, sentiment analysis, MapReduce, classification and clustering algorithms, Bayesian inference, and feature selection. Tools: Python, SQLLite. Coursera/IIT Delhi. Completed 6/6/13. 13 | 14 | Introduction to Data Science: SQL/NoSQL, Hadoop, MapReduce, statistical modeling and machine learning, sentiment analysis (via Twitter API), visualization. Tools: Python, SQLLite, Tableau. Coursera/U. of Washington. Completed 6/29/13. 15 | 16 | Zipfian Academy data science training program (not open source): Q1-Q2, 2014. 17 | * Worked with a diverse range of analytic algorithms and approaches, including supervised and unsupervised machine learning, recommender systems, natural language processing, A/B testing, and methods for large-scale data storage and retrieval. Languages: Python, MySQL. 18 | * Implemented WorkVibes, a summarization tool for company reviews. WorkVibes curates rich, distinctive content from review corpora, using part-of-speech tagging and aggregate TF-IDF weights to identify relevant opinions across large numbers of reviews. The data pipeline includes data acquisition from Glassdoor.com, HTML parsing, and the use of MySQL for storing reviews. 19 | 20 | Machine Learning: Explored a range of machine learning approaches from regression to neural networks, anomaly detection, and machine learning at scale. Coursera/Stanford. Completed 6/4/14. 21 | 22 | ###Computing 23 | 24 | ####Software Engineering 25 | 26 | Introduction to Systematic Program Design: Modeling information and structuring programs in a systematic way. Coursera/U. of British Columbia. Completed 9/11/13. 27 | 28 | ###Database 29 | MySQL Crash Course: Overview of MySQL. Book/Forta. Completed 12/30/13. 30 | 31 | ###Future steps 32 | 33 | Natural Language Processing. Coursera/Stanford 34 | 35 | Probabilistic Graphical Models. Coursera/Stanford. 36 | 37 | Introduction to Databases. Coursera/Stanford. 38 | 39 | ### Background 40 | 41 | Social/personality psychologist with computer science background: [LinkedIn](http://www.linkedin.com/in/lindaggeorge) 42 | 43 | ### Favorite resources: 44 | 45 | * Coursera 46 | * O'Reilly books Python for Data Analysis, Natural Language Processing with Python, and Doing Data Science 47 | * ThinkPython book, A. Downey 48 | * Stack Overflow 49 | * Quora 50 | 51 | [Latest version of this transcript](http://bringdata.wordpress.com/transcript) -------------------------------------------------------------------------------- /transcripts/nick-byrne-transcript.md: -------------------------------------------------------------------------------- 1 |

Nick Byrne Transcript

2 | **Open Source Data Science Masters**
3 | I'm currently looking for people to pair with, and work on a capstone project
4 | 5 | Want to collaborate? Get in touch: 6 | * [twitter](http://www.twitter.com/byrnenick); or 7 | * [email](mailto:nick@thinkactlive.com) 8 | 9 | **OpenSource Data Science Masters Curriculum**
10 | Below is a planned curriculum that I'm looking to follow. As with life, I'm not expecting it to be followed linearly necessarily. And I may swap courses in and out as interesting things arise. 11 | I do plan to take at least one element from all of the recommended themes published in the OpenSource Data Science masters. And I'm favouring online courses as it's obviously easier to stay honest with regards to progress over reading a book and claiming that you know the subject matter. 12 | 13 |

Recognised openSource curriculum

14 |

Base Introduction

15 | Data Science Introductions 16 | - [ ] Intro to Data Science by UW / Coursera, online course 17 | - [ ] Data Science by Harvard, online course 18 | - [ ] Data Science with Open Source Tools, book 19 | - [ ] Introduction to Computer Science and Programming, by MIT OpenCourseWare 20 | *Intro to CS was listed in Python(Learning) section but felt it would be a good one to bring up front (despite having a good grasp of python) 21 | 22 | Mathematics 23 | - [ ] Linear Programming (Math 407) University of Washington 24 | - [ ] Statistics by Princeton & Coursera 25 | - [ ] Differential Equations in Data Science, Python tutorial 26 | - [ ] Problem-Solving Heuristics "How to Solve It" by Polya, Book 27 | 28 |

Computing

29 | Algorithms 30 | - [ ] Algorithms Design & Analysis, by Stanford and Coursera 31 | 32 | Distributed Computing Paradigms 33 | - [ ] Intro to Hadoop and MapReduce by Cloudera and Udacity 34 | *Note: I might swap the above course with an EdX course on Apache Spark and distributed computing* 35 | 36 | Databases 37 | - [ ] Introduction to Databases, by Stanford 38 | 39 | Data Mining 40 | - [ ] Mining Massive Data Sets, by Stanford and Coursera 41 | 42 | Machine Learning - Foundational & Theoretical 43 | - [ ] Machine Learning, by Ng Stanford and Coursera (**in-progress**) 44 | - [ ] The Elements of Statistical Learning, by Stanford 45 | 46 | Machine Learning - Practical 47 | - [ ] Programming Collective Intelligence 48 | - [ ] Intro to scikit-learn, by SciPy2013 49 | 50 | Probabilistic Modeling 51 | - [ ] Probabilistic Graphical Models, by Stanford and Coursera 52 | 53 | Deep Learning (Neural Networks) 54 | - [ ] Neural Networks, by Univesity of Toronto and Coursera 55 | 56 | Natural Language Processing 57 | - [ ] From Languages to Information, by Stanford 58 | - [ ] NLP with Python (NLKT library) 59 | 60 | Analysis 61 | - [ ] Big Data Analysis with Twitter, by UC Berkeley 62 | 63 |

Data Design

64 | *To be confirmed* 65 | 66 | 67 |

Relevant prior studies

68 | - [X] Adelaide University, Mathematics 1011 69 | - [X] Adelaide University, Statistics 1001 70 | - [X] Adelaide University, Engineering Modelling and Analysis 1003 71 | - [X] Adelaide University, Mathematics 1012 72 | - [X] Adelaide University, Differential Equations and Statistical Methods 2010 73 | - [X] Adelaide University, Engineering Modelling and Analysis 2010 74 | - [X] Adelaide University, Engineering Modelling and Analysis 3009 75 | - [X] Adelaide University, Environmental Modelling, Management and Design 4987 76 | 77 | 78 | **OpenSource Data Science Masters Capstone Project** 79 | I would like to do a capstone project focused on using big data to understand workplace dynamics, and more appropriate hiring decisions. E.g. can we use big data to better understand an employees cultural fit? 80 | As I progress through the curriculum, I'll better define the capstone project. 81 | 82 | If you'd like to pair up for the capstone, [let me know](http://www.twitter.com/byrnenick) 83 | --------------------------------------------------------------------------------