├── LICENSE.md
└── README.md


/LICENSE.md:
--------------------------------------------------------------------------------
 1 | This is free and unencumbered software released into the public domain.
 2 | 
 3 | Anyone is free to copy, modify, publish, use, compile, sell, or
 4 | distribute this software, either in source code form or as a compiled
 5 | binary, for any purpose, commercial or non-commercial, and by any
 6 | means.
 7 | 
 8 | In jurisdictions that recognize copyright laws, the author or authors
 9 | of this software dedicate any and all copyright interest in the
10 | software to the public domain. We make this dedication for the benefit
11 | of the public at large and to the detriment of our heirs and
12 | successors. We intend this dedication to be an overt act of
13 | relinquishment in perpetuity of all present and future rights to this
14 | software under copyright law.
15 | 
16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
19 | IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
20 | OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
21 | ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
22 | OTHER DEALINGS IN THE SOFTWARE.
23 | 
24 | 
25 | For more information, please refer to <http://unlicense.org/>
26 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # The Open-Source Data Science Masters
  2 | This is a [fork of this](https://github.com/datasciencemasters/go), experimenting with different curriculum topics and themes.
  3 | 
  4 | [License here](LICENSE.md).
  5 | 
  6 | ## The Open Source Data Science Curriculum
  7 | ![](http://nirvacana.com/thoughts/wp-content/uploads/2013/07/RoadToDataScientist1.png)
  8 | 
  9 | ### History
 10 | ###	Fundamentals
 11 | **Intro to Data Science** [UW / Coursera](https://www.coursera.org/course/dat *
 12 | *Topics:* Python NLP on Twitter API, Distributed Computing Paradigm, MapReduce/Hadoop & Pig Script, SQL/NoSQL, Relational Algebra, Experiment design, Statistics, Graphs, Amazon EC2, Visualization.asci)
 13 | Algebra-Steven-Levandosky/dp/0536667470/ref=sr_1_1?ie=UTF8&qid=1376546498&sr=8-1&keywords=linear+algebra+levandosky#)
 14 |  * Forecasting: Principles and Practice [Monash University / Book](http://otexts.com/fpp/) *uses R
 15 |  * Problem-Solving Heuristics "How To Solve It" [Polya / Book](http://en.wikipedia.org/wiki/How_to_Solve_It)
 16 |  * Think Bayes [Allen Downey / Book](http://www.greenteapress.com/thinkbayes/)
 17 |  * Capstone Analysis of Your Own Design; [Quora](http://www.quora.com/Programming-Challenges-1/What-are-some-good-toy-problems-in-data-science)'s Idea Compendium
 18 |  * [Toy Data Ideas](http://www.quora.com/Programming-Challenges-1/What-are-some-good-toy-problems-in-data-science)
 19 | 
 20 | Skills
 21 | 
 22 | 	Matrices and Linear Algebra fundamentals
 23 | 		Linear Algebra / Levandosky [Stanford / Book](http://www.amazon.com/Linear-
 24 | 		Coding the Matrix: Linear Algebra through Computer Science Applications [Brown / Coursera](https://www.coursera.org/course/matrix)
 25 | 	Hash Functions, Binary Tree, O(n)
 26 | 	Relational Algebra
 27 | 	DB Basics
 28 | 	Inner, Outer, Cross, Theta join
 29 | 	CAP Theorem
 30 | 	abular data
 31 | 	Entropy
 32 | 	Data Frames and Series
 33 | 	Sharding
 34 | 	OLAP
 35 | 	Multidimensional Data Model
 36 | 		ETL
 37 | 	Reporting vs. BI vs. Analytics
 38 | 	JSON & XML
 39 | 	NoSQL
 40 | 	Regex
 41 | 	Vendor Landscape
 42 | 	Env setup
 43 | 	
 44 | ###	Maths and Stats
 45 |  * Statistics [Stats in a Nutshell / Book](http://shop.oreilly.com/product/9780596510497.do)	   Pick a dataset
 46 |  * Linear Programming (Math 407) [University of Washington / Course](http://www.math.washington.edu/~burke/crs/407/lectures/)
 47 | 
 48 | Skills
 49 | 
 50 | 	Descriptive statistics
 51 | 	Exploratory Data Analysis
 52 | 	Histograms
 53 | 	Percentiles and outliers
 54 | 	Probability theory
 55 | 	Bayes Theorem
 56 | 	Random Variables
 57 | 	Cumulative Distribution Function (CDF)
 58 | 	Continous Distributions (Normal, Poisson, Gaussian)
 59 | 	Skewness
 60 | 	ANOVA
 61 | 	Probability Density Functions
 62 | 
 63 | 	Central Limit Theorem
 64 | 	Monte Carlo Method
 65 | 	Hypothesis testing
 66 | 	p-value
 67 | 	Chi squared test
 68 | 	Estimation
 69 | 	Confidence intevals (CI)
 70 | 	MLE
 71 | 	Kernel Density Estimate
 72 | 	Regression
 73 | 	Covariance
 74 | 	Correlation
 75 | 	Pearson Coefficient
 76 | 	Causation
 77 | 	Least squares fit
 78 | 	Euclidean Distance
 79 | 
 80 |  * Probabilistic Programming and Bayesian Methods for Hackers [Github / Tutorials](https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers)
 81 |  * PGMs / Koller [Stanford / Coursera](https://www.coursera.org/course/pgm)
 82 | 
 83 | 
 84 | ### Computing
 85 | #### Toolbox / Programming Languages / Software stacks
 86 | Skills
 87 | 
 88 | 	Unix cli install programs and packages
 89 | 	Bash basics
 90 | 		cat, grep, wget etc
 91 | 		piping
 92 | 		understand stdio
 93 | 	Python
 94 | 	Regex
 95 | 	MS Excel w/ Analysis ToolPak
 96 | 	Java
 97 | 	R, R-studio, Rattle
 98 | 	IBM SPSS
 99 | 	Weka, Knime, RapidMiner
100 | 	Hadoop ditribution of choice
101 | 	Spark, Storm
102 | 	Flume, Scibe, Chukwa
103 | 	Nutch, Talend, Scraperwiki
104 | 	Webscraper, Flume, Sqoop
105 | 	tm, RWeka, NLTK
106 | 	RHIPE
107 | 	D3.js, ggplot2, Shiny
108 | 	IBM Languageware
109 | 	Cassandra, MongoDB
110 | #### Algorithms, data structures and databases
111 | * **Algorithms**
112 |  * Algorithms Design & Analysis I [Stanford / Coursera](https://www.coursera.org/course/algo)
113 |  * Algorithm Design [Kleinberg & Tardos / Book](http://www.amazon.com/Algorithm-Design-Jon-Kleinberg/dp/0321295358/ref=sr_1_1?ie=UTF8&qid=1376702127&sr=8-1&keywords=kleinberg+algorithms)
114 | 
115 | * **Databases**
116 |  * SQL Tutorial [W3Schools / Tutorials](http://www.w3schools.com/sql/)
117 |  * Introduction to Databases [Stanford / Online Course](http://class2go.stanford.edu/db/Winter2013/)
118 | 
119 | #### Programming
120 | * **Python** (Learning)
121 |  * New To Python: [Learn Python the Hard Way](http://learnpythonthehardway.org/), [Google's Python Class](http://code.google.com/edu/languages/google-python-class/)
122 | 
123 | * **Python** (Libraries)
124 |  * Basic Packages [Python, virtualenv, NumPy, SciPy, matplotlib and IPython ](http://www.lowindata.com/2013/installing-scientific-python-on-mac-os-x/)
125 |  * [Data Science in iPython Notebooks](http://nborwankar.github.io/LearnDataScience/) (Linear Regression, Logistic Regression, Random Forests, K-Means Clustering)
126 |  * Bayesian Inference | [pymc](https://github.com/pymc-devs/pymc)
127 |  * Labeled data structures objects, statistical functions, etc [pandas](https://github.com/pydata/pandas) (See: Python for Data Analysis)
128 |  * Python wrapper for the Twitter API [twython](https://github.com/ryanmcgrath/twython)
129 |  * Tools for Data Mining & Analysis [scikit-learn](http://scikit-learn.org/stable/)
130 |  * Network Modeling & Viz [networkx](http://networkx.github.io/)
131 |  * Natural Language Toolkit [NLTK](http://nltk.org/)
132 | 
133 | Skills
134 | 
135 | 	Variables
136 | 	Vectors
137 | 	Matrices
138 | 	Arrays
139 | 	Factors
140 | 	Lists
141 | 	Data Frames
142 | 	Reading CSV data
143 | 	Reading Raw data
144 | 	Manipulate Data Frames
145 | 	Functions
146 | 	Factor Analysis
147 | 
148 | ### Applied methods	
149 | #### Data Munging and integration
150 | The art of converting or mapping data from one "raw" form into another format that allows for more convenient consumption of the data with the help of semi-automated tools. Expect to spend 80% of your workday doing some sort of data wrangling.
151 | * [What I learned from 2 years of 'data sciencing](http://www.quantisan.com/what-i-learned-from-2-years-of-data-sciencing/) Paul Lam
152 | 
153 | Skills
154 | 
155 | 	Dimensionality & Numerosity Reduction
156 | 	Normalization
157 | 	Data Scrubbing
158 | 	Handling missing values
159 | 	Unbiased estimators
160 | 	Binning sparse values
161 | 	Feature Extraction
162 | 	Denoising
163 | 	Sampling
164 | 	Stratified Sampling
165 | 	Principal Component Analysis
166 | 	Summary of Data Formats
167 | 	Data Discovery
168 | 	Data Sources & Acquisition
169 | 	Data Integration
170 | 	Data Fusion
171 | 	Transformation and enrichment
172 | 	Data survey
173 | 	Google OpenRefine
174 | 	How Much Daya
175 | 	Using ETL
176 | 
177 | #### Visualization
178 | 
179 | Skills
180 | 
181 | 	Data Exploration in R (Hist, boxplot etc)
182 |     Uni, Bi and multivariate Viz
183 | 	ggplot2
184 | 	Histogram & Pie (Uni)
185 | 	Tree and Tree Map
186 | 	Scatter Plot
187 | 	Line Charts
188 | 	Survey Plot
189 | 	Timeline
190 | 	Decision Tree
191 | 	D3.js
192 | 	InfoVis
193 | 	IBM ManyEyes
194 | 	Tableau
195 | 
196 | ### Data mining and analysis
197 |  * Mining Massive Data Sets [Stanford / Book](http://i.stanford.edu/~ullman/mmds.html)
198 |  * Mining The Social Web [O'Reilly / Book](http://shop.oreilly.com/product/0636920010203.do)
199 |  * Introduction to Information Retrieval [Stanford / Book](http://nlp.stanford.edu/IR-book/information-retrieval-book.html)
200 | * **Analysis**
201 |  * Python for Data Analysis [O'Reilly / Book](http://www.kqzyfj.com/click-7040302-11260198?url=http%3A%2F%2Fshop.oreilly.com%2Fproduct%2F0636920023784.do&cjsku=0636920023784)
202 |  * Big Data Analysis with Twitter [UC Berkeley / Lectures](http://blogs.ischool.berkeley.edu/i290-abdt-s12/)
203 |  * Social and Economic Networks: Models and Analysis / [Stanford / Coursera](https://www.coursera.org/course/networksonline)
204 |  * Information Visualization ["Envisioning Information" Tufte / Book](http://www.amazon.com/Envisioning-Information-Edward-R-Tufte/dp/0961392118/ref=sr_1_8?ie=UTF8&qid=1376709039&sr=8-8&keywords=information+design)
205 | 
206 | ###	Machine Learning
207 |  * Machine Learning / Ng [Stanford / Coursera](https://www.coursera.org/course/ml)
208 |  * A Course in Machine Learning / Hal Daumé III UMD [Online Book](http://ciml.info/)
209 |  * Programming Collective Intelligence [O'Reilly / Book](http://shop.oreilly.com/product/9780596529321.do)
210 |  * Statistics [The Elements of Statistical Learning](http://www-stat.stanford.edu/~tibs/ElemStatLearn/)
211 |  * Machine Learning / CaltechX [Caltech / Edx](https://courses.edx.org/courses/CaltechX/CS1156x/Fall2013/)
212 | 
213 | Skills
214 | 
215 | 	Numerical Var
216 | 	Categorical Var
217 | 	Supervised Learning
218 | 	Unsupervised Learning
219 | 	Concepts, Inputs and Attributes
220 | 	Training and Test Data
221 | 	Classifier
222 | 	Prediction
223 | 	Lift
224 | 	Overfitting
225 | 	Bias and variance
226 | 	Classification
227 | 		Trees and classification
228 | 		Classification rate
229 | 		Decision trees
230 | 		Boosting
231 | 		Naive Bayes Classifiers
232 | 		K-Nearest neighbour
233 | 	Regression
234 | 		Logistic regression
235 | 		Ranking
236 | 		Linear regression
237 | 		Perceptron
238 | 	Clustering
239 | 		Hierarchical clustering
240 | 		K-means clustering
241 | 	Neural Networks
242 | 	Sentiment analysis
243 | 	Collaborative Filtering
244 | 	Tagging
245 | 	   
246 | ###	Text Mining / NLP
247 |  * NLP with Python [O'Reilly / Book](http://shop.oreilly.com/product/9780596516499.do)
248 | 
249 | Skills
250 | 
251 | 	Corpus
252 | 	Named Entity Recognition
253 | 	Text Analysis
254 | 	UIMA
255 | 	Term Document Matrix
256 | 	Term Frequency and weight
257 | 	Support Vector Machines
258 | 	Association rules
259 | 	Market Based Analysis
260 | 	Feature Extraction
261 | 	Use Mahout
262 | 	Use Weka
263 | 	Use NLTK
264 | 	Classify Text
265 | 	Vocabulaty Mapping
266 | 
267 | * Healthcare Twitter Analysis [Coursolve & UW Data Science](https://www.coursolve.org/need/54)
268 | 
269 | 
270 | ### Big Data
271 | 	Map reduce fundamentals
272 | 	Hadoop
273 | 	HDFS
274 | 	Data Replication Principles
275 | 	Setup Hadoop (IBM / Cloudera / HortonWorks)
276 | 	Name & Data nodes
277 | 	Job and task tracker
278 | 	M/R Programming
279 | 	Sqoop: Loading Data in HDFS
280 | 	Flube, Scribe: For Unstructured Data
281 | 	SQL with Pig
282 | 	DWH with Hive
283 | 	Scribe, Chukwa For Weblog
284 | 	Using Mahout
285 | 	Zookeeper Avro
286 | 	Storm: Hadoop Realtime
287 | 	Rhadoop, RHIPE
288 | 	rmr
289 | 	Cassandra
290 | 	MongoDB, Neo4j
291 | 
292 | ### General Resources:
293 | * [Coursera](http://coursera.org)
294 | * [Khan Academy](https://www.khanacademy.org/math/probability/random-variables-topic/random_variables_prob_dist/v/term-life-insurance-and-death-probability)
295 | * [Wolfram Alpha](http://www.wolframalpha.com/input/?i=torus)
296 | * [Wikipedia](http://en.wikipedia.org/wiki/List_of_cognitive_biases)
297 | * Kindle .mobis
298 | * Great PopSci Read: [The Signal and The Noise](http://www.amazon.com/Signal-Noise-Predictions-Fail-but-ebook/dp/B007V65R54/ref=tmm_kin_swatch_0?_encoding=UTF8&sr=8-1&qid=1376699450) Nate Silver
299 | * Zipfian Academy's [List of Resources](http://blog.zipfianacademy.com/post/46864003608/a-practical-intro-to-data-science)
300 | * [A Software Engineer's Guide to Getting Started w Data Science](http://www.rcasts.com/2012/12/software-engineers-guide-to-getting.html)
301 | * Data Scientist Interviews [Metamarkets](http://metamarkets.com/category/data-science/)
302 | 
303 | ## Contribute
304 | Please Share and Contribute Your Ideas -- **it's Open Source!**
305 | 
306 | ###### A note on direction
307 | This is an introduction geared toward those with at least **a minimum understanding of programming**, and (perhaps obviously) an interest in the components of Data Science (like statistics and distributed computing). Out of personal preference and need for focus, the curriculum assumes and mainly uses **Python tools and resources**, except where marked as R, Java etc.
308 | 


--------------------------------------------------------------------------------