├── LICENSE.md └── README.md /LICENSE.md: -------------------------------------------------------------------------------- 1 | This is free and unencumbered software released into the public domain. 2 | 3 | Anyone is free to copy, modify, publish, use, compile, sell, or 4 | distribute this software, either in source code form or as a compiled 5 | binary, for any purpose, commercial or non-commercial, and by any 6 | means. 7 | 8 | In jurisdictions that recognize copyright laws, the author or authors 9 | of this software dedicate any and all copyright interest in the 10 | software to the public domain. We make this dedication for the benefit 11 | of the public at large and to the detriment of our heirs and 12 | successors. We intend this dedication to be an overt act of 13 | relinquishment in perpetuity of all present and future rights to this 14 | software under copyright law. 15 | 16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 17 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 18 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 19 | IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR 20 | OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, 21 | ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR 22 | OTHER DEALINGS IN THE SOFTWARE. 23 | 24 | 25 | For more information, please refer to 26 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # The Open-Source Data Science Masters 2 | This is a [fork of this](https://github.com/datasciencemasters/go), experimenting with different curriculum topics and themes. 3 | 4 | [License here](LICENSE.md). 5 | 6 | ## The Open Source Data Science Curriculum 7 | ![](http://nirvacana.com/thoughts/wp-content/uploads/2013/07/RoadToDataScientist1.png) 8 | 9 | ### History 10 | ### Fundamentals 11 | **Intro to Data Science** [UW / Coursera](https://www.coursera.org/course/dat * 12 | *Topics:* Python NLP on Twitter API, Distributed Computing Paradigm, MapReduce/Hadoop & Pig Script, SQL/NoSQL, Relational Algebra, Experiment design, Statistics, Graphs, Amazon EC2, Visualization.asci) 13 | Algebra-Steven-Levandosky/dp/0536667470/ref=sr_1_1?ie=UTF8&qid=1376546498&sr=8-1&keywords=linear+algebra+levandosky#) 14 | * Forecasting: Principles and Practice [Monash University / Book](http://otexts.com/fpp/) *uses R 15 | * Problem-Solving Heuristics "How To Solve It" [Polya / Book](http://en.wikipedia.org/wiki/How_to_Solve_It) 16 | * Think Bayes [Allen Downey / Book](http://www.greenteapress.com/thinkbayes/) 17 | * Capstone Analysis of Your Own Design; [Quora](http://www.quora.com/Programming-Challenges-1/What-are-some-good-toy-problems-in-data-science)'s Idea Compendium 18 | * [Toy Data Ideas](http://www.quora.com/Programming-Challenges-1/What-are-some-good-toy-problems-in-data-science) 19 | 20 | Skills 21 | 22 | Matrices and Linear Algebra fundamentals 23 | Linear Algebra / Levandosky [Stanford / Book](http://www.amazon.com/Linear- 24 | Coding the Matrix: Linear Algebra through Computer Science Applications [Brown / Coursera](https://www.coursera.org/course/matrix) 25 | Hash Functions, Binary Tree, O(n) 26 | Relational Algebra 27 | DB Basics 28 | Inner, Outer, Cross, Theta join 29 | CAP Theorem 30 | abular data 31 | Entropy 32 | Data Frames and Series 33 | Sharding 34 | OLAP 35 | Multidimensional Data Model 36 | ETL 37 | Reporting vs. BI vs. Analytics 38 | JSON & XML 39 | NoSQL 40 | Regex 41 | Vendor Landscape 42 | Env setup 43 | 44 | ### Maths and Stats 45 | * Statistics [Stats in a Nutshell / Book](http://shop.oreilly.com/product/9780596510497.do) Pick a dataset 46 | * Linear Programming (Math 407) [University of Washington / Course](http://www.math.washington.edu/~burke/crs/407/lectures/) 47 | 48 | Skills 49 | 50 | Descriptive statistics 51 | Exploratory Data Analysis 52 | Histograms 53 | Percentiles and outliers 54 | Probability theory 55 | Bayes Theorem 56 | Random Variables 57 | Cumulative Distribution Function (CDF) 58 | Continous Distributions (Normal, Poisson, Gaussian) 59 | Skewness 60 | ANOVA 61 | Probability Density Functions 62 | 63 | Central Limit Theorem 64 | Monte Carlo Method 65 | Hypothesis testing 66 | p-value 67 | Chi squared test 68 | Estimation 69 | Confidence intevals (CI) 70 | MLE 71 | Kernel Density Estimate 72 | Regression 73 | Covariance 74 | Correlation 75 | Pearson Coefficient 76 | Causation 77 | Least squares fit 78 | Euclidean Distance 79 | 80 | * Probabilistic Programming and Bayesian Methods for Hackers [Github / Tutorials](https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers) 81 | * PGMs / Koller [Stanford / Coursera](https://www.coursera.org/course/pgm) 82 | 83 | 84 | ### Computing 85 | #### Toolbox / Programming Languages / Software stacks 86 | Skills 87 | 88 | Unix cli install programs and packages 89 | Bash basics 90 | cat, grep, wget etc 91 | piping 92 | understand stdio 93 | Python 94 | Regex 95 | MS Excel w/ Analysis ToolPak 96 | Java 97 | R, R-studio, Rattle 98 | IBM SPSS 99 | Weka, Knime, RapidMiner 100 | Hadoop ditribution of choice 101 | Spark, Storm 102 | Flume, Scibe, Chukwa 103 | Nutch, Talend, Scraperwiki 104 | Webscraper, Flume, Sqoop 105 | tm, RWeka, NLTK 106 | RHIPE 107 | D3.js, ggplot2, Shiny 108 | IBM Languageware 109 | Cassandra, MongoDB 110 | #### Algorithms, data structures and databases 111 | * **Algorithms** 112 | * Algorithms Design & Analysis I [Stanford / Coursera](https://www.coursera.org/course/algo) 113 | * Algorithm Design [Kleinberg & Tardos / Book](http://www.amazon.com/Algorithm-Design-Jon-Kleinberg/dp/0321295358/ref=sr_1_1?ie=UTF8&qid=1376702127&sr=8-1&keywords=kleinberg+algorithms) 114 | 115 | * **Databases** 116 | * SQL Tutorial [W3Schools / Tutorials](http://www.w3schools.com/sql/) 117 | * Introduction to Databases [Stanford / Online Course](http://class2go.stanford.edu/db/Winter2013/) 118 | 119 | #### Programming 120 | * **Python** (Learning) 121 | * New To Python: [Learn Python the Hard Way](http://learnpythonthehardway.org/), [Google's Python Class](http://code.google.com/edu/languages/google-python-class/) 122 | 123 | * **Python** (Libraries) 124 | * Basic Packages [Python, virtualenv, NumPy, SciPy, matplotlib and IPython ](http://www.lowindata.com/2013/installing-scientific-python-on-mac-os-x/) 125 | * [Data Science in iPython Notebooks](http://nborwankar.github.io/LearnDataScience/) (Linear Regression, Logistic Regression, Random Forests, K-Means Clustering) 126 | * Bayesian Inference | [pymc](https://github.com/pymc-devs/pymc) 127 | * Labeled data structures objects, statistical functions, etc [pandas](https://github.com/pydata/pandas) (See: Python for Data Analysis) 128 | * Python wrapper for the Twitter API [twython](https://github.com/ryanmcgrath/twython) 129 | * Tools for Data Mining & Analysis [scikit-learn](http://scikit-learn.org/stable/) 130 | * Network Modeling & Viz [networkx](http://networkx.github.io/) 131 | * Natural Language Toolkit [NLTK](http://nltk.org/) 132 | 133 | Skills 134 | 135 | Variables 136 | Vectors 137 | Matrices 138 | Arrays 139 | Factors 140 | Lists 141 | Data Frames 142 | Reading CSV data 143 | Reading Raw data 144 | Manipulate Data Frames 145 | Functions 146 | Factor Analysis 147 | 148 | ### Applied methods 149 | #### Data Munging and integration 150 | The art of converting or mapping data from one "raw" form into another format that allows for more convenient consumption of the data with the help of semi-automated tools. Expect to spend 80% of your workday doing some sort of data wrangling. 151 | * [What I learned from 2 years of 'data sciencing](http://www.quantisan.com/what-i-learned-from-2-years-of-data-sciencing/) Paul Lam 152 | 153 | Skills 154 | 155 | Dimensionality & Numerosity Reduction 156 | Normalization 157 | Data Scrubbing 158 | Handling missing values 159 | Unbiased estimators 160 | Binning sparse values 161 | Feature Extraction 162 | Denoising 163 | Sampling 164 | Stratified Sampling 165 | Principal Component Analysis 166 | Summary of Data Formats 167 | Data Discovery 168 | Data Sources & Acquisition 169 | Data Integration 170 | Data Fusion 171 | Transformation and enrichment 172 | Data survey 173 | Google OpenRefine 174 | How Much Daya 175 | Using ETL 176 | 177 | #### Visualization 178 | 179 | Skills 180 | 181 | Data Exploration in R (Hist, boxplot etc) 182 | Uni, Bi and multivariate Viz 183 | ggplot2 184 | Histogram & Pie (Uni) 185 | Tree and Tree Map 186 | Scatter Plot 187 | Line Charts 188 | Survey Plot 189 | Timeline 190 | Decision Tree 191 | D3.js 192 | InfoVis 193 | IBM ManyEyes 194 | Tableau 195 | 196 | ### Data mining and analysis 197 | * Mining Massive Data Sets [Stanford / Book](http://i.stanford.edu/~ullman/mmds.html) 198 | * Mining The Social Web [O'Reilly / Book](http://shop.oreilly.com/product/0636920010203.do) 199 | * Introduction to Information Retrieval [Stanford / Book](http://nlp.stanford.edu/IR-book/information-retrieval-book.html) 200 | * **Analysis** 201 | * Python for Data Analysis [O'Reilly / Book](http://www.kqzyfj.com/click-7040302-11260198?url=http%3A%2F%2Fshop.oreilly.com%2Fproduct%2F0636920023784.do&cjsku=0636920023784) 202 | * Big Data Analysis with Twitter [UC Berkeley / Lectures](http://blogs.ischool.berkeley.edu/i290-abdt-s12/) 203 | * Social and Economic Networks: Models and Analysis / [Stanford / Coursera](https://www.coursera.org/course/networksonline) 204 | * Information Visualization ["Envisioning Information" Tufte / Book](http://www.amazon.com/Envisioning-Information-Edward-R-Tufte/dp/0961392118/ref=sr_1_8?ie=UTF8&qid=1376709039&sr=8-8&keywords=information+design) 205 | 206 | ### Machine Learning 207 | * Machine Learning / Ng [Stanford / Coursera](https://www.coursera.org/course/ml) 208 | * A Course in Machine Learning / Hal Daumé III UMD [Online Book](http://ciml.info/) 209 | * Programming Collective Intelligence [O'Reilly / Book](http://shop.oreilly.com/product/9780596529321.do) 210 | * Statistics [The Elements of Statistical Learning](http://www-stat.stanford.edu/~tibs/ElemStatLearn/) 211 | * Machine Learning / CaltechX [Caltech / Edx](https://courses.edx.org/courses/CaltechX/CS1156x/Fall2013/) 212 | 213 | Skills 214 | 215 | Numerical Var 216 | Categorical Var 217 | Supervised Learning 218 | Unsupervised Learning 219 | Concepts, Inputs and Attributes 220 | Training and Test Data 221 | Classifier 222 | Prediction 223 | Lift 224 | Overfitting 225 | Bias and variance 226 | Classification 227 | Trees and classification 228 | Classification rate 229 | Decision trees 230 | Boosting 231 | Naive Bayes Classifiers 232 | K-Nearest neighbour 233 | Regression 234 | Logistic regression 235 | Ranking 236 | Linear regression 237 | Perceptron 238 | Clustering 239 | Hierarchical clustering 240 | K-means clustering 241 | Neural Networks 242 | Sentiment analysis 243 | Collaborative Filtering 244 | Tagging 245 | 246 | ### Text Mining / NLP 247 | * NLP with Python [O'Reilly / Book](http://shop.oreilly.com/product/9780596516499.do) 248 | 249 | Skills 250 | 251 | Corpus 252 | Named Entity Recognition 253 | Text Analysis 254 | UIMA 255 | Term Document Matrix 256 | Term Frequency and weight 257 | Support Vector Machines 258 | Association rules 259 | Market Based Analysis 260 | Feature Extraction 261 | Use Mahout 262 | Use Weka 263 | Use NLTK 264 | Classify Text 265 | Vocabulaty Mapping 266 | 267 | * Healthcare Twitter Analysis [Coursolve & UW Data Science](https://www.coursolve.org/need/54) 268 | 269 | 270 | ### Big Data 271 | Map reduce fundamentals 272 | Hadoop 273 | HDFS 274 | Data Replication Principles 275 | Setup Hadoop (IBM / Cloudera / HortonWorks) 276 | Name & Data nodes 277 | Job and task tracker 278 | M/R Programming 279 | Sqoop: Loading Data in HDFS 280 | Flube, Scribe: For Unstructured Data 281 | SQL with Pig 282 | DWH with Hive 283 | Scribe, Chukwa For Weblog 284 | Using Mahout 285 | Zookeeper Avro 286 | Storm: Hadoop Realtime 287 | Rhadoop, RHIPE 288 | rmr 289 | Cassandra 290 | MongoDB, Neo4j 291 | 292 | ### General Resources: 293 | * [Coursera](http://coursera.org) 294 | * [Khan Academy](https://www.khanacademy.org/math/probability/random-variables-topic/random_variables_prob_dist/v/term-life-insurance-and-death-probability) 295 | * [Wolfram Alpha](http://www.wolframalpha.com/input/?i=torus) 296 | * [Wikipedia](http://en.wikipedia.org/wiki/List_of_cognitive_biases) 297 | * Kindle .mobis 298 | * Great PopSci Read: [The Signal and The Noise](http://www.amazon.com/Signal-Noise-Predictions-Fail-but-ebook/dp/B007V65R54/ref=tmm_kin_swatch_0?_encoding=UTF8&sr=8-1&qid=1376699450) Nate Silver 299 | * Zipfian Academy's [List of Resources](http://blog.zipfianacademy.com/post/46864003608/a-practical-intro-to-data-science) 300 | * [A Software Engineer's Guide to Getting Started w Data Science](http://www.rcasts.com/2012/12/software-engineers-guide-to-getting.html) 301 | * Data Scientist Interviews [Metamarkets](http://metamarkets.com/category/data-science/) 302 | 303 | ## Contribute 304 | Please Share and Contribute Your Ideas -- **it's Open Source!** 305 | 306 | ###### A note on direction 307 | This is an introduction geared toward those with at least **a minimum understanding of programming**, and (perhaps obviously) an interest in the components of Data Science (like statistics and distributed computing). Out of personal preference and need for focus, the curriculum assumes and mainly uses **Python tools and resources**, except where marked as R, Java etc. 308 | --------------------------------------------------------------------------------