├── .gitattributes ├── LICENSE ├── README.md └── resources ├── analytics.md ├── approaching-projects.md ├── biases.md ├── inference.md ├── machine-learning.md ├── python-development.md └── statistics.md /.gitattributes: -------------------------------------------------------------------------------- 1 | # Auto detect text files and perform LF normalization 2 | * text=auto 3 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) 2019, mfarragher 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | 1. Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | 2. Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | 3. Neither the name of the copyright holder nor the names of its 17 | contributors may be used to endorse or promote products derived from 18 | this software without specific prior written permission. 19 | 20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # curated-data-science-resources 2 | People have asked me many times for data science resources so I've curated some here about topics that I find particularly important. I'm an economist by background so my interests here also cover cognitive biases and statistical inference (moreso than machine learning). 3 | 4 | This isn't an 'awesome' list... it will always be much shorter than that. It's more a curation of reference material, which I think would have a broad audience. 5 | 6 | ## Resources 7 | - 🗺 **[Approaching projects](resources/approaching-projects.md):** general resources on data science projects and workflows, e.g. reproducibility. 8 | - 📊 **[Statistics](resources/statistics.md):** resources for learning statistics across different levels of knowledge. 9 | - 📈 **[Analytics](resources/analytics.md):** e.g. coding libraries, data visualisation. 10 | - 🤖 **[Machine learning](resources/machine-learning.md):** resources for learning & applying machine learning across different levels of knowledge. 11 | - 🤔 **[Inference](resources/inference.md):** resources for making inferences from data, e.g. causal inference. 12 | - 🧠 **[Biases](resources/biases.md):** information on cognitive biases and statistical biases that can be problematic with data analysis and research. 13 | - 🐍🏗 **[Python development](resources/python-development.md):** general resources for learning Python language and developing (e.g. Python packaging). 14 | 15 | 🚧 This is a work in progress. 🚧 16 | -------------------------------------------------------------------------------- /resources/analytics.md: -------------------------------------------------------------------------------- 1 | **Key:** 2 | - 📃 Article 3 | - 📘 Book 4 | - 📁 Repo / set of articles 5 | - 🕸 Website / blog 6 | - 📽 Video(s) 7 | - 🎙 Podcast 8 | 9 | # Python 10 | ## Analytics libraries 11 | - Pandas: the core analytics library 12 | - Numpy and Scipy: libraries for advanced mathematics and scientific computing 13 | - Matplotlib and Seaborn: core libraries for visualisations 14 | - Statsmodels: main library for linear modelling and econometric methods. My library Appelpy is a wrapper around it which is focused more on model diagnostics. Linearmodels has advanced techniques for statisticians and econometricians. 15 | 16 | ## Pandas & plotting 17 | ### Starting off 18 | - 📽 [DataSchool.io](https://www.dataschool.io/start/): range of video tutorials for learning Pandas. A lot of free material. 19 | - 📽 _[Data analysis in Python with Pandas](https://github.com/justmarkham/pandas-videos)_: Some of the free video tutorials in a playlist for beginners. 20 | - 📘 Master Data Analysis with Python: Ted Petrou's neat introduction to learning Pandas with practical examples. Author knows the mechanics of the library very well. Even with a few years of Pandas usage I learnt a few new recipes from the book. 21 | - 📁 [Minimally Sufficient Pandas](https://github.com/tdpetrou/Minimally-Sufficient-Pandas): also by Petrou, this repo is a handy resource for narrowing down the functionality of Pandas to a few dozen methods and general good practice. 22 | - 📃 [Pandas: the two cultures](https://datapythonista.me/blog/pandas-the-two-cultures.html): why does Pandas have so many methods in the API for doing the same thing? A lot of that is due to how the library caters for two 'cultures' of users: i) statisticians; ii) machine learning practitioners. This post by Pandas maintainer Marc Garcia is a spotlight on the cultures. 23 | - 🕸 [Practical Business Python](https://pbpython.com/): blog with many applications of Pandas and the visualisation libraries for business problems. 24 | - 📃 [Real Python's Matplotlib Guide](https://realpython.com/python-matplotlib-guide): Matplotlib documentation can be daunting for beginners. This guide is a nice overview of the Matplotlib interface. 25 | 26 | ### Augmenting knowledge 27 | - 📃 [Diving into Pandas is faster than reinventing it](http://deanla.com/dont_reinvent_pandas.html): examples that show the power of method chaining. 28 | - 📁 [Chris Albon's data wrangling posts](https://chrisalbon.com/#python) 29 | - 📃 [Wes McKinney | Apache Arrow and the "10 Things I Hate About pandas"](https://wesmckinney.com/blog/apache-arrow-pandas-internals/): learnings from Wes McKinney on his time spent developing Pandas. The post touches upon the early history of Pandas and how the library wasn't developed with large datasets in mind (e.g. 100+ GB size). It was written in 2017 and there's been immense progress since then with libraries that can do sophisticated parallel computing on large datasets. 30 | - 🕸 [Tom Augspurger's blog](https://tomaugspurger.github.io/archives/), in particular the series of posts on 'modern idiomatic Pandas', e.g. [time series](https://tomaugspurger.github.io/posts/modern-7-timeseries/). 31 | 32 | # R 33 | - 📘 [R for Data Science](https://r4ds.had.co.nz/) and an [unofficial solutions manual](https://jrnold.github.io/r4ds-exercise-solutions/) 34 | - 📘 [Advanced R](http://adv-r.had.co.nz/) 35 | - 📃 [RStudio cheat sheets](https://rstudio.com/resources/cheatsheets/) 36 | 37 | # SQL 38 | - 🕸 [SQLhabit](https://www.sqlhabit.com/) 39 | - 🕸 [SQLzoo](https://sqlzoo.net/wiki/SQL_Tutorial) 40 | 41 | # Visualisation 42 | - 📘 [Storytelling with Data](http://www.storytellingwithdata.com/): reference on how to do effective 'data storytelling' with visualisations, touching upon aspects such as design and types of charts. 43 | -------------------------------------------------------------------------------- /resources/approaching-projects.md: -------------------------------------------------------------------------------- 1 | **Key:** 2 | - 📃 Article 3 | - 📘 Book 4 | - 📁 Repo / set of articles 5 | - 🕸 Website / blog 6 | - 📽 Video(s) 7 | - 🎙 Podcast 8 | 9 | # Inspiration 10 | - 📁 [Cookiecutter Data Science](https://drivendata.github.io/cookiecutter-data-science/): excellent template to use as a structure for a data science project. With enough use of this approach it's also inspired my thinking on how to structure a 'mono repo' that contains multiple data science projects in one codebase. 11 | - 📁 [clean-code-ml](https://github.com/davified/clean-code-ml): examples of how prototype model code on the Titanic dataset can be refactored and moved into source code. 12 | - 📘 [The Turing Way](https://the-turing-way.netlify.com/): an online handbook by the Alan Turing Institute for best practice on reproducible data science. The audience is more for academic researchers, e.g. many of the chapters focus on topics like version control, testing, etc. There are some useful techniques and tools covered to help data scientists make analysis reproducible, from the old (e.g. Make) to the new (e.g. Binder). 13 | - 📃 [Coding habits for data scientists](https://www.thoughtworks.com/insights/blog/coding-habits-data-scientists): blog post from Thoughtworks with tips on how to manage complexity in data science projects and do frequent refactoring of code. 14 | 15 | # Notebook tips 16 | - 📃 [Google Cloud's Jupyter Notebook Manifesto](https://cloudblog.withgoogle.com/products/ai-machine-learning/best-practices-that-can-improve-the-life-of-any-developer-using-jupyter-notebooks/amp/): eight principles for developing with Jupyter Notebooks. 17 | - 📃 [Dataquest | 28 Jupyter Notebook Tips, Tricks, and Shortcuts](https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/) 18 | -------------------------------------------------------------------------------- /resources/biases.md: -------------------------------------------------------------------------------- 1 | **Key:** 2 | - 📃 Article 3 | - 📘 Book 4 | - 📁 Repo / set of articles 5 | - 🕸 Website / blog 6 | - 📽 Video(s) 7 | - 🎙 Podcast 8 | 9 | # Biases 10 | ## Cognitive biases 11 | Many cognitive biases are examples of statistical bias. These are some cognitive biases that can be problematic when producing (and consuming) research or data analysis. 12 | - **Confirmation bias**: 13 | - Tendency to search for information / data in support of one's prior beliefs, at the detriment of other relevant evidence. 14 | - It can be more tempting to try to 'confirm' a hypothesis than to 'disprove' a hypothesis. 15 | - 📃 [The curious case of confirmation bias](www.psychologytoday.com/gb/blog/seeing-what-others-dont/201905/the-curious-case-confirmation-bias): article examines common claims about the prevalence of the bias and discusses how the concept has shifted to cover multiple tendencies over time (e.g. recall only of evidence that supports prior beliefs). 16 | - **Survivorship bias**: 17 | - Fallacy of focusing on data points that pass a criterion of success, at the detriment of data points that are less visible due to their failure. 18 | - Business & finance are common examples of domains where the bias is prominent. 19 | - 📽 [You are missing something](https://www.youtube.com/watch?v=ZyLVIvBidIA): 5-minute summary with practical examples of 'survivors' and how decision-making can be adversely affected by the bias. 20 | - 📃 [How the survivor bias distorts reality](https://www.scientificamerican.com/article/how-the-survivor-bias-distorts-reality/): short article mentions examples of the bias from a few different sources. 21 | - **Publication bias** 22 | - **Insensitivity to sample size** 23 | - **Availability bias** 24 | - **Clustering illusion** 25 | 26 | ## Statistical biases 27 | - **Selection bias** 28 | - **Non-response bias** 29 | -------------------------------------------------------------------------------- /resources/inference.md: -------------------------------------------------------------------------------- 1 | **Key:** 2 | - 📃 Article 3 | - 📘 Book 4 | - 📁 Repo / set of articles 5 | - 🕸 Website / blog 6 | - 📽 Video(s) 7 | - 🎙 Podcast 8 | 9 | # Causal inference 10 | ## Introduction 11 | - 📘 [The Book of Why](http://bayes.cs.ucla.edu/WHY/): Judea Pearl's most recent book on causal inference. The book explores the historical background of causal inference and Pearl's mathematical calculus for causal reasoning. 12 | - 📽 [On the path to causal inference](https://pyvideo.org/pydata-london-2019/on-the-path-to-causal-inference.html) my primer on causal inference, which explores an application of causal inference to conversion rate modelling on hotel websites (no mathematics involved). It is an application of Pearl's causal toolkit, but I also relate it to my experience with econometric methods and try to decompose the relationships in diagrams in ways that helped me to understand the toolkit. 13 | - 🕸 [Spurious Correlations](https://www.tylervigen.com/spurious-correlations): neat website for showing how correlation plots can show strong but absurd correlations between variables. Notice also how they are prevalent with time series data, which are prone to having a trend (what would be called a non-stationary process in econometrics). 14 | 15 | ## Econometric methods 16 | Econometric methods are often used to discern causal relationships in systems that involve economic variables. 17 | - 📘 [Mastering 'Metrics: The Path from Cause to Effect](http://www.masteringmetrics.com/) 18 | 19 | ## Tools 20 | - 🕸 [DAGitty](http://www.dagitty.net/): online tool for graphing causal models 21 | -------------------------------------------------------------------------------- /resources/machine-learning.md: -------------------------------------------------------------------------------- 1 | **Key:** 2 | - 📃 Article 3 | - 📘 Book 4 | - 📁 Repo / set of articles 5 | - 🕸 Website / blog 6 | - 📽 Video(s) 7 | - 🎙 Podcast 8 | 9 | # Intro 10 | - 📘 [The Hundred-page Machine Learning Book](http://themlbook.com/): Andriy Burkov's book is a great introduction to machine learning techniques. The book is available on 'read first, buy later' principle. 11 | - 🕸 [R2D3](http://www.r2d3.us/): visual introduction to the variance–bias trade-off and decision trees. 12 | - 🕸 [Tensorflow's Neural Network Playground](https://playground.tensorflow.org/): interactive introduction to neural networks. See the effects of changes to network architecture and hyperparameters on model results. 13 | - 📁 [Machine Learning from Scratch](https://dafriedman97.github.io/mlbook/index.html): explanations of common ML algos (with Python code), which look deeper at their derivation. 14 | 15 | # General 16 | - 🕸 [Machine Learning Mastery](https://machinelearningmastery.com/): range of tutorials and blog posts, from beginner to advanced concepts. Blog is regularly updated. 17 | - 🎙 [Linear Digressions](http://lineardigressions.com/) 18 | - 🎙 [Lex Fridman's AI Podcast](https://lexfridman.com/ai/) 19 | 20 | # Advanced references 21 | - 📘 [An Introduction to Statistical Learning](https://www.statlearning.com/): Hastie and Tibshirani textbook with applications in R. Online book is offered free. Authors Hastie and Tibshirani have over 15 hours of [lecture videos](https://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/) on YouTube to supplement the book. 22 | - 📘 [The Elements of Statistical Learning](https://web.stanford.edu/~hastie/ElemStatLearn/): Hastie and Tibshirani textbook on statistical learning. More advanced than their ISL textbook - assumes much more of an undergraduate-level knowledge of linear algebra, calculus and statistics. 23 | - 📘 [Machine Learning: A Probabilistic Perspective](https://probml.github.io/pml-book/) 24 | -------------------------------------------------------------------------------- /resources/python-development.md: -------------------------------------------------------------------------------- 1 | **Key:** 2 | - 📃 Article 3 | - 📘 Book 4 | - 📁 Repo / set of articles 5 | - 🕸 Website / blog 6 | - 📽 Video(s) 7 | - 🎙 Podcast 8 | 9 | # Efficient and idiomatic code 10 | - 📃 [Beautiful Idiomatic Python](https://github.com/JeffPaine/beautiful_idiomatic_python) 11 | 12 | # Testing 13 | - 📘 [Python Testing with Pytest](https://pragprog.com/titles/bopytest/python-testing-with-pytest/) 14 | - 📘 [The Art of Unit Testing](https://www.artofunittesting.com/): a book on testing practices with examples in JavaScript, but useful info for a developer regardless of language. 15 | - Testing in Python libraries: 16 | - [Pandas](https://pandas.pydata.org/docs/reference/testing.html) 17 | - [Matplotlib](https://matplotlib.org/devel/testing.html) 18 | 19 | # Packaging 20 | - 📽/🕸/📁: [The Sheer Joy of Packaging](https://python-packaging-tutorial.readthedocs.io/) (with [video](https://www.youtube.com/watch?v=xiI1i525ljE) and [Github repo](https://github.com/python-packaging-tutorial/python-packaging-tutorial)): tutorial on Python packaging, delivered at SciPy 2018. 21 | - 🕸 [Python Packaging User Guide](https://packaging.python.org/): 'official' guide on Python packaging from the Python Packaging Authority (PyPA). A fairly comprehensive reference. 22 | - 📃 [Python project maturity checklist](http://michal.karzynski.pl/blog/2019/05/26/python-project-maturity-checklist/): a checklist that covers the steps of a Python project in neat steps, covering aspects of project development from the basics all the way to automated testing and continuous integration. 23 | - 📁 [Making a Python Package](https://kiwidamien.github.io/making-a-python-package.html) (with [repo](https://github.com/kiwidamien/Roman/)): eight-part tutorial on how to release a Python package, covering aspects such as testing, automated testing and data files. 24 | - 📃 [Clarifying PEP 518 (a.k.a. pyproject.toml)](https://snarky.ca/clarifying-pep-518/): `pyproject.toml` is a future enhancement for Python packaging (provisionally accepted enhancement to this date). This post is useful in showing how it compares to the existing standards for Python packaging. 25 | 26 | # Marketing 27 | - 📃 [OpenSourceGuides | Finding users](https://opensource.guide/finding-users/): nice examples on how to promote your project and do outreach (e.g. getting feedback on project). 28 | - 📃 [How we got our 2-year-old repo trending on GitHub in just 48 hours](https://www.freecodecamp.org/news/how-we-got-a-2-year-old-repo-trending-on-github-in-just-48-hours-12151039d78b/): creator of `flask-base` shares tips on "how to get the product out there" 29 | -------------------------------------------------------------------------------- /resources/statistics.md: -------------------------------------------------------------------------------- 1 | **Key:** 2 | - 📃 Article 3 | - 📘 Book 4 | - 📁 Repo / set of articles 5 | - 🕸 Website / blog 6 | - 📽 Video(s) 7 | - 🎙 Podcast 8 | 9 | # Starting off 10 | - 📃 [Cassie Kozyrkov | Statistics: The Complete Mini Course](https://decision.substack.com/p/statistics-the-complete-mini-course): some journeys of where to explore statistics, from people who want the high-level ideas to those who want to want to go more advanced in statistics. 11 | - 🕸 [Seeing Theory](https://seeing-theory.brown.edu/index.html): highly interactive introduction to statistics concepts. A corresponding book is also in the works. 12 | - 📘 [OpenIntro Statistics](https://www.openintro.org/stat/) 13 | - 📘 [All of Statistics](https://www.stat.cmu.edu/~larry/all-of-statistics/) 14 | 15 | # Other topics 16 | - 📃 [The permutation test: A visual explanation of statistical testing](https://www.jwilber.me/permutationtest/): visual introduction to how permutation tests are calculated. 17 | - 📃 [Common statistical tests are linear models (or: how to teach stats)](https://lindeloev.github.io/tests-as-linear/): There are 100s of statistical tests and this shows how to link those to linear models. ⛓️🧠 This comes with code snippets and is recommended for people with advanced stats knowledge. 18 | - 📁 [Harvard CS109b | Bayesian Analysis](https://harvard-iacs.github.io/2020-CS109B/labs/lab04/): Harvard lab on Bayesian analysis with notebooks written with PyMC3, e.g. basics of Bayes rule to hierarchical models. 19 | --------------------------------------------------------------------------------