├── .DS_Store
├── Algorithms
└── resources.md
├── DeepLearning
├── ArticlesToRead.md
└── NeuralNetworks.md
├── LinearAlgebra
├── questions.md
└── study.md
├── MachineLearning
├── A_B_tests.md
├── Bayesian_Statistics.md
├── Bias_Variance_Trade-off
├── Classification.md
├── Clustering.md
├── DecisionTrees.md
├── DimensionalityReduction.md
├── ExplainML_Algorithm.md
├── FeatureEngineering.md
├── FeatureSelection.md
├── GeneralML_Questions.md
├── Hyperparameter_Tuning.md
├── LinearRegression.md
├── LogisticRegression.md
├── ModelValidation.md
├── Natural_Language_Processing.md
├── Optimization.md
├── Recommendation.md
├── Regression.md
├── Regularization.md
├── SVM_Kernels
└── Time_Series.md
├── OpenSource
└── openSourceAdvice.md
├── README.md
├── SQL
└── SQL_question.md
├── Statistics
├── ANOVA.md
├── Basic_Statistics.md
├── Distributions.md
├── Experiment_Design.md
├── README.md
└── Statistics2Know.md
├── ToBreakDown
├── .DS_Store
├── DeepLearningInterviewQuestions.pdf
├── Galvanize.md
├── InterviewQuestions.docx
├── README.md
├── ToBreakDown.md
├── UCSDCSEIQ interview prep doc.docx
├── deepLearningInterview.pdf
├── images
│ ├── CRISP.png
│ ├── Profit curve.png
│ ├── anscombesquartet.png
│ ├── cnns.png
│ ├── conics.png
│ ├── dbscan.png
│ ├── hyndman_modeling_process.png
│ ├── ses.png
│ ├── timeseries.png
│ └── transformations.png
├── interviewQuestionsLinkedIn.pdf
├── interviewing.html
└── template.py
└── test.md
/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mGalarnyk/DataScienceInterview/d45a972390716e3f5f2a61e44c8cef92d70912bc/.DS_Store
--------------------------------------------------------------------------------
/Algorithms/resources.md:
--------------------------------------------------------------------------------
1 | # Mostly links to learn algorithms for interviews
2 | Problem is for a lot of jobs this isn't asked, but for jobs that overlap with software engineering jobs, it helps
3 |
4 | 1. Algorithms in Python: https://github.com/keon/algorithms
5 | 2. CS9: Problem-Solving for the CS Technical Interview (Stanford): https://web.stanford.edu/class/cs9/
6 |
--------------------------------------------------------------------------------
/DeepLearning/ArticlesToRead.md:
--------------------------------------------------------------------------------
1 | 1. When NOT to use deep learning: https://www.datasciencecentral.com/profiles/blogs/when-not-to-use-deep-learning
2 |
3 |
--------------------------------------------------------------------------------
/DeepLearning/NeuralNetworks.md:
--------------------------------------------------------------------------------
1 | 1. How does a neural network with three layers (one input layer, one inner layer and one output layer) compare to a logistic regression?
2 | 2. How would you train an ANN?
3 | 3. What is back propagation?
4 | 4. What is deep learning? What is CNN (Convolution Neural Network) or RNN (Recurrent Neural Network)?
5 |
6 |
7 | ## Articles:
8 | 1. [Why You Should Use Cross-Entropy Error Instead Of Classification Error Or Mean Squared Error For Neural Network Classifier Training](https://jamesmccaffrey.wordpress.com/2013/11/05/why-you-should-use-cross-entropy-error-instead-of-classification-error-or-mean-squared-error-for-neural-network-classifier-training/)
9 |
10 |
--------------------------------------------------------------------------------
/LinearAlgebra/questions.md:
--------------------------------------------------------------------------------
1 | 1. What is an Eigenvalue and Eigenvector?
2 | 2. What do you understand by feature vectors?
3 | 3. What Is Vector Space Model and its use? (question has overlap with Machine learning question)
4 |
--------------------------------------------------------------------------------
/LinearAlgebra/study.md:
--------------------------------------------------------------------------------
1 | # Stats for Now
2 | Going to move to Jupyter and RMarkdown as it can have Latex better than markdown. It will have some limitation, but better than nothing. Might make a bunch of links to other references for this.
3 |
4 | ## Need to Know
5 | 1. Orthogonal: dot product is zero. Reference: https://en.wikipedia.org/wiki/Orthogonality
6 | 2.
7 |
8 |
9 | ## Good to Know
10 | 1. Orthonormal Matrix: orthogonal and unit vectors.
11 | ```
12 | # Latex in case you want to add to Jupyter Later
13 |
14 | $ \vec{e}_i \cdot \vec{e}_j = 0$ when $j \ne i$ and $ \vec{e}_i \cdot \vec{e}_j = 1$ when $j = i$
15 | ```
16 |
17 | ## More Advanced (not important, but will move to other section later).
18 | 1. unitary: complex square matrix U is unitary if its conjugate transpose U' is also its inverse—that is, if U'U = UU'=I. Reference: https://en.wikipedia.org/wiki/Unitary_matrix
19 |
20 |
--------------------------------------------------------------------------------
/MachineLearning/A_B_tests.md:
--------------------------------------------------------------------------------
1 | 1. How will you explain an A/B test to an engineer who does not know statistics?
2 | 2. What is the goal of A/B Testing?
3 | 3. How would you prove that one improvement you’ve brought to an algorithm is really an improvement over not doing anything? How familiar are you with A/B testing?
4 | 4. How is A/B testing different from usual Hypothesis testing?
5 |
--------------------------------------------------------------------------------
/MachineLearning/Bayesian_Statistics.md:
--------------------------------------------------------------------------------
1 | 1. How would you use Naive Bayes classifier for categorical features? What if some features are numerical?
2 | 2. Is Naïve Bayes bad? If yes, under what aspects.
3 | 3. What do you understand by conjugate-prior with respect to Naïve Bayes?
4 | 4. What is the difference between Bayesian Estimate and Maximum Likelihood Estimation (MLE)?
5 | 5. Why is naive Bayes so ‘naive’?
6 | 6. Explain prior probability, likelihood and marginal likelihood in context of Naïve Bayes algorithm?
7 |
--------------------------------------------------------------------------------
/MachineLearning/Bias_Variance_Trade-off:
--------------------------------------------------------------------------------
1 | 1. Bootstrapping - how and why it is used?
2 | 2. Define variance.
3 | 3. How does the variance of the error term change with the number of predictors, in OLS?
4 | 4. How would you control for biases?
5 | 5. What is overfitting a regression model? What are ways to avoid it?
6 |
--------------------------------------------------------------------------------
/MachineLearning/Classification.md:
--------------------------------------------------------------------------------
1 | 1. How would you deal with unbalanced binary classification?
2 | 2. State some real life problems where classification algorithms can be used?
3 | 3. Tradeoffs between different types of classification models. How to choose the best one?
4 |
--------------------------------------------------------------------------------
/MachineLearning/Clustering.md:
--------------------------------------------------------------------------------
1 | 1. Assuming a clustering model’s labels are known, how do you evaluate the performance of the model?
2 | 2. Differentiate between partitioning method and hierarchical methods of Cluster Analysis.
3 | 3. How would you assess the quality of clustering?
4 | 4. How would you select K for K-Means?
5 | 5. What Is K-Means and its objective?
6 | 6. What is the difference between Cluster and Systematic Sampling?
7 | 7. What is the difference between Gaussian Mixture Model and K-Means?
8 |
--------------------------------------------------------------------------------
/MachineLearning/DecisionTrees.md:
--------------------------------------------------------------------------------
1 | 1. Describe how Gradient Boosting works.
2 | 2. Describe some of the different splitting rules used by different decision tree algorithms.
3 | 3, How would you build a decision tree model?
4 | 4. How would you compare a decision tree to a logistic regression? Which is more suitable under different circumstances?
5 | 5. What are some business reasons you might want to use a decision tree model?
6 | 6. What impurity measures do you know?
7 | 7. What is pruning and why is it important?
8 | 8. What is Random Forest? Why would you prefer it to SVM?
9 | 9. Why do we combine multiple trees?
10 |
11 |
12 |
--------------------------------------------------------------------------------
/MachineLearning/DimensionalityReduction.md:
--------------------------------------------------------------------------------
1 | 1. Why standardize data? Difference between whitening and standardizing the data.
2 | reference: pg 567 and 568 of https://www.springer.com/us/book/9780387310732 . Need to answer this since others wont have access to the book.
3 | 2. Are dimensionality reduction techniques supervised or not? Are all of them are (un)supervised?
4 | 3. Do we need to normalize data for PCA? Why?
5 | 4. Why do we need to center data for PCA and what can happen if we don’t do it?
6 | 5. What dimensionality reductions can be used for preprocessing the data?
7 | 6. Suppose you have a very sparse matrix where rows are highly dimensional. You project these rows on a random vector of relatively small dimensionality. Is it a valid dimensionality reduction technique or not?
8 | 7. What is the advantage of performing dimensionality reduction before fitting an SVM?
9 | 8. What is the purpose of dimensionality reduction and why do we need it?
10 | 9. Is PCA a linear model or not? Why?
11 | 10. What are the differences between Factor Analysis and Principal Component Analysis?
12 | 11. What is Principal Component Analysis (PCA)? Under what conditions is PCA effective? How is it related to eigenvalue decomposition (EVD)?
13 | 12. What is the relationship between Principal Component Analysis (PCA) and Linear & Quadratic Discriminant Analysis (LDA & QDA)
14 |
15 |
--------------------------------------------------------------------------------
/MachineLearning/ExplainML_Algorithm.md:
--------------------------------------------------------------------------------
1 | 1. Explain a machine learning algorithm in layman's terms.This is more challenging than most realize. I'm looking for an intuitive understanding of machine learning algorithms and clear communication skills here. Pretty much everybody picks random forest.
2 | 2. "What is the most complex mathematical theorem proof in Data Mining that you can write now for us on these papers?".
3 | 3. Explain Bayes Theorem in layman's terms
4 | 4. Name and explain 3 different dimensionality reduction strategies
5 | 5. Explain boostrapping and its practical uses in machine learning
6 | 6. Different types of clustering and how the algorithms work (hierarchical and KMEANS).
7 | 7. You have two machine learning algorithms that have the same accuracy. How can you determine which is better.
8 |
Expectation: ROC curves.
9 | 8. How do you deal with imbalanced classes?
10 |
reference 1: imbalanced classes: http://www.datasciencecentral.com/profiles/blogs/dealing-with-imbalanced-datasets
11 |
reference 2 (smote): https://www.jair.org/media/953/live-953-2037-jair.pdf
12 | 9.
13 |
--------------------------------------------------------------------------------
/MachineLearning/FeatureEngineering.md:
--------------------------------------------------------------------------------
1 | 1. How would you derive new features from features that already exist?
2 | 2. What is Feature Engineering? Give an example where feature example can be very useful in predicting results from data and explain with reason why it is so effective in some cases?
3 |
--------------------------------------------------------------------------------
/MachineLearning/FeatureSelection.md:
--------------------------------------------------------------------------------
1 | 1. Your model considers the feature X significant, and Z is not, but you expected the opposite result. How will you explain it?
2 | 2. You have a data set containing 100K rows, and 100 columns, with one of those columns being our dependent variable for a problem we'd like to solve. How can we quickly identify which columns will be helpful in predicting the dependent variable? Identify two techniques and explain them to me as though I were 5 years old.
3 | 3. Does the model affect the choice of feature selection method?
4 | 4. How Univariate feature selection works?
5 | 5. Is feature selection a dimensionality reduction technique?
6 | 6. Is there any thumb rule for the number of features that should be used? How do you select the best features?
7 | 7. What are some good ways for performing feature selection that do not involve exhaustive search?
8 | 8. What Is feature selection and its importance with examples.
9 | 9. What is the difference between feature selection and feature extraction?
10 | 10. What is variance threshold approach of feature selection?
11 | 11. What will be your approach to recursive feature elimination?
12 | 12. You fit a multiple regression to examine the effect of a particular feature. The feature comes back insignificant, but you believe it is significant. How will you explain it?
13 |
--------------------------------------------------------------------------------
/MachineLearning/GeneralML_Questions.md:
--------------------------------------------------------------------------------
1 | 1. In your opinion, which is more important when designing a machine learning model, Model performance? or model accuracy?
2 | 2. Choose any machine learning algorithm and describe it.
3 | 3. Give examples of the algorithms using Gradient based methods of second order information.
4 | 4. Is more data always better?
5 | 5. Is there any negative impact of using too many or too few variables?
6 | 6. What are resampling methods. Why they are useful. What are their limitations?
7 | 7. What is Gradient Descent Method?
8 |
9 |
--------------------------------------------------------------------------------
/MachineLearning/Hyperparameter_Tuning.md:
--------------------------------------------------------------------------------
1 | 1. Explain grid search and how you would use it?
2 | 2. How would you use model tuning for arriving at the best parameters?
3 | 3. You have one model and want to find the best set of parameters for this model. How would you do that?
4 |
--------------------------------------------------------------------------------
/MachineLearning/LinearRegression.md:
--------------------------------------------------------------------------------
1 | 1. Why is R2 horrible for determining the quality of a model and name at least two better metrics.
2 | 2. Why is linear regression called linear?
3 | 3. What are the assumptions that standard linear regression models with standard estimation techniques make? How can some of these assumptions be relaxed?
4 | 4. What Is the following parts of a linear regression to me p-value, coefficient, R-Squared value. What is the significance of each of these components and what assumptions do we hold when creating a linear regression?
5 | 5. Could you explain some of the extension of linear models like Splines or LOESS/LOWESS?
6 | 6. Do you consider the models Y~X1+X2+X1X2 and Y~X1+X2+X1X2 to be linear? Why?
7 | 7. In linear regression, under what condition R^2 always equals a perfect 1?
8 | 8. What are the basic assumptions to be made for linear regression?
9 | 9. What are the constraints you need to keep in mind when using a linear regression?
10 | 10. What is heteroskedasticity and how to solve it
11 | 11. What is the difference between logistic and linear regression? How do you avoid local minima?
12 |
13 |
14 |
15 |
--------------------------------------------------------------------------------
/MachineLearning/LogisticRegression.md:
--------------------------------------------------------------------------------
1 | 1. I know that a linear regression model is generally evaluated using Adjusted R² or F value. How would you evaluate a logistic regression model?
2 | 2. How can you assess a good logistic model?
3 | 3. How would you train a logistic regression model?
4 | 4. What is the effect on the coefficients of logistic regression if two predictors are highly correlated? What are the confidence intervals of the coefficients?
5 | 5. What relationships exist between a logistic regression’s coefficient and the Odds Ratio?
6 |
7 |
8 |
--------------------------------------------------------------------------------
/MachineLearning/ModelValidation.md:
--------------------------------------------------------------------------------
1 | 1. Do you know about Concordance or Lift?
2 | 2. What is a ROC curve? Write pseudo-code to generate the data for such a curve.
3 | 3. What is AU ROC (AUC)?
4 | 4. Which is better Too many false positives or too many false negatives?
5 | 5. How would you analyze the performance of the predictions generated by regression models versus classification models?
6 | 6. How would you assess logistic regression versus simple linear regression models?
7 | 7. How would you check if the regression model fits the data well?
8 | 8. How would you know if your model overfits?
9 | 9. I have two models of comparable accuracy and computational performance. Which one should I choose for production and why?
10 | 10. If you had a categorical dependent variable and a mixture of categorical and continuous independent variables, what algorithms, methods, or tools would you use for analysis?
11 | 11. Is it better to have too many false negatives or too many false positives?
12 | 12. What criteria would you use while selecting the best model from many different models?
13 | 13. What Is the 80/20 rule, and tell me about its importance in model validation.
14 | 14. What is the name of the matrix used to evaluate predictive models?
15 | 15. What precision and recall are?
16 | 16. Which evaluation metrics you know? Something apart from accuracy?
17 | 17. Can you explain the difference between a Test Set and a Validation Set?
18 | 18. What is 10-Fold CV?
19 | 19. What is Cross-Validation?
20 | 20. What is the difference between holding out a validation set and doing 10-Fold CV?
21 |
22 |
23 |
--------------------------------------------------------------------------------
/MachineLearning/Natural_Language_Processing.md:
--------------------------------------------------------------------------------
1 | 1. What is the use of NLP in Machine Learning?
2 | 2. Split a large string into valid words and store them in a dictionary. If the string cannot be split, return false. What’s your solution’s complexity?
3 | 3. What Is the distances and similarity measures that can be used to compare documents?
4 | 4. How unstructured text data can be converted into structured data for the purpose of ML models?
5 | 5. How would you develop a model to identify plagiarism?
6 | 6. What is the computational complexity of finding a document’s most frequently used words?
7 | 7. Why and when stop words are removed? In which situation we do not remove them?
8 |
--------------------------------------------------------------------------------
/MachineLearning/Optimization.md:
--------------------------------------------------------------------------------
1 | 1. Describe a constrained optimization problem and how you would tackle it.
2 | 2. What are “slack variables”?
3 | 3. Do gradient descent methods always converge to same point?
4 | 4. Give examples of some convex and non-convex optimization algorithms.
5 | 5. Is it necessary that the Gradient Descent Method will always find the global minima?
6 | 6. What do you understand by statistical power of sensitivity (recall?) and how do you calculate it?
7 | 7. What is a local optimum is and why is it important in a specific context, such as k-means clustering. What are specific ways for determining if you have a local optimum problem? What can be done to avoid local optima? Read possible answer
8 | 8. What is the difference between Batch Gradient Descent and Stochastic gradient descent.
9 |
--------------------------------------------------------------------------------
/MachineLearning/Recommendation.md:
--------------------------------------------------------------------------------
1 | 1. How would you suggest followers on Twitter?
2 | 2. What is Collaborative Filtering?
3 |
--------------------------------------------------------------------------------
/MachineLearning/Regression.md:
--------------------------------------------------------------------------------
1 | 1. When to use k-Nearest Neighbors for regression?
2 |
--------------------------------------------------------------------------------
/MachineLearning/Regularization.md:
--------------------------------------------------------------------------------
1 | 1. How would you approach a categorical feature with high-cardinality?
2 | 2. How would you deal with sparsity?
3 | 3. What are the advantages and disadvantages of using regularization methods like Ridge Regression?
4 | 4. What are the problems of large feature space? How does it affect different models, e.g. OLS? What about computational complexity?
5 | 5. What is Lasso regression? How is it different from OLS and Ridge?
6 | 6. What is Regularization?
7 | 7. What is Ridge Regression? How is it different from OLS Regression? Why do we need it?
8 | 8. What is the difference between density-sparse data and dimensionally-sparse data?
9 | 9. What Is the difference between L1 and L2 regularization methods?
10 | 10. When might you want to use ridge regression instead of traditional linear regression?
11 | 11. Which problem does Regularization try to solve?
12 | 12. Why (geometrically) does LASSO produce solutions with zero-valued coefficients (as opposed to ridge)?
13 |
14 |
15 |
16 |
--------------------------------------------------------------------------------
/MachineLearning/SVM_Kernels:
--------------------------------------------------------------------------------
1 | 1. Name and describe three different kernel functions and in what situation you would use each.
2 | 2. What is a kernel? Explain the Kernel trick
3 | 3. Which kernels do you know? How to choose a kernel?
4 | 4. How would you train SVM? What about hard SVM and soft SVM?
5 | 5. How would you use SVD to perform PCA? When SVD is better than EVD for PCA?
6 | 6. Is it beneficial to perform dimensionality reduction before fitting an SVM? Why or why not?
7 | 7. What is the maximal margin classifier in an SVM? How this margin can be achieved and why is it beneficial?
8 | 8. Why does SVM need to maximize the margin between support vectors?
9 |
--------------------------------------------------------------------------------
/MachineLearning/Time_Series.md:
--------------------------------------------------------------------------------
1 | 1. Can you use machine learning for time series analysis?
2 | 2. How can you deal with different types of seasonality in time series modelling?
3 | 3. How would you apply resampling to time series data?
4 | 4. What are some different Time Series forecasting techniques?
5 | 5. How you can make data normal using Box-Cox transformation?
6 |
--------------------------------------------------------------------------------
/OpenSource/openSourceAdvice.md:
--------------------------------------------------------------------------------
1 | 1. Contribute to Pandas: https://towardsdatascience.com/get-stuck-in-with-contributing-to-pandas-fea87d2ac99
2 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # DataScienceInterview
2 | Interview stuff to help you get that job! This is a new repo. Going to allow pull requests and such later after I am comfortable where the repo is.
3 |
4 | This repo was inspired by the [Data Science Interview Guide](https://towardsdatascience.com/data-science-interview-guide-4ee9f5dc778) by [Sadat Nazrul](https://github.com/snazrul1).
5 |
6 | The table below will folders where things are generally contained and then subjects which are under them. Like Machine Learning will be folder and subject an be for example KMeans.
7 |
8 | Folder | Subject | Blog
9 | --- | --- | ---
10 | [Algorithms](https://github.com/mGalarnyk/DataScienceInterview/tree/master/Algorithms) | [Algorithms](https://github.com/mGalarnyk/DataScienceInterview/tree/master/Algorithms) | Coming Soon
11 | [Deep Learning](https://github.com/mGalarnyk/DataScienceInterview/tree/master/DeepLearning) | [Deep Learning](https://github.com/mGalarnyk/DataScienceInterview/tree/master/DeepLearning) | Coming Soon
12 | [Linear Algebra](https://github.com/mGalarnyk/DataScienceInterview/tree/master/LinearAlgebra)| [Linear Algebra](https://github.com/mGalarnyk/DataScienceInterview/tree/master/LinearAlgebra) | Coming Soon
13 | [Machine Learning](https://github.com/mGalarnyk/DataScienceInterview/tree/master/MachineLearning) | [Machine Learning](https://github.com/mGalarnyk/DataScienceInterview/tree/master/MachineLearning) | Coming Soon
14 |
15 | ## Contributors
16 | Github Username | Blog | Other Social Media
17 | --- | --- | ---
18 | [mGalarnyk](https://github.com/mGalarnyk) | [Blog](https://github.com/mGalarnyk) | [Michael Galarnyk](https://www.youtube.com/c/MichaelGalarnyk)
19 | [snazrul1](https://github.com/snazrul1) | [Blog](https://medium.com/@sadatnazrul) | [PyRevolution](https://www.youtube.com/channel/UCtMGQhxDihrhxoswOZ0p5oA)
20 | [Ephraim Schoenbrun](https://github.com/eschoenbrun) | [Blog](https://ephraimschoenbrun.com) | None yet
21 |
--------------------------------------------------------------------------------
/SQL/SQL_question.md:
--------------------------------------------------------------------------------
1 | ## SQL Interview Question Blogs
2 | 1. https://data36.com/sql-interview-questions-tech-screening-data-analysts
3 |
4 | ## Massive Social Media Company
5 | 1. Where or having clause. (SQL)
6 |
7 | 2. SQL union vs union all.
8 |
9 | 3. Find duplicates in a list (binary search or)
10 |
11 | 4. Case statements in sql.
12 |
--------------------------------------------------------------------------------
/Statistics/ANOVA.md:
--------------------------------------------------------------------------------
1 | 1. You applied ANOVA and it says that the means are different. How do you identify the populations where the differences are significant?
2 |
--------------------------------------------------------------------------------
/Statistics/Basic_Statistics.md:
--------------------------------------------------------------------------------
1 | 1. Are expected value and mean value different?
2 | 2. Do you know what Type-I/Type-II errors are?
3 | 3. What are p-values and confidence intervals?
4 | 4. What do you do when n is small? How do you quantify uncertainty? Pick one strategy and explain how to make decisions under uncertainty?
5 | 5. What is the Central Limit Theorem and why is it important in data science?
6 | 6. What is the distribution of p-value’s, in general?
7 | 7. What is the normal distribution? Give an example of some variable that follows this distribution
8 | 8. What is t-Test/F-Test/ANOVA? When to use it?
9 | 9. What summary statistics do you know?
10 | 10. How would you calculate needed sample size?
11 | 11. how would you calculate the degrees of freedom of an interaction
12 | 12. How would you find the median of a very large dataset?
13 | 13. How would you measure distance between data points?
14 | 14. How would you remove multicollinearity?
15 | 15. What is collinearity and what to do with it?
16 | 16. What is the difference between squared error and absolute error?
17 | 17. What is the null hypothesis? How do we state it?
18 | 18. When do we need the intercept term and when do we not?
19 |
--------------------------------------------------------------------------------
/Statistics/Distributions.md:
--------------------------------------------------------------------------------
1 | 1. Describe a non-normal probability distribution and how to apply it.
2 | 2. Do you know the Dirichlet distribution? the multinomial distribution?
3 | 3. Explain what a long-tailed distribution is and provide three examples of relevant phenomena that have long tails. Why are they important in classification and prediction problems?
4 | 4. Give examples of data that does not have a Gaussian distribution, or log-normal.
5 | 5. How would you check if a distribution is close to Normal? Why would you want to check it? What is a QQ Plot?
6 | 6. How would you find an anomaly in a distribution?
7 | 7. How would you know when Gaussian Mixture Model is applicable?
8 | 8. What is the difference between skewed and uniform distribution?
9 |
--------------------------------------------------------------------------------
/Statistics/Experiment_Design.md:
--------------------------------------------------------------------------------
1 | 1. Why is randomization important in experimental design?
2 | 2. What are confounding variables?
3 | 3. What is the importance of having a selection bias?
4 | 4. When you sample, what bias are you inflicting?
5 | 5. Why do we need hypothesis testing?
6 | 6. Why do we need to sample and how?
7 | 7. How would you use resampling for hypothesis testing? Have you heard of Permutation Tests?
8 | 8. Some 3rd party organization randomly assigned people to control and experiment groups. How can you verify that the assignment truly was random?
9 | 9. What does it mean (practically) for a design matrix to be “ill-conditioned”?
10 | 10. What is an interaction?
11 | 11. What is Power analysis?
12 | 12. How would you test if two populations have the same mean? What if you have 3 or 4 populations?
13 |
--------------------------------------------------------------------------------
/Statistics/README.md:
--------------------------------------------------------------------------------
1 | ## This section will be made more Organized over time
2 |
3 | 1. Probability and statistics using Python talk from Chalmer (Dark Lord from Booz Allen/GA): https://www.youtube.com/watch?v=zzbw0JbiI6Y
4 |
--------------------------------------------------------------------------------
/Statistics/Statistics2Know.md:
--------------------------------------------------------------------------------
1 | ## Priority
2 | ### Be Able to explain and show math of how it works.
3 | 1. Explain a p-value
4 | 2. Standard deviation: https://en.wikipedia.org/wiki/Standard_deviation
5 | 3. Variance: https://en.wikipedia.org/wiki/Variance
6 | 4. Arithmetic Mean: https://en.wikipedia.org/wiki/Mean
7 | 5. What is a random variable (got asked this in an interview once): https://www.che.utah.edu/~tony/course/material/Statistics/18_rv_pdf_cdf.php#cumu
8 | ## Mathy Job
9 | ### Only necessary if you will have a very math heavy job.
10 | 1. Indicator Function: https://en.wikipedia.org/wiki/Indicator_function
11 |
--------------------------------------------------------------------------------
/ToBreakDown/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mGalarnyk/DataScienceInterview/d45a972390716e3f5f2a61e44c8cef92d70912bc/ToBreakDown/.DS_Store
--------------------------------------------------------------------------------
/ToBreakDown/DeepLearningInterviewQuestions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mGalarnyk/DataScienceInterview/d45a972390716e3f5f2a61e44c8cef92d70912bc/ToBreakDown/DeepLearningInterviewQuestions.pdf
--------------------------------------------------------------------------------
/ToBreakDown/Galvanize.md:
--------------------------------------------------------------------------------
1 | # Python Interview
2 |
3 | realpython: primitives
4 | 1. What are python primitives (check real python)?
5 |
6 | 1. Make a text reading program with word count, average word amount, number of sentences etc.
7 |
8 | 2. Anagram detector that went through a list of words and detected which ones were anagrams.
9 |
10 | 3. Min max of an array
11 |
12 |
13 | # SQL Interview Portion
14 | 1. MySQL
15 |
16 | # Data Science Interview
17 | 1. conditional problems
18 |
19 | 2. Simple regression analysis
20 |
21 | 3. Interpreting complexity of a test/training set.
22 |
23 | 4. Interpret an ROC Graph.
24 |
25 | 5. False-positive, true-positive, false-negative, and false-positive in logistic regression.
26 | You're going to be shown one and asked to explain which section is which.
27 |
28 | 6. Outliers in a graph
29 |
30 | 7. Statistical knowledge, general probability, probability distributions, general stats
31 |
32 | 8. frequentist stats
33 |
34 | 9. Overfitting and things you can do to combat them.
35 |
36 | 10. How would you compare two ML models?
37 |
38 | 11. Interpreting train and test error.
39 | Asked what a ROC curve was and asked to identify true positive true negative, and so on. Also asked about machine learning and training vs test accuracy.
40 |
41 | 12. https://www.springboard.com/blog/data-science-interview-questions/
42 |
43 | 13. Frequentist and Bayesian Methods.
44 |
45 | 14. Structured and unstructured data sets (scikit-learn, numpy, scipy)
46 |
47 | 15. Distributions (binomial etc).
48 |
49 | 16. Natural language processing
50 |
51 | 17. Big Data (spark etc)
52 |
53 | 18. What is a random variable?
54 | A random variable X is an object that can be used to generate random numbers, in a way that valid probabilistic statements about the generated numbers can be made.
55 | P(X=1)= .5
56 | P(X=1) = 1/12
57 | P(X>6) =0
58 | are all probabilistic statements about a random variable X.
59 |
60 | 19. Distributions (binomial etc).
61 | Random variables
62 | * Number of heads seen in ten flips of a quarter
63 | * Number of heads seen in ten flips of a dime.
64 | have the same distribution
65 |
66 | The distribution of a random variable is the pattern of all probabilities we assign to all outcomes of the random variable. So two random variables have the same distribution if they all assign the same probabilities to all possible outcomes. In this case, we say that these random variables are equally distributed.
67 |
68 | Discrete
69 | The binomial distribution counts discrete occurrences among discrete trials.
70 |
71 | The poisson distribution counts discrete occurrences among a continuous domain (typically events in time).
72 | Continuous
73 | Normal, Uniform, Exponential
74 |
75 | 20. SVM
76 |
77 | 21. Lasso and Ridge Regression
78 |
79 | 22. One sided vs two sided T test.
80 |
81 |
82 | # Interview Preparation
83 | 1. https://github.com/GalvanizeDataScience/data-science-primer
84 | 2. https://github.com/GalvanizeOpenSource/stats-shortcourse
85 |
86 |
87 |
88 |
--------------------------------------------------------------------------------
/ToBreakDown/InterviewQuestions.docx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mGalarnyk/DataScienceInterview/d45a972390716e3f5f2a61e44c8cef92d70912bc/ToBreakDown/InterviewQuestions.docx
--------------------------------------------------------------------------------
/ToBreakDown/README.md:
--------------------------------------------------------------------------------
1 | # Galvanize Reference #
2 |
3 | ## Table of Contents ##
4 |
5 |
6 | * [Introduction](#introduction)
7 | * Programming
8 | * [Python](#python)
9 | ** [Base Data Types](#base-data-types)
10 | ** [Built-in Functions](#built---in-functions)
11 | ** [Classes](#classes)
12 | ** [Loops and List Comprehension](#loops-and-list-comprehension)
13 | ** [File I/O](#file-i-o)
14 | ** [Lambda Functions in Map, Filter, and Reduce](#lambda-functions-in-map-filter-and-reduce)
15 | ** [Testing and Debugging](#testing-and-debugging)
16 | ** [A Note on Style](#a-note-on-style)
17 | ** [Common Errors](#common-errors)
18 | * [Python Packages](#python-packages)
19 | ** [pandas](#pandas)
20 | ** [numpy](#numpy)
21 | ** [scipy](#scipy)
22 | ** [statsmodels](#statsmodels)
23 | ** [sklearn](#sklearn)
24 | ** [boto](#boto)
25 | ** [Plotting with matplotlib and seaborn](#plotting-with-matplotlib-and-seaborn)
26 | * [SQL](#sql)
27 | * [MongoDB and Pymongo](#mongodb-and-pymongo)
28 | * [Git](#git)
29 | * [Unix](#unix)
30 | * [Development and Virtual Environments](#development-and-virtual-environments)
31 | * Linear Algebra, Probability, and Statistics
32 | * [Linear Algebra](#linear-algebra)
33 | * [Calculus](#calculus)
34 | * [Probability](#probability)
35 | ** [Set Operations and Notation](#set-operations-and-notation)
36 | ** [Combinatorics](#combinatorics)
37 | ** [Bayes Theorum](#bayes-theorum)
38 | ** [Random Variables](#random-variables)
39 | * [Statistics](#statistics)
40 | ** [Key Definitions](#key-definitions)
41 | ** [Common Distributions](#common-distributions)
42 | ** [Frequentist Statistics](#frequentist-statistics)
43 | ** [Confidence Intervals](#confidence-intervals)
44 | ** [Hypothesis and A/B Testing](#hypothesis-and-a-b-testing)
45 | ** [Bayesian Statistics](#bayesian-statistics)
46 | * Modeling
47 | * [Exploratory Data Analysis](#exploratory-data-analysis)
48 | * [Linear Regression Introduction and Univariate](#linear-regression-introduction-and-univariate)
49 | * [Multivariate Regression](#multivariate-regression)
50 | * [Logistic Regression](#logistic-regression)
51 | * [Cross-validation](#cross-validation)
52 | * [Bias/Variance Tradeoff](#bias/variance-tradeoff)
53 | * [Gradient Ascent](#gradient-ascent)
54 | * [Machine Learning](#machine-learning-(ml))
55 | * [Supervised Learning](#supervised-learning)
56 | ** [k-Nearest Neighbors (KNN)](#k-nearest-neighbors-(knn))
57 | ** [Decision Trees](#decision-trees)
58 | ** [Bagging and Random Forests](#bagging-and-random-forests)
59 | ** [Boosting](#boosting)
60 | ** [Maximal Margin Classifier, Support Vector Classifiers, Support Vector Machines](#maximal-margin-classifier,-support-vector-classifiers,-support-vector-machines)
61 | ** [Neural Networks](#neural-networks)
62 | * [Unsupervised Learning](#unsupervised-learning)
63 | ** [KMeans Clustering](#kmeans-clustering)
64 | ** [Hierarchical Clustering](#hierarchical-clustering)
65 | ** [Dimension Reduction](#dimension-reduction)
66 | ** [Principle Component Analysis](#principle-component-analysis)
67 | ** [Singular Value Decomposition](#singular-value-decomposition)
68 | ** [Non-Negative Matrix Factorization](#non---negative-matrix-factorization)
69 | * [Data Engineering](#data-engineering)
70 | * [Parallelization](#parallelization)
71 | * [Apache Hadoop](#apache-hadoop)
72 | * [Apache Spark](#apache-spark)
73 | * [Special Topics](#special-topics)
74 | * [Natural Language Processing](#natural-language-processing)
75 | * [Time Series](#time-series)
76 | * [Web-Scraping](#web---scraping)
77 | * [Profit Curves](#profit-curves)
78 | * [Imbalanced Classes](#imbalanced-classes)
79 | * [Recommender systems](#recommender-systems)
80 | * [Graph Theory](#graph-theory)
81 | * [Probabilitic Data Structures](#probabilitic-data-structures)
82 | * [Helpful Visualizations](#helpful-visualizations)
83 | * [Note on Style and Other Tools](#note-on-style-and-other-tools)
84 | * [Career Pointers](#career-pointers)
85 |
86 | ---
87 |
88 | ## Introduction ##
89 |
90 | This is designed to be a catch-all reference tool for my time as a Data Science Fellow at Galvanize. My goal is to have a living document that I can update with tips, tricks, and additional resources as I progress in my data science-ing. Others might find it of use as well.
91 |
92 | The main responsibilities of a data scientist are:
93 |
94 | 1. `Ideation` (experimental design)
95 | 2. `Importing` (especially SQL/postgres/psycopg2)
96 | * defining the ideal dataset
97 | * understanding how your data compares to the ideal
98 | 4. `Exploratory data analysis (EDA)` (especially pandas)
99 | 4. `Data munging`
100 | 5. `Feature engineering`
101 | 6. `Modeling` (especially sklearn)
102 | 7. `Presentation`
103 |
104 | While these represent the core competencies of a data scientist, the method for implementing them is best served by the Cross-Industry Standard Process for Data Mining (CRISP-DM), pictured below.
105 |
106 | 
107 |
108 | This system helps refocus our process on business understanding and business needs. Always ask what your ideal data set is before moving into the data understanding stage, then approach the data you do have in that light. Always, always, always focus on business solutions.
109 |
110 | (Data Science Use Cases)[https://www.kaggle.com/wiki/DataScienceUseCases]
111 |
112 | ---
113 |
114 | ## Python ##
115 |
116 | There are many different programming paradigms such as declarative, functional, and procedural. Object-oriented programming (OOP) languages (and Python in particular) allows for much of the functionality of these other paradigms. One of the values of OOP is encapsulation, which makes working in teams easy. Objects are a combination of state and behavior. The benefits of OOP include:
117 |
118 | * Readability
119 | * Complexity management (scoping)
120 | * Testability
121 | * Isolated modifications
122 | * DRY code (don't repeat yourself)
123 | * Confidence
124 |
125 | OOP has classes and objects. A class is a combination of state and behaviors. Python has a logic where everybody can access anything, making it difficult to obfuscate code. This makes Python particularly adept for the open source community.
126 |
127 | There are a number of topics that I won't address here such as control structures, functions, and modules.
128 |
129 | Reference:
130 | * [Free, temporary hosting of a notebook](https://tmpnb.org/)
131 | * [Visualizing a python script](http://www.pythontutor.com/)
132 | * [Dan's video on setting up a dev environment](https://www.youtube.com/watch?v=TyPGcnkkheQ&t=391s)
133 | * [CoderPad, for collaborative coding](https://coderpad.io/)
134 |
135 | ### Base Data Types ###
136 |
137 | Python has a few base datatypes whose different characteristics can be leveraged for faster, drier code. Data types based on hashing, such as dictionaries, allow us to retrieve information much more quickly because we can go directly to a value, similar to a card in a card catalogue, rather than iterating over every item.
138 |
139 | * `string`: immutable
140 | * string = 'abc'; string[-1] # note indexing is cardinal, not ordinal like R
141 | * `tuple`: immutable
142 | * tuple = (5, 3, 8); tuple[3]
143 | * tuple2 = (tuple, 2, 14) # nested tuple
144 | * `int`: immutable integer
145 | * `float`: immutable floating point number
146 | * `list`: mutable, uses append
147 | * list = [5, 5, 7, 1]; list[2]
148 | * list2 = [list, 23, list] # nested list
149 | * list.append(list2) # in place operation
150 | * list3 = list # aliases list
151 | * list3 = list[:]; list3 = list(list) # both copies list
152 | * `dict`: mutable, uses append, a series of hashable key/value pairs.
153 | * dict = {‘key’: 'value', 'first': 1}
154 | * dict['first']
155 | * dict.keys(); dict.values() # returns keys or values
156 | * dict['newkey'] = 'value'
157 | * defaultdict (from collections) can be used to avoid key errors
158 | * Counter(dict).most_common(1) # (also from collections) orders dict values
159 | * `set`: mutable, uses append, also uses hashing. Sets are like dict's without values (similar to a mathematical set) in that they allow only one of a given value. You can do set operations on them
160 | * s = set([1, 2, 3])
161 | * s2 = {3, 4, 5} # equivalanetly
162 | * s & s2 # returns the intersection (3)
163 | * set.intersection(*[s, s2]) # equivalently, where * is list expansion
164 | * s.add(4) # in place operation
165 | * s.union(s2)
166 |
167 |
168 | ### Built-in Functions ###
169 |
170 | The following are the base functions Python offers, some more useful than others. Knowing these with some fluency makes dealing with more complex programs easier.
171 |
172 | `abs()`, `all()`, `any()`, `basestring()`, `bin()`, `bool()`, `bytearray()`, `callable()`, `chr()`, `classmethod()`, `cmp()`, `compile()`, `complex()`, `delattr()`, `dict()`, `dir()`, `divmod()`, `enumerate()`, `eval()`, `execfile()`, `file()`, `filter()`, `float()`, `format()`, `frozenset()`, `getattr()`, `globals()`, `hasattr()`, `hash()`, `help()`, `hex()`, `id()`, `input()`, `int()`, `isinstance()`, `issubclass()`, `iter()`, `len()`, `list()`, `locals()`, `long()`: , `map()`: , `max()`: , `memoryview()`, `min()`, `next()`, `object()`, `oct()`, `open()`, `ord()`, `pow()`, `print()`, `property()`, `range()`, `raw_input()`, `reduce()`, `reload()`, `repr()`, `reversed()`, `round()`, `set()`, `setattr()`, `slice()`, `sorted()`, `staticmethod()`, `str()`, `sum()`, `super()`, `tuple()`, `type()`, `unichr()`, `unicode()`, `vars()`, `xrange()`, `zip()`, `__import__()`:
173 |
174 | Note the difference between functions like `range` and `xrange` above. `range` will create a list at the point of instantiation and save it to memory. `xrange`, by contrast, will generate a new value each time it's called upon. **Generators** like this (`zip` versus `izip` from itertools is another common example) are especially powerful in long for loops.
175 |
176 | ### Classes ###
177 |
178 | Python classes offers the benefits of classes with a minimum of new syntax and semantics. A class inherits qualities from other classes. In practice, we often start by extending the object class however in more complex architectures you can extend from other classes, especially your own.
179 |
180 | To make a class:
181 |
182 | class MapPoint(object): # Extends from the class object, uses camel case by convetion
183 | def __init__(self):
184 | self.x = None
185 | self.y = None
186 |
187 | def lon(self): # This is a method, or a function w/in a class
188 | return self.x # use 'self' to access attributes
189 |
190 | def _update_lon(self): # an underscore before the method is convention for one that's only meant to be accessed internally
191 | self.x += 1
192 |
193 | Magic methods allow you to define things like printing and comparing values:
194 |
195 | def __cmp__(self, other): # Used in comparing values
196 | if self._fullness < other._fullness:
197 | return -1 # This is a convention
198 | elif self._fullness > other._fullness:
199 | return 1
200 | elif self._fullness == other._fullness:
201 | return 0
202 | else:
203 | raise ValueError(“Couldn’t compare {} to {}”.format(self, other))
204 |
205 | A decorator is a way to say that you're going to access some attribute of a class as though it's not a function. Decorators are a syntactic convenience that allows a Python source file to say what it is going to do with the result of a function or a class statement before rather than after the statement. This is not a function; it's encapsulated.
206 |
207 | @property
208 | def full(self):
209 | return self._fullness == self._capacity
210 |
211 | @property
212 | def empty(self):
213 | return self._fullness == 0
214 |
215 | Basic vocabulary regarding classes:
216 |
217 | * `encapsulation`: is the idea that you don’t need to worry about the specifics of how a given thing is working
218 | * `composition`: is the idea that things can be related (fruit to backpacks, for instance)
219 | * `inheritance`: is where children take the quality of their parents
220 | * `polymorphism`: is where you customize a given behavior for a child
221 |
222 | Reference:
223 | * [List of Magic Methods](http://www.rafekettler.com/magicmethods.html)
224 |
225 | ### Loops and List Comprehension ###
226 |
227 | There are two types of loops in python: for loops and while loops. List comprehension provides a more compact way to approach for loops.
228 |
229 | For loops iterate over an item, completing after it has either iterated over all of its iterable components or reaches a `break`. While this is similar to other components, I won't belabor this. However it is important to draw attention to functions like `enumerate`, which returns an index and the iterable item, and `zip`, which combines pairs of items.
230 |
231 | Here are a few examples of list comprehension:
232 |
233 | square_root = [np.sqrt(i) for i in x]
234 |
235 | L = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
236 | flattened_L = [item for row in L for item in row]
237 |
238 | even =["even" if item % 2 == 0 else "odd" for item in L]
239 |
240 | ### File I/O ###
241 |
242 | Normally we use pandas to open most files. When working with .txt files, the base python funcationlity can be helpful.
243 |
244 | f = open("myfile.txt")
245 | for line in f:
246 | # do something
247 | f.close() # remember to close!
248 |
249 | with open("myfile.txt") as f: # does not need f.close()
250 | # do stuff with the file
251 |
252 | `readline()` is another option for reading a single line. For outputing a file, you can use the following:
253 |
254 | f = open("out.txt", 'w') # specify 'w' for writing or 'a' for appending to the end of the file
255 | f.write("Hello!\n")
256 | f.close()
257 |
258 | ### Lambda Functions in Map, Filter, and Reduce ###
259 |
260 | Lambda functions are for defining functions without a name:
261 |
262 | L = [(2, 4), (5, 3), (6, 8), (4, 1)]
263 | L.sort(key=lambda x: x[0])
264 |
265 | map(lambda x: x**2)
266 | map(lambda x: x * 2 if x > 0 else x, L)
267 |
268 | ages = range(30)
269 | adults = filter(lambda x: x > 18, ages)
270 |
271 | reduce(lambda total, x: total + x, [1, 2, 3])
272 |
273 |
274 |
275 | ### Testing and Debugging ###
276 |
277 | * add this in line you're examining: import pdb; pdb.set_trace()
278 | ** Resource: http://frid.github.io/blog/2014/06/05/python-ipdb-cheatsheet/
279 | * Use test_[file to test].py when labeling test files
280 | * Use `from unittest import TestCase`
281 |
282 |
283 | nosetests
284 |
285 | `from unittest import TestCase`
286 |
287 | class OOPTests(TestCase):
288 |
289 | def test_backpack(self):
290 | pack = Backpack()
291 | x = 1
292 | pack.throw_in(x)
293 | self.assertIn(1, pack._items)
294 |
295 | You need assertions in unittest in order to test things.
296 |
297 | Resources:
298 | * (Debugging exeercises for pdb)[http://tjelvarolsson.com/blog/five-exercises-to-master-the-python-debugger/]
299 |
300 | ### A Note on Style ###
301 |
302 | Classes are capitalized LikeThis (camel case) and functions like_this (snake case).
303 | https://www.python.org/dev/peps/pep-0008/
304 |
305 | DRY: don’t repeat yourself. WET: we enjoy typing.
306 |
307 | pep8 help # at the terminal, this will help you evaluate your code for pep8
308 |
309 | Use this block to run code if the .py doc is called directly but not if imported as a module:
310 | if __name__ == ‘__main__’:
311 | [Code to run if called directly here]
312 |
313 | There’s a package called flake8 which combines pep8 (the style guide) with flake (with code introspection, eg flagging calling a variable you didn’t define). You should be able to install this for Atom
314 |
315 | ### Common Errors ###
316 |
317 | Here are some common errors to avoid:
318 |
319 | * `AttributeError`: Thrown when you call an attribute an object doesn't have
320 | * `ZeroError`: Thrown when dividing by a zero
321 | * `AssertionError`: Thrown when code fails an `assert` line such as `assert multiply_numbers(2, 2) == 4`
322 |
323 | ---
324 |
325 | ## Python Packages ##
326 |
327 | Python offers an array of packages instrumental in its rise as one of the leading languages for data science. Here are some useful packages, some of which will be explored in detail below:
328 |
329 | * Fundamental Libraries for Scientific Computing
330 | ** `IPython Notebook`: an alternative Python command line shell for interactive computing
331 | ** `NumPy`: the most fundamental package for efficient scientific computing through linear algebra routines
332 | ** `pandas`: a library for operating with table-like structures and data munging
333 | ** `SciPy`: one of the core packages for scientific computing routines
334 | * Math and Statistics
335 | ** `Statsmodels`: statistical data analysis mainly through linear models and includes a variety of statistical tests
336 | ** `SymPy`: symbolic mathematical computations
337 | ** `Itertools`: combinatorics generators
338 | * Machine Learning
339 | ** `Scikit-learn`: includes a broad range of different classifiers, cross-validation and other model selection methods, dimensionality reduction techniques, modules for regression and clustering analysis, and a useful data-preprocessing module
340 | ** `Shogun`: ML library focussed on large-scale kernel methods
341 | ** `PyBrain`: Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network Library
342 | ** `PyLearn2`: research library
343 | ** `PyMC`: Bayesian statistics library
344 | * Plotting and Visualization
345 | ** `matplotlib`: the defacto plotting library
346 | ** `seaborn`: adds features to matplotlib like violin plots and more appealing aesthetics
347 | ** `ggplot`: a port of R's ggplot2
348 | ** `plotly`: focus on interactivity
349 | ** `Bokeh`: aesthetic layouts and interactivity to produce high-quality plots for web browsers
350 | ** `d3py`: creates interactive data visualizations based on d3
351 | ** `prettyplotlib`: enhancement library for matplotlib (good for presentations)
352 | * Database Interaction
353 | ** `Psycopg2`: access to postgres databases
354 | ** `Pymongo`: access to MongoDB databases
355 | ** `sqlite3`: access to SQLite databases
356 | * Data formatting and storage
357 | ** `bs4`: defacto library for parsing content from webscraping
358 | ** `csvkit`: has some functionality beyond pandas for csv's
359 | ** `PyTables`: good for large datasets
360 |
361 | ### pandas ###
362 |
363 | Pandas is a library for data manipulation and analysis, the python equivalent of R's dplyr. Pandas objects are based on numpy arrays so **vectorized options (e.g. apply) over iterative ones offer large performance increases.**
364 |
365 | Objects:
366 |
367 | * `series`: `prices = pd.Series([1, 2, 3, 4, 5], index = ['apple', 'pear', 'banana', 'mango', 'jackfruit'])`
368 | * `prices.iloc[1:3]`: works on position, returns rows 1 stopping 1 before 3
369 | * `prices.loc['pear']`: works on index, returns row 'pear'
370 | * `prices.loc['pear':]`: Returns pear on
371 | * `prices.ix[1:4]`: Similar but returns index and value
372 | * `prices['inventory' > 20]`: subsets based on a boolean string
373 | * `dataframe`: `inventory = pd.DataFrame({'price': [1, 2, 3, 4, 5], 'inventory': prices.index})`
374 | * `inventory.T`: Transpose of inventory
375 | * `inventory.price`: Returns a series of that column
376 | * `inventory.drop(['pear'], axis = 0)`: deletes row 'pear'
377 | * `inventory.drop(['price'], axis = 1, inplace = True)`: deletes column 'price' and applies to current df
378 |
379 | Common functions:
380 |
381 | * EDA
382 | * `prices.describe()`
383 | * `prices.info()`
384 | * `prices.head()`
385 | * `prices.hist()`
386 | * `crosstab`: similar to table in R `pd.crosstab(inventory['inventory'], inventory['price'])`
387 | * `pd.value_counts(inventory['price'])`
388 | * Mathematical operations
389 | * `prices.mean()`
390 | * `prices.std()`
391 | * `prices.median()`
392 | * Other operations
393 | * `pd.concat(frames)`: Concats a list of objects
394 | * `pd.merge(self, right, on = 'key')`: joins df's. Can specify left_on, right_on if column names are different. You can also specify how for inner or outer.
395 |
396 | **Split-apply-combine** is a strategy for working on groups of data where you split the data based upon a given characteristic, apply a function, and combine it into a new object.
397 |
398 | inventory = pd.DataFrame({'price': [1, 2, 3, 4, 5], 'inventory': prices.index})`
399 | inventory['abovemean'] = inventory.price > 2.5
400 |
401 | * `grouped = inventory.groupby('abovemean')`: creates groupby object
402 | * `grouped.aggregate(np.sum)`: aggregates the sum of those above the mean versus below the mean
403 | * `grouped.transform(lambda x: x - x.mean())`: tranforms groups by lambda equation
404 | * `grouped.filter(lambda x: len(x)>2)`: filters by lambda function
405 | * `grouped.apply(sum)`: applies sum to each column by `abovemean`
406 |
407 | Working with datetime objects, you can do `df.dt`. and then hit tab to see the different options. Similarly, `df.str` can give us all of our string functions.
408 |
409 | ---
410 |
411 | ### numpy ###
412 |
413 | NumPy is arguably the most fundamental package for efficient scientific computing through linear algebra routines.
414 |
415 | Here are some general functions
416 |
417 | * `array = np.array([1, 2, 3, 4, 5])`
418 | * `array.shape`
419 | * `array..reshape(5, 1)`
420 | * `np.concatenate((foo, bar), axis = 1)` # adds a column
421 | * `np.hstack((foo, bar))` # adds a column
422 | * `np.vstack((foo, bar))` # adds a row
423 | * `foo.min(axis = 0)` # takes column mins
424 |
425 | Note: in calculating `np.std()`, be sure to specify `ddof = 1` when refering to the sample.
426 |
427 | ---
428 |
429 | ### scipy ###
430 |
431 | SciPy is one of the core packages for scientific computing routines. `linalg` is particularly helpful for linear algebra. It's worth noting the following:
432 |
433 | * `loc`: what SciPy calls the mean
434 | * `scale`: what SciPy calls standard deviation
435 |
436 | ---
437 |
438 | ### statsmodels ###
439 |
440 | Statsmodels is the de facto library for performing regression tasks in Python.
441 |
442 | import statsmodels.api as sm
443 |
444 | from statsmodels.regression.linear_model import OLS
445 | # unlike sklearn, this will provide a summary view of your model
446 |
447 | Note that logistic regression in statsmodels will fail to converge if data perfectly separable. Logistic regression in sklearn normalizes (punishes betas) so it will converge.
448 |
449 |
450 | ### sklearn ###
451 |
452 | Sklearn does not allow for categorical variables. Everything must be encoded as a float.
453 |
454 | Splitting test/training data:
455 |
456 | from sklearn.model_selection import train_test_split
457 | from sklearn.cross_validation import train_test_split # Older version
458 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
459 |
460 |
461 | from sklearn.linear_model import LinearRegression
462 | from sklearn.linear_model import LogisticRegression
463 |
464 | from sklearn.neighbors import KNeighborsClassifier
465 |
466 | from sklearn.tree import DecisionTreeRegressor
467 |
468 | from sklearn.ensemble import RandomForestClassifier
469 | # `max_features`: for classification start with sqrt(p) and for regression p/3
470 | # `min_sample_leaf`: start with None and try others
471 | # `n_jobs`: -1 will make it run on the max # of proocessors
472 |
473 | from sklearn.ensemble import AdaBoostClassifier
474 | from sklearn.ensemble import GradientBoostingRegressor
475 | # You can used `staged_predict` to access the boosted steps, which is especially useful in plotting error rate over time
476 |
477 | from sklearn.svm import SVC # sklearn uses SVC's even though it's a SVM.
478 | # By default, SVC will use radial basis, which will enlarge your feature space with higher-order functions
479 |
480 |
481 | A few other fun tools:
482 |
483 | from sklearn.model_selection import GridSearchCV
484 | # Searches over designated parameters to tune a model
485 | # GridSearch can parallelize jobs so set `n_jobs = -1`
486 |
487 | from sklearn.pipeline import Pipeline
488 | # Note that pipeline is helpful for keeping track of changes
489 |
490 | from sklearn.preprocessing import LabelEncoder
491 | # This tool will transform classes into numerical values
492 |
493 | ---
494 |
495 | ### boto ###
496 |
497 | Boto is the python interface to Amazon Web Services. Let's walk through an example of saving files to s3 assuming that kinks of setting up AWS creditials have already been worked out.
498 |
499 | import boto
500 | conn = boto.connect_s3() # make sure boto is up to date for this to work on ec2
501 | print conn # confirms connection
502 |
503 | conn.get_all_buckets() # prints all buckets
504 | new_bucket = conn.create_bucket('new_bucket') # creates new bucket
505 | new_key = new_bucket.new_key('new_file.txt')
506 | new_key.set_contents_from_string('file content here')
507 | new_bucket.get_all_keys() # lists all files
508 |
509 | for key in new_bucket.get_all_keys(): # can only delete empty buckets
510 | key.delete()
511 | new_bucket.delete()
512 |
513 |
514 | ---
515 |
516 | ### Plotting with matplotlib and seaborn ###
517 |
518 | Matplotlib is the defacto choice for plotting in python. There's also Plotly, Bokeh, Seaborne, Pandas, and ggplot. Seaborne and Pandas were both built on matplotlib.
519 |
520 | There are three levels plotting can can be accesed on:
521 |
522 | 1. `plt`: minimal, fast interface
523 | 2. `OO interface w/ pyplot`: fine-grained control over figure, axes, etc
524 | 3. `pure OO interface`: embed plots in GUI applications (will probably never use)
525 |
526 | plt.figure
527 | x_data = np.arange(0, 4, .011)
528 | y_data = np.sin(x_data)
529 | plt.plot(x_data, y_data)
530 | plt.show()
531 |
532 | pd.scatter_matrix(df, alpha=0.2, figsize=(6, 6), diagonal='kde') # scattermatrix
533 |
534 | The objects involved are the figure and axes. We can call individual axes, but normally we deal with them together. The figure defines the area on which we can draw. The axes is how we plot specific data.
535 |
536 | fig, ax_list = plt.subplots(4, 2)
537 | for ax, flips in zip(ax_list.flatten(), value):
538 | x_value = [data changed by flips]
539 | y_value = [data changed by flips]
540 | ax.plot(x_value, y_value)
541 |
542 | Plotting predictions can be done using `np.linspace()` to generate X or Y values for your plot.
543 |
544 | A useful plot from seaborn is heatmap and violin plot (which is the kde mirrored):
545 |
546 | agg = cars.groupby(['origin'], cars['year'])
547 | ax = sns.heatmap(agg.unstack(level = 'year'), annot = True) # Be sure to annotate so you know what your values are
548 |
549 | fit, axes = plt.subplots(3, 2)
550 | for ax, var in zip(axes.ravel(), num_vars):
551 | sns.violinplot(y = var, data = cars, ax = ax)
552 |
553 | ---
554 |
555 | ## SQL ##
556 |
557 | SQL (Structured Query Language) is a special-purpose programming language designed for managing data held in relational database management systems (RDBMS). In terms of market share, the following database engines are most popular: Oracle, MySQL, Microsoft SQL Server, and PostgreSQL (or simply Postgres, the open source counterpart to the other, more popular proprietary products). While Postgres is the database system we will be working with, psql is the command-line program which can be used to enter SQL queries directly or execute them from a file.
558 |
559 | Steps to setting up a local database using psql (you can also access SQL using sqlite3 [filename.sql]):
560 |
561 | 1. Open Postgres
562 | 2. Type `psql` at the command line
563 | 3. Type `CREATE DATABASE [name];`
564 | 4. Quit psql (`\q`)
565 | 5. Navigate to where the .sql file is stored
566 | 6. Type `psql [name] < [filename.sql]`
567 | 7. Type `psql readychef`
568 |
569 | Basic commands:
570 |
571 | * `\d`: returns the schema of all tables in the database
572 | * `\d [table]`: returns table schema
573 | * `;`: excecutes a query
574 | * `\help` or `\?`: help
575 | * `\q`: quit
576 |
577 | **To understand SQL, you must understand SELECT statements.** All SQL queries have three main ingredients:
578 |
579 | 1. `SELECT`: What data do you want?
580 | 2. `FROM`: Where do you want that data from?
581 | 3. `WHERE`: Under what conditions?
582 |
583 | The order of evaluation of a SQL SELECT statement is as follows:
584 |
585 | 1. `FROM + JOIN`: first the product of all tables is formed
586 | 2. `WHERE`: the where clause filters rows that do not meet the search condition
587 | 3. `GROUP BY + (COUNT, SUM, etc)`: the rows are grouped using the columns in the group by clause and the aggregation functions are applied on the grouping
588 | 4. `HAVING`: like the WHERE clause, but can be applied after aggregation
589 | 5. `SELECT`: the targeted list of columns are evaluated and returned
590 | 6. `DISTINCT`: duplicate rows are eliminated
591 | 7. `ORDER BY [value] DESC`: the resulting rows are sorted
592 | 8. `CASE WHEN gender = 'F' THEN 'female' ELSE 'male' END AS gender_r`: this is SQL's if/else construction. This is good for dealing with null values
593 |
594 | Here are some common commands on SELECT statements:
595 |
596 | * `*`
597 | * `COUNT`: `COUNT(*)` is slower than counting a specific column. `COUNT(1)` is fastest, as it's only counting rows
598 | * `MAX or MIN`
599 | * `DISTINCT`
600 | * `SUM`
601 |
602 | Joins are used to query across multiple tables using foreign keys. **Every join has two segments: the tables to join and the columns to match.** There are three types of joins that should be imagined as a venn diagram:
603 |
604 | 1. `INNER JOIN:` joins based on rows that appear in both tables. This is the default for saying simply JOIN and would be the center portion of the venn diagram.
605 | * SELECT * FROM TableA INNER JOIN TableB ON TableA.name = TableB.name;
606 | * SELECT c.id, v.created at FROM customers as c, visits as v WHERE c.id = v.customer_id; # joins w/o specifying it's a join
607 | 2. `LEFT OUTER JOIN:` joins based on all rows on the left table even if there are no values on the right table. A right join is possible too, but is only the inverse of a left join. This would be the left two sections of a venn diagram.
608 | 3. `FULL OUTER JOIN:` joins all rows from both tables even if there are some in either the left or right tables that don't match. This would be all three sections of a venn diagram.
609 |
610 | While data scientists are mostly accessing data, it's also useful to know how to create tables.
611 |
612 | CREATE TABLE table_name
613 | (
614 | column_name data_type(size),
615 | column_name data_type(size),
616 | column_name data_type(size)
617 | );
618 |
619 | Data types include varchar, integer, decimal, date, etc. The size specifies the maximum length of the column of the table.
620 |
621 | Use `ALTER TABLE` to add indexes.
622 |
623 | In **Relational Database Modeling** it's important to have normalized data. Think of this saying: "The key, the whole key, and nothing but the key, so help me Codd." First normal form is ensured by atomistic data and having a key. Second normal form is where each non-key value in your table refers to the whole key (which could be multiple keys). Third normal form refers to nothing but the key.
624 |
625 | Resources:
626 | * (SQLZoo)[http://sqlzoo.net/wiki/SELECT_basics]
627 | * (Illustrated Example of Joins)[https://blog.codinghorror.com/a-visual-explanation-of-sql-joins/]
628 | * (Databases by market share)[http://db-engines.com/en/ranking]
629 | * (SQL practice)[https://pgexercises.com/]
630 | * (SQL data types)[http://www.w3schools.com/sql/sql_datatypes.asp]
631 | * (Normalization)[http://www.bkent.net/Doc/simple5.htm]
632 |
633 | ### SQL using Pandas and Psycopg2 ###
634 |
635 | You can make connections to Postgres databases using the python package psycopg2 including creating cursors for SQL queries, commit SQL actions, and close the cursor and connection. **Commits are all or nothing: if they fail, no changes will be made to the database.** You can set `autocommit = True` to automatically commit after each query. A cursor points at the resulting output of a query and can only read each observation once. If you want to see a previously read observation, you must rerun the query.
636 |
637 | **Beware of SQL injection where you add code.** Use `%s` where possible, which will be examined by psychopg2 to make sure you're not injecting anything. You can use ORM's like ruby on rails or the python equivalent jengo to take care of SQL injection too.
638 |
639 | Resources:
640 | * (pgAdmin for connecting to databases)[https://www.pgadmin.org/]
641 | * (Amazon Relational Database Service for setting up databases)[https://aws.amazon.com/rds/]
642 |
643 | ---
644 |
645 | ## MongoDB and Pymongo ##
646 |
647 | MongoDB is an open-source cross-platform document-oriented database program, classified as a NoSQL database program. Instead of traditional, table-oriented databases, it uses the dynamic schema system similar to JSON-like documents. It doesn't require us to define a schema, though that means that if we don't have a schema we can't do joins (its main limitation). It's great for semi-structured data though sub-optimmal for complicated queries. You search it with key-value pairs, like a dictionary.
648 |
649 | 1. MongoDB has the same concept as a database/schema, being able to contain zero or more schemas
650 | 2. Databases can have zero or more collections (i.e. tables)
651 | 3. Collections can be made up of zero or more documents (i.e. rows)
652 | 4. A document is made up of one or more fields (i.e. variables)
653 | When you ask MongoDB for data, it returns a pointer to the result called a **cursor**. Cursors can do things such as counting or skipping ahead before pulling the data. The cursor's execution is delayed until necessary.
654 |
655 | The difference between a document and a table is that relational databases define columns at the table level whereas a document-oriented database defines its fields at the document level. Each document within a collection has its own fields.
656 |
657 | sudo mongod
658 |
659 | **Pymongo**, similar to psychopg, is the python interface with MongoDB. You can also use the javascript client (the mongo shell). Every document you insert has an ObjectId to ensure that you can distinguish between identical objects.
660 |
661 | use my_new_database
662 |
663 | Inserting data
664 |
665 | db.users.insert({name: 'Jon', age: '45', friends: ['Henry', 'Ashley'] })
666 | show dbs db.getCollectionNames()
667 | db.users.insert({name: 'Ashley', age: '37', friends: ['Jon', 'Henry'] })
668 | db.users.insert({name: 'Frank', age: '17', friends: ['Billy'], car: 'Civic'}) db.users.find()
669 |
670 | Querying data
671 |
672 | db.users.find({'name': 'Jon'}) # find by single field
673 | db.users.find({'name': 'Jon', 'sex': 'male'}) # multi-field query
674 | db.users.find({}, { name: true }) # field selection (only return name from all documents). Also known as projection
675 | db.users.find_one({'name': 'Jon'}) # only returns the first response
676 |
677 | db.users.find({'age': {'$gt': 25}}) # age > 25 (see below for inequality operators)
678 | db.users.find({'car': {'$exists' : True}}) # find by presence of field
679 | db.users.find({'name': {'$regex' : "[Cc]ar | [Tt]im"}}) # regulary expressions (regex)
680 | db.users.find({'birthyear': {'$in': [1988, 1991, 2002]}}).count() # number of people born in one of those years
681 | db.users.find({'cities_lived': {'$all': ['Portland', 'Houston']}}) # unlike $in, must match all values in the array
682 | db.users.find({'user.name': 'tim'}) # dot notation queries into the subfield name in user
683 |
684 |
685 | You use this nested structure, similar to JSON. Use `.pretty()` at the end of a query for a more readable format.
686 |
687 | Updating data
688 |
689 | db.users.update({name: "Jon"}, { $set: {friends: ["Phil"]}}) # replaces friends array
690 | db.users.update({name: "Jon"}, { $push: {friends: "Susie"}}) # adds to friends array
691 | db.users.update({name: "Stevie"}, { $push: {friends: "Nicks"}}, true) # upsert - create user if it doesn’t exist
692 | db.users.update({}, { $set: { activated : false } }, false, true) # multiple updates
693 |
694 | Deleting data
695 |
696 | db.users.remove({})
697 |
698 | Inequality operators:
699 | * `$gt`: greater than
700 | * `$lt`: less than
701 | * `$gte`: greater than or equal to
702 | * `$lte`: less than or equal to
703 | * `$ne`: not equal
704 |
705 |
706 | Reference:
707 | * [Good reference](http://openmymind.net/mongodb.pdf)
708 | * [Cheatsheet](https://blog.codecentric.de/files/2012/12/MongoDB-CheatSheet-v1_0.pdf)
709 | ---
710 |
711 | ## Git ##
712 |
713 | Git is the version control software. Github builds a UI around git. A repo is a selection of files (usually code, not data— *data on github is not a preferred practice*). You can fork a repo, so it becomes your copy of the repo. Then you can clone that repo you’re making a copy of it on your local file system. This puts it on your computer. You now have three copies: the main repo, your repo, and your clone on your computer. You should commit the smallest changes to the code.
714 |
715 | * `git add [file name]`: add new files; can also use '.' instead of specific file name
716 | * `git commit -m “message”`: you have to add a message for a git commit If you forget the message, it will open vim so type “:q” to get out of it. This adds a commit to the code on the local file system. You can go back if need be.
717 | * `git commit -am “message”`: commits all new files with the same message
718 | * `git push`: This pushes all commits to your repo on the cloud. These won’t make it back to the original repo
719 | * `git clone [url]`: clones repo to your machine
720 | * `git status`: check the commit status
721 | * `git log`: shows the log status of
722 | * `git fetch upstream`: fetches upstream changes
723 |
724 | Adding is building to a commit. Ideally a commit would be a whole new feature/version. *It’s only with commands that have the word ‘force’ that you risk losing data.* When you make a pull request, it means you have a new version of a repo and you want these to be a part of the original repo.
725 |
726 | Here's a workflow:
727 |
728 | 1. One person creates a repository (and adds others as collaborators)
729 | 2. Everyone clones the repository
730 | 3. User A makes and commits a small change. Like, really small.
731 | 4. User A pushes change.
732 | 5. User B makes and commits a small change.
733 | 6. User B tries to push, but gets an error.
734 | 7. User B pull User A's changes.
735 | 8. User B resolves any possible merge conflicts (this is why you keep commits small)
736 | 9. User B pushes.
737 | 10. Repeat
738 |
739 | With merge issues, you'll have to write a merge message followed by, `esc`, `:wq`, and then `enter`.
740 |
741 | To keep your fork up to date (courtesy of [ChristinaSolana](https://gist.github.com/CristinaSolana/1885435)):
742 |
743 | Clone and add remote from original repo in your forked repo:
744 |
745 | git clone git@github.com:YOUR-USERNAME/YOUR-FORKED-REPO.git
746 | git remote add upstream git://github.com/ORIGINAL-DEV-USERNAME/REPO-YOU-FORKED-FROM.git
747 | git fetch upstream
748 |
749 | Update your fork from the original repo:
750 |
751 | git pull upstream master
752 |
753 | Resources:
754 | * [Git School](https://try.github.io/levels/1/challenges/1)
755 | * [Centralized Git Workflow](https://www.atlassian.com/git/tutorials/comparing-workflows/)
756 |
757 | ---
758 |
759 | ## Unix ##
760 |
761 | * `ls`: list files in current directory
762 | * `ls -la`: list all files including hidden directories
763 | * `cd directory`: change directories to directory
764 | * `cd ..`: navigate up one directory
765 | * `mkdir new-dir new-dir2`: create a directory called new-dir and new-dir2
766 | * `rm some-file`: remove some-file
767 | * `rm -rf some-dir`: remove some-dir
768 | * `rmdir some-dir`: remove some-dir
769 | * `man some-cmd`: pull up the manual for some-cmd
770 | * `pwd`: find the path of the current directory
771 | * `mv path/to/file new/path/to/file`: move a file or directory (also used for renaming)
772 | * `cp path/to/file new/path/to/file`: copy a file or directory (use -r for recursive when copying a directory)
773 | * `cat`: reads a file
774 | * `cat > text.txt`: opens file to input text (close with CTRL+D); use `>>` to append to end of the text
775 | * `head`: reads the first few lines of a file (use flag of number of words to show)
776 | * `tail`: reads the last few lines of a file
777 | * `diff`: looks at the difference between two files
778 | * `wc`: word count
779 | * `uniq`: unique values
780 | * `grep`: search
781 | * `cat text.txt | sort > text2.txt`: pipes the output of cat through sort and saves it to text2.txt
782 | * `ls */file`: * is a wildcard; this could be used to find a file
783 | * `cut`
784 | * `paste`
785 | * `find . -name blah`: find files in the current directory (and children) that have blah in their name
786 | * To jump to beginning of line: __CTRL__ + __a__
787 | * To jump to end of line: __CTRL__ + __e__
788 | * To cycle through previous commands: __UP ARROW__ / __DOWN ARROW__
789 | * `python -i oop.py`: opens python and imports oop.py
790 | * `which python`: Tells you the path the's executed with the command python
791 | * `Ctr-Z`: puts current process to the background.
792 | * `fg`: brings the suspended process to the foreground
793 | * `ps waux`: shows you all current processes (helpful for finding open databases)
794 | * `top`: gives you the processes too
795 | * `xcode-select --install`: updates command line tools (often needed after an OS update)
796 | * `wget [url]`: downloads a given file
797 | * `curl -v -X GET [website]`: makes a GET HTTP request
798 | * `df`: checks disk usage
799 | * `kill`: kills a process. Use a numerical flag for priority (9 for highest)
800 |
801 | You can also access your bash profile with `atom ~/.bash_profile`
802 |
803 | ---
804 |
805 | ## Development and Virtual Environments ##
806 |
807 | Here are a few tools for setting up a developing environment and creating virtual environments.
808 |
809 | [Virtual Environments using Conda](https://uoa-eresearch.github.io/eresearch-cookbook/recipe/2014/11/20/conda/) is a good starting point for using conda. Here are a few commands:
810 |
811 | * `conda create -n yourenvname python=x.x anaconda`
812 | * `source activate yourenvname`
813 | * `conda install -n yourenvname [package]`
814 | * `source deactivate`
815 | * `conda remove -n yourenvname -all`
816 |
817 | [Virtualenv](https://virtualenv.pypa.io/en/stable/) is a tool to accomplish similar outcomes.
818 |
819 | [Oh-my-zsh](https://github.com/robbyrussell/oh-my-zsh) is a tool for zsh config and command line themes.
820 |
821 | Vagrant is also a tool for virtual environments. Install [Virtual Box](https://www.virtualbox.org/wiki/Downloads) first with the gist [coe here](https://gist.github.com/learncodeacademy/5f84705f2229f14d758d). You can follow this [tutorial](https://www.youtube.com/watch?v=PmOMc4zfCSw).
822 |
823 | Docker provides a container for a given application. A container is different from a vitual machine in that a container virtualizes the operating system as well so that different containers cannot see one another.
824 |
825 | ---
826 |
827 |
828 | ## Linear Algebra ##
829 |
830 | Linear algebra is about being able to solve equations in a more efficient manner. A **scalar** (denoted by a lower-case, unbolded letter) has only magnitude (denoted as ||a||). A **vector** (denoted by a lower-case, bold letter) is a one-dimensional matrix that has both magnitude and a heading. **Distance** is the total area covered while **displacement** is the direct distance between your start and your end point. A **matrix** is a m row by n column brick of numbers.
831 |
832 | You can initialize matrices and vectors using NumPy as follows:
833 |
834 | mat = np.array([[4, -5], [-2, 3]])
835 | vect = np.array([-13, 9])
836 | column_vect = np.array([[13], [9]])
837 | np.ones((2, 4)) # Creates a 2 x 4 matrix of ones
838 | np.zeros((3, 2))
839 | np.arange(2, 20) # Creates a row vector from 2:19
840 | mat.shape() # Returns the dimensions
841 | vect.reshape(2, 1) # Transposes the column vector vect
842 | vect.reshape(2, -1) # Returns the same
843 | mat' # Aliases the transpose (switches columns and rows) of mat
844 | np.transpose(mat) # Copies the transpose
845 | mat[1, 0] # returns -2
846 | mat[1] # returns row
847 | mat[:,1] # returns column
848 | np.concatenate((mat, vect)) # adds vect to mat by adding a row (use axis = 1 to add as a new column)
849 | mat + 1 # scalar operation (element-wise addition); can do w/ same-size matrices too
850 |
851 | Matrix multiplication can only happen when *the number of columns of the first matrix equals the number of rows in the second*. The inner product, or **dot product**, for vectors is the summation of the corresponding entities of the two sequences of numbers (returning a single, scalar value). This can be accomplished with `np.dot(A, B)`. The **outer product** of a 4-dimensional column vector and a 4-dimensional row vector is a 4x4 matrix where each value is the product of the corresponding column/vector value.
852 |
853 | **Matrix-matrix multiplication** is a series of vector-vector products and is not communicative (meaning A*B != B*A). *A 2x3 matrix times 3x2 matrix gives a 2x2 result* where 1,1 of the result is the dot product of the first row of the first matrix and the first column of the second matrix:
854 |
855 | A = [1, 2, 3]
856 | [4, 5, 6]
857 | B = [7, 8]
858 | [9, 10]
859 | [11, 12]
860 | AB = [1*7+2*9+3*11, 1*8+2*10+3*12] = [58, 64]
861 | [4*7+5*9+6*11, 4*8+5*10+6*12] [139, 154]
862 |
863 | A = np.arange(1, 7).reshape(2, 3)
864 | B = np.arange(7, 13).reshape(3, 2)
865 | np.dot(A, B)
866 |
867 | An **identity matrix** is a square matrix with 1's along the diagonal and 0's everywhere else. If you multiply any matrix by an identity matrix of the same size, you get the same matrix back. There is no matrix division. The **inverse** of a matrix is an alternative to division where you do 1 over the value of the given location. A matrix multiplied by its inverse gives you an identity matrix. A **transpose** is where the rows are exchanged for columns.
868 |
869 | **Axis-wise** operations aggregate over the whole matrix. For example, `A.mean()` returns the mean for the whole matrix. `A.mean(axis = 0)` returns a mean for every column. **Rank** is defined as the number of linearly dependent rows or columns in a matrix, such as a column that's the summation of two others or a multiple of another. A **feature matrix** is a matrix where each column represents a variable and each row a datapoint. By convention, we use `X` to be our feature matrix and `y` to be our dependent variable.
870 |
871 | There are a few important types of matrices that we will discuss in more detail below. An **orthogonal matrix** is important for statement of preference surveys. **Eigenvectors** and **eigenvalues** are good for page rank analysis. The **stochastic matrix** is central to the Markov process, or a square matrix specifying the probabilities of going from one state to another such that every column of the matrix sums to 1.
872 |
873 |
874 | ---
875 |
876 | ## Calculus ##
877 |
878 | There are two main operations of calculus and therefore two essential tools needed in data science: differentiation and integration.
879 |
880 | In graphing a straight line, the slope is often thought of as rise over run where in `f(x) = mx + b`, `m` is the slope of the line. The **slope** of a function is the rate of change with respect to itself at a given point. The **derivative** of a function is a function that gives the slope of the tangent line. This has to do with the function's limit, also known as the **instantaneous rate of change**. The derivative is also denoted by `dy/dx`. A **second derivative** measures how the rate of change of a quantity is itself changing.
881 |
882 | An **integral** assigns numbers to functions in a way that can describe displacement, area, volume, and other concepts that arise by combining infinitesimal data. There are two main types of integrals. An **indefinite integral** simply reverses a derivative. A **definite integral** has an upper and lower bound.
883 |
884 | ---
885 |
886 | ## Probability ##
887 |
888 | **Probability** is the measure of the likelihood that an event will occur written as the number of successes over the number of trials. **Odds** are the number of successes over the number of failures. Probability takes a value between 0 and 1 and odds usually take the form `successes:failures`.
889 |
890 | ### Set Operations and Notation ###
891 |
892 | A set is a range of all possible outcomes or events, also called the sampling space. It can be discrete or continuous. It is useful to think of set operations in the context of a venn diagram. Union is all of a venn diagram of A and B, intersection is the center portion, difference is the A portion and complement is the B portion.
893 |
894 | * `Union`: A ∪ B = {x: x ∈ A ∨ x ∈ B} - The union of sets A and B is x such that x is in A or x is in B.
895 | * `Intersection`: A ∩ B = {x: x ∈ A ∧ x ∈ B } - The intersection is x such that x is in A and in B.
896 | * `Difference`: A ∖ B = {x: x ∈ A ∧ x ∉ B } - The difference is x such that x is in A and not in B.
897 | * `Complement`: A C = {x: x ∉ A} - The complement is x such that x is in A and not in B.
898 | * `Null (empty) set`: ∅
899 |
900 | DeMorgan's Law converts and's to or's. The tetris-looking symbol is for 'not.'
901 |
902 | * ¬(A ∨ B) ⟺ ¬A ∧ ¬B - Not A and B is equal to not A or not B
903 | * ¬(A ∧ B) ⟺ ¬A ∨ ¬B - Not A or B is equal to not A and not B
904 |
905 | Events A and B are independent (A ⊥ B) if P(A ∩ B) = P(A)P(B) or (equivalently) P(A|B)=P(A). Remember to think of a Venn Diagram in conceptualizing this. This is conditional probability.
906 |
907 | ### Combinatorics ###
908 |
909 | Combinatorics is the mathematics of ordering and choosing sets. There are three basic approaches:
910 |
911 | 1. `Factorial`: Take the factorial of n to determine the total possible ordering of the items given that all the items will be used.
912 | 2. `Combinations`: The number of ways to choose k things given n options and that the order doesn't matter.
913 | 3. `Permutations`: The number of ways to choose k things given n options and that the order does matter.
914 |
915 | Combinations: `n! / ((n-k)! * k!)`
916 |
917 | from itertools import combinations
918 | comb = [i for i in combinations('abcd', 2)] # n = 4; k = 2
919 | len(comb) # returns 6 possible combinations
920 |
921 | This would be spoken as `n choose k` or `4 choose 2` in this example.
922 |
923 | Permutations: `n! / (n-k)!`
924 |
925 | from itertools import permutations
926 | perm = [i for i in permutations('abcd', 2)] # n = 4; k = 2
927 | len(perm) # returns 12 possible permutations
928 |
929 | This would be spoken as `n permutations of size k` or `4 permutations of size 2` in this example. It is twice the length of the combinations of the same string because of the fact that the ordering creates more possible outputs.
930 |
931 | ### Bayes Theorum ###
932 |
933 | Bayes theorum states that P(B|A) = P(A|B)P(B) / P(A). The denominator here is a normalizing function computed by all the possible ways that A could happen. Let's take the following example:
934 |
935 | The probability of a positive test result from a drug test given that one has doped is .99. The probability of a positive test result given that they haven't doped is .05. The probability of having doped is .005. What is the probability of having doped given a positive test result?
936 |
937 | P(+|doped) - .99
938 | P(+|clean) - .05
939 | P(doped) - .005
940 |
941 | P(doped|+) = P(+|doped) * P(doped) / P(+)
942 | = P(+|doped) * P(doped) / P(doped) * P(+|uses) + P(clean) * P(+|clean)
943 | = (.99 * .005) / (.005 * .99 + (1-.005) * .05 )
944 | = .09
945 |
946 | It's helpful to write this out a decision tree. The denominator is the sum of all the possible ways you can get A, which means it's the sum of each final branch of the tree. In this case, it's all the ways you can have a positive test result: by having doped and getting a positive test result given that you use and the probability of being clean multiplied by the probability of a positive result given that you're clean.
947 |
948 | The **base rate fallacy** is the effect of a small population that has a disease on your ability to accurately predict it. If only a small population has a given disease, your likelihood of accurately predicting a case of the disease goes down. For rare diseases, multiple tests must be done in order to accurately evaluate if a person has the disease due to this fallacy. This is also why testing is only conducted when you are already in a population that is more likely to have a disease, such as testing people who have come in contact with ebola.
949 |
950 | ### Random Variables ###
951 |
952 | A **random variable** is a function that maps events in our sample space to some numerical quantity. There are three general types of these functions in applied probability:
953 |
954 | 1. `Probability mass function (PMF)`: Used for discrete random variables, the PMF return the probability of receiving that specific value. Technically, the PMF encompasses both discrete and continuous variables.
955 | 2. `Probability density function (PDF)`: Used exclusively for continuous random variables, the PDF returns the probability that a value is in a given range
956 | 3. `Cumulative distribution function (CDF)`: tells us the probability that any draw will be less than a specific quantity (it's complement is the survivor function). The CDF always stays constant or increases as x increases since it refers to the likelihood of a value less than it.
957 |
958 | We can compute the **covariance** of two different variables using the following: `Cov[X,Y] = E[(x−E[X]) (y−E[Y])]`. This is related to the **correlation** which is the covariance over the multiplication of their standard deviations: `Cov(X,Y) / σ(X)σ(Y)`. Correlation puts covariance on a -1 to 1 scale, allowing you to see proportion. These look uniquely at linear relationships.
959 |
960 | **Marginal distributions** take a possibly not independent multivariate distribution and considers only a single dimension. It is a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the variables in the subset without reference to the values of the other variables. By looking at the marginal distribution, you are able to **marginalize out** variables that have little covariance. We always need to be thinking about the histograms of the two variables we're comparing as well as their intersect.
961 |
962 | This contrasts with a conditional distribution, which gives the probabilities contingent upon the values of the other variables. The **conditional distribution** is the joint distribution divided by the marginal distribution evaluated at a given point. The conditional distribution says that we know a height, what's the distribution of weight given that height? In data science, this is *the* thing we want to know.
963 |
964 | The case of **Anscombe's quartet** shows us how statistics can often show us how poorly these statistics account for variance. Correlation captures direction, not non-linearity, steep slopes, etc. This is why want to know the conditional distribution, not just summary stats.
965 |
966 | ![Anscombes_quartet] (https://github.com/conorbmurphy/galvanizereference/blob/master/images/anscombesquartet.png)
967 |
968 | **Pearson correlation** evaluates linear relationships between two continuous variables. The **Spearman correlation** evaluates the monotonic relationship between two continuous or ordinal variables without assuming the linearity of the variables.
969 |
970 | ## Statistics ##
971 |
972 | Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data. There are two general camps: frequentist and Bayesian. Bayesians are allowed to impose prior knowledge onto their analysis. The difference between these two camps largely boils down to what is fixed versus what is not. Frequentists think that data are a repeatable random sample where the underlying parameters remain constant. Bayesians, by contrast, believe that the data, not the underlying parameters, are fixed. There is an argument that the two camps are largely similar if no prior is imposed on the analysis. Bayesian statistics require a lot of computationally intense programming.
973 |
974 | * Frequentist:
975 | * Point estimates and standard errors
976 | * Deduction from P(data|H0), by setting α in advance
977 | * P-value determines acceptance of H0 or H1
978 | * Bayesian:
979 | * Start with a prior π(θ) and calculate the posterior π(θ|data)
980 | * Broad descriptions
981 |
982 | ### Key Definitions ###
983 |
984 | | | Population | Sample |
985 | |--------------------|:-----------:|---------:|
986 | | size | N | n |
987 | | mean | μ | x̄ ('x-bar') |
988 | | variance | σ2 | s2 |
989 | | standard deviation | σ | s |
990 | | proportion | π | p ('little p') |
991 |
992 | Capital letters refer to random variables; lowercase refers to a specific realization. Variables with a hat (^) often refer to a predicted value. `X` refers to all possible things that can happen in the population; `x` refers to draws from X.
993 |
994 | Other vocabulary:
995 |
996 | * `S`: sample space, or a range of all possible outcomes (discrete or continuous)
997 | * `i.i.d.`: independent, identically distributed - refers to when draws from X are not dependent on previous draws and fall into the same distribution
998 | * `α`: threshold for rejecting a hypothesis
999 | * `β`: the probability of Type II error in any hypothesis test–incorrectly concluding no statistical significance. `β` is also used for your regression coefficients
1000 | * `1 - β`: power, derived from the probability of type II error
1001 | * `λ`: can mean many things, including the mean and variance for a poisson distribution (which are equal)
1002 |
1003 | ### Common distributions ###
1004 |
1005 | Rules for choosing a good distribution:
1006 |
1007 | * Is data discrete or continuous?
1008 | * Is data symmetric?
1009 | * What limits are there on possible values for the data?
1010 | * How likely are extreme values?
1011 |
1012 | * `Discrete`:
1013 | * `Bernoulli`: Model one instance of a success or failure trial (p)
1014 | * `Binomial`: Number of successes out of a number of trials (n), each with probability of success (p)
1015 | * `Poisson`: Model the number of events occurring in a fixed interval and events occur at an average rate (lambda) independently of the last event
1016 | * `Geometric`: Sequence of Bernoulli trials until first success (p)
1017 | * `Continuous`:
1018 | * `Uniform`: Any of the values in the interval of a to b are equally likely
1019 | * `Gaussian`: Commonly occurring distribution shaped like a bell curve, often comes up because of the Central Limit Theorem. Also known as a normal distribution
1020 | * `Gamma`: A two-parameter family of continuous distributions, used for cases such as rainfall and size of insurance claims
1021 | * `Exponential`: Model time between Poisson events where events occur continuously and independently. This is a special case of the gamma distribution
1022 |
1023 | ### Frequentist Statistics ###
1024 |
1025 | In frequentist statistics, there are four standard methods of estimation and sampling, to be explored below. Central to frequentist statistics is the **Central Limit Theorum (CLT)** which states that the sample mean converges on the true mean as the sample size increases. *Variance also decreases as the sample size increases.*
1026 |
1027 | #### Method of Moments (MOM) ####
1028 |
1029 | MOM has three main steps:
1030 |
1031 | 1. Assume the underlying distribution (e.g. Poisson, Gamma, Exponential)
1032 | 2. Compute the relevant sample moments (e.g. mean, variance)
1033 | 3. Plug those sample moments into the PMF/PDF of the assumed distribution
1034 |
1035 | There are four main moments we're concerned about, each raised to a different power:
1036 |
1037 | 1. `Mean/Expected value`: base 1 - the central tendency of a distribution or random variable
1038 | 2. `Variance`: base 2 - the expectation of the squared deviation of a random variable from its mean
1039 | 3. `Skewness`: base 3 - a measure of asymmetry of a probability distribution about its mean. Since it’s to the 3rd power, we care about whether it’s positive or negative.
1040 | 4. `Kurtosis`: base 4 - a measure of the "tailedness" of the probability distribution
1041 |
1042 | Variance is calculated as the squared deviation of the mean:
1043 |
1044 | var(x) = E[(x - μ)**2]
1045 |
1046 | σ2 and s2 are different in that s2 is multiplied by 1/(n-1) because n-1 is considered to be the degrees of freedom and a sample tends to understate a true population variance. Because you have a smaller sample, you expect the true variance to be larger. The number of **degrees of freedom** is the number of values in the final calculation of a statistic that are free to vary. The sample variance has N-1 degrees of freedom, since it is computed from N random scores minus the only 1 parameter estimated as an intermediate step, which is the sample mean.
1047 |
1048 | Example: your visitor log shows the following number of visits for each of the last seven days: [6, 4, 7, 4, 9, 3, 5]. What's the probability of having zero visits tomorrow?
1049 |
1050 | import scipy.stats as scs
1051 | lambda = np.mean([6, 4, 7, 4, 9, 3, 5])
1052 | scs.poisson(0, lambda)
1053 |
1054 | #### Maximum Likelihood Estimation (MLE) ####
1055 |
1056 | **Maximum Likelihood Estimation (MLE)** is a method of estimating the parameters of a statistical model given n observations by finding the parameter values that maximize the likelihood of making the observations given the parameters. In other words, this is the probability of observing the data we received knowing that they were drawn from a distribution with known parameters. We're going to be testing which hypothesis maximizes a given likelihood. For instance, if we saw 52 heads in 100 coin flips, we can evaluate the likelihood of a fair coin.
1057 |
1058 | MLE has three main steps:
1059 |
1060 | 1. Assume the underlying distribution (e.g. Poisson, Gamma, Exponential)
1061 | 2. Define the likelihood function for observing the data under different parameters
1062 | 3. Choose the parameter set that maximizes the likelihood function
1063 |
1064 | Our function is the data we received given some known parameter. We assume that our data is i.i.d.
1065 |
1066 | ƒ(x1, x2, ..., xn | θ) = ƒ(x1 | θ)ƒ(x2 | θ)...ƒ(xn | θ)
1067 | np.product(ƒ(xi | θ)) # for every sample in n
1068 |
1069 | We find the θ^ to maximize the log-likelihood function. We want the argument that maximizes the log-likelihood equation:
1070 |
1071 | θ^mle = arg max log( L(θ | x1, x2, ..., xn) )
1072 |
1073 | #### Maximum a Posteriori (MAP) ####
1074 |
1075 | This is a Bayesian method that I mention here because it is the opposite of MLE in that it looks at your parameters given that you have a certain data set. MAP is proportionate to MLE with information on what you thought going into the analysis. More on this under the Bayesian section.
1076 |
1077 | #### Kernel Density Estimation (KDE) ####
1078 |
1079 | Using **nonparametric** techniques allows us to model data that does not follow a known distribution. KDE is a nonparametric technique that allows you to estimate the PDF of a random variable, making a histogram by summing kernel functions using curves instead of boxes. In plotting with this method, there is a bias verses variance trade-off so choosing the best representation of the data is relatively subjective.
1080 |
1081 | Resources:
1082 | * [Kernel Desnsity Estimation example](http://glowingpython.blogspot.com/2012/08/kernel-density-estimation-with-scipy.html)
1083 |
1084 | #### Confidence Intervals ####
1085 |
1086 | Assuming normality, you would use the following for your 95% confidence interval for the population mean with a known standard deviation:
1087 |
1088 | x̄ +- 1.96 * (s / sqrt(n))
1089 |
1090 | This is your sample mean +/- your critical value multiplied by the standard error. Typical critical values for a two-sided test (known as z*) are:
1091 |
1092 | * 99%: 2.576
1093 | * 98%: 2.326
1094 | * 95%: 1.96
1095 | * 90%: 1.645
1096 |
1097 | For an unknown standard deviation, t* is used instead of z* as the critical value. This can be found in a t-distribution table. The degrees of freedom is calculated by subtracting 1 from n.
1098 |
1099 | **Bootstraping** estimates the sampling distribution of an estimator by sampling with replacement from the original sample. Bootstrapping is often used to estimate the standard errors and confidence intervals of an unknown parameter, but can also be used for your beta coefficients or other values as well. We bootstrap when the theoretical distribution of the statistical parameter is complicated or unknown (like wanting a confidence interval on a median or correlation), when n is too small, and when we favor accuracy over computational costs. It comes with almost no assumptions.
1100 |
1101 | 1. Start with your dataset of size n
1102 | 2. Sample from your dataset with replacement to create a bootstrap sample of size n
1103 | 3. Repeat step 2 a number of times (50 is good, though you often see 1k-10k)
1104 | 4. Each bootstrap sample can then be used as a separate dataset for estimation and model fitting (often using percentile instead of standard error)
1105 |
1106 | ### Hypothesis and A/B Testing ###
1107 |
1108 | A hypothesis test is a method of statistical inference where, most commonly, two statistical data sets are compared or data from a sample is compared to a sythetic data set from an idealized model. The steps are as follows:
1109 |
1110 | 1. State the null (H0) hypothesis and the alternative (H1)
1111 | 2. Choose the level of significance (alpha)
1112 | 3. Choose an appropriate statistical test and find the test statistic
1113 | 4. Compute the p-value and either reject or fail to reject the H0
1114 |
1115 | The following maps out type I and II errors.
1116 |
1117 | | | H0 is true | H0 is false |
1118 | | ------------------------------ |:----------------:| ----------------:|
1119 | | Fail to reject H0 | correctly accept | Type II error/beta |
1120 | | Reject H0 | Type I error/alpha | correctly reject* |
1121 |
1122 | * This is 1-beta or pi **this is the domain of power**
1123 |
1124 | The court of law worries about type I error while in medicine we worry about type II error. Tech generally worries about Type I error (especially in A/B testing) since we don't want a worse product. The **power** of a test is the probability of rejecting the null hypothesis given that it is false. If you want a smaller type II error, you're going to get it at the expense of a larger type I error. Power is the complement of beta.
1125 |
1126 | A **t-test** is any statistical hypothesis test in which the test statistic follows a **Student's t-distribution** under the null hypothesis. It can be used to determine if two sets of data are significantly different from each other.
1127 |
1128 | **Z-tests** can be conducted for a smaller sample size (n < 30) while t-tests are generally reserved for lager sample sizes as CLT comes into effect at around this sample size.
1129 |
1130 | Use a **T test** when sigma is unknown and n < 30. If you're not sure, just use a T test. Scipy assumes that you're talking about a population. You must set `ddof=1` for a sample. A **Z test** is used for estimating a proportion. **Welch's T Test** can be used when the variance is not equal (if unsure, set `equal_var=False` since it will only have a nominal effect on the result).
1131 |
1132 | The **Bonferroni Correction** reduces the alpha value we use based upon the test that we're correcting. We divide alpha by the number of tests. It is a conservative estimate.
1133 |
1134 | **Chi-squared tests** estimate whether two random variables are independent and estimate how closely an observed distribution matches an expected distribution (known as a goodness-of-fit test). `chisquare()` assumes a goodness-of-fit test while `chi1_contingency()` assumes a contingency table.
1135 |
1136 | **Experimental design** must address confounding factors that might also account for the variance. You want to minimize confounding factors but there is such thing as controlling for too many confounding factors. You can get to the point that you can’t have any data because you’ve over controlled, for instance women in SF in tech in start-ups post series b funding etc.
1137 |
1138 | Resources:
1139 | * [Power visualized](http://rpsychologist.com/d3/NHST/)
1140 |
1141 | ### Bayesian Statistics ###
1142 |
1143 | The first step is always to specify a probability model for unknown parameter values that includes some prior knowledge about the parameters if available. We then update this knowledge about the unknown parameters by conditioning this probability model on observed data. You then evaluate the fit of the model to the data. If you don’t specify a prior then you will likely have a very similar response than the frequentists.
1144 |
1145 | When we look at the posterior distribution, the denominator is just a normalizing constant. The posterior is the new belief through the data that we’ve been given. Priors come from published research, a researcher’s intuition, an expert option, or a non-informative prior.
1146 |
1147 | #### Bayesian A/B Testing ####
1148 |
1149 | In frequentist A/B testing, you can only reject or fail to reject. You can't amass evidence for another hypothesis. In Bayesian A/B testing, you can use a uniform distribution for an uninformative prior. Depending on the study you're doing, an uninformative prior (which gives equal probability to all possible values) can be effectively the same as a frequentist approach. If you use a bad prior, it will take longer to converge on the true value. You can get a distribution from your test and then perform an element-wise comparison and take the mean to see where they overlap. To test a 5% improvement, do element-wise plus .05.
1150 |
1151 | The **multi-armed bandit** is the question of which option you take given prior knowledge. There are two operative terms. **Exploitation** leverages your current knowledge in order to get the highest expected reward at that time. **Exploration** is testing other options to determine how good each one is. Multi-armed bandit is now a big part of reinforcement learning (a branch of AI) more than it is part of stats. You can use this for dynamic A/B testing, budget allocation amongst competing projects, clinical trials, adaptive routing for networks minimizing delays, and reinforcement learning.
1152 |
1153 | **Regret** is the difference between the maximal reward mean and the reward at time t. You can never know what our actual regret is. We don't know the true mean of a click-through rate, for instance. Regret can be seen as how often you choose the suboptimal bandit (a cost function to minimize).
1154 |
1155 | There are four main multi-armed bandit algorithms:
1156 |
1157 | 1. `Epsilon-Greedy`: Epsilon is the percent of time that we explore, frequently set at 10%. Think of epsilon as how often you try a new restaurant. Normally, you eat at your favorite spot but you want to choose a new one sometimes. You try a new place, don’t like it, but it has low regret.
1158 | 2. `UCB1`: Part of a set of algorithms optimized by upper confidence. The UCB1 greedily chooses the bandit with the highest expected payout but with a clever factor that automatically balances exploration and exploitation by calculating exploration logarithmically. This is a zero-regret strategy.
1159 | 3. `Softmax`: creates a probability distribution over all the bandits in proportion to how good we think each lever is. It’s a multi-dimensional sigmoid (a “squashing” function). This is a probability matching algorithm. Aneeling is where you vary tau as you go on, similar to lowering the temperature slowly when blowing glass or steel refining.
1160 | 4. `Bayesian bandit`: softmax has one distribution governing the process, here we have distributions that sum to the number of values in our distribution.
1161 |
1162 | ---
1163 |
1164 | ## Modeling ##
1165 |
1166 | ## Exploratory Data Analysis ##
1167 |
1168 | EDA is the first cut analysis where you evaluate the following points:
1169 |
1170 | * What are the feature names and types?
1171 | * Are there missing values?
1172 | * Are the data types correct?
1173 | * Are there outliers?
1174 | * Which features are continuous and which are categorical?
1175 | * What is the distribution of the features?
1176 | * What is the distrubiton of the target?
1177 | * How do the variables relate to one another?
1178 |
1179 | Some common functions to do this are `pd.head()`, `.describe()`, `.info()`, and `pd.crosstab()`. If you see 'object' in the info result where you should have numbers, it is likely that you have a string hidden in the column somewhere. When removing NA's, be sure that you don't drop any data that might be included in your final analysis.
1180 |
1181 | There are a few options for dealing with NA's:
1182 |
1183 | * Drop that data
1184 | * Fill those values with:
1185 | * Column averages
1186 | * 0 or other neutral number
1187 | * Fill forwards or backwards (especially for time series)
1188 | * Impute the values with a prediction (e.g. mean, mode)
1189 | * Ignore them and hope your model can handle them
1190 |
1191 | One trick for imputation is to add another column to your model that's your variable imputed with a third column that's binary for whether you did impute it. Using this method, your model can unlearn the imputation if you shouldn't have done it.
1192 |
1193 | Exploratory plots such as scattermatrix.
1194 |
1195 | ---
1196 |
1197 |
1198 | ### Linear Regression Introduction and Univariate ###
1199 |
1200 | Linear regression is essentially fitting lines to data, originally coming from trying to predict child height to parent height. In univariate linear regression, we are investigating the following equation:
1201 |
1202 | yˆ = βˆ0 + βˆ1x + ε
1203 |
1204 | Where ε is an error term that's iid and normally distributed with a mean of 0.
1205 |
1206 | **Ordinary Least Squares** is the most common method for estimating the unknown parameters in a linear regression with the goal of minimizing the sum of the squares of the differences between observed and predicted values. The difference between the ith observed response and the ith response value that is predicted by our linear model (or ei = yi −yˆi) is called the **residual**. The **residual sum of squares (RSS)** is a measure of over fit denoted by RSS = e21 + e22 + ··· + e2n or RSS = (y1 −βˆ0−βˆ1x1)2 + (y2 −βˆ0−βˆ1x2) +...+ (yn−βˆ0−βˆ1xn)2.
1207 |
1208 | Just like with estimated values, we can look at the standard error of our regression line to compute the range that the true population regression line likely falls within. The **residual standard error (RSE)** is given by the formula RSE = sqrt(RSS/(n − 2)). Using this SE, we can calculate the confidence interval, or, in other words, the bounds within which the true value likely falls. We can use SE's to perform hypothesis tests on the coefficients. Most commonly, we test the null hypothesis that there is no relationship between X and Y versus the alternative that there is a relationship (or H0: B1 = 0 verus H1: B1 != 0).
1209 |
1210 | **Mean squared error** takes the average of the distance between our expected and real values. It is a good error metric but is not comparable across different datasets.
1211 |
1212 | Roughly speaking, we interpret the **p-value** as follows: a small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance, in the absence of any real association between the predictor and the response. Hence, if we see a small p-value, then we can infer that there is an association between the predictor and the response. We reject the null hypothesis—that is, we declare a relationship to exist between X and Y —if the p-value is small enough.
1213 |
1214 | The **F-statistic** allows us to compare models to see if we can reduce it. The F-test can also be used generally to see if the model is useful beyond just predicting the mean.
1215 |
1216 | linear relations versus polynomial terms
1217 |
1218 | **Assessing linear model fit** is normally done through two related terms: the RSE and the R**2 statistic. The RSE estimates the standard deviation of ε, our error term. It is roughly the average amount that our response will deviate from our regression line. It is computed as sqrt((1/(n-2)) * RSS). RSE is a measure of lack of model fit.
1219 |
1220 | Since RSE is in the units of Y, it is not always clear what constitutes a good value. **R**2** takes the form of a proportion of variance explained by the model between 0 and 1. The R**2 formula is 1 - (RSS/TSS) where TSS is the sum of (yi − y¯)**2 or the **total sum of squares.** In simple regression, it's possible to show that the squared correlation between X and Y and the R**2 value are identical, however this doesn't extend to multivariate regression.
1221 |
1222 |
1223 |
1224 | ### Multivariate Regression ###
1225 |
1226 | Multivariate regression takes the following form:
1227 |
1228 | Y = β0 + β1X1 + β2X2 + ··· + βpXp + ε
1229 |
1230 | While univariate regression uses ordinary least squares to predict the coefficents, multivariate regression uses multiple least squares streamlined with matrix algebra. To do this in python, use the following:
1231 |
1232 | est = sm.OLS(y, X) # This comes from statsmodels
1233 | est = est.fit()
1234 | est.summary()
1235 |
1236 | This printout tells you the dependent variable (y), the type of model, and the method. The F statistic will tell us if any of our coefficients are equal to 0. AIC and BIC should be minimized. We want to make sure the t statistic is larger than 2. We want to then exclude the largest p values. **Backwards stepwise** moves backwards through the model by removing the variables with the largest p values. Skew and kurtosis are also important in reading a linear model because if they're drastic, it violates our t test since a t test assumes normality.
1237 |
1238 | The assumptions of a linear model are as follows:
1239 |
1240 | 1. `Linearity`
1241 | 2. `Constant variance (homoscedasticity)`: this can often be rectified with a log
1242 | 3. `Independence of errors`
1243 | 4. `Normality of errors`
1244 | 5. `Lack of multicollinearity`
1245 |
1246 | **Residual plots** allow us to visualize whether there's a pattern not accounted for by our model, testing the assumptions of our model.
1247 |
1248 | A **leverage point** is an observation with an unusual X value. We calculate leverage using the **hat matrix** and the diagonals of which respond to its leverage. **Studentized residuals** are the most common way of quantifying outliers.
1249 |
1250 | **Variance Inflation Factors (VIF)** allows us to compute multicollinearity by the ratio of the variance of β^j when fitting the full model divided by the variance of β^j if fit on its own. The smallest value for VIF is 1, which indicates a complete absence of multicollinearity. The rule of thumb is that a VIF over 10 is problematic. *Multicollinearity only affects the standard errors of your estimates, not your predictive power.*
1251 |
1252 | **QQ plots** allow you to test normality by dividing your normal curve into n + 1 sections, giving you a visual for normality. **Omitted variable bias** is when a variable you leave out of the model inflates the value of other variables which should be omitted had the original variable been included.
1253 |
1254 | **Categorical variables** take a non-numeric value such as gender. When using a categorical variable, you use a constant of all ones and then other variables (such as removing one ethnicity as the constant and adding two new variables for two other ethnicities). To vary the slop, you can add an **interaction term** such as income multiplied by whether they're a student. An **interaction effect** would be, for instance, when radio advertisements combined with tv has a more pronounced effect than separate. This can be dealt with by multiplying the two.
1255 |
1256 | Here are some potential transformations:
1257 |
1258 | | Model | Transformation(s) | Regression equation | Predicted value (y^) |
1259 | | ---------- |:-----------------:| ----------------------:| --------------------:|
1260 | | Standard linear | none | y = β0 + β1x | y^ = β0 + β1x |
1261 | | Exponential | Dependent variable = log(y) | log(y) = β0 + β1x | y^ = 10**(β0 + β1x) |
1262 | | Quadradic | Dependent variable = sqrt(y) |sqrt(y) = β0 + β1x | y^ = (β0 + β1x)**2 |
1263 | | Reciprocal | Dependent variable = 1/y|1/y = β0 + β1x | y^ = 1 / (β0 + β1x) |
1264 | | Logarithmic | Independent variable = log(x) |y = β0 + β1log(x) | y^ = β0 + β1log(x) |
1265 | | Power | Dep and Ind variables = log(y) and log(x) |log(y) = β0 + β1log(x) | y^ = 10**(β0 + β1log(x)) |
1266 |
1267 | Note that a logarithm is the inverse operation to exponentiation. That means the logarithm of a number is the exponent to which another fixed number, the base, must be raised to produce that number. In stats and math, we often assume the natural log. In computer science, we often assume log base 2
1268 |
1269 | ![Transformations] (https://github.com/conorbmurphy/galvanizereference/blob/master/images/transformations.png)
1270 |
1271 | This is closely related to **conic sections**. A conic is the intersection of a plane and a right circular cone. The four basic types of conics are parabolas, ellipses, circles, and hyperbolas.
1272 |
1273 | 
1274 |
1275 | Reference:
1276 | * [Interpreting Q-Q plots](http://emp.byui.edu/brownd/stats-intro/dscrptv/graphs/qq-plot_egs.htm)
1277 | * [Introduction to Conics](http://www.sparknotes.com/math/precalc/conicsections/section1.rhtml)
1278 |
1279 |
1280 | ### Logistic Regression ###
1281 |
1282 | Linear regression is good choice when your dependent variable is continuous and your independent variable(s) are either continuous or categorical. *Logistic regression, a subset of linear regression, is used when your dependent variable is categorical.* Examples of when you use logistic regression include customer churn, species extinction, patient outcome, prospective buyer purchase, eye color, and spam email. Since linear regression is not bounded to our discrete outcome, we need something that takes a continuous input and produces a 0 or 1, has intuitive transition, and interpretable coefficients. The **logit function** asymptotically approaches 0 and 1, coming from the sigmoid family. It is solved via maximum likelihood.
1283 |
1284 | If your results say "optimization not terminated successfully," do not use them as it's likely an issue with multicollinearity. We look at the difference between LL-Null and Log-Likelihood, which tell us how far we could go to 0 (a perfect model) and how far we went (Log-Likelihood). Doing `likelihood-ratio-test()` will return the p-value of one model being better than the other if one model is the same as the other with subtracted parameters.
1285 |
1286 | *The coefficients for logistic regression are in log odds.* In other words, we would interpret them as:
1287 |
1288 | exp(β0 + β1X1 + β2X2 + ··· + βpXp)
1289 | ----------------------------------
1290 | 1 + exp(β0 + β1X1 + β2X2 + ··· + βpXp)
1291 |
1292 | Similar to hypothesis testing, a **confusion matrix** gives you your true and false positive and negatives:
1293 |
1294 | | | Predicted positive | Predicted negative |
1295 | |--------------------|:-----------:|---------:|
1296 | | Actually positive | True positive | False negative |
1297 | | Actually negative | False positive | True negative |
1298 |
1299 | When you fill in this matrix, you can calculate accuracy (true poitives and negatives over n), misclassification (accuracy's compliment), true positive (actual yes over predicted yes), true negative, specificity, precision, etc.
1300 |
1301 | The **Receiver Operator Characteristic (ROC) curve** is the sigmoid function with our different quadrants for true and false negatives and positives. After you run your regression model, you’re going to get a score, which is a probability. You’ll run a bunch of models with their respective confusion matrix. You then plot them to see which model is better. You might have to decide if you’re more worried about false positives or negatives. You make a ROC curve by putting your sensitivity on the y axis and your false positive rate on the x axis.
1302 |
1303 |
1304 |
1305 | ### Cross-validation ###
1306 |
1307 | Cross-validation is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent dataset. The biggest mistake you can make in terms of validating a model is to keep all of your data together. Do not evaluate yourself on your training set. The **train/test split** divides your data into separate sets for you to train and evaluate your model. When talking about splits, convention is to flip our dataset on its side to have variables as rows. You can:
1308 |
1309 | * Split the data evenly into two equal-size groups
1310 | * Split the data into n groups (each data point in its own group)
1311 | * `Leave one out Cross-validation`: Split the data into n groups of k points and train a completely new model on all the data minus one datapoint, evaluating it on that accuracy. Ideally the size of k is 1. Then you take the mean on a bunch of reporting models. At the end, you retrain the model on the whole dataset.
1312 | * `10-fold cross-validation`: Same as above but using the convention of groups of 10
1313 |
1314 | When training hyperparameters, set aside another subset of the data for testing at the end.
1315 |
1316 | SKlearns CV description: http://scikit-learn.org/stable/modules/cross_validation.html
1317 | Time series cross-validation example: http://robjhyndman.com/hyndsight/tscvexample/
1318 |
1319 | ### Bias/Variance Tradeoff ###
1320 |
1321 | The error of any model is the function of three things: the **variance** (the amount that the function would change if trained on a different dataset) plus **bias** (the error due to simplified approximation) plus the irreducible error (epsilon). Your bias goes up as your variance goes down and vice versa. **Underfitting** is when the model does not fully capture the signal in X, being insufficiently flexible. **Overfitting** is when the model erroneously interprets noise as signal, being too flexible.
1322 |
1323 | In linear regression, we minimize the RSS by using the right betas. You want to avoid large course corrections because that would mean high variance. **Ridge (or L2) regression** and **lasso (L1) regression** avoid high variance by penalizing high betas. In ridge, you add a hyperparameter that sums up the squared betas. As the betas increase, the sum of the beta squared's increase as well. Since you multiply your summation of squared betas by lambda, the bigger the lambda the more this is penalized. This evens out the regression line. Since high betas are penalized, we have to standardize the predictors.
1324 |
1325 | Lasso regression is almost identical to ridge except you take the magnitude (or absolute value). When you plot lasso, it looks like a V while ridge looks like a parabola. Lasso still penalizes you when you're close to 0 so it's helpful for feature selection and more sparse models. *Most consider lasso is better than ridge but it depends on the situation.*
1326 |
1327 | ### Gradient Ascent ###
1328 |
1329 | Minimization algorithms are optimizations techniques. In linear regression, we minimized the sum of squared errors; we minimized with respect to our betas. In logistic regression, we maximized the likelihood function. **Gradient ascent** is built on the idea that if we want to find the maximum point on a function, the best way to move is in the direction of the gradient. Gradient descent is the same function except our formula subtracts as to minimize a function.
1330 |
1331 | It works on a small subset of problems but is used everywhere because we don't have many functions for optimization. Here are some characteristics:
1332 |
1333 | * It is like a derivative but with many dimensions
1334 | * It works on "smooth functions" with a gradient
1335 | * Must be some kind of convex function that has a minimum
1336 |
1337 | If the above criteria are met, it will get asymptotically close to the global optimum. The biggest concern is getting stuck in a local minimum where the derivative is zero. In a broad sense, it works by taking a point on a curve and varying it slightly to see if its value has increased or decreased. You then use that information to find the direction of the minimum. **Alpha** is the size of the step you take towards that minimum.
1338 |
1339 | **Stochastic gradient descent (SGD)** saves computation by taking one observation in the data and calculates the gradient based upon that. The path is more chaotic however it will converge as long as there's no local minimum. **Minibatch SDG** takes a subset of your data and computes the gradient based on that. Relative to SGD, it takes a more direct path to the optimum while keeping us from having to calculate our minimum using the whole dataset.
1340 |
1341 | In terms of increasing time (not necessarily number of iterations), SGD is often fastest ahead of minibatch and then standard gradient descent. SGD often gets an accurate enough response and is good for big data as it converges faster on average and can work online (with updating new data and sunsetting old observations by pushing itself away from the optimum to check for change).
1342 |
1343 | The **Newton-Raphson Method** chooses our learning rate (alpha) in GD. When the derivative is changing quickly, it takes a larger step. When we're close to the minimum, it takes a smaller step by looking at the tangent's intersection with the x axis.
1344 |
1345 | https://www.wolframalpha.com/
1346 |
1347 | ### Machine Learning (ML) ###
1348 |
1349 | **Machine learning (ML)** is a "Field of study that gives computers the ability to learn without being explicitly programmed." There are three broad categories of ML depending on the feedback available to the system:
1350 |
1351 | 1. `Supervised learning`: The algorithm is given example inputs and outputs with the goal of mapping the two together. k-Nearest Neighbors and decision trees are non-parametric, supervised learning algorithms.
1352 | 2. `Unsupervised learning`: No labels are given to the algorithm, leaving it to find structure within the input itself. Discovering hidden patterns or feature engineering is often the goal of this approach
1353 | 3. `Reinforcement learning`: A computer interacts with a dynamic environment in which it must perform a goal (like driving a car) without being explicitly told if it has come close to its goal
1354 |
1355 | The best model to choose often depends on computational costs at training time versus at prediction time in addition to sample size. There are a number of possible outputs of ML:
1356 |
1357 | * `Classification`: results are classified into one or more classes, such as spam filters. For this, we have logistic regression, decision trees, random forest, KNN, SVM's, and boosting
1358 | * `Regression`: provides continuous outputs.
1359 | * `Clustering`: returns a set of groups where, unlike classification, the groups are not known beforehand
1360 | * `Density estimations`: finds the distribution of inputs in some space
1361 | * `Dimensionality reduction`: simplifies inputs by mapping them into a lower-dimensional space
1362 |
1363 | An **ensemble** leverages the idea that more predictors can make a better model than any one predictor independently, combining many predictors with average or weighted averages. *Often the best supervised algorithms have unsupervised learning as a pre-processing step such as using dimensionality reduction, then kmeans, and finally running knn against centroids from kmeans. There are a few types of ensembles:
1364 |
1365 | 1. `Committees`: this is the domain of random forest where regressions have an unweighted average and classification uses a majority
1366 | 2. `Weighted averages`: these give more weight to better predictors (the domain of boosting)
1367 | 3. `Predictor of predictors`: Treats predictors as features in a different model. We can do a linear regression as a feature and random forest as another (part of the MLXtend package)
1368 |
1369 | References: https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
1370 |
1371 | ### Supervised Learning
1372 |
1373 | WHEN TO USE WHICH ALGORITH
1374 |
1375 | ### k-Nearest Neighbors (KNN) ###
1376 |
1377 | KNN is a highly accurate ML algorithm that is insensitive to outliers and makes no assumptions about the data. You can also do online updates easily (you just store another data point), use as many classes as you want, and learn a complex function with no demands on relationships between variables (like linearity). The downside is that it is computationally very expensive because it's **IO bound** (you have to read every data point to use it) and noise can affect results. Categorical variables makes feature interpretation tricky. It works with numeric and nominal values.
1378 |
1379 | KNN is a classifier algorithm that works by comparing a new piece of data to every piece of data in a training set. We then look at the top k most similar pieces of data and take a majority vote, the winner of which will get its label assigned to the new data. Your error rate is the number of misclassifications over the number of total classifications.
1380 |
1381 | The method:
1382 |
1383 | 1. Collect: any method
1384 | 2. Prepare: numeric values are needed for a distance calculation; structured data is best. We need to feature scale in order to balance our distances.
1385 | 3. Analyze: any method
1386 | 4. Train: notable about KNN is that there is no training step. You can change k and your distance metric, but otherwise there's not much to adapt
1387 | 5. Test: calculate the error rate
1388 |
1389 | You prediction is the majority (in the case of classification) or the average (in the case of regression) of the k nearest points. Defining k is a challenge. When k is 1, you get localized neighborhoods around what could be outliers (high variance) as well as 100% accuracy if you evaluate yourself on your training data. If k is n then you're only guessing the most common class (high bias). A good place to start is `k = sqrt(n)`.
1390 |
1391 | Distance metrics can include any of the following:
1392 |
1393 | * `Euclidean`: a straight line (and a safe default)
1394 | * `Manhattan`: straight lines from the base point to the destination using the number of stops at other points along the way. Manhattan distance is based on how far a taxi would go following blocks in a city
1395 | * `Cosine`: uses the angle from the origin to the new point. For instance, you can plot different points and the magnitude (distance from the origin) won't matter, only the distance from the origin. *Cosine distance is key in NLP.*
1396 | * `Custom`: you can always design your own distance metric as well
1397 |
1398 | The **curse of dimensionality** is that as dimensionality increases, the performance of KNN commonly decreases. This starts to affect you when you have p (number of dimensions) as 5 or greater. The nearest neighbors are no longer nearby neighbors. Adding useful features (that are truly associated with the response) is generally helpful but noise features increase dimensionality without an upside. The more dimensions you have (p), the smaller the amount of space you have in your non-outlier region. *This is the kryptonite of KNN*. You'll often have to reduce dimensions to use it, especially with NLP. **Locality-sensitive hashing** is a way to reduce the time to lookup each datapoint.
1399 |
1400 | KNN can be used for classification, regression (neighbors averaged to give continuous value), imputation (replace missing data), and anomaly detection (if the nearest neighbor is far away, it might be an outlier).
1401 |
1402 |
1403 | ### Decision Trees ###
1404 |
1405 | Decision trees use **information theory** (the science of splitting data) to classify data into different sets, subsetting those sets further as needed. One benefit of decision trees over KNN is that it's *incredibly interpretable*. They are computationally inexpensive, feature interaction is already built in, and they can handle mixed data (discrete, continuous, and categorical). The downside is that they are prone to overfitting. They are also terrible at extrapolation (e.g. if I make $10 for 1 hour work and $20 for 2 hours work, they'll predict $20 for 10 hours of work). Like KNN they work with numeric and nominal values however numeric values have to be translated into a nominal one. They are computationally challenging at the training phase and inexpensive at the prediction phase as well as able to deal with irrelevant features and NA's (the opposite of KNN).
1406 |
1407 | Trees consist of **nodes**. One type of node is the common element throughout the tree known as the **root** (at the top, so it's botanically inaccurate). The **stump** is the first node. The **leaves** are a type of node at the end of the tree.
1408 |
1409 | **Information gain** is the difference between the pre-split and and post-split entropy. The **entropy** of a set is a measure of the amount of disorder. We want to create splits that minimize the entropy in each side of split. **Cross-entropy** is a measure of node purity using the log and **Gini index** does the same with a slightly different formula (both are effectively the same). You can do regression decision trees by using RSS against the mean value of each leaf instead of cross-entropy or Gini. You can also use variance before and after.
1410 |
1411 | The method:
1412 |
1413 | 1. Collect: any method
1414 | 2. Prepare: any continuous values need to be quantized into nominal ones
1415 | 3. Analyze: any method, but trees should be examined after they're built
1416 | 4. Train: construct a tree by calculating the information gain for every possible split and select the split that has the highest information gain. Splits for categorical features are `variable = value` or `!= variable` and for continuous variable `variable <= threshold`
1417 | 5. Test: calculate the error rate
1418 |
1419 | With *continuous variables*, decision trees divide the space and use the mean from that given space. For continuous variables, there are three general split methods for decision tree regression:
1420 |
1421 | 1. Minimize RSS after the split
1422 | 2. Reduce the weighted variance after each split (weighted by the number of each side of split over the total number of values going to that split)
1423 | 3. Reduce weighted std after the split
1424 |
1425 |
1426 | Decision trees are high variance since they are highly dependent on the training data. We can ease up on the variance by **pruning**, which is necessary whenever you make trees. **Prepruning** is when you prune as you build the tree. You can do this with leaf size (stopping when there's few data points at a node), depth (stop when a tree gets too deep), class mix (stop when some percent of data points are the same class), and error reduction (stop when the information gains are small). For **postpruning**, you can build a full tree, then cut off some leaves (meaning merge them) using a formula similar to ridge regression. You will likely extend your trees with bagging, random forests, or boosting.
1427 |
1428 | Algorithms for splitting data include ID3, C4.5, and CART.
1429 |
1430 | ### Bagging and Random Forests ###
1431 |
1432 | **Bagging** is Bootstrap AGGregatING where you take a series of bootstrapped samples from your data following this method:
1433 |
1434 | 1. Draw a random bootstrap sample of size n
1435 | 2. Grow a decision tree using the same split criteria at each node by maximizing the information gain (no pruning)
1436 | 3. Repeat steps 1 & 2 k times
1437 | 4. Aggregate the prediction by each tree to assign the class label by majority vote (for classification) or average (for regression)
1438 |
1439 | Bagging starts with high variance and averages it away. Boosting does the opposite, starting with high bias and moving towards higher variance.
1440 |
1441 | **Random forest** is the same as bagging but with one new step: in step 2 it randomly selects d features without resampling (further decorrelating the trees). We can do this process in parallel to save computation. Start with 100-500 trees and plot the number of trees versus your error. You can then do many more trees at the end of your analysis. Generally you don't need interaction variables though there may be times when you need them.
1442 |
1443 | Random forest offers **feature importance**, or the relative importance of the variables in your data. This makes your forest more interpretable and gives you free feature selection. Note that you are only interested in rank, not magnitude, and that multicollinearity will inflate the value of certain variables. There are two ways of calculating this, called the first and second way:
1444 |
1445 | 1. Start with an array that is the same number of features in your model and then calculate the information gain and points split with each variable.
1446 | 2. Calculate OOB error for a given tree by evaluating predictions on the observations that were not used in building the base learner. After, take your features and give them a random value between the min and max and calculate how much worse it makes your model.
1447 |
1448 | One downside of random forest is that it sacrifices the interpretability of individual trees. *Explain or predict, don't do both*. Some models have to be explained to stakeholders. Others just need high predictive accuracy. Try to separate these two things whenever possible.
1449 |
1450 | **Out of Bag (OOB) Error** pertains to bootstrap samples. Since each bootstrap is different and you're already calling it, you can use your OOB samples for cross-validation using `oob_score = True`. *You will rarely cross-validate a random forest because OOB error effectively acts as the cross-validation* meaning that you only need a test/train split, not a validation set.
1451 |
1452 | For categorical data, strings need to be converted to numeric. If possible, convert to a continuous variable (e.g. S, M, L into a weight and height).
1453 |
1454 | Galit Shmueli's paper "To Explain or to Predict?""
1455 |
1456 | ### Boosting ###
1457 |
1458 | **Boosting** is generally considered to be the best out of box supervised learning algorithm for classification and regression. In practice, you often get similar results from bagging and random forests while boosting generally gives you better performance. Boosting does well if you have more features than sample size, and is especially helpful for website data where we're collecting more and more information.
1459 |
1460 | While it's most natural to think of boosting in the context of trees, it applies to other 'weak learners' including linear regression. It does not apply to strong learners like SVM's. While random forest and bagging creates a number of trees and looks for consensus amongst them, boosting trains a single tree. That is, random forest takes place in parallel while boosting must be done in series since each tree relies on the last (making it slower to train). Boosting does not involve bootstrap sampling: each tree is fit on the error of the previous model.
1461 |
1462 | The first predicting function does practically nothing where the error is basically everything. Here's the key terminology:
1463 |
1464 | * `B`: your number of trees, likely in the hundreds or thousands. Since boosting can overfit (though it's rare), pick this with cross-validation
1465 | * `D`: your depth control on your trees (also known as the interaction depth since it controls feature interaction order/degree). Often D is 1-3 where 1 is known as stumps.
1466 | * `λ`: your learning rate. It is normally a small number like .1, .01, or .001. Use CV to pick it. λ is in tension with B because learning more slowly means needing more trees
1467 | * `r`: your error rate. Note that you're fitting to r instead of y.
1468 |
1469 |
1470 | For every B, you fit a tree with d splits to r. You update that with a version of the tree shrunken with λ and update the residuals. The boosted model is the sum of all your functions multiplied by λ. At each step, you upweight the residuals you got wrong. We control depth and learning rate to avoid a 'hard fit' where we overfit the data. This is the opposite of random forest where we make complex models and then trim them back.
1471 |
1472 | There are many different types of boosting with the best option depending on computation/speed, generality, flexibility and nomenclature:
1473 |
1474 | * `AdaBoost`: 'adaptive boosting' that upweights the points you're getting wrong to focus more on them. It will weigh each individual weak learner based on its performance. This is a special case of gradient boosting, which was discovered after AdaBoost. AdaBoost allows us to be more nuanced in how we shrink a function as it's not just a constant lambda. You have two sets of weights: d, our weight for an underrepresented class (updated as it goes), and alpha, which is the weight of our weak learners (constant). *Anything better than random is valuable.* Alpha flips at .5, meaning that if a given learner is wrong more than it's right we predict the opposite outcome.
1475 | * `Gradient Boosting`: improves the handling of loss functions for robustness and speed. It needs differentiable loss functions
1476 | ** `XGBoost`: a derivation on gradient boosting invented, interestingly enough, by a student in China to win a Kaggle competition
1477 |
1478 | To *compare our models*, lets imagine you want a copy of a painting. Bagging would be the equivalent of asking a bunch of painters to observe the painting from the same location and then go paint it from memory. Random forests would be the same except these painters could stand anywhere around it (further de-correlating their results). In both of these cases, you would average the results of all the painters. Boosting, by contrast, would be the equivalent of asking one single painter to sit by the painting, waiting for each stroke to dry, and then painting the 'error' between what they've already painted and the true painting.
1479 |
1480 | A particularly helpful visualization: https://github.com/zipfian/DSI_Lectures/blob/master/boosting/matt_drury/Boosting-Presentation-Galvanize.ipynb
1481 |
1482 | ### Maximal Margin Classifier, Support Vector Classifiers, Support Vector Machines ###
1483 |
1484 | Support Vector Machines (SVM) are typically thought of as a classification algorithm however they can be used for regressions as well. SVM's used to be as well researched as neural networks are today and are considered to be one of the best "out of the box" classifiers. Hand writing analysis in post offices were done by SVM's. They are useful when you have a sparcity of solutions achievable by the l1 penalty and when you know that kernels and margins can be an effective way to approach your data.
1485 |
1486 | In p-dimensional space, a **hyperplane** is a flat affine subspace of dimension p-1. With a dat set of an n x p matrix, we have n points in p-dimensional space. Points of different classes, if completely separable, are able to be separated by an infinite number of hyperplanes. The **maximal margin hyperplane** is the separating hyperplane that is farthest from the training observations. The **margin** is the orthogonal distance between points between two classes (when the inner product of two vectors is 0, it's orthoginal). The **maximal margin classifier** then is when we classify a test observation based on which side of the maximal margin hyperplane it lies.
1487 |
1488 | This margin relies on just three points, those closest to the hyperplane. These points are called the **support vectors**.
1489 |
1490 | While a maximal margin classifier relies on a perfect division of classes, a **support vector classifier** (also known as a **soft margin classifier**) is a generalization to the non-separable case. Using the term **C** we control the number and severity of violations to the margin. In practice, we evaluate this tuning parameter at a variety of values using cross-validation. C is how we control the bias/variance trade-off for this model, as it increases we become more tolerant of violations, giving us more bias and less variance.
1491 |
1492 | Finally, **support vector machines (SVM)** allow us to address the problem of possibly non-linear boundaries by enlarging the feature space using quadratic, cubic, and higher-order polynomial functions as our predictors. **Kernels** are a computationally efficient way to enlarge your feature space as they rely only on the inner products of the observations, not the observations themselves. There are a number of different kernels that can be used.
1493 |
1494 | The slack variable allows us to relax our constraint in a given case. Class imbalance is an issue for SVM (when you have a lot of classes of different positive and negative values). When you penalize the betas with respect to the size of the beta, then you should scale them. SVM's encode yi as (-1, 1) to denote which side of our hyperplane it's on, a change from logistic regression where it's encoded as (0, 1).
1495 |
1496 | If there are more than two classes of data, there are two options to approach the problem:
1497 |
1498 | 1. `One versus the rest`: train k models for your k classes and choose the model that predicts the highest probability for your specific class
1499 | 2. `One versus one`: choose the best model based on ties
1500 |
1501 | There are many *differences between logisic regression and SVM's*. A logistic regression only asymptotically approaches 0 or 1. Perfectly separable data is a problem for logistic regression where it won't converge. SVM's give you a class without a probability; logistic regression assigns a probability for a given class.
1502 |
1503 | MIT lecture on SVM's: https://www.youtube.com/watch?v=_PwhiWxHK8o
1504 |
1505 | ### Neural Networks ###
1506 |
1507 | Neural networks, initially inspired to emulate the function of the brain, have a lot of flexibility in modeling different data types. They are particularly adept at multi-class classification. Neural nets have very high variance, have long training time, and are not great with little data.
1508 |
1509 | Let's take linear regression as a jumping off point. Here, we put weights on each of our features to predict y-hat. We use gradient descent on our loss function to look at the derrivative (and therefore the direction of the slope) and minizie the coefficients. Instead of weights coming through one node like in linear regression, neural nets have multiple nodes, each which combines all of the inputs in a slightly different way. This adds non-linearity. THere's a sigmoid that squashes the sums before these values are sent to another node.
1510 |
1511 | **Back propagation** is an application of the chain rule. The forward pass over the neurons predicts based upon its values. The backwards pass takes a gradient of the loss and pushes that gradient back through the nodes and uses the chain rule to calculate partial gradients. A **densly connected node** is when when all connections go to a given node.
1512 |
1513 | You can *initialize neural nets* in a number of ways such as sigmoids however (ReLU)[https://en.wikipedia.org/wiki/Rectifier_(neural_networks)] has been shown to have faster training times by taking the max of 0 and x. This helps prevent 'dead neurons' where its weight will go to zero and become inactive. A really accurate layer should be 5 to 6 layers deep. If you crash, remove one layer at a time and add width.
1514 |
1515 | There are a number of different types of neural nets including:
1516 |
1517 | * `Recurrent`: flexible, often used in NLP
1518 | * `Recursive`: often used in time series
1519 | * `Convolutional`: designed for image data
1520 | * `Adverserial`:
1521 |
1522 | **Convolutional neural nets** take their name from the filters that are passed over data. They take a smaller filter and shift it over the image, multiplying it continuously over the image's values and summing the response. This process, called **convolution,** blurs the image and reduces dimensions. We used to code these convolutions by hand; now neural nets can do this automatically with back propogation. Pooling is another approach to reducing dimensionality. An **epoch** is when a neural net sees every point in your data set.
1523 |
1524 | 
1525 |
1526 | Stanford course: http://cs231n.stanford.edu/
1527 |
1528 | TensorFlow is the most prominent package with high-level API's like Lasagne and Keras that wrap it. ImageNet and MNIST are standard sources of machine vision data.
1529 |
1530 | http://www.deeplearningbook.org/
1531 | http://neuralnetworksanddeeplearning.com/
1532 | https://gym.openai.com/
1533 |
1534 | TensorFlow Tutorials:
1535 | * https://www.tensorflow.org/tutorials/mnist/beginners/
1536 | * https://www.tensorflow.org/tutorials/mnist/pros/
1537 |
1538 | ## Unsupervised Learning ##
1539 |
1540 | Unsupervised learning is of more interest to computer science than for stats. It uses a lot of computing power to find the underlying structure of data set. In supervised learning, you cross-validate *with y (our target) as our supervisor*. In unsupervised, we don't have a y and we therefore can't cross-validate. There are two common and contrasting unsupervised techniques: principle component analysis and clustering.
1541 |
1542 | ### KMeans Clustering ###
1543 |
1544 | **Kmeans clustering** aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster called the **centroid**. Kmeans is helpful when dealing with space (such as in mapping) or in clustering customer profiles together. Kmeans can work well with knn where you run knn on the centroids from kmeans. Limitations include that new points could be classified in the 'beams' if far from a distant cluster and that it is often difficult to find a distance metric for categorical data.
1545 |
1546 | In essence, points in your data set are assigned one of k clusters. The centroid is then calculated as the center of the points in its cluster. The points are then reassigned to their closest centroid. This continues until a maximum number of iterations or convergence.
1547 |
1548 | There are a number of ways to initialize kmeans. One is to initialize randomly by assigning every point to a cluster at random. Another is to pick k random points and call them centroids. **Kmeans++** picks one centroid at random and then improves the probability of getting another centroid as you get farther from that point, allowing us to converge faster. It's worth using Kmeans++ whenever possible as it improves run time (though it doesn't make much difference in the final solution).
1549 |
1550 | The group sum of squares tells you the spread of your clusters. This both indicates convergence as well as the best number of clusters. The **elbow method** involves plotting the sum of squares on the y axis and k on the x axis where the elbow normally shows the poitn of diminishing returns for k. Beyond this method, the **silhouette coefficient** asks how tightly your group is clustered compared to the next group. If your group is tightly clustered as is the next group away then this is a good clustering.
1551 |
1552 | You can use standard distance metrics like Euclidean, Manhattan, and cosine similarity. Other options like **DBSCAN** or **OPTICS** does clustering while dealing with the noise.
1553 |
1554 | See also *Locality-sensitive hashing* for a comparable clustering method.
1555 |
1556 | 
1557 |
1558 | Resources:
1559 | https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
1560 |
1561 | ### Hierarchical Clustering ###
1562 |
1563 | **Hierarchical clustering** is a method of cluster analysis which seeks to build a hierarchy of clusters, resulting in a **dendogram**. There are two main types of hierarchical clustering: agglomerative (a bottom-up approach) and divisive (a top-down approach). The main drawback to this method is that it asserts that there is a hierarchy within the data set. For instance, there could be men and women in a data set as well as different nationalities where none of these given traits are truly hierarchical. This is also the slowest approach from what's listed above by a large margin.
1564 |
1565 | In this method, you decide two things: your **linkage**, or how you choose the points to calculate difference, and your distance metric. The four linkage metrics:
1566 |
1567 | * `Complete`: This is the maximal intercluster dissimilarity. You calculate all the dissimilarities between the observations in cluster A and cluster B and use the largest of these.
1568 | * `Single`: This is minimal intercluster dissimilarity, or the opposite of the above.
1569 | * `Average`: This is the mean intercluster dissimilarity. Compute all pairwise dissimilarities between the two clusters and record the average of these.
1570 | * `Centroid`: This is the dissimilarity between the centroids of clusters A and B, which can result in undesirable inversions.
1571 |
1572 | Take for instance grouping customers by online purchases. If one customer buys many pairs of socks and one computer, the computer will not show up as significant when in reality it could be the most important purchase. This can be corrected by subtracting the mean and scaling by the standard deviation (see pg. 399 of ISLR).
1573 |
1574 | Resources:
1575 | * [Maximally Informative Hierarchical Representations of High-Dimensional Data](http://www.jmlr.org/proceedings/papers/v38/versteeg15.pdf)
1576 |
1577 | ### Dimension Reduction ###
1578 |
1579 | So far, we've used the following techniques for reducing dimensionality:
1580 |
1581 | * `Lasso`: adds a regularization parameter that penalizes us for large betas such that many betas are reduced to zero
1582 | * `Stepwise selection`: in backwards stepwise, we put all of our features into our dataset and then remove features one by one to see if it reduces our R2 value
1583 | * `Relaxed Lasso`: This is a combination of lasso and stepwise where you run a regression and then lasso, removing the variables that were reduced with lasso and then re-running the regression without them
1584 |
1585 | Reducing dimension allows us to visualize data in 2D, remove redundant and/or correlated features, remove features that we don't know what to do with, and remove features for storage needs. Take a small image for instance. A 28x28 pixel image is 784 dimensions, making it challenging to process.
1586 |
1587 | ### Principle Component Analysis ###
1588 |
1589 | **Principle component analysis (PCA)** uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The general intuition is that we rotate the data set in the direction of the most variation. It is good for visualization since it can condense many features, before doing kNN on high-dimensional data, and working with images when you can't do neural nets. Otherwise it is not widely used since their are better ways to deal wwith dimension reduction.
1590 |
1591 | More specifically, we do the following:
1592 |
1593 | * Create a centered design matrix of n rows and p features (centered on its mean)
1594 | * Calculate the covariance matrix: a pxp square matrix with the variance on the diagonal and the correlation elsewhere. *Note that if features aren't well correlated (.3 or above) then PCA won't do well*
1595 | * The principle components are the eigenvectors of the covariance matrix, ordered on an orthogonal basis capturing the most-to-least variance of the data
1596 |
1597 | A **scree plot** shows the eigenvalues in non-increasing order with lambda on the y and the principle components on the x axis. Looking for the elbow tell us how many components we need to capture various amounts of the feature space.
1598 |
1599 | PCA weakness: https://gist.github.com/lemonlaug/976543b650e53db24ab2
1600 |
1601 | ### Singular Value Decomposition ###
1602 |
1603 | **Singular value decomposition (SVD)** is a factorization of a real or complex matrix. We can use SVD to determine what we call **latent features**. Every matrix X has a unique decomposition and SVD returns three matrices:
1604 |
1605 | * U: mxk in shape - representing your observations
1606 | * sigma: kxk in shape - eigenvalues of the concept space (your latent features). This is where your feature importance comes from
1607 | * V transpose: k x n in shape - representing your variables. This give you a notion of the features from your original dataset.
1608 |
1609 | ### Non-Negative Matrix Factorization ###
1610 |
1611 | **Non-Negative Matrix Factorization** gives a similar result to SVD where we factorize our matrix into two separate matrices but with a constraint that the values in our two matrices (we get two, not three in NMF) be non-negative. In SVD, we could have a negative value for one of our themes (if we’re using SVD for themes). We care about NMF because it has different properties than SVDs. In NMF, we're going to identify latent features but a given observation can either express it or not--it can't negatively express it like SVD. One downside is that NMF can't take new data without rerunning OLS on it. Use NMF over PCA whenever possible.
1612 |
1613 | Let's start with the example of baking a cake. Can we figure out the protein/fat/carbs in a cake based on its ingredients?
1614 |
1615 | Our response comes in form of X = WH. Using an example of NLP, here's how we would interpret these matrices:
1616 |
1617 | * X is mxn: documents by word counts
1618 | * W is mxk: documents by latent features. Each coefficient k links up with H. Each row m shows how much it expresed each of the latent features k
1619 | * H is kxn: Each column is a column n that is a latent feature that matches using k to W. This could be themes of texts.
1620 |
1621 | Here is the process:
1622 |
1623 | 1. Initialize W and H using a given value (this can be done with random values). Kmeans can be used for a strategic initialization
1624 | 2. Solve for H holding W constant (minimize OLS) and clip negative values by rounding them to zero
1625 | 3. Solve for W holding H constant and clip negative values
1626 | 4. Repeat 2 and 3 until they converge
1627 |
1628 | **Alternating least squares (ALS) allows us to find ingredients and proportions given the final product. ALS is biconvex since it solves for two matrices. You still have to define your **k value** which limits the columns of W and rows of H. You can plot the reconstruction error to evaluate k at different levels.
1629 |
1630 | You can use this strategy for a number of things such as theme analysis by having movies as your columns and users as your rows with the intersect being their rating. You can also have pixel values flattened on the columns and the image as the row. Part of the original motivation behind NMF was wanting to know the parts that contribute to an image where PCA would give a result but you couldn't see recognizable features in the result. This is good for textual analysis too because it can't have anti-values (e.g. this text is negative eggplant).
1631 |
1632 |
1633 | ---
1634 | ## Data Engineering
1635 | ---
1636 |
1637 | ### Parallelization ###
1638 |
1639 | CPython, the C interpretation of Python and the most common installation, uses a **Global interpreter lock (GIL)**, or a mechanism used in computer language interpreters to synchronize the execution of threads so that only one native thread can execute at a time. If a global variable is being accessed by multiple threads, it can excecute wrongly. This is a substantial limitation for python. Given this limitation, there are three options for processes:
1640 |
1641 | 1. `Serial`: processes happen in serial succession, preventing conflicts between threads.
1642 | 2. `Multi-threaded`: operating out of shared memory, multi-threaded tasks can be pictured as sub-tasks of a single process farmed out to different threads. If you're CPU-bound, you can only create as many threads as you have cores. If you're I/O bound, threading can help you run many things at once (e.g. webscraping if you're not rate limited)
1643 | 3. `Multi-processing`: operating out of distributed memory, this submits multiple processes to completely separate memory locations meaning each process will run completely independently from the next. This has the overhead of communicating between multiple processes. If you're CPU-bound, this will significantly increase your speed. It isn't helpful for I/O bound problems.
1644 |
1645 | A simple example of threading:
1646 |
1647 | from threading import Thread
1648 | def worker(num):
1649 | """thread worker function"""
1650 | print 'Worker: %s' % num
1651 | return
1652 | threads = []
1653 | for i in range(5):
1654 | t = threading.Thread(target=worker, args=(i,))
1655 | threads.append(t)
1656 | t.start()
1657 |
1658 | A simple example of multiprocessing:
1659 |
1660 | import multiprocessing as mp
1661 | processes = [mp.Process(target=rand_string, args=(5, output)) for x in range(4)] # define processes
1662 | for p in processes:
1663 | p.start() # start processes
1664 | for p in processes:
1665 | p.join() # exit them
1666 |
1667 | A simpler way than to maintain a list of processes is using the `Pool` class:
1668 |
1669 | pool = mp.Pool(processes=4)
1670 | results = pool.map(function, range(1,7))
1671 | results2 = pool.apply(function, range(1,7)) # Another option
1672 | results = [pool.apply_async(function, args=(samples, x, w)) for w in widths] # asychronous application
1673 | print(results)
1674 |
1675 |
1676 | “Premature optimization is the root of all evil” - Computer Scientists
1677 |
1678 | Resources:
1679 | * [Multiprocessing in Python](https://www.youtube.com/watch?v=X2mO1O5Nuwg)
1680 | * [Intro to Multiprocessing](http://sebastianraschka.com/Articles/2014_multiprocessing.html)
1681 | * [Threading](https://pymotw.com/2/threading/)
1682 |
1683 | ---
1684 |
1685 | ### Apache Hadoop ###
1686 |
1687 | Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. It is the open-source implementation of the MapReduce paradigm coming from a paper published by Google and development by Yahoo. The core of Apache Hadoop consists of a storage system, known as **Hadoop Distributed File System (HDFS)**, and a processing paradigm called **MapReduce**. This framework addresses both the storage and processing problems. **MRJob** is the python package that allows you to interact with Hadoop originating from Yelp.
1688 |
1689 | Hadoop defaults to splitting data into 64mb chunks spread across 3 machines at random. The NameNode is knows how to put all of these chunks back together. While this used to be a single point of failure, there is now at least one secondary NameNode that can be used in case the main node fails.
1690 |
1691 | MapReduce is commonly summarized by the phrase *bring the code to the data.* MapReduce has four stages:
1692 |
1693 | 1. `Mapping`: Maps a function on given nodes
1694 | 2. `Combining`: Combines results on those given nodes
1695 | 2. `Shufflesort`: Moves results between nodes
1696 | 3. `Reducing`: Reduces final result
1697 |
1698 | Use `hadoop fs -ls` to see your data. You can run jobs with `hs mapper_script.py reducer_script.py destination_directory`
1699 |
1700 | ---
1701 |
1702 | ### Apache Spark ###
1703 |
1704 | Apache Spark is a cluster computing platform designed to be fast and general-purpose. Spark extends the popular MapReduce model to efficiently support more types of computations, including interactive queries and stream processing. Spark is designed to cover a wide range of workloads that previously required separate distributed systems, including batch applications, iterative algorithms, interactive queries, and streaming. It powers multiple higher-level components specialized for various workloads, such as SQL or machine learning. Spark can sit on top of hadoop. Spark is significantly faster than hadoop because it is saving computation in memory rather than to disk and thanks to its understanding of directed, acyclic tasks.
1705 |
1706 | `pyspark` allows for interaction with Spark within python. The Spark stack is as follows:
1707 |
1708 | * `core`: Spark Core is also home to the API that defines resilient distributed datasets and dataframe.
1709 | * `Spark SQL`: a package for working with structured data. It puts a schema on RDDs, enabling you to use SQL and DataFrame syntax.
1710 | * `Streaming`: enables the processing of live streams of data
1711 | * `MLlib`: common machine learning functionality designed to scale
1712 | * `GraphX`: library for manipulating graphs
1713 | * `Cluster managers`: Spark can run over a variety of cluster managers such as Hadoop YARN and Apache Mesos
1714 |
1715 | **Resilient distributed datasets (RDDs)** are Spark’s main programming abstraction. RDDs represent a collection of items distributed across many compute nodes that can be manipulated in parallel. RDDs are immutable. Spark Core provides many APIs for building and manipulating these collections. Use RDDs when you need low-level transformations and actions on your data and are working with unstructured or schema-less data. Think of RDDs like a set of instructions rather than stores of data.
1716 |
1717 | There are two types of operations on your data. A **transformation** like `filter` or `map` return a new RDD. Its evaluation is **lazy** so it does not take place immediately. Rather, an **action** triggers the execution of your transformations. This includes operations such as `take` or `collect`. If you do not persist your data using `cache` or `persist` (an alias for `cache` with memory only), you will have to recalculate your RDD entirely for each new action.
1718 |
1719 | Here's some starter code. Creating RDDs:
1720 |
1721 | rDD = sc.parallelize(data, 4) # Creates an RDD from data with 4 partitions.
1722 | rDD2 = sc.textFile('readme.md', 4) # creates and RDD from a text file. Can also use s3, hadoop hdfs, etc.
1723 |
1724 | Some transformations:
1725 |
1726 | rDD.map(lambda x: x*2) # maps a function to the elements
1727 | rDD.flatMap(lambda x: [x, x+5]) # computes the same as map but flattens the result
1728 | rDD.flapMapValues() # returns an iterator (often used for tokenization)
1729 | rDD.filter(lambda x: x % 2 == 0) # returns the element if it evaluates to True
1730 | rDD.distinct() # returns unique elements--this is very expensive
1731 | rDD.sample() # samples with or without replacement
1732 | rDD.reduceByKey() # combine values with the same key. This sends less data over the netork than goupByKey()
1733 | rDD.groupByKey() # be careful since this can move a lot of data
1734 | rDD.combineByKey() # combine values with the same key
1735 | rDD.sortByKey()
1736 | rDD.keys() # returns just the keys
1737 | rDD.values() # returns just the values
1738 |
1739 | Set operations (transformations):
1740 |
1741 | rDD.union(rdd2)
1742 | rDD.intersection(rdd2)
1743 | rDD.subtract(rdd2)
1744 | rDD.subtractByKey
1745 | rDD.cartesian(rdd2)
1746 |
1747 | Joins (transformations):
1748 |
1749 | rDD.join(rDD2)
1750 | rDD.rightOuterJoin(rDD2)
1751 | rDD.leftOuterJoin(rDD2)
1752 | rDD.cogroup(rDD2) # group data from both RDD's sharing the same value
1753 |
1754 | Aggregations (transformations):
1755 |
1756 | rDD.fold() / rDD.foldByKey()
1757 | rDD.aggregate()
1758 | rDD.reduce() / rDD.reduceByKey() # using `ByKey` runs parallel operations for each key and a built-in combiner step
1759 |
1760 | Some actions:
1761 |
1762 | rDD.reduce(lambda a, b: a * b) # aggregates the datasets by taking two elements and returning one (returns the product of all elements). This must be communicative and associative
1763 | rDD.take(n) # returns first n results
1764 | rDD.top(n) # returns top n results
1765 | rDD.collect() # returns all elements (be careful with this)
1766 | rDD.count()
1767 | rDD.countByValue()
1768 | rDD.takeOrdered(n, key=func) # returns n elements based on provided ordering
1769 |
1770 |
1771 | The other main abstraction is a Spark **DataFrame**, which offers speed-ups over RDDs. A **Dataset** is a distributed collection of data. A DataFrame is a Dataset organized into named columns. There are some optimizations that are built into DataFrames that are not built into RDDs. Spark is moving more towards DataFrames than RDDs. This will get you started with DataFrames:
1772 |
1773 | df = spark.read.json("examples/src/main/resources/people.json")
1774 | df.show()
1775 | df.printSchema()
1776 | df.select(df['name'], df['age'] + 1).show()
1777 | df.filter(df['age'] > 21).show()
1778 | df.groupBy("age").count().show()
1779 |
1780 | A **wide transformation** shuffles data across nodes while a **narrow transformation** keeps the transformation on a given node. Using `reduceByKey` is comparable to a shufflesort where you reduce on a node before sending data across nodes. When there is data loss, Spark will analyze the **Directed Acyclic Graph (DAG)** to trace back the analysis to through its dependencies to see where a given partition was loss. This also allows Spark to run operations in parallel.
1781 |
1782 | **Caching** is an important tool with Spark where you can speed up your analysis by reading data out of memory instead of disk. **Broadcast variables** allows peer-to-peer transfer to speed up transfering data across nodes. Note that a **BroadcastHashJoin** is a way to join tables when one table is very small. This is more efficient than a **ShuffleHashJoin**. *Most latency issues can find their roots in shuffling too much data across the network.*
1783 |
1784 | **Spark Streaming** uses **DStreams or Discretized streams** for stream processing. A new abstraction here is a **receiver** that receives data from any source that can stream data like Kinesis, Kafka, Twitter, etc and creates a DStream. Basic sources for stream processing include file systems, socket connections, and Akka. More advanced sources are Kafka, Flume, Kinesis, Twitter, etc. A reliable receiver will send an acknowledgement of data receipt where as an unreliable one will not. *On DStreams, you can conduct transformations (both stateless and stateful transformations) and output operations.* The former is lazily evaluated and the latter creates an output. **Window transformations** will count the elements within a given window.
1785 |
1786 | **Structured Streaming** looks to relieve the user from having to reason about streaming. This allows you to, among other things, combine static and streaming datasets. It is still in Alpha in 2.1.
1787 |
1788 | Tools like **Ganglia** help monitor EMR activity. Apache also has a tool called **Oozie** for queing Spark jobs. **Airflow** is a prefered tool over Oozie. AWS also provides a queing solution called **Data Pipeline**.
1789 |
1790 | References
1791 | * [RDD Tranformations](http://spark.apache.org/docs/0.7.3/api/pyspark/pyspark.rdd.RDD-class.html)
1792 | * [Everyday I'm Shuffling (Tips for writing better Spark jobs)](https://www.youtube.com/watch?v=Wg2boMqLjCg)
1793 | * [Spark Streaming Videos from Datastax](https://academy.datastax.com/resources/getting-started-apache-spark?unit=spark-streaming-discretized-stream)
1794 | * [Databricks on Structured Streaming](https://www.slideshare.net/databricks/a-deep-dive-into-structured-streaming)
1795 |
1796 | ---
1797 |
1798 | ## Special Topics ##
1799 |
1800 | ---
1801 |
1802 | ### Natural Language Processing ###
1803 |
1804 | **Natural language processing (NLP)** is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. Computational linguistics (now NLP) grew up side by side with linguistics.
1805 |
1806 | A collection of documents is a **corpus**. Certain **parallel corpuses** like the Canadian parlimentary proceedings help us advance in other languages in addition to English. A **document** is a collection of **tokens**. A token is difficult to define: it could be a word, words minus articles, particular versions of a word, **n-grams** (such as a **monogram**, **bigram**, or **trigram**), etc. A **stop word** is a word that provides little insight such as 'the' or 'and'. How can we effectively tag different parts of speech? How can we segment a sentence?
1807 |
1808 | Computation allowed the field to go from a rule-based view to a probabilitic one. We often **normalize** by removing stop words and making all words lower case. We can also **stem** words by reducing them to their stem and **lemmatize** them by taking their root (like ration as the root of rationalize).
1809 |
1810 | A document represented as a vector of word counts is called a **bag of words.** The problem here is that bags of words are naive: word counts emphasize words from longer documents and that every word has equal footing.
1811 |
1812 | The text featurization pipeline is as follows:
1813 |
1814 | 1. Tokenization
1815 | 2. Lower case conversion
1816 | 3. Stop words removal
1817 | 4. Stemming/Lamentization
1818 | 5. Bag of words/N-grams
1819 |
1820 | **Term Frequency Inverse Document Frequency (TFIDF)** is the beginning of term frequency where we see which words matter. This offers imporvements over a bag of words. We can look at number of occurences of a term t in a document. We also want to look at how common the term is in general (in documents in general or the number of documents containing t over the number of documents). *The intuition is that we normalize term frequency by document frequency.*
1821 |
1822 | Term Frequency:
1823 |
1824 | tf(t,d) = total count of term t in document d / total count of all terms in document d
1825 |
1826 | Inverse document frequency:
1827 |
1828 | idf(t,D) = log(total count of documents in corpus D / count of documents containing term t)
1829 |
1830 | Term frequency, inverse document frequency:
1831 |
1832 | tf-idf(t, d, D) = tf(t, d) * idf(t, D)
1833 |
1834 | For Bayes theorum, the posterior is the prior times the likelihood over the evidence. How do we turn this theorum into a classifier? For each additional feature we're looking at teh probability of each feature given all previous features. We are going to make a naive assumption, which is that each of these are indipendent variables. This is effectively wrong, but it works out quite well. This turns into our prior P(c) multiplied by each feature given that class P(F1 | c). We now have the pseudo-probability that the result is span, for instance (pseudo because we don't have the denominator). The denominator stays the same for all classes so rather than going through the work of calculating it (especially since it's not clear how to calculate it), you ignore the denominator and compare numerators to see the highest class likelihood. Using **Naive Bayes** like this is an effective multi-class classifier.
1835 |
1836 | In this approach, we take the log transformation of our probabilities since we could have a **floating point underflow problem** where the numbers are so small the computer would assume they're zero. Since probability is lower than 1, we can switch everything into log space and add it (adding is the same as multiplying in non-log space).
1837 |
1838 | Also note that **Laplacian smoothing** is necessary where you assume you've seen every feature in every class one more time than you have. This is the smoothing constant that keeps your model from tanking on unseen words.
1839 |
1840 | Naive Bayes is great for wide data (p > n), is fast to train, good at online learning/streaming data, simple to implement, and is multi-class. The cons are that correlated features aren’t treated right and it is sometimes outperformed by other models (especially with medium-sized data).
1841 |
1842 | Here are a few helpful tools:
1843 |
1844 | from unidecode import unidecode # for dealing with unicode issues
1845 | from collections import Counter # use Counter() to make a feature dictionary with words as keys and values as counts
1846 | import string # useful in punctuation removal
1847 | tok = ‘’.join([c for c in tok if not c in string.punctuation])
1848 |
1849 | *Speech and Language Processing* by Jurafsky and Martin
1850 | http://blog.christianperone.com/2011/09/machine-learning-text-feature-extraction-tf-idf-part-i/
1851 | http://blog.christianperone.com/2011/10/machine-learning-text-feature-extraction-tf-idf-part-ii/
1852 | http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/
1853 |
1854 | ### Time Series ###
1855 |
1856 | Time series data is a sequence of observations of some quantity that's collected over time, not a number of discrete events. A time series is a balance of two forces. **Momentum** is the idea that things will follow the trend it is on. **Mean reversion** is the idea that things will regress to the mean. *The essence of forecasting is these two forces.* The general general rule of thumb is to have 3-4 times the history that you're predicting on. There's no explicity penalty for too much history although AIC could penalize you on an ARIMA model.
1857 |
1858 | You could model time series with polynomials however the higher the degree the more they shoot off to infinity at the end of your dataset. This considers momentum but not mean reversion. You can use some ML techniques, especially to win Kaggle competitions, however this is still somewhat unexplored. Clustering time periods in new ways as well as other ML techniques is a path forward for time series.
1859 |
1860 | We're interested in yt+1 (the next time period) + yt+2 + ... yt+h where *h* is our time horizon. We assume discrete time sampling at regular intervals. This is hard to model because we only observe one realization of the path of the process.
1861 |
1862 | The components of a time series are:
1863 |
1864 | 1. `Trend`: the longterm structure (could be linear or exponential)
1865 | 2. `Seasonality`: something that repeats at regular intervals. This could be seasons, months, wekks, days, etc. This is anything that demonstrates relatively fixed periodicity between events
1866 | 3. `Cyclic`: fluctuations over multiple steps that are not periodic (i.e. not at a specified seasonality)
1867 | 4. `Shock`: a sudden change of state
1868 | 5. `Noise`: everything not in the above is viewed to be white noise without a discernible structure
1869 |
1870 | 
1871 |
1872 | When we remove the trends, seasonality, and cyclic effects, we should be left with white noise. It is possible to see a growing amplitude as time advances. This could be modeled as a multiplicative effect to reduce that.
1873 |
1874 | #### ARIMA Models ####
1875 |
1876 | **An autoregressive integrated moving average (ARIMA)** model is a generalization of an **autoregressive moving average (ARMA)** model. Both of these models are fitted to time series data either to better understand the data or to predict future points in the series (forecasting). There are three parts to ARIMA:
1877 |
1878 | 1. `AR` = p. This is your expectation times time plus an error term (the error term has an expectation of 0)
1879 | 2. `I` = d. This is your integral, your order of differencing. ARMA is forecasting without this element, which is differencing. To difference, you subtract on the y value from its last value. You normally difference twice (compute with np.diff(n=d))
1880 | 3. `MA` = q. This is you moving average. A moving average of 2 is what is the average of the last two values. This removes seasonality and decreases variance.
1881 |
1882 | This reverse-engineers time series. Variance captures up and down. **Autocovariance** is how much a lag predicts the future value of a time series. It is a dimensionless measure of the influence of one lag upon another.
1883 |
1884 | For EDA, you want to plot time series, ACF and PACF. You want hypotheses which includes seasonality. You then want to aggregate so that you have an appropriate gain (balance granularity with generality so you can see your signal).
1885 |
1886 | 
1887 |
1888 | Estimating ARIMA models using Box-Jenkins: here's the main Box-Jenkins methodology:
1889 |
1890 | 1. Exploratory data analysis (EDA):
1891 | ** plot time series, ACF, PACF
1892 | ** identify hypotheses, models, and data issues
1893 | ** aggregate to an appropriate grain
1894 | 2. Fit model(s)
1895 | ** Difference until stationary (possibly at different seasonalities!)
1896 | ** Test for a unit root (Augmented Dicky-Fuller (ADF)): if found, is evidence data still has trend
1897 | ** However: too much differencing causes other problems
1898 | ** Transform until variance is stable
1899 | 3. Examine residuals: are they white noise?
1900 | 4. Test and evaluate on out of sample data
1901 | 5. Worry about:
1902 | ** structural breaks
1903 | ** forecasting for large h with limited data needs a "panel of experts"
1904 | ** seasonality, periodicity
1905 |
1906 | An Autocorrelation Function (ACF) is a function of lag, or a difference in time. The autocorrelation will start at 1 because everything is self-correlated and then we'll see put a confidence interval over it. You want to address this seasonality in your model. A Partial Autocorrelation Function (PACF) looks at what wasn't picked up by the ACF. ACF goes with MA; PACF goes with AR. *In a certain sense, all we're doing is taking care of ACF and PACF until we just see noise, then we run it through ARIMA.* You deal with trends first through linear models, clustering on segments, or whatever other method. Then you deal with perodicity using ACF and PACF. You then deal with cycles until you see only noise.
1907 |
1908 | You can predict against other forecasts. For instance, a fed prediction is a benchmark to which the market reacts so you can use that as a predictor.
1909 |
1910 |
1911 | #### Exponential Smoothing ####
1912 |
1913 | **Exponential Smoothing** is the Bayesian analogue to ARIMA, developed in the late 50's. The basic idea is that you weigh more recent data more than older data. Forecasts using this method are wighted averages of past observations with the weights decaying exponentially as the observations get older. This is different from a moving average, for instance, that weighs past values equally. With ETS, you’re essentially modeling the state and then build a model that moves it forward. The ‘state space model’ is a black box that takes in data but spits out responses.
1914 |
1915 | **Alpha** is your smoothing parameter that determines how much you weigh recent information where 0 <= α <= 1. In the below, you can see how a smaller alpha relates to a delayed incorporation of the real data. The sum of the weights for any given α will be approximately 1. The smaller the α, the more weight is placed on more distantly past events.
1916 |
1917 | 
1918 |
1919 | *This is the bias/variance tradeoff for ETS.* An alpha of .2 takes longer to adjust because it multiplies past values by .2.
1920 |
1921 | The simplest ETS method is called **simple exponential smoothing (SES)**. This is suitable for forecasting data with no trend or seasonal pattern. By contrast we can talk about two methods: a naive one and an average one. The former would predict the last value for the next, essentially weighing the last value above all others. The average method would predict the average of all values. SES is a combination of these two approaches where weights decrease exponentially as observations are more distant. Here is the basic formula:
1922 |
1923 | ŷT+1|T = αyT + α(1−α)yT-1 + α(1−α)**2yT-2 + ...
1924 |
1925 | * `α`: smoothing parameter
1926 | * `ℓ`: level (or the smoothed value) of the series at time t
1927 | * `b`: trend
1928 | * `β*`: smoothing parameter for the trend
1929 | * `ϕ`: dampening parameter
1930 | * `s`: seasonality
1931 |
1932 | The unknown parameters and initial values for any exponential smoothing method can be estimated by minimising SSE. For SES, ℓ and α should be set this way.
1933 |
1934 | `Holt's linear trend` method extended SES to allow forecasting data with a trend. Two separate smoothing equations (one for the level and one for the trend/slope) are included in this approach. A variation called the `exponential trend` method allows the level and the slope to be multiplied rather than added.
1935 |
1936 | Empirically these two methods (linear and exponential trends) tend to over-estimate as they hold the trend indefinitely into the future. To compensate for this, the `damped trend` method dampens the trend to a flat line in the distant future. This is a very popular method, especially when trends are projected far into the future. ϕ dampens the trend such that it approaches a constant in the future. There is also a `multiplicative damped trend` variation.
1937 |
1938 |
1939 |
1940 | ARIMA features & benefits:
1941 |
1942 | * Benchmark model for almost a century
1943 | * Much easier to estimate with modern computational resources
1944 | * Easy to diagnose models graphically
1945 | * Easy to fit using Box-Jenkins methodology
1946 |
1947 | ETS features & benefits:
1948 |
1949 | * Can handle non-linear and non-stationary processes
1950 | * Can be computed with limited computational resources
1951 | * Not always a subset of ARIMA
1952 | * Easier to explain to non-technical stakeholders
1953 |
1954 | We have not covered fourrier series. There are other ways to combine models like the Kalman filter.
1955 |
1956 |
1957 |
1958 | Two separate, but not mutually exclusive, approaches to time series exist: the time domain approach and the frequency domain approach.
1959 |
1960 | A couple helpful references, in order of difficulty
1961 | * Hyndman & Athanasopoulos: Forecasting: principles and practice (free online, last published '12)
1962 | * Shumway & Stoffer: Time Series & Applications: w. R Examples (4th ed out later in '16)
1963 | Enough material for whole year masters course. First chapters included in our readings
1964 | * Box, Jenkins: Time Series Analysis: forecasting and Control ('08)
1965 | Recent update edition by originators named in famous methodology. Also includes older focus on process control - still very important even as recent texts are more likely to emphasize finance
1966 | * Hamilton: Time Series Analysis ('94) - Older and huge, but a classic
1967 |
1968 | Specialization on economics/finance
1969 | Tsay Analysis of Financial Time Series ('10)
1970 | Advanced, but below is a more introductory text by the same author (with greater focus on R):
1971 | Tsay Introduction to Analysis of Financial Data ('12)
1972 | Enders: Applied Econometric Time Series
1973 | Elliott & Timmermann: Economic forecasting
1974 |
1975 | https://www.analyticsvidhya.com/blog/2015/12/complete-tutorial-time-series-modeling/
1976 |
1977 | ### Web-Scraping ###
1978 |
1979 | The general rule of thumb for web-scraping is that if you can see it, you can scrape it. **HTTP** is the hypertext transfer protocol, or the protocol for transfering documents across the internet. There are other protocols like git, smtp (for emails), ftp, etc. The internet is these connections. HTTP can't do much: it's a stateless and dumb framework and it's hard for a site to know the specifics of your traffic to the site. There are four actions in HTTP with little preconceived notion of how these work:
1980 |
1981 | * `GET`: think of it as to read. This is most of what you do
1982 | * `PUT`: think of it like an edit, such as editing a picture
1983 | * `POST`: think of this like creating new content, like adding a new photo to instagram
1984 | * `DELETE`: be careful with this domain
1985 |
1986 | The workflow of web-scraping is normally as follows:
1987 |
1988 | 1. `Scraping`
1989 | 2. `Storage`: normally with MongoDB (it's reasonable to write all the HTML code to your hard drive for small projects). Always store everything: don't do pre-parsing
1990 | 3. `Parsing`
1991 | 4. `Storage`: moving parsed content into pickle/CSV/SQL/etc
1992 | 5. `Prediction`
1993 |
1994 | **CSS** is a separate document that styles a webpage. There are often many of them at the top of the HTML page. It's a cascading style sheet, which enables the separation of the document content from presentation, cascading because it uses the most specific rule chosen. CSS allows us to pull from web pages since the CSS rules apply to different areas we're looking for.
1995 |
1996 | *requests* is the best option for scraping a site. *urllib* can be used as a backup if needed (it also could be necessary for local sites). *BeautifulSoup* is good for HTML parsing.
1997 |
1998 | import requests
1999 | from IPython.core.display import HTML # in a notebook
2000 |
2001 | z = requests.get(‘http://galvanize.com')
2002 | z.content # gives you a summary of the content
2003 | HTML(z.content) # loads the page
2004 |
2005 | An HTML Basic authentication is when you have a popup window asking you to sign in built into HTTP. You can add an auth tuple with the username and password. Most sites use their own authentication information.
2006 |
2007 | Use developer tools when logging into a site, focusing on Form Data for extra parameters like step, rt, and rp. This will help you tweak your request.
2008 |
2009 | When making a request, there's no way to track from request to request (the equivalent of opening a new page in an incognito window). You want to use a cookie so that you don't have to sign in each time. Do this with the following:
2010 |
2011 | import requests as requests
2012 | s = requests.Session() # makes new session
2013 |
2014 | form_data = {'step':'confirmation',
2015 | 'p': 0,
2016 | 'rt': '',
2017 | 'rp': '',
2018 | 'inputEmailHandle':'isaac.laughlin@gmail.com',
2019 | 'inputPassword':clist_pwd}
2020 |
2021 | headers = {"Host": "accounts.craigslist.org",
2022 | "Origin": "https://accounts.craigslist.org",
2023 | "Referer": "https://accounts.craigslist.org/login",
2024 | 'User-Agent':"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.130 Safari/537.36"}
2025 |
2026 | s.headers.update(headers)
2027 | z = s.post('https://accounts.craigslist.org/login', data=form_data)
2028 | z
2029 |
2030 | For parsing, you can use the following example using BeautifulSoup:
2031 |
2032 | from bs4 import BeautifulSoup
2033 | soup = BeautifulSoup(z.content, from_encoding='UTF-8')
2034 | t = soup.findAll('table')
2035 | listings = t[1]
2036 | rows = listings.find(class_='tr')
2037 | data = [[x.text for x in row.findAll('td')] for row in rows]
2038 | data
2039 | [row.findAll('td') for row in rows]
2040 |
2041 | You can also use `pd.read_html()` to turn a table into a pandas dataframe.
2042 |
2043 | When javascript goes out to load data (shown by the spinning download wheel), you can see this in the XHR tab.
2044 |
2045 | You can use **Selenium** to automate the browser process for more advanced guards. **Tor** can be used to hide your identity as needed.
2046 |
2047 | Selenium: http://www.seleniumhq.org/
2048 | CSS selection game: flukeout.github.io
2049 |
2050 |
2051 |
2052 | ### Profit Curves ###
2053 |
2054 | A **profit curve** allows us to monetize the best model to choose. It assigns costs and benefits to our confusion matrix. A profit matrix looks like this:
2055 |
2056 | | | Predicted positive | Predicted negative |
2057 | |--------------------|:-------------------:|---------------------:|
2058 | | Actually positive | Benefit of true positive| Cost of false negative |
2059 | | Actually negative | Cost of false positive | Benefit true negative |
2060 |
2061 | In the case of fraud, for example, the benefit of accurately predicting fraud when there is fraud saves alot of money. The cost of a false positive is relatively low.
2062 |
2063 | An ROC curve gives our estimated true positives versus false positives. Similarly, you walk through all the possible thresholds in your dataset, make a confusion matrix, and use that confusion matrix to estimate profit. On the X axis you plot the percentage of test instances, or the percent of data points that you predict positive. The Y axis is profit.
2064 |
2065 | 
2066 |
2067 | You choose the model and threshold based upon the point that maximizes the profit. This can be applied to abstract cases as well, like a somewhat abstract quantification for how bad it is to have spam misclassified versus accurately classified.
2068 |
2069 | ### Imbalanced Classes ###
2070 |
2071 | There are a number of ways to deal with imbalanced classes. We could say that everything from the positive class is worth 100 times more than the negative class. This is something we do in SVM's, though we could do this with random forest. A second way is to balance by adding or subtracting data. A third way is the **elbow method** where you look at a ROC curve and choose an elbow from the model. This is your best option if you have only one model.
2072 |
2073 | There are three common techniques for adding or subtracting data:
2074 |
2075 | 1. `Undersampling`: randomly discard majority class observations. This makes calculations faster but entails losing data
2076 | 2. `Oversampling`: reblicate minority class observations by multiplying points (you duplicate points, not bootstrap them). The upside is that you don't discard info while the downside is that you'll likely overfit.
2077 | 3. `Synthetic Minority Oversampling Technique (SMOTE)`: this uses KNN to create data that you would be likely to see. This generally performs better than the other approaches. Be sure to use a random distance between the two points you use. It takes three inputs:
2078 | ** `T`: number of minority class samples T
2079 | ** `N%`: amount of SMOTE
2080 | ** `k`: number of nearest neighbors (normally 5)
2081 |
2082 | Always test the three methods and throw it on a ROC curve. SMOTE generally works well. When it doesn't work, it really doesn't work.
2083 |
2084 | You can change your cost function in a way where predicting wrong is taxed more. **Stratified k-fold sampling** ensures that each fold has the same proportion of the classes than the training set. If you have a significant minority, this will help keep you from fitting your model to only one class. F-1 as your metric gives you a **harmonic mean** between precision and recall. A harmonic mean is not just the arithmetic mean but also how far points are from one another. The above are generally better than F-1 however if your automatically selecting than F-1 can be a good metric, especially as a replacement for accuracy. If you're comparing a few models, use curves. If you're comparing a large number of them, use F-1.
2085 |
2086 | ### Recommender Systems ###
2087 |
2088 | Recommender systems can be used to recommend a person, thing, experience, company, employee, travel, purchase, etc. This problem exists partially in computer science, AI, human-computer interaction and cognitive psychology. You want to consider the level of personalization, diversity in the recommendations, persistance with which you recommend, different modalities for different modes, privacy, and trust issues (what are your motivations for recommending me).
2089 |
2090 | We have three general kinds of data: on the target person, the target item, and the user item-interactions such as ratings. Different algorithms have different demands for the type of data we need.
2091 |
2092 | One question is how we interpret actions, like to view something, to like it, or share it. There are two general domains of action: explicit actions where a user took a clear action like a rating or implicit actions such as viewing a product.
2093 |
2094 | The three common appraches vary in how they frame the challenge and the data requirements:
2095 |
2096 | * `Classification/Regression`: this approach uses past interactions as your training data using familiar tools and algorithms. The downside is that you need training data so you have to have a user rate your items. You have to do feature engineering too. Tuning a classifier for every user is difficult and you can compress modalities (like a user's mood). Finally, this isn't a social model since it's built on a single person.
2097 | * `Collaborative filtering`: every single rating from any user allows you to improve a recommendation for every other user. This is often considered the de facto recommender system. It doesn't use any information from the users or items, rather it uses similarity to other users/products to predict interactions. The past interactions are your training data. Your prediction of a given person for a given item relates to their similarity to other people. With r as your rating, you want to subtract the user's average rating from their rating of a given item to normalize it (in case they rate every item highly or poorly). For larger data sets, you can look for neighborhoods and cache the results. **Content-boosted collaborative filtering** takes into account other content like Facebook connections to boost the similarity metric. The crippling weakness of this model is the **cold start problem** where we don't know what to predict for a new user.
2098 | * `Similarity`: this approach tries to do a great job at figuring out which things are similar to one another. It then gives you more ways to compare items when you collapse features. This means it's lower time until first utility. You also fave good side-effects like topics and topic distributions. The limitations are that your item information requirement can be high and that it is sollipsistic.
2099 | * `Matrix Factorization`: see below.
2100 |
2101 | You also want to consider user state, like when a user owns something, do they need another one? There's **basket analysis** which analyzes products that go well together. You also want to consider long-term preference shift. You can also do A/B testing to evalutate long-terms success.
2102 |
2103 | A challenge with PCA, SVD, and NMF is that they all must be computed on a dense data matrix. You can impute missing values with something like the mean (what sklearn does when it says it factorizes sparse matrices). **Matrix factorization** turns your matrix X into U*V. We return a prediction where we once had missing values. Callaborative filtering is a memory based model where we store data and query it. Factorization is more model based as it creates predictions from which subtle outputs can be deduced. Like in NMF, we'll define a cost function. However, since it's sparse we can't update on the entire matrix. *The keys is updating only on known data.* Like in most of data science, we have a cost function and we try to minimize it. Similar to NMF, we can use ALS. A more popular option is **Funk SVD**, which is not technically SVD and was popularized in the Netflix challenge.
2104 |
2105 | Gradient descent rules apply to matrix factorization but only with regards to individual cells in the matrix where we have data. You're doing updates on one cell in X by looking at the row of U and column of V and skipping unrated items. Root mean squared error is the most common validation metric. If a row or column is entirely sparse, it's a cold start problem. Prediction is fast and you can find latent features for topic meaning. The downside is that you need to re-factorize with new data, there are not good open source tools, and it's difficult to tune directly to the type of recommendation you want to make. Side information can be added.
2106 |
2107 | Original netflix article on the Funk method: http://sifter.org/~simon/journal/20061211.html
2108 |
2109 | ### Graph Theory ###
2110 |
2111 | **Graph theory** is the study of graphs, which are mathematical structures used to model pairwise relations between objects. The seminal paper for graph theory was published in 2004, so it is still a very new field.
2112 |
2113 | A graph G is composed of V and E where V is a finite set of elements and E is a set of two subsets of E (your connections). In other words, G = (V,E) Main vocabulary:
2114 |
2115 | * `Vertices/nodes/points`: the entities involved in the relationship
2116 | * `Edges/arcs/lines/dyad`: the basic unit of a social network analysis denoting a single relationship.
2117 | * `Mode`: a **1-mode** graph includes only one type of node, where a bimodal or multi-modal graph involves many types such as organizations, a posting, etc.
2118 | * `Neighbors`: the nodes connected to a given node
2119 | * `Degree`: the total number of neighbors
2120 | * `Path`: a series of vertices and the path that connects them without repeating an edge, denoted ABD
2121 | * `Walk`: a path that traces back over itself
2122 | * `Complete`: there is an edge from every node to every other node on the graph
2123 | * `Connected`: Every node is connected in some path to other nodes (no freestanding nodes)
2124 | * `Subgraph`: A subgraph is any subset of nodes and the edges that connect them. A single node is a subgraph
2125 | * `Connected Components`
2126 |
2127 | Here are three types of graphs:
2128 |
2129 | 1. `Directed`: nodes are connected to one another (e.g. Facebook or LinkedIn connections and phone calls)
2130 | 2. `Undirected`: nodes care connected in one or two directions (e.g. twitter followers or LinkedIn endorsement)
2131 | 3. `Weighted`: distances are weighted (e.g. traveling costs such as fuel or time or social pressure)
2132 |
2133 | There are three ways of representing a graph in a computer:
2134 |
2135 | 1. `Edge list`: This is a list of edges [(A,B,3), (A,C,5), (B,C,7), etc.] In general, you don’t use this.
2136 | 2. `Adjacency list`: This keeps track of which nodes are connected to a given node (much faster to search): {A: (B,3), (C,5), B: (A,3), (C,7)}
2137 | ** Space: omega(abs(v) + abs(e))
2138 | ** Lookup cost: omega(abs(V))
2139 | ** Neighbor lookup: omega(# neighbors)
2140 | 3. `Adjacency matrix`: - The same as the list but as a full matrix (computationally inefficient). You can do a bit matrix of 0’s and 1’s or use the weighted values. For unconnected nodes, you want high numbers that denote that they’re not directly connected. It should be orders of magnitude larger than other values.
2141 | ** Space: omega(v**2)
2142 | ** Lookup cost: omega(1) (constant time)
2143 | ** Neighbor lookup: omega(abs(V))
2144 |
2145 | A B C D
2146 | A 0 3 5 10e6
2147 | B 3 0 7 9
2148 |
2149 | There are two main search options. **Breadth first search (BFS)** is the main search option which searches breadth before depth. This good for finding the shortest paths between nodes by searching the first tier, then the second, etc. **Depth first search (DFS)** follows through each node to its greatest depth, then goes back one tier, exhausts that, etc. BFS is exaustive so it can be very slow if your node happens to be the last one seen. DFS is not garunteed to find the shortest path.
2150 |
2151 | BFS(graph, start, end): #BFS pseudo-code
2152 | Create an empty queue, Q
2153 | Init an empty set V (visited nodes)
2154 | Add A to Q
2155 | While Q is not empty:
2156 | Take node off Q, call it N w/ distance d
2157 | If N isnot in V:
2158 | add N to V
2159 | if N is desired end node:
2160 | done
2161 | else:
2162 | add every neighbor of N to Q with distance d+1
2163 |
2164 | **Centrality** is how we measure the importance of a node. **Degree centrality** says that the more connections my node has the more important it is. We can also normalize by dividing by the number of nodes. **Betweenness centrality** shows the importance of a node for being between other nodes. A given point will have 0 betweenness if you don't need to pass through it. **Eigenvector centrality** comes through Perron-Frobenius Theorum. We can get our eigenvector centrality by lambda(V)=AV. You do this by multiplying your adjacency matrix by your degree vector. **Page rank** is very similar to eigenvector centrality. See notes for the equations for this.
2165 |
2166 | You can find communities in your network using the following:
2167 |
2168 | * `Mutual ties`: nodes within the group know each other
2169 | * `Compactness`: reach other members in a few steps
2170 | * `Dense edges`: high frequency of edges within groups
2171 | * `Separation from other groups`: frequency within groups higher than out of group
2172 |
2173 | **Modularity** is a measure that defines how likely the group structure seen was due to random chance. To calculate this, you look at the edges of a given node over the sum of all edges in the network. You need to relax the assumptions in order to get the math to work by allowing a node to connect with itself. Instead of modularity, you could also use hierarchical clustering and just cut off at different points, though modularity with the heuristic is much faster.
2174 |
2175 | Resources:
2176 | * NetworkX for a python package
2177 | * Gephi for graphing
2178 |
2179 |
2180 | ---
2181 |
2182 | ## Probabilitic Data Structures ##
2183 |
2184 | At a high level, **probabilistic data structures** are algorithms that use randomness to improve its efficiency. There are two classes of these. The **Las Vegas type** is guaranteed to find the right answer but has a random element in its runtime. The **Monte Carlo type** may not find the right answer but has a fixed runtime. For instance, the former would look at random indices in order to find a result until it finds it while the latter would look at random indices for a certain number of iterations before completing. Bloom filters, HyperLogLog, locality-sensitive hashing, and count-min sketch are common implementations of these data structures. Think of using hashing with a large n instead of sampling.
2185 |
2186 | A **bloom filter** is a super-fast and space-efficient probabilitic data structure used to check for *set membership*. It can tell you if an item is not in a set or if it probably is in a set, but it can't say for sure if it is in the set. It also can't be changed without using a more advanced algorithm. You use a number of hash functions to see about whether a given item is in a set. This is a lean way of filtering data that allows you to call more expensive operation on what is not filtered out, such as when Google Chrome filtering out malicious websites. With bloom filters, you define two variables: `m` (your number of bits) and `k` (the number of different hash functions defined). Here's an example:
2187 |
2188 | import numpy as np
2189 |
2190 | m = 18 # number of bits
2191 | k = 3 # constant, smaller than m and determined by intended false positive rate
2192 |
2193 | x, y, z = np.zeros(m), np.zeros(m), np.zeros(m) # suppose these are IP address of website visitors
2194 | x[1], x[5], x[13], y[4], y[11], y[16], z[3], z[5], z[11] = np.ones(9)
2195 | print("\x1b[34;1m"); print('x', x); print('y', y); print('z', z)
2196 |
2197 | w = np.zeros(m) # suppose this is somebody who hasn't visited a website yet
2198 | w[4], w[13], w[15] = np.ones(3)
2199 | print("\x1b[31;1mw", w, "\x1b[0m")
2200 |
2201 | A bloom filter will look for the membership of `w` within the set {x, y, z}. *HBase and Cassandra use them to avoid expensive lookups.*
2202 |
2203 | **HyperLogLog** is an algorithm for *counting distinct values*, approximating the number of distinct elements in a multiset (a multiset is a set that allows duplicate values). It evolved from the observation that the cardinality, or number of elements, of a multiset of uniformly distributed random numbers can be estimated by calculating the maximum number of leading zeros in the binary representation of each number in the set. If the maximum number of leading zeros is `n` then an estimate for the number of distinct elements in the set is `2**n`. This approach has high variance so HyperLogLog minimizes this by using multiple subsets of the multiset, calculating the max number of leading zeros, and using the harmonic mean to combine the estimates to estimate the total cardinality. In Spark, actions like `approxCountDistinct` and `approxQuantile` are implementations of this approach, also known as approximative algorithms.
2204 |
2205 | **Locality-sensitive hashing** addresses the curse of dimensionality by hashing input items so that similar items map to the same buckets with high probability. Its aim is to maximize the probability of collisions for similar items. This approach has great similarities to clustering and KNN
2206 |
2207 | Finally, **count-min sketch** serves as a frequency table of events in a stream of data. It uses hash functions to map events to frequencies. This is a slight derivation on counting bloom filters. Instead of bits like we had with bloom filters, we have counters. We use rows of counters where each row has a different hash function. Each time you see a value, you increment each of the rows, which hash differently for different values. When recalling a count, you take the minimum of the values for the counters hashed to the value you're trying to count. You can set the width and the depth of your counters (basically how many bits and how many counters) based upon your accepted epsilon, or error rate. This is great for any kind of frequency tracking, NLP, etc.
2208 |
2209 | Resources:
2210 | * [It Probably Works](https://www.youtube.com/watch?v=FSlPU5Nrvds)
2211 | * [Bloom filters](http://www.lsi.upc.es/~diaz/p422-bloom.pdf)
2212 | * [Original Bloom Filter Paper](http://dl.acm.org/citation.cfm?id=362692)
2213 | * [Approximative Algorithms in Spark](https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html)
2214 | * [Original count-min paper](http://www.sciencedirect.com/science/article/pii/S0196677403001913)
2215 | * [Later Count-Min Paper (easy to read)](https://www.computer.org/csdl/mags/so/2012/01/mso2012010064.pdf)
2216 | * [Probabilistic Data Structures - bloom, count-min and hyperloglog](https://www.youtube.com/watch?v=F7EhDBfsTA8)
2217 | * [Locality-Sensiteve Hashing at Uber on Spark](https://www.youtube.com/watch?v=Ha7_Vf2eZvQ&feature=youtu.be)
2218 |
2219 | ---
2220 |
2221 | ## Helpful Visualizations ##
2222 |
2223 | Feature importance: Plot your features in order of importance (tells you importance but not if they correlate positively or negatively)
2224 | Partial dependency: This makes predictions having froze a given feature and incrementing it up. FOr instance, you can plot two features against the 'partial dependence', which is your outcome.
2225 | ROC
2226 | Residual plots
2227 | QQ Norm
2228 |
2229 |
2230 | ---
2231 |
2232 | ## Note on Style and Other Tools ##
2233 |
2234 | iPython offers features like tab completion and auto-reload over the main python install. You can also type `%debug` to debut code.
2235 | Jupyter/iPython notebook or Apache Zeppelin
2236 | Markdown: https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet
2237 | Atom: command / hashes out text
2238 | Anacoda
2239 | Homebrew
2240 | AWS: https://gist.github.com/iamatypeofwalrus/5183133
2241 |
2242 | Sniffer which wraps nose tests so every time you save a file it will automatically run the tests - pip install sniffer nose-timer
2243 |
2244 | Visualization:
2245 | * Tableau
2246 | * D3.js - based in javascript
2247 | * Shiny
2248 |
2249 | ---
2250 |
2251 | ## Career Pointers ##
2252 |
2253 | Common themes:
2254 |
2255 | * SQL
2256 | * SELECT statements
2257 | * Linear models
2258 | * How to interpret coefficients
2259 | * How to find the values of the coefficients
2260 | * Probability
2261 | * May involve counting, distributions, Bayes, and other rules
2262 | * Expectations, variance
2263 | * Open ended questions on business problems
2264 | * Case studies/take homes
2265 | * Computational complexity
2266 | * Coding questions
2267 | * mergesort
2268 | * Fizzbuzz
2269 | * Palandromes
2270 | * Anagrams
2271 |
2272 | Common Interview Questions:
2273 |
2274 | * Narrative: why do you want to be a data scientist?
2275 | * Runtime analysis:
2276 | * do you understand the concequences of what you’ve written? Use generators and dicts when possible.
2277 | * Indexes can speed up queries (btree is common, allowing you to subdivide values.)
2278 | * Vectorized options are more effective than iterative ones
2279 | * You need to know the basics of OOP in interviews. A common interview question is how to design a game: start with designing classes.
2280 | * SQL will be addressed on all interviews
2281 | * What is the difference between WHERE and HAVING? (HAVING is like WHERE but can be applied after an aggregation)
2282 | * Types of joins
2283 | * Confounding factors in experimental design
2284 | * Basic probability
2285 | * Basic combinatorics
2286 | * Linear regression basics, especially that LR is more complex than y = MX + B
2287 | * Interpreting coefficients for linear and logistic regression
2288 | * How do you know if you overfit a model and how do you adjust for it?
2289 | * Compare lasso regression and logistic regression using PCA
2290 | * Why would you choose breadth first search over depth first search in graph theory?
2291 | * What's NMF and how does the math work?
2292 |
2293 | Cracking the Coding Interview
2294 | O'Reilly (Including salary averages): https://www.oreilly.com
2295 |
2296 | ## Notes for future expansion
2297 |
2298 | * Summarize ML approaches
2299 | * Week 8 notes
2300 | * Week 9 - Data Products
2301 | * Week 10 - Runtime Complexity
2302 | * Add more info on regularization
2303 | * Derive linear regression and PCA
2304 | * LDA
2305 |
2306 | List of data engineering tools:
2307 | * https://github.com/igorbarinov/awesome-data-engineering
2308 |
--------------------------------------------------------------------------------
/ToBreakDown/ToBreakDown.md:
--------------------------------------------------------------------------------
1 | Every resource needs to be broken down and put into separate sections (and cite where they came from).
2 |
3 | 1. https://towardsdatascience.com/how-to-ace-data-science-interviews-statistics-f3d363ad47b
4 |
5 | 2. Algorithms used on daily basis by data scientist: https://www.kdnuggets.com/2018/04/key-algorithms-statistical-models-aspiring-data-scientists.html
6 |
7 | 3. http://houseofbots.com/news-detail/2851-4-this-is-what-i-really-do-as-a-data-scientist
8 |
9 | 4. https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-7-unsupervised-learning-pca-and-clustering-db7879568417
10 |
11 | 5. https://www.udemy.com/python-for-data-structures-algorithms-and-interviews/learn/v4/overview
12 |
13 | 6. http://nbviewer.jupyter.org/github/jmportilla/Python-for-Algorithms--Data-Structures--and-Interviews/tree/master/
14 |
15 | 7. https://towardsdatascience.com/data-science-and-machine-learning-interview-questions-3f6207cf040b
16 |
17 | 8. **useful**: http://houseofbots.com/news-detail/2248-4-109-commonly-asked-data-science-interview-questions
18 |
19 | 9. https://medium.com/acing-ai/google-ai-interview-questions-acing-the-ai-interview-1791ad7dc3ae
20 |
21 | 10. https://medium.com/acing-ai
22 |
--------------------------------------------------------------------------------
/ToBreakDown/UCSDCSEIQ interview prep doc.docx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mGalarnyk/DataScienceInterview/d45a972390716e3f5f2a61e44c8cef92d70912bc/ToBreakDown/UCSDCSEIQ interview prep doc.docx
--------------------------------------------------------------------------------
/ToBreakDown/deepLearningInterview.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mGalarnyk/DataScienceInterview/d45a972390716e3f5f2a61e44c8cef92d70912bc/ToBreakDown/deepLearningInterview.pdf
--------------------------------------------------------------------------------
/ToBreakDown/images/CRISP.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mGalarnyk/DataScienceInterview/d45a972390716e3f5f2a61e44c8cef92d70912bc/ToBreakDown/images/CRISP.png
--------------------------------------------------------------------------------
/ToBreakDown/images/Profit curve.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mGalarnyk/DataScienceInterview/d45a972390716e3f5f2a61e44c8cef92d70912bc/ToBreakDown/images/Profit curve.png
--------------------------------------------------------------------------------
/ToBreakDown/images/anscombesquartet.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mGalarnyk/DataScienceInterview/d45a972390716e3f5f2a61e44c8cef92d70912bc/ToBreakDown/images/anscombesquartet.png
--------------------------------------------------------------------------------
/ToBreakDown/images/cnns.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mGalarnyk/DataScienceInterview/d45a972390716e3f5f2a61e44c8cef92d70912bc/ToBreakDown/images/cnns.png
--------------------------------------------------------------------------------
/ToBreakDown/images/conics.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mGalarnyk/DataScienceInterview/d45a972390716e3f5f2a61e44c8cef92d70912bc/ToBreakDown/images/conics.png
--------------------------------------------------------------------------------
/ToBreakDown/images/dbscan.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mGalarnyk/DataScienceInterview/d45a972390716e3f5f2a61e44c8cef92d70912bc/ToBreakDown/images/dbscan.png
--------------------------------------------------------------------------------
/ToBreakDown/images/hyndman_modeling_process.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mGalarnyk/DataScienceInterview/d45a972390716e3f5f2a61e44c8cef92d70912bc/ToBreakDown/images/hyndman_modeling_process.png
--------------------------------------------------------------------------------
/ToBreakDown/images/ses.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mGalarnyk/DataScienceInterview/d45a972390716e3f5f2a61e44c8cef92d70912bc/ToBreakDown/images/ses.png
--------------------------------------------------------------------------------
/ToBreakDown/images/timeseries.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mGalarnyk/DataScienceInterview/d45a972390716e3f5f2a61e44c8cef92d70912bc/ToBreakDown/images/timeseries.png
--------------------------------------------------------------------------------
/ToBreakDown/images/transformations.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mGalarnyk/DataScienceInterview/d45a972390716e3f5f2a61e44c8cef92d70912bc/ToBreakDown/images/transformations.png
--------------------------------------------------------------------------------
/ToBreakDown/interviewQuestionsLinkedIn.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mGalarnyk/DataScienceInterview/d45a972390716e3f5f2a61e44c8cef92d70912bc/ToBreakDown/interviewQuestionsLinkedIn.pdf
--------------------------------------------------------------------------------
/ToBreakDown/interviewing.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |