├── .gitignore
├── README.md
├── applied.md
├── img
    └── pareto.png
├── statistics.md
└── tradeoffs.md


/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | pip-wheel-metadata/
 24 | share/python-wheels/
 25 | *.egg-info/
 26 | .installed.cfg
 27 | *.egg
 28 | MANIFEST
 29 | 
 30 | # PyInstaller
 31 | #  Usually these files are written by a python script from a template
 32 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 33 | *.manifest
 34 | *.spec
 35 | 
 36 | # Installer logs
 37 | pip-log.txt
 38 | pip-delete-this-directory.txt
 39 | 
 40 | # Unit test / coverage reports
 41 | htmlcov/
 42 | .tox/
 43 | .nox/
 44 | .coverage
 45 | .coverage.*
 46 | .cache
 47 | nosetests.xml
 48 | coverage.xml
 49 | *.cover
 50 | *.py,cover
 51 | .hypothesis/
 52 | .pytest_cache/
 53 | 
 54 | # Translations
 55 | *.mo
 56 | *.pot
 57 | 
 58 | # Django stuff:
 59 | *.log
 60 | local_settings.py
 61 | db.sqlite3
 62 | db.sqlite3-journal
 63 | 
 64 | # Flask stuff:
 65 | instance/
 66 | .webassets-cache
 67 | 
 68 | # Scrapy stuff:
 69 | .scrapy
 70 | 
 71 | # Sphinx documentation
 72 | docs/_build/
 73 | 
 74 | # PyBuilder
 75 | target/
 76 | 
 77 | # Jupyter Notebook
 78 | .ipynb_checkpoints
 79 | 
 80 | # IPython
 81 | profile_default/
 82 | ipython_config.py
 83 | 
 84 | # pyenv
 85 | .python-version
 86 | 
 87 | # pipenv
 88 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 89 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 90 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 91 | #   install all needed dependencies.
 92 | #Pipfile.lock
 93 | 
 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
 95 | __pypackages__/
 96 | 
 97 | # Celery stuff
 98 | celerybeat-schedule
 99 | celerybeat.pid
100 | 
101 | # SageMath parsed files
102 | *.sage.py
103 | 
104 | # Environments
105 | .env
106 | .venv
107 | env/
108 | venv/
109 | ENV/
110 | env.bak/
111 | venv.bak/
112 | 
113 | # Spyder project settings
114 | .spyderproject
115 | .spyproject
116 | 
117 | # Rope project settings
118 | .ropeproject
119 | 
120 | # mkdocs documentation
121 | /site
122 | 
123 | # mypy
124 | .mypy_cache/
125 | .dmypy.json
126 | dmypy.json
127 | 
128 | # Pyre type checker
129 | .pyre/
130 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # ml-interviews
 2 | 
 3 | ## Background
 4 | This repo contains links and instructional materials to help prepare for industry machine learning (ML) interviews (data/applied/research scientist, ML engineer, etc). It is primarily aimed at Master's and Ph.D. students. While there are innumerable resources online for learning ML, this amalgamation of courses, papers, blog articles, and Twitter threads form daunting list of concepts and algorithms. The goal of this repo is to distill a much smaller list of topics that are useful for the recruiting process in graduate-level internships and full-time roles. Roughly, I see this as five topics.
 5 | 
 6 | - **Statistics Basics:** Theoretical and applied statistics concepts such as deriving the distribution of a random variable, testing for equality of predictive performance, etc. See `statistics.md`.
 7 | - **Implementation Basics:** Simple algorithms which you may be asked to implement on the spot. Expect, for example, to be able to implement $k$-nearest neighbors or $k$-means clustering from memory.
 8 | - **Trade-offs:** Enumeration of various trade-offs between desirable properties that come up in discussions, such as bias-variance, precision-recall, and accuracy-fairness. See `tradeoffs.md`.
 9 | - **Advanced Topics:** Modern methods and problems in machine learning. See [CSE 599i](https://courses.cs.washington.edu/courses/cse599i/20au/) taught during Autumn 2020 at the University of Washington for a survey of generative models, which will in turn cover many relevant topics in deep learning. Pay particular attention to the transformer architecture, and try to implement it yourself.
10 | - **Applied ML:** Examples of a question in which the interviewer asks you to design an ML solution to create a business product. See `applied.md`.
11 | 
12 | ## Feedback
13 | This material is a work in progress, and I am happy to receive corrections and feedback! Please see my [website](https://ronakdm.github.io/) for contact information.
14 | 


--------------------------------------------------------------------------------
/applied.md:
--------------------------------------------------------------------------------
 1 | # Applied ML
 2 | 
 3 | A common interview type is the "applied ML" interview, in which the intervier presents a product feature, and asks that you design an ML solution to implement the feature. These problems are usually given with minimal information or structure, and it is up to you to design a comprehensive solution. Consider the following two examples, which are taken from real interviews.
 4 | 
 5 | **Example 1:** How can we automatically identify posts on our feed that sell firearms?
 6 | 
 7 | **Example 2:** How can we predict the click-through rate (CTR) of an advertisement?
 8 | 
 9 | These are "solved" at the end of this note.
10 | 
11 | **What they are looking for:** The abilities to 1) translate ML methods into real-world products/outcomes, 2) quickly adjust to a domain you have probably never seen before, and 3) create structure (through exhaustive questioning) from a vauge outline.
12 | 
13 | ## Approach
14 | 
15 | This is my usual approach to structuring an answer.
16 | 
17 | ### Problem Design
18 | 
19 | 1. Repeat the general (real-world) problem description back to the interviewer.
20 | 2. Identify who the key stakeholders are (users, clients, investors, citizens).
21 | 3. Identify domain-specific performance metrics.
22 | 4. Choose an ML problem (regression, classification, clustering, etc) that most closely resembles the applied problem, and confirm with your interviewer.
23 | 
24 | ### Data Design
25 | 
26 | 1. Identify the input and label spaces explicitly.
27 | 2. Determine how the training data is sampled/collected.
28 | 3. Ask if there is a feature engineering step, in which case suggest features relevant for the problem.
29 | 4. Ask if there are any privacy considerations that need to be taken into account.
30 | 
31 | ### Modeling
32 | 
33 | 1. Present a set of models that range from simple to complex. If there are images or text, use pretrained networks (ResNet50, BERT, more modern solutions, etc) to represent them as vectors, concatenate them with other numerical features for a combined representation. Then, apply linear or logistic regression to the output. In the simplest form, leave the encoders as frozen features and only solve the (convex) problem of learning the weights for the final layer. Then, add complexity by additionally adding one more layer to the head, or unfreezing the encoder weights and backpropagating through them. Note that you could also use a tree-based approach (see `tradeoffs.md`).
34 | 2. Identify all hyperparameters on the table (architecture, optimizers, encoders, etc).
35 | 3. Then, train the model and see if you successfully optimized the train loss. If not, options are to decrease the learning rate, or in some cases increase $\ell^2$-regularization to upweigh the strongly convex portion of the objective function. If you cannot optimize, there is not much use in discussing downstream performance. In a convex problem, the norm of the gradient can be used to verify optimality. In a non-convex (e.g. neural network) setting, one can usually achieve zero training loss.
36 | 4. Once the model is near-optimized, assess the generalization gap by checking validation loss. If there is a substantial gap, make changes that aid generalization (e.g. increase regularization, model simplification). If there is no gap, but performance is not as good as it could be, then increase model complexity. For example, in a human-like problem such as image classification, the Bayes optimal risk is essentially zero.
37 | 5. In classication, after training a good model, consider optimizing the threshold to achieve a good balance of precision and recall. Discuss any other trade-offs in general.
38 | 
39 | ### Deployment and Pitfalls
40 | 
41 | 1. Consider deployment details, such as frequency of retraining, online learning, continual learning, or centralized vs federated setup.
42 | 2. Mention issues of fairness, both in the societal sense and in the technical sense (see [Agarwal (2018)](https://icml.cc/Conferences/2018/Schedule?showEvent=2361) for examples).
43 | 3. If not mentioned before, mention issues of class imbalance, which could cause a model selected based on accuracy perform badly in terms of precision, recall, fairness, and other metrics.
44 | 4. If not mentioned before, this would also be a good place to discuss privacy concerns.
45 | 5. Most importantly, try to mention domain-specific pitfalls that can occur and need to be accounted for. For example, in Example 1 above, a Veteran's Day post might contain a firearm but not be selling it. Could this be a pitfall of the model, and can we correct for it?


--------------------------------------------------------------------------------
/img/pareto.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ronakdm/ml-interviews/4cc888c44694222dbc5cfd989a6d3cca2a045138/img/pareto.png


--------------------------------------------------------------------------------
/statistics.md:
--------------------------------------------------------------------------------
 1 | # Statistics Basics
 2 | 
 3 | One interview format that appears in both technology and quantitative finance roles is a pure statistics interview, in which you are asked [Casella and Berger](https://books.google.com/books/about/Statistical_Inference.html?id=0x_vAAAAMAAJ) style textbook problems.
 4 | 
 5 | **What they are looking for:** Knowledge of advanced undergraduate to Master's level mathematical statistics.
 6 | 
 7 | ## Deriving the Distribution of a Random Variable
 8 | 
 9 | A general question is describing the distribution of a random variable $X$.
10 | If it is real-valued, this can be done by specifying the cumulative distribution function (CDF) $F_X$, 
11 | probability density function (PDF) $f_X$, or probability mass function (PMF) $p_X$. This can be done in the following ways.
12 | - Use the "first principles" method, i.e. using reasoning from probability to compute $F_X(x) := P(X \leq x)$.
13 | - If $X = g(Y)$ for $g: \mathbb{R} \rightarrow \mathbb{R}$ for some random variable $Y$, and $g$ is invertible with differentiable inverse, then
14 | compute $f_X(x) = f_Y(g^{-1}(x)) \cdot |[g^{-1}]'(x)|$.
15 | - If $X = (X_1, X_2)$ is a joint distribution of real-valued random variables, compute $F_{X_1 | X_2 = x_2}$ and $F_{X_2}$ (or the corresponding densities).
16 | - Compose the first principles method with the law of total probability. Let $Z$ be a discrete random variable observed jointly with $X$, taking values in $\mathcal{Z}$, and compute
17 | 
18 | $$
19 | P(X \leq x) = \sum_{z \in \mathcal{Z}} P(X \leq x, Z = z).
20 | $$
21 | 
22 | - If $(X, Z)$ are continuous, real-valued random variables with joint density $f_{X, Z}$, then compute
23 | 
24 | $$
25 | f_X(x) = \int_{z \in \mathbb{R}} f_{X, Z}(x, z)dz.
26 | $$
27 | 
28 | - If $X = g(U, V)$, where $g: \mathbb{R}^2 \rightarrow \mathbb{R}$ and $U$ and $V$ are real-valued random variables, compose the first
29 | principles method with the law of iterated expectations:
30 | 
31 | $$
32 | P(X \leq x) = P(g(U, V) \leq x) = E[P(g(U, V) \leq x | V)].
33 | $$
34 | 
35 | - When deriving a conditional density given $Y = y$, write $f_{X | Y = y}(x) \propto f_{X, Y}(x, y)$, and recognize the 
36 | *family* of $f_{X | Y = y}$ by looking at the $x$ terms and the *parameters* by looking at the $y$ terms of $f_{X, Y}(x, y)$.
37 | 
38 | **Example:** Let $Z_1, \ldots, Z_n$ be $\mathbb{R}^d$-valued random variables distributed uniformly on the $d$-dimensional Euclidean unit ball $\lbrace z: ||z||_2 \leq 1\rbrace$. Find the distribution of the minimum distance between the origin and any of the $Z_i$'s (as a function of $n$ and $d$).
39 | 
40 | ## Computing Functionals of the Distribution of a Random Variable
41 | 
42 | Similarly, one may ask you to compute the expectation $\mathbb{E}[X]$ or variance $\operatorname{Var}[X]$ of a random variable $X$. Some options are listed below.
43 | - While it is usually simpler to compute the expectation or variance, one can always compute the entire distribution of $X$ and then proceed by definition:
44 | 
45 | $$
46 | \mathbb{E}[X] = \int x f_X(x) dx, \quad \operatorname{Var}[X] = \mathbb{E}[X^2] - \mathbb{E}[X]^2
47 | $$
48 | 
49 | - For $(X, Z)$ with joint density $f_{X, Z}$, use the law of total expectation (with analogous techniques for discrete random variables):
50 | 
51 | $$
52 | \mathbb{E}[X] = \int_{x \in \mathbb{R}}\int_{z \in \mathbb{R}} x f_{X, Z}(x, z)dzdx.
53 | $$
54 | 
55 | - Use the law of iterated expectation:
56 | 
57 | $$
58 | \mathbb{E}[X] = \mathbb{E}[\mathbb{E}[X | Z]].
59 | $$
60 | 
61 | - Use the law of total variance:
62 | 
63 | $$
64 | \operatorname{Var}[X] = \mathbb{E}[\operatorname{Var}[X|Z]] + \operatorname{Var}[\mathbb{E}[X|Z]].
65 | $$
66 | 
67 | - If $\textstyle X = \sum_{i=1}^n X_i$ is a sum of random variables, letting $\operatorname{Cov}[X_i, X_j]$ be the covariance of $X_i$ and $X_j$, use:
68 | 
69 | $$
70 | \mathbb{E}[X] = \sum_{i=1}^n \mathbb{E}[X_i], \quad \operatorname{Var}[X] = \sum_{i=1}^n \sum_{j=1}^n \operatorname{Cov}[X_i, X_j].
71 | $$
72 | 
73 | - If $X$ is a count of some quantity, write it as a sum of indicators $\textstyle X = \sum_{i=1}^n 1_{A_i}$ for events $A_1, \ldots, A_n$, so that:
74 | 
75 | $$
76 | \mathbb{E}[X] = \sum_{i=1}^n \mathbb{E}[1_{A_i}] = \sum_{i=1}^n P(A_i). 
77 | $$
78 | 
79 | **Example:** 8 students from JHU and 10 students from UW sit around a circular table with 18 seats, with uniformly randomly assigned seats. Each person will shake the hands of the person next to them if they are from different universities. What is the expected number of handshakes that occur around the table? What is the variance?
80 | 
81 | 
82 | ## Hypothesis Testing for Model Performance
83 | 
84 | When dealing with accuracies, the performance of a classification model $h: \mathcal{X} \rightarrow \{0, 1\}$ on test set realizations $(x_i, y_i), ..., (x_n, y_n)$ is written
85 | 
86 | $$
87 |     \operatorname{acc}(h) = \frac{1}{n} \sum_{i=1}^n 1_{\lbrace h(x_i) = y_i \rbrace}
88 | $$
89 | 
90 | for $x_i \in \mathcal{X}$ and $y_i \in \{0, 1\}$. Note that we have conditioned on any training data, and consider only the randomness in the test set. 
91 | As such, under the assumption that each $(x_i, y_i)$ is an independent and identically distributed (i.i.d.) realization of some random pair $(X, Y)$, this is a Binomial proportion, with parameters $n$ and $p := P(h(X) = Y)$. Various [confidence intervals](https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval) have been proposed for this type of statistic. When comparing two classifiers $h_1$ and $h_2$, if using the same test set, we turn to paired tests of differences between means, such as the [Wilcoxon signed-rank test](https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test) or more modern statistical tools.


--------------------------------------------------------------------------------
/tradeoffs.md:
--------------------------------------------------------------------------------
 1 | # Trade-offs
 2 | 
 3 | Often, an interviewer expects you to discuss a trade-off between two desirable properties. While there can be more than two, this is a less common setting in which many of these ideas can easily be adapted.
 4 | 
 5 | **What they are looking for:** The ability to recognize a that instead of a single solution, there is a space of feasible solutions, as well as ideas on how to navigate this space.
 6 | 
 7 | ## Generalities
 8 | 
 9 | One structured approach for discussing a trade-off is answering the following questions.
10 | 1. **What are the controllable hyperparameters that allow you to navigate this tradeoff?** Usually, there is a single dial (e.g. the classification threshold for the precision-recall trade-off) that is tuned to achieve a particular balance between the properties of interest. It is essential to identify what exact setting can be changed in order to mechanistically control the two properties. Needless to say, be sure which direction affects the metrics in which way (up or down).
11 | 2. **What are the evaluation metrics that allow you to select these hyperparameters?** After identifying the knobs to turn, select ways to evaluate your options. One method is to choose a single-number summary that depends on the values of both properties. Another is to require only that a given solution is *Pareto optimal*, in that any change that make one property more favorable will make the other less favorable. Such a solution is also said to be on the *Pareto frontier* of solutions. In the figure below, $f_0: \mathcal{X} \rightarrow \mathbb{R}^2$ is an objective function to be minimized, $\mathcal{O}$ is the set of achievable values of $f_0$, and $x^{\text{po}}$ is a Pareto optimal point. Among points on the frontier, a traditional ML evaluation metric or a domain-specific metric (number of movies watched, number of lives saved, etc) can select a solution. One might ask: why bother with the original trade-off at all if ultimately we will pick the best solution in accordance with a single metric? The answer is that 1) the Pareto frontier provides a constrained search space of solutions, specifically one in which no solution is dominated by another in terms of the original two properties, and 2) this optimization is usually much simpler to execute (likely in one dimension).
12 | 
13 | <p style="center"> 
14 | 
15 | ![Pareto](img/pareto.png)
16 | 
17 | </p>
18 | 
19 | **Source:** [Convex Optimization – Boyd and Vandenberghe](https://web.stanford.edu/~boyd/cvxbook/)
20 | 
21 | 
22 | ## Examples
23 | 
24 | - **Bias-Variance:** The control for this setting is *model complexity*, which will vary depending on context. In a traditional regression setting, it will be the regularization constant or the dimensionality of the feature representation. In neural networks, this could include architecture choice (number of layers/units), dropout probability, early stopping, and data augmentation. A natural single-number summary for this trade-off is *mean squared error*. In more traditional settings, it can be relevant to also connect this trade-off to ensemble methods. Specifically, if we have access to many high-variance, low-bias learners, *bagging* can reduce the variance. Similarly, if we have access to many high-bias, low-variance learners, *boosting* can combine the biases in a useful way.
25 | - **Precision-Recall (PR):** In binary classification, the PR-curve is formed using all values of the *classification threshold*. One way to choose this is *precision-at-recall-0.9*, which is simply fixing the recall to be above 90% (or any number) and picking the threshold which maximizes precision. Note that the question of picking the threshold is different from picking between various methods by summarizing the entire PR trode-off (such as by *F1-score* or *AUC*).
26 | - **Computational-Statistical:** An example in which this trade-off appears is *dimension reduction*. By the *data-processing inequality*, reducing dimension can only destroy information, i.e. hampering statistical performance in the interest of computational tractability. The hyperparameter is the *dimension* of the reduced representation. Another example is in *Markov Chain Monte Carlo (MCMC)* estimation, in which sampling more iterates can increase statistical performance but spend more computational resources. In most cases, evaluation can be done my maximum statistical performance (validation accuracy, say) given a fixed computational budget (wall time, money spent, energy consumed, etc).
27 | - **Accuracy-Fairness:** Considering there are many definitions of algorithmic fairness, the tunable hyperparameter and evaluation metrics would vary widely for this trade-off. Nonetheless, it is always good to mention, especially if you use a particular technical definition of fairness (*equalized odds*, *demographic parity*, etc).
28 | - **Forest-Network:** This one is really many trade-offs combined into one. Decision forest methods (random forest, gradient-boosted decision trees (GBDTs)) are very popular in practice, and it is relevant to compare them to neural networks (NNs). 
29 |     - There is a belief that GBDTs "just work", in that they can be spun up with little effort and virtually *no hyperparameter tuning*, as opposed to the many possible hyperparameters that might come up for a neural net (learning rate, implicit regularization, architecture).
30 |     - When using random forests, another benefit is that many trees can be trained *in parallel*, potentially making it a much faster method than NNs if given access to a wide CPU cluster.
31 |     - NNs are undisputed when working with *structured data* such as images and text. That being said, it is common practice to use pre-trained networks (often called *encoders* in this context) to represent images/text/audio as unstructured, frozen vectors and then apply GBDTs to the resulting representations. However, if one wants to update the encoders via backpropagation, it is much easier to use a small linear/logistic regression or NN *head* on top of the pretrained network.
32 |     - There is also a belief that forest-based methods are more *interpretable*, in that they come with a natural measure of [feature importance](https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html). Be mindful of the criticisms of this and other feature importance measures.
33 | 


--------------------------------------------------------------------------------