├── machine-learning ├── __init__.py ├── requirements.txt ├── learn.py ├── resources │ └── naivebayes.py └── README.md └── README.md /machine-learning/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /machine-learning/requirements.txt: -------------------------------------------------------------------------------- 1 | numpy==1.10.4 2 | scikit-learn==0.17.1 3 | scipy==0.17.0 4 | wheel==0.24.0 -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | NICAR 2016 2 | ========== 3 | 4 | Various documents and code examples for NICAR 2016 presentations. 5 | 6 | Feel free to reach out if you have any questions: chase.davis@nytimes.com or [@chasedavis](https://www.twitter.com/chasedavis/) -------------------------------------------------------------------------------- /machine-learning/learn.py: -------------------------------------------------------------------------------- 1 | from sklearn import preprocessing 2 | from sklearn import cross_validation 3 | from sklearn.naive_bayes import MultinomialNB 4 | from sklearn.ensemble import RandomForestClassifier 5 | from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer 6 | 7 | if __name__ == '__main__': 8 | 9 | ########## STEP 1: DATA IMPORT AND PREPROCESSING ########## 10 | 11 | # Here we're taking in the training data and splitting it into two lists: One with the text of 12 | # each bill title, and the second with each bill title's corresponding category. Order is important. 13 | # The first bill in list 1 should also be the first category in list 2. 14 | training = [line.strip().split('|') for line in open('data/training.txt', 'r').readlines()] 15 | text = [t[0] for t in training if len(t) > 1] 16 | labels = [t[1] for t in training if len(t) > 1] 17 | 18 | # A little bit of cleanup for scikit-learn's benefit. Scikit-learn models wants our categories to 19 | # be numbers, not strings. The LabelEncoder performs this transformation. 20 | encoder = preprocessing.LabelEncoder() 21 | correct_labels = encoder.fit_transform(labels) 22 | 23 | ########## STEP 2: FEATURE EXTRACTION ########## 24 | print 'Extracting features ...' 25 | 26 | # vectorizer = CountVectorizer(stop_words='english', min_df=2, lowercase=True, analyzer='word') 27 | 28 | # These two lines use scikit-learn helpers to transform our training data into a document/term matrix. 29 | vectorizer = CountVectorizer() 30 | data = vectorizer.fit_transform(text).todense() 31 | 32 | # print data 33 | # print data.shape 34 | 35 | ########## STEP 3: MODEL BUILDING ########## 36 | print 'Training ...' 37 | 38 | # model = RandomForestClassifier(n_estimators=10, random_state=0) 39 | 40 | model = MultinomialNB() 41 | fit_model = model.fit(data, correct_labels) 42 | 43 | ########## STEP 4: EVALUATION ########## 44 | print 'Evaluating ...' 45 | 46 | # Evaluate our model with 10-fold cross-validation 47 | scores = cross_validation.cross_val_score(model, data, correct_labels, cv=10) 48 | print "Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2) 49 | 50 | ########## STEP 5: APPLYING THE MODEL ########## 51 | print 'Classifying ...' 52 | 53 | docs_new = ["Public postsecondary education: executive officer compensation.", 54 | "An act to add Section 236.3 to the Education code, related to the pricing of college textbooks.", 55 | "Political Reform Act of 1974: campaign disclosures.", 56 | "An act to add Section 236.3 to the Penal Code, relating to human trafficking." 57 | ] 58 | 59 | test_data = vectorizer.transform(docs_new) 60 | 61 | for i in xrange(len(docs_new)): 62 | print '%s -> %s' % (docs_new[i], encoder.classes_[model.predict(test_data[i])]) -------------------------------------------------------------------------------- /machine-learning/resources/naivebayes.py: -------------------------------------------------------------------------------- 1 | ''' 2 | naivebayes.py 3 | 4 | An implementation of a simple Naive Bayes classifier, adapted from the excellent 5 | Programmer's Guide to Data Mining, which can be found here: 6 | 7 | http://guidetodatamining.com/home/toc/chapter-5/ 8 | 9 | Naive Bayes is considered one of the simplest but most effective classifiers in 10 | machine learning. It is considered a supervised classification method because it 11 | relies on a set of training data in which items and their proper classifications 12 | are already known. The algorithm uses probabilities to "learn" about the characteristics 13 | of those items and then uses that knowledge to classify new, unknown observations. 14 | 15 | For example, the algorithm might learn that an apple is a red fruit with a stem 16 | and an orange is an orange fruit with a rind. If the classifier then receives an 17 | input of a green fruit with a stem, it will classify the item as an apple. 18 | 19 | This implementation is set up to deal only with discrete categorical data, which 20 | is typical behavior for a simple Naive Bayes classifier. However, Naive Bayes can 21 | deal with continuous numerical data as well using one of two approaches: making 22 | the continuous data discrete by fitting it into bins, or by assuming the data fits 23 | a Gaussian (normal) distribution and altering the algorithm accordingly. 24 | 25 | A good explanation of these approaches can be found here under the "Parameter 26 | Estimation" section: http://en.wikipedia.org/wiki/Naive_Bayes_classifier 27 | 28 | Other useful information about Naive Bayes can be found here: 29 | 30 | Naive Bayes for text classification in NLTK: 31 | http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html 32 | https://gist.github.com/1266498 33 | 34 | Toby Segaran's Programming Collective Intelligence: 35 | http://www.amazon.com/Programming-Collective-Intelligence-Building-Applications/dp/0596529325 36 | http://blog.kiwitobes.com/?p=44 37 | ''' 38 | 39 | class NaiveBayes(object): 40 | def __init__(self, data): 41 | """ 42 | Bayesean classifiers require two types of probabilities to be created in 43 | training in order to properly classify input. They are known as "prior" 44 | and "conditional" probabilities. 45 | 46 | Priors are sort of a starting point for Bayesean probability, based on 47 | the classes but not their attributes. For example, if there are 10 items 48 | in the dataset -- 8 of type 'A' and 2 of type 'B' -- the model understands 49 | that a new item is far more likely to be of type 'A' (.8 probability) 50 | than type 'B' (.2 probability), even before its individual attributes 51 | are taken into account. 52 | 53 | The classifier also needs access to a set of conditional probabilities 54 | based on the features of each class. For instance, the probability that 55 | a certain attribute is associated with type 'A' vs. type 'B'. 56 | """ 57 | self.data = data # Input. Assumes first item in each row is class name. 58 | self.prior = {} 59 | self.conditional = {} 60 | 61 | def _calculate_prior(self, total, classes): 62 | """ 63 | Calculate prior probabilities, as described above, using the training set. 64 | 65 | This particular approach calculates class probabilities based on the frequency 66 | of each class in the training set. Another approach would be to assume all 67 | classes are equally likely to occur. 68 | """ 69 | for (category, count) in classes.items(): 70 | self.prior[category] = float(count) / float(total) 71 | return 72 | 73 | def _calculate_conditional(self, counts, classes): 74 | """ 75 | Calculate conditional probabilities, as described above, using the training 76 | set. 77 | """ 78 | # For each class and its set of feature counts, which are represented in 79 | # a dictionary as numbered columns. 80 | for (category, columns) in counts.items(): 81 | tmp = {} 82 | # For each set of feature counts in a column 83 | for (col, valueCounts) in columns.items(): 84 | tmp2 = {} 85 | # For each feature and count in a row 86 | for (value, count) in valueCounts.items(): 87 | # Calculate the probability that each feature is associated 88 | # with a given class 89 | tmp2[value] = float(count) / float(classes[category]) 90 | tmp[col] = tmp2 91 | 92 | tmp3 = [] 93 | # 94 | for i in range(1, len(self.data[0])): 95 | tmp3.append(tmp[i]) 96 | self.conditional[category] = tmp3 97 | return 98 | 99 | def train(self): 100 | """ 101 | Train the classifier by calculating prior and conditional probabilities 102 | from data provided in a training set. The larger and more varied the 103 | training set, the better luck you will have classifying new observations. 104 | """ 105 | total = len(self.data) # Total number of items in the dataset 106 | classes = {} # Each distinct class in the data, with counts 107 | counts = {} # Counts of features, grouped under each class 108 | 109 | # For each row of data in the training set 110 | for instance in self.data: 111 | category = instance[0] 112 | classes.setdefault(category, 0) 113 | counts.setdefault(category, {}) 114 | classes[category] += 1 115 | 116 | col = 0 117 | # For each column in the data row, total the rote counts of each 118 | # feature, grouped by the class. (Start with index 1 in the list because 119 | # index 0 is the category.) 120 | for columnValue in instance[1:]: 121 | col += 1 122 | tmp = {} 123 | if col in counts[category]: 124 | tmp = counts[category][col] 125 | if columnValue in tmp: 126 | tmp[columnValue] += 1 127 | else: 128 | tmp[columnValue] = 1 129 | counts[category][col] = tmp 130 | 131 | # Feed those counts to the probability functions above in order to calculate 132 | # prior and conditional probabilities. 133 | self._calculate_prior(total, classes) 134 | self._calculate_conditional(counts, classes) 135 | return 136 | 137 | def classify(self, instance): 138 | """ 139 | Classifies a new observation based on the probabilities calculated in 140 | training. 141 | """ 142 | categories = {} 143 | # Loop through every set of conditional probabilities in the training set 144 | for (category, vector) in self.conditional.items(): 145 | prob = 1 146 | # For every feature in each class of conditional probabilities 147 | for i in range(len(vector)): 148 | # No probability should ever be set to exactly zero, as it will 149 | # wipe out all other probabilities when they are multiplied. 150 | colProbability = .0000001 151 | # If a feature from the input class matches one in the training vector 152 | if instance[i] in vector[i]: 153 | # Get the probability for that feature 154 | colProbability = vector[i][instance[i]] 155 | # Apply each conditional probability to the total probability 156 | prob = prob * colProbability 157 | # Now apply the prior probability 158 | prob = prob * self.prior[category] 159 | categories[category] = prob 160 | # Total and sort all the probabilities that the classifier input 161 | cat = list(categories.items()) 162 | cat.sort(key=lambda catTuple: catTuple[1], reverse = True) 163 | # Return the class with the highest probability 164 | return(cat[0]) 165 | 166 | if __name__ == '__main__': 167 | data = [['i100', 'both', 'sedentary', 'moderate', 'yes'], 168 | ['i100', 'both', 'sedentary', 'moderate', 'no'], 169 | ['i500', 'health', 'sedentary', 'moderate', 'yes'], 170 | ['i500', 'appearance', 'active', 'moderate', 'yes'], 171 | ['i500', 'appearance', 'moderate', 'aggressive', 'yes'], 172 | ['i100', 'appearance', 'moderate', 'aggressive', 'no'], 173 | ['i500', 'health', 'moderate', 'aggressive', 'no'], 174 | ['i100', 'both', 'active', 'moderate', 'yes'], 175 | ['i500', 'both', 'moderate', 'aggressive', 'yes'], 176 | ['i500', 'appearance', 'active', 'aggressive', 'yes'], 177 | ['i500', 'both', 'active', 'aggressive', 'no'], 178 | ['i500', 'health', 'active', 'moderate', 'no'], 179 | ['i500', 'health', 'sedentary', 'aggressive', 'yes'], 180 | ['i100', 'appearance', 'active', 'moderate', 'no'], 181 | ['i100', 'health', 'sedentary', 'moderate', 'no']] 182 | 183 | b = NaiveBayes(data) 184 | b.train() 185 | print b.classify(['health', 'moderate', 'moderate', 'yes']) 186 | print b.classify(['appearance', 'moderate', 'moderate', 'no']) -------------------------------------------------------------------------------- /machine-learning/README.md: -------------------------------------------------------------------------------- 1 | Hands-on with machine learning 2 | ============================== 3 | 4 | First of all, let me be clear about one thing: You're not going to "learn" machine learning in 60 minutes. 5 | 6 | Instead, the goal of this session is to give you some sense of how to approach one type of machine learning in practice, specifically [http://en.wikipedia.org/wiki/Supervised_learning](supervised learning). 7 | 8 | For this exercise, we'll be training a simple classifier that learns how to categorize bills from the California Legislature based only on their titles. Along the way, we'll focus on three steps critical to any supervised learning application: feature engineering, model building and evaluation. 9 | 10 | To help us out, we'll be using a Python library called [http://scikit-learn.org/](scikit-learn), which is the easiest to understand machine learning library I've found in any language. 11 | 12 | That's a lot to pack in, so this session is going to move fast, and I'm going to assume you have a strong working knowledge of Python. Don't get caught up in the syntax. It's more important to understand the process. 13 | 14 | Since we only have time to hit the very basics, I've also included some additional points you might find useful under the "What we're not covering" heading of each section below. There are also some resources at the bottom of this document that I hope will be helpful if you decide to learn more about this on your own. 15 | 16 | Installation 17 | ============ 18 | 19 | If you would like to install and run this on your own machine, do this (requires Python and pip): 20 | 21 | ``` 22 | git clone git@github.com:cjdd3b/nicar2016.git 23 | cd nicar2016/machine-learning 24 | pip install -r requirements.txt 25 | ``` 26 | 27 | To run, simply type ```python learn.py```. 28 | 29 | Feature engineering 30 | =================== 31 | 32 | Feature engineering can at once be one of the easiest and most difficult things to master in machine learning. 33 | 34 | Features are how you represent data to your model. Most of the time, this involves two things: 1.) Selecting (or creating) elements of data that will be useful for the task at hand; and 2.) Representing them as numerical values in the form of a matrix. 35 | 36 | The first part is an art. Creating features that you model finds useful can be a tedious exercise in trial and error, but fortunately it doesn't (necessarily) require any special knowledge of machine learning to get started. The second part requires some basic understanding of linear algebra and geometry, but nothing too intimidating. 37 | 38 | For this example, we'll see how to extract features from just the words in our bill titles. 39 | 40 | What we're covering 41 | ------------------- 42 | 43 | - Features are the information we give our models. They are the only way models understand the world around them. 44 | 45 | - Features are typically represented as a matrix composed of vectors. In CAR terms, you can think of this as being like a spreadsheet. In our example, each row is a document, each column represents a word contained in the whole set of documents, and each value is the number of times a given word appeared in a given document. This is also known as a term/document matrix. 46 | 47 | - The number of columns also represents the dimensionality of our dataset in a geometric space known as the hyperplane. A matrix with two columns, for latitude and longitude, can be thought of as representing a vector in two-dimensional space (like a map). Our data follows a similar intuition but is represented in hundreds of dimensions. 48 | 49 | What we're not covering 50 | ----------------------- 51 | 52 | - In our example, features are simply counts of the words contained in our bill titles. But often it's helpful to define them more explicitly, depending on the task at hand. For example, if you're building a classifier to [identify quotes in text](https://github.com/cjdd3b/quotex-simple/), two useful features might be "paragraph contains quote marks" and "paragraph ends with quote marks". Often I like to define each of these as [functions](https://github.com/cjdd3b/quotex-simple/blob/master/features/features.py), which are combined later into a feature vector. 53 | 54 | - Features can be either discrete or continuous, but different models handle those in different ways. 55 | 56 | - Often you'll want to scale your features so their values fall within a defined range (-1 to 1, for example). A number of simple [normalization techniques](http://en.wikipedia.org/wiki/Feature_scaling) can be applied during preprocessing to solve this. 57 | 58 | - As a rule of thumb, keep the dimensionality of your feature space as low as possible. Your intuition of how points relate in two or three dimensions [doesn't apply in higher dimensional spaces](http://en.wikipedia.org/wiki/Curse_of_dimensionality). Dimensionality reduction techniques such as [principal component analysis](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) can be useful here. 59 | 60 | - In some models, such as [Random Forests](http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation), it's useful to understand how much each feature contributes to the model's predictive power. You can also do this to pare out features that add very little to the model. Likewise, it's also good to keep an eye out for features that correlate with each other. 61 | 62 | Model building 63 | ============== 64 | 65 | There are dozens of commonly used models for classification, all of which have certain advantages and disadvantages. But no matter which model we use, the process for working with it is largely the same. 66 | 67 | First, we have to "train" it on a set of pre-labeled data, represented by the set of features we created in the previous step. Once that's done, we'll evaluate its performance, tune it, and, once we're satisfied, use it to classify new data. 68 | 69 | Scikit-learn makes it easy to substitute one model for another, so we'll try two: Multinomial Naive Bayes and Random Forests. 70 | 71 | What we're covering 72 | ------------------- 73 | 74 | - Classification models essentially apply weights to features. They do this by looking at a pre-labeled set of data. After the model has been "trained," it can evaluate unseen data using the information it learned from the training set. 75 | 76 | - A model is only as good as its features. 77 | 78 | - Journalists have long used models -- particularly linear and logistic regression -- to explain variation within datasets. We're doing something similar but taking it one step further and using the model to predict something about unseen data. 79 | 80 | - In the newsroom, it's often useful to choose models that are relatively transparent and easy to explain: linear and logistic regression, Naive Bayes, k-nearest neighbors, decision trees and Random Forests are good examples. Things like Support Vector Machines and neural networks are more black-box, and so I only use those when there's a good reason. 81 | 82 | What we're not covering 83 | ------------------------ 84 | 85 | - Overfitting happens when you create a classifier that performs well on training data but doesn't generalize well to unseen data. It's a major problem to beware of. Here's a great [visual explanation](http://www.quora.com/What-is-an-intuitive-explanation-of-overfitting) of what it looks like. 86 | 87 | - Regularization is one good way to prevent overfitting, and it's worth [learning about](https://www.youtube.com/watch?v=Ms7QkS-uKiI). 88 | 89 | - Models have different types of parameters that can be tuned to optimize their performance. Fine-tuning parameters can get you a bit more accuracy, but tuning them poorly can blow up your model. So be careful. Scikit-learn provides some useful tools to auto-tune parameters via methods like [Grid Search](http://scikit-learn.org/stable/modules/grid_search.html). 90 | 91 | - Once a model is trained, you can persist it using either Python's pickle module or the slightly fancier [joblib](http://scikit-learn.org/stable/modules/model_persistence.html), which is packaged with scikit-learn. This is helpful when you have a model that takes a long time to train, and you don't want to retrain it every time you run your program. 92 | 93 | - It's generally a good idea to keep the code that creates your model separate from the code that runs it. Use persistance to dump and load the model as needed. 94 | 95 | Evaluation 96 | ============== 97 | 98 | Knowing how to properly evaluate your models is perhaps the single most important thing in practical machine learning. 99 | 100 | Having some intuition into how your model performs -- and specifically how it fails -- in some ways obivates the need to fully understand many of the complex mathematical theory that underpins machine learning in general, at least when you're first starting out. 101 | 102 | Here we'll go over some techniques for seeing how our model performs, and how we can use that information to improve it. 103 | 104 | What we're covering 105 | ------------------- 106 | 107 | - Distill your model's performance into a metric. Then try to make that number go up. 108 | 109 | - [K-fold cross-validation](http://scikit-learn.org/stable/modules/cross_validation.html) is generally a good way to evaluate classifier performance. 110 | 111 | What we're not covering 112 | ----------------------- 113 | 114 | - In classification, be mindful of the tradeoff between [precision and recall](http://en.wikipedia.org/wiki/Precision_and_recall). Which one you optimize for depends on what you're trying to accomplish. The composite metric [F1](http://en.wikipedia.org/wiki/F1_score) captures some of this balance, but there are plenty of times it won't be the best metric for your task. 115 | 116 | - Beware any model that performs too well. 117 | 118 | - Take time to learn exactly what your model is getting right and what it's getting wrong. Looking at specific mistakes can guide you toward creating new features, tuning parameters and trying different types of models. 119 | 120 | - Some models can also produce confidence scores when they are applied to unseen data. In the case of scikit-learn's Random Forests, for example, we can access these scores using the [predict_proba method](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict_proba). 121 | 122 | Other resources 123 | =============== 124 | 125 | If you're interested in learning more about machine learning, I'd highly recommend Andrew Ng's [Machine Learning](https://www.coursera.org/course/ml) course on Coursera, and the book and online course [Mining of Massive Datasets](http://www.mmds.org/). Both provide a good theoretical grounding in many common techniques, as well as helpful practical advice. 126 | 127 | O'Reilly also produces several books worth checking out: [Machine Learning for Hackers](http://shop.oreilly.com/product/0636920018483.do), [Programming Collective Intelligence](http://shop.oreilly.com/product/9780596529321.do) and [Data Analysis With Open Source Tools](http://shop.oreilly.com/product/9780596802363.do). 128 | 129 | [Learning scikit-learn](http://www.amazon.com/Learning-scikit-learn-Machine-Python/dp/1783281936) and [Mastering Machine Learning with scikit-learn](https://www.packtpub.com/big-data-and-business-intelligence/mastering-machine-learning-scikit-learn) are also good for learning more about the scikit-learn library. 130 | 131 | And you can always contact me if you have any questions: chase.davis@nytimes.com --------------------------------------------------------------------------------