├── .gitignore ├── README.md ├── data visualization └── index.html ├── intro ├── LateNight_thru_7June2014.csv └── intro.ipynb └── scraping ├── gawker_titles.txt ├── scraping.ipynb └── wsj_titles.txt /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | .ipynb_checkpoints 3 | *.pyc -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Boston Machine Learning 2 | 3 | **Table of Contents** 4 | 5 | * [Intro to Data Science](#intro-to-data-science) 6 | * [Web Scraping](#web-scraping) 7 | * [Theano](#theano) 8 | * [Data Visualization](#data-visualization) 9 | * [Semi-supervised Learning](#Semi-supervised-Learning) 10 | * [Dealing with Temporal Clinical Data](#dealing-with-temporal-clinical-data) 11 | * [RNNs and Hyperparameters](#rnns-and-hyperparameters) 12 | * [Bayesian Methods](#bayesian-methods) 13 | * [Distributed Learning](#distributed-learning) 14 | * [Techniques for Dimensionality Reduction](#techniques-for-dimensionality-reduction) 15 | * [Modeling Sensor Data](#modeling-sensor-data) 16 | * [Introduction to Markov Decision Processes](#introduction-to-markov-decision-processes) 17 | * [Perception as Analysis by Synthesis](#perception-as-analysis-by-synthesis) 18 | * [Operationalizing Data Science Output](#operationalizing-data-science-output) 19 | * [GPU Accelerated Learning](#gpu-accelerated-learning) 20 | * [High Dimensional Function Learning](#high-dimensional-function-learning) 21 | * [Basketball Analytics Using Player Tracking Data](#basketball-analytics-using-player-tracking-data) 22 | * [TensorFlow in Practice](#tensorflow-in-practice) 23 | * [Virtual Currency Trading](#virtual-currency-trading) 24 | * [NSFW Modeling with ConvNets](#nsfw-modeling-with-convnets) 25 | * [Structured Attention Networks](#structured-attention-networks) 26 | * [Automated Machine Learning](#automated-machine-learning) 27 | * [Grounding Natural Language with Autonomous Interaction](#grounding-natural-language-with-autonomous-interaction) 28 | * [Neural Network Design Using RL](#neural-network-design-using-rl) 29 | * [AI for Enterprise](#ai-for-enterprise) 30 | 31 | 32 | 33 |

34 | 35 |
36 | 37 | 38 | #### Intro to Data Science 39 | 40 | * [Imran Malek](https://twitter.com/imran_malek) is a Solutions Architect at DataXu. His workshop introduced pandas and matplotlib. [Slides](https://docs.google.com/presentation/d/1Qb6bzYBcAoKVyBI9Q0IOg7lon77JqeSXB3tKgoaCHkI/present?slide=id.p) | [Notebook](https://github.com/gwulfs/bostonml/blob/master/intro/intro.ipynb) 41 | 42 |

43 | 44 |
45 | 46 | 47 | #### Web Scraping 48 | 49 | * [Marcus Way](https://twitter.com/marcus_way) is an SDE at Amazon and was previously a Software Engineer at Wanderu, a company that helps people find the lowest bus fares. This workshop took us through the process of acquiring data from the web before building a model to predict whether an article's title originated from Gawker or the Wall Street Journal. [Notebook](https://github.com/gwulfs/bostonml/blob/master/scraping/scraping.ipynb) 50 | 51 |

52 | 53 |
54 | 55 | 56 | #### Theano 57 | 58 | * [Alec Radford](https://twitter.com/alecrad) is the Head of Research at indico. His talk introduced Theano and convolutional networks. [Video](https://www.youtube.com/watch?v=S75EdAcXHKk) | [Code](https://github.com/Newmu/Theano-Tutorials) 59 | 60 |

61 | 62 |
63 | 64 | 65 | #### Data Visualization 66 | 67 | * [Lane Harrison](https://twitter.com/laneharrison) is an Assistant Professor of Computer Science at WPI and was previously a Postdoc in the [Visual Analytics Lab](http://valt.cs.tufts.edu/) at Tufts. His workshop introduced data visualization with d3.js. [Slides](http://goo.gl/ro4wqB) | [Code](https://github.com/gwulfs/bostonml/blob/master/data%20visualization/index.html) 68 | 69 |

70 | 71 |
72 | 73 | 74 | #### Semi-supervised Learning 75 | 76 | * [Eli Brown](https://www.cdm.depaul.edu/Faculty-and-Staff/Pages/faculty-info.aspx?fid=1311) is an Assistant Professor of Computer Science at DePaul. His talk focused on using interactive visualizations to help users leverage learning algorithms. [Slides](http://goo.gl/jIwMzc) | [Paper](http://ieeexplore.ieee.org/xpl/abstractAuthors.jsp?arnumber=6400486) 77 | 78 |

79 | 80 |
81 | 82 | 83 | #### Dealing with Temporal Clinical Data 84 | 85 | * [Marzyeh Ghassemi](https://www.linkedin.com/in/marzyehghassemi) is a PhD Student at MIT CSAIL in the [Clinical Decision Making Group](http://groups.csail.mit.edu/medg/). Her session introduced both Latent Dirchlet Allocation and Gaussian Processes before walking us through her recent paper entitled "A Multivariate Timeseries Modeling Approach to Severity of Illness Assessment and Forecasting in ICU with Sparse, Heterogeneous Clinical Data." [Paper](http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9393) | [Slides](https://drive.google.com/file/d/0BwC1eSaTX5cZSkhSQVpjVWpOQWc/view) 86 | 87 |

88 | 89 |
90 | 91 | 92 | #### RNNs and Hyperparameters 93 | 94 | * [Alec Radford](https://twitter.com/alecrad) - Using Passage to Train RNNs. [Slides](https://docs.google.com/presentation/d/1HYfUZLRZRJovQpv5mYxox9bz9erxj7Ak_ZovENMvM90/present?slide=id.p) | [Code](https://github.com/indicodatasolutions/passage) | [Video](https://www.youtube.com/watch?v=VINCQghQRuM) 95 | 96 | * [David Duvenaud](https://twitter.com/davidduvenaud) - Gradient-Based Learning of Hyperparameters. [Paper](http://arxiv.org/abs/1502.03492) | [Slides](https://drive.google.com/file/d/0BwC1eSaTX5cZWjZBM3V2Umw4RnM/view?usp=sharing) | [Code](https://github.com/HIPS/hypergrad) 97 | 98 |

99 | 100 |
101 | 102 | 103 | #### Bayesian Methods 104 | 105 | * [Allen Downey](https://twitter.com/allendowney) is a Professor of Computer Science at Olin College. His talk focused on an application of bayesian statistics from World War II. [Slides](https://docs.google.com/presentation/d/1Ec2KkdOSk1DVUrN9i-Q8gkxO8FXEVjEFGaQxy5T5KOk/pub?slide=id.p) | [Video](https://www.youtube.com/watch?v=iQLozVBZ_VI) 106 | 107 | * [José Miguel Hernández Lobato](http://jmhl.org) is a Postdoc at Harvard's [Intelligent Probabilistic Systems Lab](http://hips.seas.harvard.edu) and presented on bayesian optimization and information-based approaches. [Slides](https://drive.google.com/file/d/0BwC1eSaTX5cZZFptMExPN2dVbDQ/view?usp=sharing) 108 | 109 |

110 | 111 |
112 | 113 | 114 | #### Distributed Learning 115 | 116 | * [Arno Candel](https://twitter.com/ArnoCandel) is the Chief Architect at H2o. His talk focused on the implementation and application of distributed machine learning algorithms such as Elastic Net, Random Forest, Gradient Boosting, and Deep Neural Networks. [Slides](https://drive.google.com/file/d/0BwC1eSaTX5cZaTA4V0ZqbnRBVjQ/view?usp=sharing) 117 | 118 |

119 | 120 | 121 | #### Techniques for Dimensionality Reduction 122 | 123 | * [Dan Steinburg](https://www.dannyadam.com) is a PhD student in intelligent systems at the University of Pittsburgh. His talked introduced various techniques for dimensionality reduction including PCA, multidimensional scaling, isomaps, locally linear embedding, and laplacian eigenmaps. [Slides](https://drive.google.com/file/d/0BwC1eSaTX5cZa0RjejlEeEpGTlE/view?usp=sharing) 124 | 125 |

126 | 127 | 128 | #### Modeling Sensor Data 129 | 130 | * [Hank Roark](https://twitter.com/hankroark) is a Data Scientist at H2O, where he works on building data products within the domains of machine prognostics, health management, and agriculture. His workshop focused on on the challenges faced when modeling streaming sensor data. [Slides](http://www.slideshare.net/hankroark/machine-learning-for-the-sensored-iot) | [Notebook](http://nbviewer.ipython.org/github/h2oai/h2o-meetups/blob/master/2015_09_19_ModelingSensorData/ML_for_Sensored_IoT.ipynb) 131 | 132 | 133 |

134 | 135 |
136 | 137 | #### Introduction to Markov Decision Processes 138 | 139 | * [Alborz Geramifard](https://www.linkedin.com/in/alborzgeramifard) is a Research Scientist at Amazon and lead an introductory workshop on MDPs with RLPy. [Paper](http://jmlr.org/papers/v16/geramifard15a.html) | [Code](https://github.com/rlpy/rlpy) 140 | 141 |

142 | 143 |
144 | 145 | #### Perception as Analysis by Synthesis 146 | 147 | * [Tejas Kulkarni](https://twitter.com/tejasdkulkarni) is a PhD Student at MIT in Josh Tenenbaum's lab and spent last summer working at Google DeepMind in London. His talk will was focused on his recent paper entitled: "Picture: A Probabilistic Programming Language for Scene Perception." [Paper](http://www.cv-foundation.org/openaccess/content_cvpr_2015/html/Kulkarni_Picture_A_Probabilistic_2015_CVPR_paper.html) | [Slides](https://drive.google.com/file/d/0BwC1eSaTX5cZeHU2dG9oQW02SFk/view) 148 | 149 |

150 | 151 |
152 | 153 | 154 | #### Operationalizing Data Science Output 155 | 156 | * [Tom LaGatta](https://www.linkedin.com/in/lagatta) is a Senior Data Scientist & Analytics Architect at Splunk. His session focused on aligning data science output with operational workflows. 157 | 158 |

159 | 160 |
161 | 162 | 163 | #### GPU Accelerated Learning 164 | 165 | * [Bob Crovella](https://www.linkedin.com/in/robert-crovella-0ab46330) joined NVIDIA in 1998 and leads a technical team that is responsible for supporting GPU Computing Products. His talk began with an introduction to why GPUs are helpful when training deep neural networks. He then walked through demos of cuDNN and DIGITS from the perspective of how they fit together with frameworks like Caffe, Torch, and Theano. [Slides](https://drive.google.com/file/d/0BwC1eSaTX5cZdEFaMElEMEhVbmc/view?usp=sharing) | [Video](https://www.youtube.com/watch?v=tVTllMpyz8k) 166 | 167 |

168 | 169 |
170 | 171 | #### High Dimensional Function Learning 172 | 173 | * Jason Klusowski is a PhD student at Yale and presented on the computational and theoretical aspects of approximating d-dimensional functions. [Slides](https://drive.google.com/file/d/0BwC1eSaTX5cZQmFPdFVUS285bVk/view?usp=sharing) | [Video](https://www.youtube.com/watch?v=pUR5_XOAMBk&feature=youtu.be) 174 | 175 |

176 | 177 |
178 | 179 | #### Basketball Analytics Using Player Tracking Data 180 | 181 | * [Alexander D'Amour](https://twitter.com/alexdamour) is an Assistant Professor in Statistics at UCB, and recently completed his PhD at Harvard. His talk introduced applications of 24-FPS spatial data in the direction of answering fundamental questions related to the game of basketball. [Video](https://www.youtube.com/watch?v=4kqTBO5KDr4) 182 | 183 |

184 | 185 |
186 | 187 | #### TensorFlow in Practice 188 | 189 | * [Nathan Lintz](https://twitter.com/nlintz) is a research scientist at indico Data Solutions where he is responsible for developing machine learning systems in the domains of language detection, text summarization, and emotion recognition. His session focused on the first principles of TensorFlow, building all the way up to generative modeling with recurrent networks. [Slides](http://www.slideshare.net/indicods/tensorflow-in-practice) | [Code](https://github.com/nlintz/TensorFlow-Tutorials) | [Video](https://www.youtube.com/watch?v=op1QJbC2g0E) 190 | 191 |

192 | 193 |
194 | 195 | #### Virtual Currency Trading 196 | 197 | * [Anders Brownworth](https://twitter.com/anders94) is a principle engineer at Circle and was previously an instructor at the MIT Media Lab. His talk focused on building the intution needed with respect to the blockchain and bitcoin to develop succesful trading stratagies. [Slides](https://drive.google.com/file/d/0BwC1eSaTX5cZOVhJWVVxZWMyXzg/view?usp=sharing) | [Video](http://anders.com/blockchain) | [Hacker News](https://news.ycombinator.com/item?id=13566951) 198 | 199 |

200 | 201 |
202 | 203 | #### NSFW Modeling with ConvNets 204 | 205 | * [Ryan Compton](https://twitter.com/rycpt) is a data scientist at Clarifai. His talk used the problem of nudity detection to illustrate the workflow involved with training and evaluating convolutional neural networks. He also discussed deconvolution and demonstrated how it can be used to visualize intermediate feature layers. [Video](https://www.youtube.com/watch?v=EhtRDT-3CC4) 206 | 207 |

208 | 209 | #### Structured Attention Networks 210 | 211 | * [Yoon Kim](https://people.csail.mit.edu/yoonkim) is Phd Student in computer science at Harvard. This session gave an overview of attention mechanisms and structured prediction before introducing a method for combining the two ideas by way of graphical models. [Slides](https://drive.google.com/file/d/0BwC1eSaTX5cZMVhqWXJncDlkTHc/view?usp=sharing) | [Code](https://github.com/harvardnlp/struct-attn) 212 | 213 |

214 | 215 | #### Automated Machine Learning 216 | 217 | * [Nicolo Fusi](https://twitter.com/nfusi) is a research scientist at Microsoft Research, working at the intersection of machine learning, computational biology and medicine. He received his PhD in Computer Science from the University of Sheffield under Neil Lawrence. His talk focused on the process of selecting and tuning pipelines consisting of data preprocessing methods and machine learning models. [Slides](https://drive.google.com/file/d/0BwC1eSaTX5cZNUdSeWI0WHBnUzBLOWpsU1VfcFdVeXhxOExj/view?usp=sharing) | [Paper](https://arxiv.org/abs/1705.05355) | [Video](https://www.youtube.com/watch?v=5uNi29NUxms) 218 | 219 |

220 | 221 | 222 | #### Grounding Natural Language with Autonomous Interaction 223 | 224 | * [Karthik Narasimhan](https://twitter.com/karthik_r_n) is a PhD candidate at CSAIL working on natural language understanding and deep reinforcement learning. His talk focused on task-optimized representations to reduce dependence on annotation. The session built up to a demonstration of how reinforcement learning can enhance traditional NLP systems in low resource scenarios. In particular, he described an autonomous agent that can learn to acquire and integrate external information to improve information extraction. [Slides](https://drive.google.com/file/d/0BwC1eSaTX5cZdGNKTi1BTk5VUUE/view?usp=sharing) 225 | 226 |

227 | 228 | 229 | #### Neural Network Design Using RL 230 | 231 | * [Bowen Baker](https://bowenbaker.github.io) recently completed his graduate work at the MIT Media Lab. His presentation touched on practical CNN meta-modeling. He now is continuing his work as a member of the research team at OpenAI. [Slides](https://drive.google.com/file/d/0BwC1eSaTX5cZTkIwcUd2c1F4cjA/view?usp=sharing) | [Video](https://www.youtube.com/watch?v=zs80AB3tCUQ&feature=youtu.be) 232 | 233 |

234 | 235 | 236 | #### AI for Enterprise 237 | 238 | * [Sophie Vandebroek](https://www.linkedin.com/in/sophievandebroek) is the COO at IBM Research, and discussed applications of her teams work. [Ruchir Puri](https://scholar.google.com/citations?user=HTf7H58AAAAJ&hl=en) is the Chief Architect of Watson, and presented on challenges related to deploing machine learning systems. [Video](https://www.youtube.com/watch?v=LnMHdyvc-nE) 239 | 240 |

241 | -------------------------------------------------------------------------------- /data visualization/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 62 | -------------------------------------------------------------------------------- /scraping/gawker_titles.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gwulfs/bostonml/1b6f37c564ca6f2d08f2ed5510e414494932294a/scraping/gawker_titles.txt -------------------------------------------------------------------------------- /scraping/scraping.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Web Scraping in Python" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "The following provides an example of\n", 15 | "

IPython Notebook
Parsing HTML using lxml
Data manipulation w/ pandas
Classification w/ Multinomial Naive Bayes using scikit-learn

\n", 21 | "\n", 22 | "Now lets see if we can teach our computer how to tell the difference between a Wall Street Journal headline and a Gawker headline. \n", 23 | "\n", 24 | "Inspiration for this notebook/presentation was provided by [this](http://nbviewer.ipython.org/github/nealcaren/workshop_2014/blob/master/notebooks/6_Classification.ipynb) and [this](http://nbviewer.ipython.org/github/cs109/content/blob/master/HW3_solutions.ipynb). If you want to run this notebook, or one like it, it'll be helpful for you to check out [Anaconda](https://store.continuum.io/cshop/anaconda/)\n" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "## Get the Data!\n", 32 | "If we want to build a model with a discerning taste for quality news, we're going to have to find him some examples. Lucky for us, Gawker and the Wall Street Journal both have websites. \n", 33 | "### To the internet (with dev tools)!\n", 34 | "#### [Gawker](http://gawker.com)\n", 35 | "After poking around the gawker homepage with the help of command + option + i (mac) and a Chrome plugin, [XPath Helper](https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl?hl=en), we can see that all the article titles can be grabbed with the xpath, `\"//header/h1/a/text()\"`. lxml lets us retrieve all the text that matches our xpath" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 1, 41 | "metadata": { 42 | "collapsed": false 43 | }, 44 | "outputs": [ 45 | { 46 | "name": "stdout", 47 | "output_type": "stream", 48 | "text": [ 49 | "We got 17 titles. Here are the first 5:\n" 50 | ] 51 | }, 52 | { 53 | "data": { 54 | "text/plain": [ 55 | "['College Kids Not So Into the Free Press. Whatever',\n", 56 | " \"Today's Best Deals: Running Shoes, Outdoor Clothes, Anker Jump Starter, and More\",\n", 57 | " \"Who's Named in the Panama Papers?\",\n", 58 | " \"NYPD's Blowhard Union Boss Is Feeling a Little Scared and a Little Confused\",\n", 59 | " 'This Is Bad, Even For Sarah Palin']" 60 | ] 61 | }, 62 | "execution_count": 1, 63 | "metadata": {}, 64 | "output_type": "execute_result" 65 | } 66 | ], 67 | "source": [ 68 | "from lxml import html \n", 69 | "x = html.parse('http://gawker.com')\n", 70 | "titles = x.xpath('//header/h1/a/text()')\n", 71 | "print \"We got {} titles. Here are the first 5:\".format(len(titles))\n", 72 | "titles[:5]" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "This page only has 20 titles on it. If we want to properly train our model, we're going to need more examples. There's a \"More Stories\" button at the bottom of the page, which brings us to another, similarly structured page with new titles and another \"More Stories\" button. To get more examples, we'll repeat the process above in a loop with each successive iteration hitting the page pointed to by the \"More Stories\" button. In order to figure out what the link is, go do some more investigating! You can also just look at the code below. " 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 2, 85 | "metadata": { 86 | "collapsed": false 87 | }, 88 | "outputs": [], 89 | "source": [ 90 | "headlines = x.xpath('//header/h1/a/text()')" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": 3, 96 | "metadata": { 97 | "collapsed": false 98 | }, 99 | "outputs": [ 100 | { 101 | "name": "stdout", 102 | "output_type": "stream", 103 | "text": [ 104 | "Retrieved 17 titles from url: http://gawker.com/\n", 105 | "Retrieved 18 titles from url: http://gawker.com/?startTime=1459814100852\n", 106 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1459777920161\n", 107 | "Retrieved 18 titles from url: http://gawker.com/?startTime=1459616727760\n", 108 | "Retrieved 16 titles from url: http://gawker.com/?startTime=1459518120384\n", 109 | "Retrieved 17 titles from url: http://gawker.com/?startTime=1459440720292\n", 110 | "Retrieved 21 titles from url: http://gawker.com/?startTime=1459378619355\n", 111 | "Retrieved 18 titles from url: http://gawker.com/?startTime=1459344905458\n", 112 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1459282860784\n", 113 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1459224087623\n", 114 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1459180800411\n", 115 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1459096920497\n", 116 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1458942097476\n", 117 | "Retrieved 18 titles from url: http://gawker.com/?startTime=1458873001132\n", 118 | "Retrieved 23 titles from url: http://gawker.com/?startTime=1458838501396\n", 119 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1458762000470\n", 120 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1458691200399\n", 121 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1458648838619\n", 122 | "Retrieved 17 titles from url: http://gawker.com/?startTime=1458568349668\n", 123 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1458411300607\n", 124 | "Retrieved 18 titles from url: http://gawker.com/?startTime=1458317400101\n", 125 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1458236400438\n", 126 | "Retrieved 21 titles from url: http://gawker.com/?startTime=1458168300774\n", 127 | "Retrieved 21 titles from url: http://gawker.com/?startTime=1458139800781\n", 128 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1458082978370\n", 129 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1458052800282\n", 130 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1457977740670\n", 131 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1457906400579\n", 132 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1457799193651\n", 133 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1457712900315\n", 134 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1457649720778\n", 135 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1457582709616\n", 136 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1457542700682\n", 137 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1457492894819\n", 138 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1457456760380\n", 139 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1457390396828\n", 140 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1457316282784\n", 141 | "Retrieved 17 titles from url: http://gawker.com/?startTime=1457221920353\n", 142 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1457130679407\n", 143 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1457063685144\n", 144 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1457037300965\n", 145 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1456967315072\n", 146 | "Retrieved 18 titles from url: http://gawker.com/?startTime=1456942260395\n", 147 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1456885757675\n", 148 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1456850488611\n", 149 | "Retrieved 21 titles from url: http://gawker.com/?startTime=1456790401314\n", 150 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1456763744146\n", 151 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1456691039055\n", 152 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1456527300918\n", 153 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1456502460180\n", 154 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1456436100526\n", 155 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1456414440291\n", 156 | "Retrieved 18 titles from url: http://gawker.com/?startTime=1456336700198\n", 157 | "Retrieved 17 titles from url: http://gawker.com/?startTime=1456273083458\n", 158 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1456201802624\n", 159 | "Retrieved 21 titles from url: http://gawker.com/?startTime=1456168500179\n", 160 | "Retrieved 21 titles from url: http://gawker.com/?startTime=1456088148835\n", 161 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1455993374937\n", 162 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1455903180013\n", 163 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1455843616785\n", 164 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1455804120343\n", 165 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1455730221178\n", 166 | "Retrieved 17 titles from url: http://gawker.com/?startTime=1455659880488\n", 167 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1455591918939\n", 168 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1455553200047\n", 169 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1455469971709\n", 170 | "Retrieved 17 titles from url: http://gawker.com/?startTime=1455383318369\n", 171 | "Retrieved 18 titles from url: http://gawker.com/?startTime=1455289920340\n", 172 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1455223200774\n", 173 | "Retrieved 18 titles from url: http://gawker.com/?startTime=1455199761203\n", 174 | "Retrieved 18 titles from url: http://gawker.com/?startTime=1455134100662\n", 175 | "Retrieved 18 titles from url: http://gawker.com/?startTime=1455073539497\n", 176 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1455052500595\n", 177 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1454976900789\n", 178 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1454951100525\n", 179 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1454868325898\n", 180 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1454784480648\n", 181 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1454693820811\n", 182 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1454626184034\n", 183 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1454595538930\n", 184 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1454525160578\n", 185 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1454461200308\n", 186 | "Retrieved 18 titles from url: http://gawker.com/?startTime=1454439412740\n", 187 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1454380200472\n", 188 | "Retrieved 21 titles from url: http://gawker.com/?startTime=1454348400739\n", 189 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1454277180763\n", 190 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1454172298806\n", 191 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1454086800199\n", 192 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1454019780001\n", 193 | "Retrieved 18 titles from url: http://gawker.com/?startTime=1453993980250\n", 194 | "Retrieved 21 titles from url: http://gawker.com/?startTime=1453930800340\n", 195 | "Retrieved 18 titles from url: http://gawker.com/?startTime=1453866002019\n", 196 | "Retrieved 17 titles from url: http://gawker.com/?startTime=1453830666472\n", 197 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1453765105238\n", 198 | "Retrieved 21 titles from url: http://gawker.com/?startTime=1453699281227\n", 199 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1453646100280\n", 200 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1453500960040\n", 201 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1453436298952\n", 202 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1453399920758\n", 203 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1453332702220\n", 204 | "Retrieved 19 titles from url: http://gawker.com/?startTime=1453263812584\n", 205 | "Retrieved 18 titles from url: http://gawker.com/?startTime=1453236600032\n", 206 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1453165200269\n", 207 | "Retrieved 20 titles from url: http://gawker.com/?startTime=1453091040767\n" 208 | ] 209 | } 210 | ], 211 | "source": [ 212 | "# These are the xpaths we determined from snooping \n", 213 | "next_button_xpath = \"//div[@class='row load-more']//a[contains(@href, 'startTime')]/@href\"\n", 214 | "headline_xpath = '//header/h1/a/text()'\n", 215 | "\n", 216 | "# We'll use sleep to add some time in between requests\n", 217 | "# so that we're not bombarding Gawker's server too hard. \n", 218 | "from time import sleep\n", 219 | "\n", 220 | "# Now we'll fill this list of gawker titles by starting\n", 221 | "# at the landing page and following \"More Stories\" links\n", 222 | "gawker_titles = []\n", 223 | "base_url = 'http://gawker.com/{}'\n", 224 | "next_page = \"http://gawker.com/\"\n", 225 | "while len(gawker_titles) < 2000 and next_page:\n", 226 | " dom = html.parse(next_page)\n", 227 | " headlines = dom.xpath(headline_xpath)\n", 228 | " print \"Retrieved {} titles from url: {}\".format(len(headlines), next_page)\n", 229 | " gawker_titles += headlines\n", 230 | " next_pages = dom.xpath(next_button_xpath)\n", 231 | " if next_pages: \n", 232 | " next_page = base_url.format(next_pages[0]) \n", 233 | " else:\n", 234 | " print \"No next button found\"\n", 235 | " next_page = None\n", 236 | " sleep(5)" 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": 4, 242 | "metadata": { 243 | "collapsed": false 244 | }, 245 | "outputs": [ 246 | { 247 | "name": "stdout", 248 | "output_type": "stream", 249 | "text": [ 250 | "Holy smokes, we got 2001 Gawker headlines!\n" 251 | ] 252 | } 253 | ], 254 | "source": [ 255 | "with open('gawker_titles.txt', 'wb') as out:\n", 256 | " out.writelines(gawker_titles)\n", 257 | "# with open('gawker_titles.txt') as f:\n", 258 | "# gawker_titles = f.readlines()\n", 259 | " \n", 260 | "print \"Holy smokes, we got {} Gawker headlines!\".format(len(gawker_titles))" 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "#### [Wall Street Journal](http://online.wsj.com/public/page/archive-2014-1-1.html)\n", 268 | "Now we'll do a similar thing with WSJ now. Here we notice that they have a section of the site where they have lists of articles for each day in the past year. There are links to the different archive dates all over the page, and we can see that the links all have the same structure, with different dates in the URL. Lets iterate over a bunch of dates. I grabbed the articles from the first day of each month this year" 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": 5, 274 | "metadata": { 275 | "collapsed": false 276 | }, 277 | "outputs": [ 278 | { 279 | "name": "stdout", 280 | "output_type": "stream", 281 | "text": [ 282 | "Retrieved 106 WSJ headlines from url: http://online.wsj.com/public/page/archive-2014-1-1.html\n", 283 | "Retrieved 21 WSJ headlines from url: http://online.wsj.com/public/page/archive-2014-2-1.html\n", 284 | "Retrieved 31 WSJ headlines from url: http://online.wsj.com/public/page/archive-2014-3-1.html\n", 285 | "Retrieved 284 WSJ headlines from url: http://online.wsj.com/public/page/archive-2014-4-1.html\n", 286 | "Retrieved 386 WSJ headlines from url: http://online.wsj.com/public/page/archive-2014-5-1.html\n", 287 | "Retrieved 120 WSJ headlines from url: http://online.wsj.com/public/page/archive-2014-6-1.html\n", 288 | "Retrieved 310 WSJ headlines from url: http://online.wsj.com/public/page/archive-2014-7-1.html\n", 289 | "Retrieved 300 WSJ headlines from url: http://online.wsj.com/public/page/archive-2014-8-1.html\n", 290 | "Retrieved 162 WSJ headlines from url: http://online.wsj.com/public/page/archive-2014-9-1.html\n", 291 | "Retrieved 388 WSJ headlines from url: http://online.wsj.com/public/page/archive-2014-10-1.html\n" 292 | ] 293 | } 294 | ], 295 | "source": [ 296 | "wsj_url = \"http://online.wsj.com/public/page/archive-2014-{}-1.html\"\n", 297 | "wsj_headline_xpath = \"//h2/a/text()\"\n", 298 | "wsj_headlines = []\n", 299 | "for i in range(1, 11): \n", 300 | " dom = html.parse(wsj_url.format(i))\n", 301 | " titles = dom.xpath(wsj_headline_xpath)\n", 302 | " wsj_headlines += titles\n", 303 | " print \"Retrieved {} WSJ headlines from url: {}\".format(len(titles), wsj_url.format(i)) " 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": 6, 309 | "metadata": { 310 | "collapsed": false 311 | }, 312 | "outputs": [ 313 | { 314 | "name": "stdout", 315 | "output_type": "stream", 316 | "text": [ 317 | "Jeez, Louise! We got 2108 WSJ headlines!\n" 318 | ] 319 | } 320 | ], 321 | "source": [ 322 | "with open('wsj_titles.txt', 'wb') as out:\n", 323 | " out.writelines(wsj_headlines)\n", 324 | "# with open('wsj_titles.txt') as f:\n", 325 | "# wsj_headlines = f.readlines()\n", 326 | " \n", 327 | "print \"Jeez, Louise! We got {} WSJ headlines!\".format(len(wsj_headlines))" 328 | ] 329 | }, 330 | { 331 | "cell_type": "markdown", 332 | "metadata": {}, 333 | "source": [ 334 | "Now we'll use pandas to build a data frame with two columns: \"gawker\", which contains a boolean value indicating whether the value in the \"title\" column came from Gawker's website. " 335 | ] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "execution_count": 7, 340 | "metadata": { 341 | "collapsed": false 342 | }, 343 | "outputs": [ 344 | { 345 | "data": { 346 | "text/html": [ 347 | "

\n", 348 | "\n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | "

	gawker	title
4104	False	Australia to Fly Support Missions in Iraq
4105	False	Zalando Stock Debut Disappoints
4106	False	China September Official Manufacturing PMI Hol...
4107	False	Asian Shares Mixed in Quiet Trading
4108	False	BOJ Tankan: Japan Corporate Sentiment Improves

\n", 384 | "

" 385 | ], 386 | "text/plain": [ 387 | " gawker title\n", 388 | "4104 False Australia to Fly Support Missions in Iraq\n", 389 | "4105 False Zalando Stock Debut Disappoints\n", 390 | "4106 False China September Official Manufacturing PMI Hol...\n", 391 | "4107 False Asian Shares Mixed in Quiet Trading\n", 392 | "4108 False BOJ Tankan: Japan Corporate Sentiment Improves" 393 | ] 394 | }, 395 | "execution_count": 7, 396 | "metadata": {}, 397 | "output_type": "execute_result" 398 | } 399 | ], 400 | "source": [ 401 | "import pandas as pd\n", 402 | "gawk_records = [{'gawker': True, 'title': title} for title in gawker_titles]\n", 403 | "wsj_records = [{'gawker': False, 'title': title} for title in wsj_headlines]\n", 404 | "df = pd.DataFrame.from_records(gawk_records + wsj_records)\n", 405 | "df.tail()" 406 | ] 407 | }, 408 | { 409 | "cell_type": "markdown", 410 | "metadata": {}, 411 | "source": [ 412 | "## Teach the Machine!" 413 | ] 414 | }, 415 | { 416 | "cell_type": "markdown", 417 | "metadata": {}, 418 | "source": [ 419 | "The basic goal of machine learning is to learn a model or function that maps our inputs/observations to outputs/predictions. The model that we'll be building to make our predictions is known as Naive Bayes. We can compute by counting the probability that any given word shows up in a Gawker title, or P(Word | Gawker). We want the probability of a that a given body of text is a Gawker title, or P(Gawker | Words).\n", 420 | "\n", 421 | "$$P(Gawker | Words) = P(Word_1|Gawker)*P(Word_2|Gawker)*..*P(Word_n | Gawker)*P(Gawker)$$\n", 422 | "\n", 423 | "All of the probabilities on the right side of the equation are empirically determined from the text data. Luckily, instead of having to write all of the code to determine those probabilities ourselves, sklearn's CountVectorizer does all of the dirty work for us." 424 | ] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "execution_count": 10, 429 | "metadata": { 430 | "collapsed": false 431 | }, 432 | "outputs": [], 433 | "source": [ 434 | "vectorizer = CountVectorizer(min_df=5, max_df=.3, ngram_range=(1,2))\n", 435 | "# X, Y = make_xy(df)\n", 436 | "X = vectorizer.fit_transform(df.title)\n", 437 | "X = X.tocsc() # some versions of sklearn return COO format\n", 438 | "Y = df.gawker.values.astype(np.int) # Need numbers instead of bools for " 439 | ] 440 | }, 441 | { 442 | "cell_type": "markdown", 443 | "metadata": {}, 444 | "source": [ 445 | "Now we have a data array, X, whose rows correspond to titles and whose columns correspond to words. So the value X[i, j] is the number of times word j shows up in article title i. Each row has a corresponding member in vector, Y. If the headline associated with X[i] came from Gawker, Y[i] == 1. Otherwise, Y[i] == 0. Now our data are in a format we want it to be in. Time to seperate our data into training and testing sets before building and evaluating our model." 446 | ] 447 | }, 448 | { 449 | "cell_type": "code", 450 | "execution_count": 11, 451 | "metadata": { 452 | "collapsed": false 453 | }, 454 | "outputs": [], 455 | "source": [ 456 | "X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X,Y)" 457 | ] 458 | }, 459 | { 460 | "cell_type": "code", 461 | "execution_count": 12, 462 | "metadata": { 463 | "collapsed": false 464 | }, 465 | "outputs": [ 466 | { 467 | "data": { 468 | "text/plain": [ 469 | "MultinomialNB(alpha=0.5, class_prior=None, fit_prior=False)" 470 | ] 471 | }, 472 | "execution_count": 12, 473 | "metadata": {}, 474 | "output_type": "execute_result" 475 | } 476 | ], 477 | "source": [ 478 | "clf = naive_bayes.MultinomialNB(fit_prior=False, alpha=0.5)\n", 479 | "clf.fit(X_train, Y_train)" 480 | ] 481 | }, 482 | { 483 | "cell_type": "markdown", 484 | "metadata": {}, 485 | "source": [ 486 | "## Test the machine!" 487 | ] 488 | }, 489 | { 490 | "cell_type": "markdown", 491 | "metadata": {}, 492 | "source": [ 493 | "We'll see how good our model is by testing the accuracy of its predictions on articles it hasn't seen before. The accuracy metric reported below is simply the percentage of titles it classifies correctly." 494 | ] 495 | }, 496 | { 497 | "cell_type": "code", 498 | "execution_count": 13, 499 | "metadata": { 500 | "collapsed": false 501 | }, 502 | "outputs": [ 503 | { 504 | "name": "stdout", 505 | "output_type": "stream", 506 | "text": [ 507 | "Accuracy: 84.91%\n" 508 | ] 509 | } 510 | ], 511 | "source": [ 512 | "print \"Accuracy: %0.2f%%\" % (100 * clf.score(X_test, Y_test))" 513 | ] 514 | }, 515 | { 516 | "cell_type": "markdown", 517 | "metadata": {}, 518 | "source": [ 519 | "Not bad for a stupid machine! We can take a closer look at how the model is making predictions. Let's look at which words are the Gawkeriest and which ones are the Wall Street Journaliest." 520 | ] 521 | }, 522 | { 523 | "cell_type": "code", 524 | "execution_count": 14, 525 | "metadata": { 526 | "collapsed": false 527 | }, 528 | "outputs": [ 529 | { 530 | "name": "stdout", 531 | "output_type": "stream", 532 | "text": [ 533 | "Gawker words\t P(gawker| word)\n", 534 | " one week 0.99\n", 535 | " titles 0.99\n", 536 | " get free 0.99\n", 537 | " choose 0.99\n", 538 | " ingredient recipe 0.99\n", 539 | " week of 0.99\n", 540 | " apron fresh 0.99\n", 541 | " month of 0.99\n", 542 | " choose from 0.99\n", 543 | " of oyster 0.99\n", 544 | "WSJ words\t P(gawker | word)\n", 545 | " market 0.02\n", 546 | " what news 0.02\n", 547 | " manufacturing 0.01\n", 548 | " investors 0.01\n", 549 | " growth 0.01\n", 550 | " ipo 0.01\n", 551 | " ceo 0.01\n", 552 | " sales 0.01\n", 553 | " china 0.01\n", 554 | " profit 0.01\n" 555 | ] 556 | } 557 | ], 558 | "source": [ 559 | "import numpy as np\n", 560 | "words = np.array(vectorizer.get_feature_names())\n", 561 | "\n", 562 | "x = np.eye(X_test.shape[1])\n", 563 | "probs = clf.predict_log_proba(x)[:, 0]\n", 564 | "ind = np.argsort(probs)\n", 565 | "\n", 566 | "good_words = words[ind[:10]]\n", 567 | "bad_words = words[ind[-10:]]\n", 568 | "\n", 569 | "good_prob = probs[ind[:10]]\n", 570 | "bad_prob = probs[ind[-10:]]\n", 571 | "\n", 572 | "print \"Gawker words\\t P(gawker| word)\"\n", 573 | "for w, p in zip(good_words, good_prob):\n", 574 | " print \"%20s\" % w, \"%0.2f\" % (1 - np.exp(p))\n", 575 | " \n", 576 | "print \"WSJ words\\t P(gawker | word)\"\n", 577 | "for w, p in zip(bad_words, bad_prob):\n", 578 | " print \"%20s\" % w, \"%0.2f\" % (1 - np.exp(p))" 579 | ] 580 | }, 581 | { 582 | "cell_type": "markdown", 583 | "metadata": {}, 584 | "source": [ 585 | "You might be complaining that our model should do better. But some titles can be tough. Would you have classified these mis-predicted article titles correctly? " 586 | ] 587 | }, 588 | { 589 | "cell_type": "code", 590 | "execution_count": 15, 591 | "metadata": { 592 | "collapsed": false 593 | }, 594 | "outputs": [ 595 | { 596 | "name": "stdout", 597 | "output_type": "stream", 598 | "text": [ 599 | "Mis-predicted WSJ quotes\n", 600 | "---------------------------\n", 601 | "Study Finds Over One Million Caring for Iraq, Afghan War Veterans\n", 602 | "\n", 603 | "Photo of the Week\n", 604 | "\n", 605 | "Body of Ferry Victim Found by Fishermen\n", 606 | "\n", 607 | "From Florida Boy to Alleged Suicide Bomber in Syria\n", 608 | "\n", 609 | "How the 'Jesus' Wife' Hoax Fell Apart\n", 610 | "\n", 611 | "\n", 612 | "Mis-predicted Gawker quotes\n", 613 | "--------------------------\n", 614 | "Benetton Will Contribute to Fund For Rana Plaza Victims in Bangladesh\n", 615 | "\n", 616 | "NYU Urges Staffers to Help Pay Students' Outrageously Expensive Tuition\n", 617 | "\n", 618 | "Nobody In China Wants Tibetan Mastiffs Anymore\n", 619 | "\n", 620 | "Ukraine Cease-Fire Under Threat in Key Town, Holding Elsewhere\n", 621 | "\n", 622 | "Leaders Reach Shaky Cease-fire Deal to End War in Ukraine \n", 623 | "\n" 624 | ] 625 | } 626 | ], 627 | "source": [ 628 | "prob = clf.predict_proba(X)[:, 0]\n", 629 | "predict = clf.predict(X)\n", 630 | "\n", 631 | "bad_wsj = np.argsort(prob[Y == 0])[:5]\n", 632 | "bad_gawker = np.argsort(prob[Y == 1])[-5:]\n", 633 | "\n", 634 | "print \"Mis-predicted WSJ quotes\"\n", 635 | "print '---------------------------'\n", 636 | "for row in bad_wsj:\n", 637 | " print df[Y == 0].title.irow(row)\n", 638 | " print\n", 639 | "\n", 640 | "print\n", 641 | "print \"Mis-predicted Gawker quotes\"\n", 642 | "print '--------------------------'\n", 643 | "for row in bad_gawker:\n", 644 | " print df[Y == 1].title.irow(row)\n", 645 | " print" 646 | ] 647 | }, 648 | { 649 | "cell_type": "markdown", 650 | "metadata": {}, 651 | "source": [ 652 | "Make your own headline and see if it belongs in Gawker or WSJ!" 653 | ] 654 | }, 655 | { 656 | "cell_type": "code", 657 | "execution_count": 16, 658 | "metadata": { 659 | "collapsed": false 660 | }, 661 | "outputs": [ 662 | { 663 | "name": "stdout", 664 | "output_type": "stream", 665 | "text": [ 666 | "P(gawker) = 0.971367221347\n" 667 | ] 668 | } 669 | ], 670 | "source": [ 671 | "probs = clf.predict_proba(vectorizer.transform([\"Your title here\"]))\n", 672 | "print \"P(gawker) = {}\".format(probs[0][1])" 673 | ] 674 | } 675 | ], 676 | "metadata": { 677 | "kernelspec": { 678 | "display_name": "Python 2", 679 | "language": "python", 680 | "name": "python2" 681 | }, 682 | "language_info": { 683 | "codemirror_mode": { 684 | "name": "ipython", 685 | "version": 2 686 | }, 687 | "file_extension": ".py", 688 | "mimetype": "text/x-python", 689 | "name": "python", 690 | "nbconvert_exporter": "python", 691 | "pygments_lexer": "ipython2", 692 | "version": "2.7.11" 693 | } 694 | }, 695 | "nbformat": 4, 696 | "nbformat_minor": 0 697 | } 698 | -------------------------------------------------------------------------------- /scraping/wsj_titles.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gwulfs/bostonml/1b6f37c564ca6f2d08f2ed5510e414494932294a/scraping/wsj_titles.txt --------------------------------------------------------------------------------