├── GettingStarted.ipynb ├── GettingTheMost.ipynb ├── MulticlassClassification.ipynb ├── MulticlassLDF.ipynb ├── README.md ├── RareCategory.ipynb └── UnsupervisedNLP.ipynb /GettingStarted.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "", 4 | "signature": "sha256:1804322a06cfe9fc0859446d550f1c0b0d843fff0f575723b6fa88872b7c0c1c" 5 | }, 6 | "nbformat": 3, 7 | "nbformat_minor": 0, 8 | "worksheets": [ 9 | { 10 | "cells": [ 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "# Welcome to \"NLP in VW\"!\n", 16 | "\n", 17 | "The goal here is to get you comfortable with using `vw` for basic NLP tasks, like binary classification. We will explore some of the `vw` options that are particularly useful for language problems and then future notebooks will go beyond binary classification. The topics we'll cover are:\n", 18 | "\n", 19 | "* [What is binary classification?](#binary)\n", 20 | "* [Constructing a data set](#data)\n", 21 | "* [Running vw](#run)\n", 22 | "* [Multiple passes over the data](#passes)\n", 23 | "* [Saving the model](#save)\n", 24 | "* [Making predictions on test data](#test)\n", 25 | "* [Cheat sheet and next steps](#cheat)\n", 26 | "\n", 27 | "# What is Binary Classification?\n", 28 | "\n", 29 | "The job of a binary classifier is to learn to map inputs (usually called $\\mathbf x$) to binary labels (usually called $+1$ and $-1$). A simple example is sentiment classification. Given some text (perhaps a movie review), determine whether the overall sentiment expressed by that review is positive or negative toward the movie.\n", 30 | "\n", 31 | "Because this is a machine learning application, this mapping is **induced** from training data. The training data consists of a (hopefully large!) set of labeled examples: movie reviews **paired** with the correct label (positive or negative). The classic data set for this is from [Pang and Lee](http://www.cs.cornell.edu/people/pabo/movie-review-data/); this is the data we will work with later. For comparison purposes, it's worth keeping in mind that the best performance Pang and Lee achieve on this data in their [2004 paper](http://www.cs.cornell.edu/home/llee/papers/cutsent.pdf) that introduced it is about 13% error. In this tutorial we'll get 15% error, and in the [subsequent tutorial](GettingTheMost.ipynb) will get 10%.\n", 32 | "\n", 33 | "Once the classifier (mapping from review to sentiment) has been learned, we can apply it to new reviews that are missing ratings to predict what the rating probably would have been. We usually care about the **accuracy** of this classifier: what percentage of predictions did it get wrong. Of course, we want to be able to measure this accuracy, so we hold out some test data on which to evaluate the classifier.\n", 34 | "\n", 35 | "## Sounds Great, Let's Do It!\n", 36 | "\n", 37 | "There are two prerequisites: we need to make sure `vw` is installed and we need some data. If `vw` is installed correctly, and is in your path, the following should work:" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "collapsed": false, 43 | "input": [ 44 | "!vw --version" 45 | ], 46 | "language": "python", 47 | "metadata": {}, 48 | "outputs": [ 49 | { 50 | "output_type": "stream", 51 | "stream": "stdout", 52 | "text": [ 53 | "8.1.1\r\n" 54 | ] 55 | } 56 | ], 57 | "prompt_number": 1 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "If you get some error like \"vw: not found\" then `vw` is either not installed correctly or is not in your path.\n", 64 | "\n", 65 | "# Constructing Your Data Set\n", 66 | "\n", 67 | "We also need some data on which to train and test a classifier. We'll download the Pang and Lee data referenced above." 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "collapsed": false, 73 | "input": [ 74 | "!mkdir data\n", 75 | "!curl -o data/review_polarity.tar.gz http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz\n", 76 | "!tar zxC data -f data/review_polarity.tar.gz" 77 | ], 78 | "language": "python", 79 | "metadata": {}, 80 | "outputs": [ 81 | { 82 | "output_type": "stream", 83 | "stream": "stdout", 84 | "text": [ 85 | " % Total % Received % Xferd Average Speed Time Time Time Current\r\n", 86 | " Dload Upload Total Spent Left Speed\r\n", 87 | "\r", 88 | " 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0" 89 | ] 90 | }, 91 | { 92 | "output_type": "stream", 93 | "stream": "stdout", 94 | "text": [ 95 | "\r", 96 | " 49 3053k 49 1522k 0 0 2656k 0 0:00:01 --:--:-- 0:00:01 2653k" 97 | ] 98 | }, 99 | { 100 | "output_type": "stream", 101 | "stream": "stdout", 102 | "text": [ 103 | "\r", 104 | "100 3053k 100 3053k 0 0 3214k 0 --:--:-- --:--:-- --:--:-- 3211k\r\n" 105 | ] 106 | } 107 | ], 108 | "prompt_number": 2 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "metadata": {}, 113 | "source": [ 114 | "We can take a look at the beginning of one of the positive reviews and one of the negative reviews:" 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "collapsed": false, 120 | "input": [ 121 | "!head -n3 data/txt_sentoken/pos/cv000_29590.txt" 122 | ], 123 | "language": "python", 124 | "metadata": {}, 125 | "outputs": [ 126 | { 127 | "output_type": "stream", 128 | "stream": "stdout", 129 | "text": [ 130 | "films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . \r\n", 131 | "for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . \r\n", 132 | "to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . \r\n" 133 | ] 134 | } 135 | ], 136 | "prompt_number": 3 137 | }, 138 | { 139 | "cell_type": "code", 140 | "collapsed": false, 141 | "input": [ 142 | "!head -n3 data/txt_sentoken/neg/cv000_29416.txt" 143 | ], 144 | "language": "python", 145 | "metadata": {}, 146 | "outputs": [ 147 | { 148 | "output_type": "stream", 149 | "stream": "stdout", 150 | "text": [ 151 | "plot : two teen couples go to a church party , drink and then drive . \r\n", 152 | "they get into an accident . \r\n", 153 | "one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . \r\n" 154 | ] 155 | } 156 | ], 157 | "prompt_number": 4 158 | }, 159 | { 160 | "cell_type": "markdown", 161 | "metadata": {}, 162 | "source": [ 163 | "Okay, so our first job is to put this data into `vw` format. Luckily this data is already lowercased and tokenized (words are separated from punctuation by extra spaces), so we don't have to deal with that issue.\n", 164 | "\n", 165 | "This format is quite flexible, and we'll see additional fun things you can do later, but for now, the basic file format is one-example per line, with the label first and then a vertical bar (pipe) and then all of the features. If we're doing bag of words representation (a good starting point for text data), the features are just each-of-the-individual-words-in-the-text. For example, for the two above files, we'd want to create two `vw` examples like:\n", 166 | "\n", 167 | " +1 | films adapted from comic books have had plenty of success , whether they're ...\n", 168 | " -1 | plot : two teen couples go to a church party , drink and then drive . they get into ...\n", 169 | " \n", 170 | "However, there's an issue here. There are two **reserved characters** in the `vw` example: colon (`:`) and pipe (`|`). This means we need to convert these to characters to anything-else.\n", 171 | "\n", 172 | "Let's write a little python to do this conversion. You could do it just with `sed` and friends, but this is an iPython notebook, so why not do it that way?" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "collapsed": false, 178 | "input": [ 179 | "def textToVW(lines):\n", 180 | " return ' '.join([l.strip() for l in lines]).replace(':','COLON').replace('|','PIPE')\n", 181 | "\n", 182 | "def fileToVW(inputFile):\n", 183 | " return textToVW(open(inputFile,'r').readlines())\n", 184 | "\n", 185 | "print fileToVW('data/txt_sentoken/neg/cv000_29416.txt')[:50]" 186 | ], 187 | "language": "python", 188 | "metadata": {}, 189 | "outputs": [ 190 | { 191 | "output_type": "stream", 192 | "stream": "stdout", 193 | "text": [ 194 | "plot COLON two teen couples go to a church party ,\n" 195 | ] 196 | } 197 | ], 198 | "prompt_number": 5 199 | }, 200 | { 201 | "cell_type": "markdown", 202 | "metadata": {}, 203 | "source": [ 204 | "Here, we see the first few words of the negative review, with ':' replaced by COLON (this is safe because all the other text is lowercased) and '|' replaced by PIPE.\n", 205 | "\n", 206 | "Now we just need to read in all the positive examples and all the negative examples:" 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "collapsed": false, 212 | "input": [ 213 | "import os\n", 214 | "\n", 215 | "def readTextFilesInDirectory(directory):\n", 216 | " return [fileToVW(directory + os.sep + f) \n", 217 | " for f in os.listdir(directory)\n", 218 | " if f.endswith('.txt')]\n", 219 | "\n", 220 | "examples = ['+1 | ' + s for s in readTextFilesInDirectory('data/txt_sentoken/pos')] + \\\n", 221 | " ['-1 | ' + s for s in readTextFilesInDirectory('data/txt_sentoken/neg')]\n", 222 | "\n", 223 | "print '%d total examples read' % len(examples)" 224 | ], 225 | "language": "python", 226 | "metadata": {}, 227 | "outputs": [ 228 | { 229 | "output_type": "stream", 230 | "stream": "stdout", 231 | "text": [ 232 | "2000 total examples read\n" 233 | ] 234 | } 235 | ], 236 | "prompt_number": 6 237 | }, 238 | { 239 | "cell_type": "markdown", 240 | "metadata": {}, 241 | "source": [ 242 | "Now, we've got all the files, we put \"`+1 | `\" at the beginning of the positive ones and \"`-1 | `\" at the beginning of the negative ones. *Voila* we have our `vw` data.\n", 243 | "\n", 244 | "We'll now generate some training data and some test data. To achieve this, we're going to permute the examples (after putting in a random seed for reproducability, [hopefully okay cross-platform](http://stackoverflow.com/questions/9023660/how-to-generate-a-repeatable-random-number-sequence)) and then taking the first 80% and train and the last 20% as test.\n", 245 | "\n", 246 | "The fact that we're permuting the data is **very important**. By default, `vw` uses an online learning strategy, and if we did something silly like putting all the positive examples before the negative examples, learning would take a LONG time. More on this later." 247 | ] 248 | }, 249 | { 250 | "cell_type": "code", 251 | "collapsed": false, 252 | "input": [ 253 | "import random\n", 254 | "random.seed(1234)\n", 255 | "random.shuffle(examples) # this does in-place shuffling\n", 256 | "# print out the labels of the first 50 examples to be sure they're sane:\n", 257 | "print ''.join([s[0] for s in examples[:50]])" 258 | ], 259 | "language": "python", 260 | "metadata": {}, 261 | "outputs": [ 262 | { 263 | "output_type": "stream", 264 | "stream": "stdout", 265 | "text": [ 266 | "++-++---++-+--+-+--+--++-+++---+-+--++----+++--+-+\n" 267 | ] 268 | } 269 | ], 270 | "prompt_number": 7 271 | }, 272 | { 273 | "cell_type": "markdown", 274 | "metadata": {}, 275 | "source": [ 276 | "Now, we can write the first 1600 to a training file and the last 400 to a test file." 277 | ] 278 | }, 279 | { 280 | "cell_type": "code", 281 | "collapsed": false, 282 | "input": [ 283 | "def writeToVWFile(filename, examples):\n", 284 | " with open(filename, 'w') as h:\n", 285 | " for ex in examples:\n", 286 | " print >>h, ex\n", 287 | " \n", 288 | "writeToVWFile('data/sentiment.tr', examples[:1600])\n", 289 | "writeToVWFile('data/sentiment.te', examples[1600:])\n", 290 | "\n", 291 | "!wc -l data/sentiment.tr data/sentiment.te" 292 | ], 293 | "language": "python", 294 | "metadata": {}, 295 | "outputs": [ 296 | { 297 | "output_type": "stream", 298 | "stream": "stdout", 299 | "text": [ 300 | " 1600 data/sentiment.tr\r\n", 301 | " 400 data/sentiment.te\r\n", 302 | " 2000 total\r\n" 303 | ] 304 | } 305 | ], 306 | "prompt_number": 8 307 | }, 308 | { 309 | "cell_type": "markdown", 310 | "metadata": {}, 311 | "source": [ 312 | "At this point, everything is properly set up and we can run `vw`!\n", 313 | "\n", 314 | "# Running VW for the First Time" 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "collapsed": false, 320 | "input": [ 321 | "!vw --binary data/sentiment.tr" 322 | ], 323 | "language": "python", 324 | "metadata": {}, 325 | "outputs": [ 326 | { 327 | "output_type": "stream", 328 | "stream": "stdout", 329 | "text": [ 330 | "Num weight bits = 18\r\n", 331 | "learning rate = 0.5\r\n", 332 | "initial_t = 0\r\n", 333 | "power_t = 0.5\r\n", 334 | "using no cache\r\n", 335 | "Reading datafile = data/sentiment.tr\r\n", 336 | "num sources = 1\r\n", 337 | "average since example example current current current\r\n", 338 | "loss last counter weight label predict features\r\n", 339 | "1.000000 1.000000 1 1.0 1.0000 -1.0000 740\r\n", 340 | "0.500000 0.000000 2 2.0 1.0000 1.0000 630\r\n", 341 | "0.750000 1.000000 4 4.0 1.0000 -1.0000 870\r\n", 342 | "0.500000 0.250000 8 8.0 -1.0000 -1.0000 526\r\n", 343 | "0.625000 0.750000 16 16.0 -1.0000 1.0000 529\r\n", 344 | "0.531250 0.437500 32 32.0 1.0000 -1.0000 1188\r\n", 345 | "0.468750 0.406250 64 64.0 -1.0000 1.0000 931\r\n", 346 | "0.375000 0.281250 128 128.0 1.0000 -1.0000 662\r\n", 347 | "0.406250 0.437500 256 256.0 1.0000 1.0000 922\r\n", 348 | "0.343750 0.281250 512 512.0 1.0000 -1.0000 297\r\n" 349 | ] 350 | }, 351 | { 352 | "output_type": "stream", 353 | "stream": "stdout", 354 | "text": [ 355 | "0.307617 0.271484 1024 1024.0 1.0000 1.0000 991\r\n", 356 | "\r\n", 357 | "finished run\r\n", 358 | "number of examples per pass = 1600\r\n", 359 | "passes used = 1\r\n", 360 | "weighted example sum = 1600.000000\r\n", 361 | "weighted label sum = 6.000000\r\n", 362 | "average loss = 0.280625\r\n", 363 | "best constant = 0.003750\r\n", 364 | "best constant's loss = 0.999986\r\n", 365 | "total feature number = 1204816\r\n" 366 | ] 367 | } 368 | ], 369 | "prompt_number": 9 370 | }, 371 | { 372 | "cell_type": "markdown", 373 | "metadata": {}, 374 | "source": [ 375 | "This output consists of three parts:\n", 376 | "\n", 377 | "1. The header, which displays some information about the parameters `vw` is using to do the learning (number of bits, learning rate, ..., number of sources). We'll discuss (some) of these later.\n", 378 | "2. The progress list (the lines with lots of numbers); much more on this below.\n", 379 | "3. The footer, which displays some statistics about the success (or failure) of learning. In this case, it says, among other things, that it made one pass over the data, encountered 1600 training examples (yay!) and found a model with an average loss of 28.06%. It also says that it processed 1.2m features (summed over all training examples), which gives some sense of the data size.\n", 380 | "\n", 381 | "One important note is that when we ran `vw`, we added the flag `--binary`, which instructs `vw` to report all losses as zero-one loss.\n", 382 | "\n", 383 | "Let's look first at the first four lines of the progress list:\n", 384 | "\n", 385 | " average since example example current current current\n", 386 | " loss last counter weight label predict features\n", 387 | " 1.000000 1.000000 1 1.0 1.0000 -1.0000 740\n", 388 | " 0.500000 0.000000 2 2.0 1.0000 1.0000 630\n", 389 | " 0.750000 1.000000 4 4.0 1.0000 -1.0000 870\n", 390 | " 0.500000 0.250000 8 8.0 -1.0000 -1.0000 526\n", 391 | " \n", 392 | "The columns are labeled, which gives some clue as to what's being printed out. The way `vw` works internally is that it processes one example at a time. At every $2^k$th example (examples 1, 2, 4, 8, 16, ...), it prints out a status update. This way you get lots of updates early (as a sanity check) and fewer as time goes on. The third column gives you the example number. The fourth column tells you the total \"weight\" of examples so far; right now all examples have a weight of 1.0, but for some problems (e.g., imbalanced data), you might want to give different weight to different examples. The fifth column tells you the true current label (+1 or -1) and the sixth column tells you the models' current prediction. Lastly, it tells you how many features there are in this example.\n", 393 | "\n", 394 | "The first two columns deserve some explanation. In \"default\" mode, `vw` reports \"progressive validation loss.\" This means that when `vw` sees a training example, it *first* makes a prediction. It then computes a loss on that single prediction. Only after that does it \"learn\". The average loss computed in this was is the **progressive validation loss.** It has a nice property that it's a good estimate of test loss, *provided you only make one pass over the data*, **and** it's efficient to compute. The first column tells you the average progressive loss over the *entire* run of `vw`; the second column tells you the average progressive loss *since the last time `vw` printed something*.\n", 395 | "\n", 396 | "In practice, this second column is what you want to look at for telling how well your model is doing.\n", 397 | "\n", 398 | "# Your Second Run of VW\n", 399 | "\n", 400 | "There are a couple of things we need to do to get a useful system. The first is that for most data sets, a single online pass over the data is insufficient -- we need to run more than one. The second is that we actually need to store the model somewhere so that we can make predictions on test data! We'll go through these in order.\n", 401 | "\n", 402 | "## Running More than One Pass\n", 403 | "\n", 404 | "On the surface, running more than one pass seems like an easy thing to ask `vw` to do. It's a bit more complicated than it might appear.\n", 405 | "\n", 406 | "The first issue is that one of the main speed bottlenecks for `vw` is file IO. Reading, and parsing, your input data is incredibly time consuming. In order to get around this, when multiple passes over the data are requested, `vw` will create and use a **cache file**, which is basically a second copy of your data stored in a `vw`-friendly, efficient, binary format. So if you want to run more than one pass, you have to tell `vw` to create a cache file.\n", 407 | "\n", 408 | "Here's an example running 5 passes:" 409 | ] 410 | }, 411 | { 412 | "cell_type": "code", 413 | "collapsed": false, 414 | "input": [ 415 | "!vw --binary data/sentiment.tr --passes 5 -c -k" 416 | ], 417 | "language": "python", 418 | "metadata": {}, 419 | "outputs": [ 420 | { 421 | "output_type": "stream", 422 | "stream": "stdout", 423 | "text": [ 424 | "Num weight bits = 18\r\n", 425 | "learning rate = 0.5\r\n", 426 | "initial_t = 0\r\n", 427 | "power_t = 0.5\r\n", 428 | "decay_learning_rate = 1\r\n", 429 | "creating cache_file = data/sentiment.tr.cache\r\n", 430 | "Reading datafile = data/sentiment.tr\r\n", 431 | "num sources = 1\r\n", 432 | "average since example example current current current\r\n", 433 | "loss last counter weight label predict features\r\n", 434 | "1.000000 1.000000 1 1.0 1.0000 -1.0000 740\r\n", 435 | "0.500000 0.000000 2 2.0 1.0000 1.0000 630\r\n", 436 | "0.750000 1.000000 4 4.0 1.0000 -1.0000 870\r\n", 437 | "0.500000 0.250000 8 8.0 -1.0000 -1.0000 526\r\n", 438 | "0.687500 0.875000 16 16.0 1.0000 -1.0000 490\r\n", 439 | "0.562500 0.437500 32 32.0 -1.0000 1.0000 454\r\n", 440 | "0.515625 0.468750 64 64.0 -1.0000 1.0000 520\r\n", 441 | "0.398438 0.281250 128 128.0 1.0000 1.0000 563\r\n", 442 | "0.382812 0.367188 256 256.0 1.0000 1.0000 1311\r\n" 443 | ] 444 | }, 445 | { 446 | "output_type": "stream", 447 | "stream": "stdout", 448 | "text": [ 449 | "0.357422 0.332031 512 512.0 1.0000 1.0000 387\r\n", 450 | "0.312500 0.267578 1024 1024.0 1.0000 -1.0000 466\r\n" 451 | ] 452 | }, 453 | { 454 | "output_type": "stream", 455 | "stream": "stdout", 456 | "text": [ 457 | "0.202643 0.202643 2048 2048.0 -1.0000 -1.0000 578 h\r\n" 458 | ] 459 | }, 460 | { 461 | "output_type": "stream", 462 | "stream": "stdout", 463 | "text": [ 464 | "0.182418 0.162281 4096 4096.0 1.0000 1.0000 331 h\r\n", 465 | "\r\n", 466 | "finished run\r\n", 467 | "number of examples per pass = 1440\r\n", 468 | "passes used = 5\r\n", 469 | "weighted example sum = 7200.000000\r\n", 470 | "weighted label sum = -30.000000\r\n", 471 | "average loss = 0.150000 h\r\n", 472 | "best constant = -0.004167\r\n", 473 | "best constant's loss = 0.999983\r\n", 474 | "total feature number = 5418985\r\n" 475 | ] 476 | } 477 | ], 478 | "prompt_number": 10 479 | }, 480 | { 481 | "cell_type": "markdown", 482 | "metadata": {}, 483 | "source": [ 484 | "In this command, we added three new command-line options:\n", 485 | "\n", 486 | "* `--passes 5`: this is the most obvious one: it tells `vw` to run five passes over the data.\n", 487 | "* `-c`: this tells `vw` to automatically create and use a cache file; `vw` constructs this cache file in `foo.cache` where `foo` is the name of your input data (in the `vw` header it informs you that it's creating a file called `data/sentiment.tr.cache` for caching)\n", 488 | "* `-k`: by default, if `vw` uses a cache file, it *first* checks to see if the file exists. If the cache file already exists, it completely ignores the data file (`sentiment.tr`) and *just* uses the cache file. This is great if your data never changes because it makes the first pass slightly faster. However, I often change my data between `vw` runs and it's *really* annoying to spend two hours debugging only to find out that `vw` is ignoring the new data in favor of it's old cache file. `-k` tells `vw` to \"kill\" the old cache file: even if it exists, it should be recreated from scratch.\n", 489 | "\n", 490 | "(Warning: if you're running multiple jobs on the same file in parallel, you will get clashes on the cache file. You should either create a single cache file ahead of time and use it for all jobs [remove `-k` in that case], *or* you should explicitly give your own file names to the cache by saying `--cache myfilename0.cache` instead of `-c`.)\n", 491 | "\n", 492 | "If you're particularly attentive, you might have noticed that there are a few \"`h`\"s in the progress list (and in the printing of the average loss at the end).\n", 493 | "\n", 494 | "This is **holdout** loss. Remember all that discussion of progressive validation loss? Well, it's useless when you're making more than one pass. That's because on the second pass, you'll already have trained on all the training data, so your model is going to be exceptionally good at making predictions.\n", 495 | "\n", 496 | "`vw`'s default solution to this is to holdout a fraction of the training data as validation data. By default, it will hold out **every 10th example** as test. The holdout loss (signaled by the `h`) is then the average loss, *limited to these 10% of the training examples*. (Note that on the first pass, it still prints progressive validation loss because this is a safe thing to do.)\n", 497 | "\n", 498 | "## Saving the Model and Making Test Predictions\n", 499 | "\n", 500 | "Now that we know how to do several passes and get heldout losses, we might want to actually save the learned model to a file so we can make predictions on test data! This is easy: we just tell `vw` where to save the final model using `-f file` (`-f` means \"final\"). Let's do this, and crank up the number of passes to 20:" 501 | ] 502 | }, 503 | { 504 | "cell_type": "code", 505 | "collapsed": false, 506 | "input": [ 507 | "!vw --binary data/sentiment.tr --passes 20 -c -k -f data/sentiment.model" 508 | ], 509 | "language": "python", 510 | "metadata": {}, 511 | "outputs": [ 512 | { 513 | "output_type": "stream", 514 | "stream": "stdout", 515 | "text": [ 516 | "final_regressor = data/sentiment.model\r\n", 517 | "Num weight bits = 18\r\n", 518 | "learning rate = 0.5\r\n", 519 | "initial_t = 0\r\n", 520 | "power_t = 0.5\r\n", 521 | "decay_learning_rate = 1\r\n", 522 | "creating cache_file = data/sentiment.tr.cache\r\n", 523 | "Reading datafile = data/sentiment.tr\r\n", 524 | "num sources = 1\r\n", 525 | "average since example example current current current\r\n", 526 | "loss last counter weight label predict features\r\n", 527 | "1.000000 1.000000 1 1.0 1.0000 -1.0000 740\r\n", 528 | "0.500000 0.000000 2 2.0 1.0000 1.0000 630\r\n", 529 | "0.750000 1.000000 4 4.0 1.0000 -1.0000 870\r\n", 530 | "0.500000 0.250000 8 8.0 -1.0000 -1.0000 526\r\n", 531 | "0.687500 0.875000 16 16.0 1.0000 -1.0000 490\r\n", 532 | "0.562500 0.437500 32 32.0 -1.0000 1.0000 454\r\n", 533 | "0.515625 0.468750 64 64.0 -1.0000 1.0000 520\r\n", 534 | "0.398438 0.281250 128 128.0 1.0000 1.0000 563\r\n", 535 | "0.382812 0.367188 256 256.0 1.0000 1.0000 1311\r\n", 536 | "0.357422 0.332031 512 512.0 1.0000 1.0000 387\r\n" 537 | ] 538 | }, 539 | { 540 | "output_type": "stream", 541 | "stream": "stdout", 542 | "text": [ 543 | "0.312500 0.267578 1024 1024.0 1.0000 -1.0000 466\r\n" 544 | ] 545 | }, 546 | { 547 | "output_type": "stream", 548 | "stream": "stdout", 549 | "text": [ 550 | "0.202643 0.202643 2048 2048.0 -1.0000 -1.0000 578 h\r\n" 551 | ] 552 | }, 553 | { 554 | "output_type": "stream", 555 | "stream": "stdout", 556 | "text": [ 557 | "0.182418 0.162281 4096 4096.0 1.0000 1.0000 331 h\r\n" 558 | ] 559 | }, 560 | { 561 | "output_type": "stream", 562 | "stream": "stdout", 563 | "text": [ 564 | "0.169231 0.156044 8192 8192.0 1.0000 1.0000 955 h\r\n" 565 | ] 566 | }, 567 | { 568 | "output_type": "stream", 569 | "stream": "stdout", 570 | "text": [ 571 | "\r\n", 572 | "finished run\r\n", 573 | "number of examples per pass = 1440\r\n", 574 | "passes used = 9\r\n", 575 | "weighted example sum = 12960.000000\r\n", 576 | "weighted label sum = -54.000000\r\n", 577 | "average loss = 0.143750 h\r\n", 578 | "best constant = -0.004167\r\n", 579 | "best constant's loss = 0.999983\r\n", 580 | "total feature number = 9754173\r\n" 581 | ] 582 | } 583 | ], 584 | "prompt_number": 11 585 | }, 586 | { 587 | "cell_type": "markdown", 588 | "metadata": {}, 589 | "source": [ 590 | "And now, we have a model:" 591 | ] 592 | }, 593 | { 594 | "cell_type": "code", 595 | "collapsed": false, 596 | "input": [ 597 | "!ls -l data/sentiment.model" 598 | ], 599 | "language": "python", 600 | "metadata": {}, 601 | "outputs": [ 602 | { 603 | "output_type": "stream", 604 | "stream": "stdout", 605 | "text": [ 606 | "-rw-r--r-- 1 hal hal 283246 Jan 8 15:13 data/sentiment.model\r\n" 607 | ] 608 | } 609 | ], 610 | "prompt_number": 12 611 | }, 612 | { 613 | "cell_type": "markdown", 614 | "metadata": {}, 615 | "source": [ 616 | "One thing you might have noticed is that even though we asked `vw` for 20 passes, it actually only did 9! (It tells you this in the footer.) This happens because by default `vw` does early stopping: if the holdout loss ceases to improve for three passes over the data, it stops optimizing and stores the *best* model found so far. We will later see how to adjust these defaults.\n", 617 | "\n", 618 | "## Making Predictions\n", 619 | "\n", 620 | "Now we want to make predictions. In order to do this, we have to (a) tell `vw` to load a model, (b) tell it only to make predictions (and not to learn), and (c) tell it where to store the predictions. (Ok, technically we don't need to store the predictions anywhere if all we want to know is our error rate, but I'll assume we actually care about the output of our system.)" 621 | ] 622 | }, 623 | { 624 | "cell_type": "code", 625 | "collapsed": false, 626 | "input": [ 627 | "!vw --binary -t -i data/sentiment.model -p data/sentiment.te.pred data/sentiment.te" 628 | ], 629 | "language": "python", 630 | "metadata": {}, 631 | "outputs": [ 632 | { 633 | "output_type": "stream", 634 | "stream": "stdout", 635 | "text": [ 636 | "only testing\r\n", 637 | "predictions = data/sentiment.te.pred\r\n", 638 | "Num weight bits = 18\r\n", 639 | "learning rate = 0.5\r\n", 640 | "initial_t = 0\r\n", 641 | "power_t = 0.5\r\n", 642 | "using no cache\r\n", 643 | "Reading datafile = data/sentiment.te\r\n", 644 | "num sources = 1\r\n", 645 | "average since example example current current current\r\n", 646 | "loss last counter weight label predict features\r\n", 647 | "0.000000 0.000000 1 1.0 1.0000 1.0000 967\r\n", 648 | "0.000000 0.000000 2 2.0 1.0000 1.0000 1043\r\n", 649 | "0.000000 0.000000 4 4.0 1.0000 1.0000 757\r\n", 650 | "0.000000 0.000000 8 8.0 -1.0000 -1.0000 243\r\n", 651 | "0.062500 0.125000 16 16.0 -1.0000 1.0000 345\r\n", 652 | "0.156250 0.250000 32 32.0 1.0000 -1.0000 572\r\n", 653 | "0.125000 0.093750 64 64.0 1.0000 1.0000 1517\r\n", 654 | "0.140625 0.156250 128 128.0 -1.0000 -1.0000 575\r\n", 655 | "0.160156 0.179688 256 256.0 -1.0000 -1.0000 599\r\n", 656 | "\r\n", 657 | "finished run\r\n", 658 | "number of examples per pass = 400\r\n", 659 | "passes used = 1\r\n", 660 | "weighted example sum = 400.000000\r\n", 661 | "weighted label sum = -6.000000\r\n", 662 | "average loss = 0.150000\r\n", 663 | "best constant = -0.015000\r\n", 664 | "best constant's loss = 0.999775\r\n", 665 | "total feature number = 289865\r\n" 666 | ] 667 | } 668 | ], 669 | "prompt_number": 13 670 | }, 671 | { 672 | "cell_type": "markdown", 673 | "metadata": {}, 674 | "source": [ 675 | "Let's go through these options in turn:\n", 676 | "\n", 677 | "* `--binary`: as before, tell `vw` that this is a binary classification problem and to report loss as a zero-one value\n", 678 | "* `-t`: put `vw` in test mode. You might assume that because we're loading a model to start with, `vw` would be in test mode by default. You would be wrong. Sometimes it's useful to start from a pre-trained model and continue training later.\n", 679 | "* `-i data/sentiment.model`: tell `vw` to load an **i**nitial model from the specified file\n", 680 | "* `-p data/sentiment.te.pred`: store the predictions in the specified file\n", 681 | "* `data/sentiment.te`: the data on which to make predictions\n", 682 | "\n", 683 | "One of the most important bits of information in the output is the `average loss` which tells us our test error rate: in this case, 15% error.\n", 684 | "\n", 685 | "We can now take a look at the predictions:" 686 | ] 687 | }, 688 | { 689 | "cell_type": "code", 690 | "collapsed": false, 691 | "input": [ 692 | "!head data/sentiment.te.pred" 693 | ], 694 | "language": "python", 695 | "metadata": {}, 696 | "outputs": [ 697 | { 698 | "output_type": "stream", 699 | "stream": "stdout", 700 | "text": [ 701 | "1\r\n", 702 | "1\r\n", 703 | "-1\r\n", 704 | "1\r\n", 705 | "-1\r\n", 706 | "-1\r\n", 707 | "-1\r\n", 708 | "-1\r\n", 709 | "-1\r\n", 710 | "1\r\n" 711 | ] 712 | } 713 | ], 714 | "prompt_number": 14 715 | }, 716 | { 717 | "cell_type": "markdown", 718 | "metadata": {}, 719 | "source": [ 720 | "And yay, we've successfully made predictions!\n", 721 | "\n", 722 | "Because `vw` knows this is a binary classification problem, it's just giving you +1/-1 outputs. In many cases, we want a scalar value, before threshholding occurs. We can do this by asking `vw` from **raw** predictions, using `-r` in lieu (or in addition to) `-p`:" 723 | ] 724 | }, 725 | { 726 | "cell_type": "code", 727 | "collapsed": false, 728 | "input": [ 729 | "!vw --binary -t -i data/sentiment.model -r data/sentiment.te.raw data/sentiment.te" 730 | ], 731 | "language": "python", 732 | "metadata": {}, 733 | "outputs": [ 734 | { 735 | "output_type": "stream", 736 | "stream": "stdout", 737 | "text": [ 738 | "only testing\r\n", 739 | "raw predictions = data/sentiment.te.raw\r\n", 740 | "Num weight bits = 18\r\n", 741 | "learning rate = 0.5\r\n", 742 | "initial_t = 0\r\n", 743 | "power_t = 0.5\r\n", 744 | "using no cache\r\n", 745 | "Reading datafile = data/sentiment.te\r\n", 746 | "num sources = 1\r\n", 747 | "average since example example current current current\r\n", 748 | "loss last counter weight label predict features\r\n", 749 | "0.000000 0.000000 1 1.0 1.0000 1.0000 967\r\n", 750 | "0.000000 0.000000 2 2.0 1.0000 1.0000 1043\r\n", 751 | "0.000000 0.000000 4 4.0 1.0000 1.0000 757\r\n", 752 | "0.000000 0.000000 8 8.0 -1.0000 -1.0000 243\r\n", 753 | "0.062500 0.125000 16 16.0 -1.0000 1.0000 345\r\n", 754 | "0.156250 0.250000 32 32.0 1.0000 -1.0000 572\r\n", 755 | "0.125000 0.093750 64 64.0 1.0000 1.0000 1517\r\n", 756 | "0.140625 0.156250 128 128.0 -1.0000 -1.0000 575\r\n", 757 | "0.160156 0.179688 256 256.0 -1.0000 -1.0000 599\r\n", 758 | "\r\n", 759 | "finished run\r\n", 760 | "number of examples per pass = 400\r\n", 761 | "passes used = 1\r\n", 762 | "weighted example sum = 400.000000\r\n", 763 | "weighted label sum = -6.000000\r\n", 764 | "average loss = 0.150000\r\n", 765 | "best constant = -0.015000\r\n", 766 | "best constant's loss = 0.999775\r\n", 767 | "total feature number = 289865\r\n" 768 | ] 769 | } 770 | ], 771 | "prompt_number": 15 772 | }, 773 | { 774 | "cell_type": "code", 775 | "collapsed": false, 776 | "input": [ 777 | "!head data/sentiment.te.raw" 778 | ], 779 | "language": "python", 780 | "metadata": {}, 781 | "outputs": [ 782 | { 783 | "output_type": "stream", 784 | "stream": "stdout", 785 | "text": [ 786 | "0.786418\r\n", 787 | "1.720858\r\n", 788 | "-0.315573\r\n", 789 | "0.386969\r\n", 790 | "-1.752520\r\n", 791 | "-1.432538\r\n", 792 | "-0.474776\r\n", 793 | "-0.189435\r\n", 794 | "-2.108955\r\n", 795 | "1.990319\r\n" 796 | ] 797 | } 798 | ], 799 | "prompt_number": 16 800 | }, 801 | { 802 | "cell_type": "markdown", 803 | "metadata": {}, 804 | "source": [ 805 | "The `.raw` file now contains the un-thresholded predictions. Anything greater than 0 gets mapped to +1 and anything less than zero gets mapped to -1.\n", 806 | "\n", 807 | "For fun, we can also compute our accuracy on the training data. This should by lower:" 808 | ] 809 | }, 810 | { 811 | "cell_type": "code", 812 | "collapsed": false, 813 | "input": [ 814 | "!vw --binary -t -i data/sentiment.model data/sentiment.tr" 815 | ], 816 | "language": "python", 817 | "metadata": {}, 818 | "outputs": [ 819 | { 820 | "output_type": "stream", 821 | "stream": "stdout", 822 | "text": [ 823 | "only testing\r\n", 824 | "Num weight bits = 18\r\n", 825 | "learning rate = 0.5\r\n", 826 | "initial_t = 0\r\n", 827 | "power_t = 0.5\r\n", 828 | "using no cache\r\n", 829 | "Reading datafile = data/sentiment.tr\r\n", 830 | "num sources = 1\r\n", 831 | "average since example example current current current\r\n", 832 | "loss last counter weight label predict features\r\n", 833 | "0.000000 0.000000 1 1.0 1.0000 1.0000 740\r\n", 834 | "0.000000 0.000000 2 2.0 1.0000 1.0000 630\r\n", 835 | "0.000000 0.000000 4 4.0 1.0000 1.0000 870\r\n", 836 | "0.000000 0.000000 8 8.0 -1.0000 -1.0000 526\r\n", 837 | "0.000000 0.000000 16 16.0 -1.0000 -1.0000 529\r\n", 838 | "0.000000 0.000000 32 32.0 1.0000 1.0000 1188\r\n", 839 | "0.000000 0.000000 64 64.0 -1.0000 -1.0000 931\r\n", 840 | "0.015625 0.031250 128 128.0 1.0000 1.0000 662\r\n", 841 | "0.011719 0.007812 256 256.0 1.0000 1.0000 922\r\n" 842 | ] 843 | }, 844 | { 845 | "output_type": "stream", 846 | "stream": "stdout", 847 | "text": [ 848 | "0.009766 0.007812 512 512.0 1.0000 1.0000 297\r\n", 849 | "0.011719 0.013672 1024 1024.0 1.0000 1.0000 991\r\n" 850 | ] 851 | }, 852 | { 853 | "output_type": "stream", 854 | "stream": "stdout", 855 | "text": [ 856 | "\r\n", 857 | "finished run\r\n", 858 | "number of examples per pass = 1600\r\n", 859 | "passes used = 1\r\n", 860 | "weighted example sum = 1600.000000\r\n", 861 | "weighted label sum = 6.000000\r\n", 862 | "average loss = 0.015625\r\n", 863 | "best constant = 0.003750\r\n", 864 | "best constant's loss = 0.999986\r\n", 865 | "total feature number = 1204816\r\n" 866 | ] 867 | } 868 | ], 869 | "prompt_number": 17 870 | }, 871 | { 872 | "cell_type": "markdown", 873 | "metadata": {}, 874 | "source": [ 875 | "This is, indeed, quite a bit lower: a 1.56% error rate! Of course, this is cheating.\n", 876 | "\n", 877 | "Sometimes, especially at test time, you don't want `vw` to produce output while running. You can tell it to be quiet with `--quiet`.\n", 878 | "\n", 879 | "Finally, when we're making real predictions on real test data, we often don't have labels. That's fine. If you give `vw` an example without a label, it won't learn on it, but it can still make predictions. We can simulate this on the beginning of the test data, for instance by looking at:" 880 | ] 881 | }, 882 | { 883 | "cell_type": "code", 884 | "collapsed": false, 885 | "input": [ 886 | "!head data/sentiment.te | cut -d' ' -f2-20" 887 | ], 888 | "language": "python", 889 | "metadata": {}, 890 | "outputs": [ 891 | { 892 | "output_type": "stream", 893 | "stream": "stdout", 894 | "text": [ 895 | "| after watching \" rat race \" last week , i noticed my cheeks were sore and realized that\r\n", 896 | "| when andy leaves for cowboy camp , his mother holds a yard sale and scrounges in his room\r\n", 897 | "| of course i knew this going in . why is it that whenever a tv-star makes a movie\r\n", 898 | "| the film \" magnolia \" can be compared to a simple flower as its title and movie poster\r\n", 899 | "| some movies ask you to leave your brain at the door , some movies ask you to believe\r\n", 900 | "| the high school comedy seems to be a hot genre of the moment . with she's all that\r\n", 901 | "| in double jeopardy , the stakes are high . think of the plot as a rehash of sleeping\r\n", 902 | "| its a stupid little movie that trys to be clever and sophisticated , yet trys a bit too\r\n", 903 | "| \" goodbye , lover \" sat on the shelf for almost a year since its lukewarm reception at\r\n", 904 | "| those of you who frequently read my reviews are not likely to be surprised by the fact that\r\n" 905 | ] 906 | } 907 | ], 908 | "prompt_number": 18 909 | }, 910 | { 911 | "cell_type": "markdown", 912 | "metadata": {}, 913 | "source": [ 914 | "These are the first ten test examples with their labels (but not the pipe) removed, and only the first 19 words kept. When making real predictions we'll use all the words, but for printing on the screen this keep the output small.\n", 915 | "\n", 916 | "We can pipe this directly into `vw`:" 917 | ] 918 | }, 919 | { 920 | "cell_type": "code", 921 | "collapsed": false, 922 | "input": [ 923 | "!head data/sentiment.te | cut -d' ' -f2- | vw --binary -t -i data/sentiment.model -r /dev/stdout --quiet" 924 | ], 925 | "language": "python", 926 | "metadata": {}, 927 | "outputs": [ 928 | { 929 | "output_type": "stream", 930 | "stream": "stdout", 931 | "text": [ 932 | "0.786418\r\n", 933 | "1.720858\r\n", 934 | "-0.315573\r\n", 935 | "0.386969\r\n", 936 | "-1.752520\r\n", 937 | "-1.432538\r\n", 938 | "-0.474776\r\n", 939 | "-0.189435\r\n", 940 | "-2.108955\r\n", 941 | "1.990319\r\n" 942 | ] 943 | } 944 | ], 945 | "prompt_number": 19 946 | }, 947 | { 948 | "cell_type": "markdown", 949 | "metadata": {}, 950 | "source": [ 951 | "Here, you can see that (a) `vw` can read data from standard input (in this case, the `head` of the test data), and can produce output to `/dev/stdout`. Because we ran in `--quiet` mode, all we got were the predictions. And note these are the same predictions as before: `vw` isn't cheating by looking at the correct label when it's in `-t` (test) mode.\n", 952 | "\n", 953 | "# Cheat Sheet and Next Steps\n", 954 | "\n", 955 | "Train with:\n", 956 | "\n", 957 | " vw --binary --passes 20 -c -k -f MODEL DATA\n", 958 | "\n", 959 | "Predict with:\n", 960 | "\n", 961 | " vw --binary -t -i MODEL -r RAWOUTPUT DATA\n", 962 | "\n", 963 | "You're now in a position where you can successfully: download data, process it into `vw` format, train a predictor on it, and use that predictor to make test predictions.\n", 964 | "\n", 965 | "From here, you can:\n", 966 | "\n", 967 | "* Learn how to [adjust some of the default arguments to try to get better performance](GettingTheMost.ipynb)\n", 968 | "* Learn how to [adjust example weights for rare category detection and related problems](RareCategory.ipynb)\n", 969 | "* Learn how to [do more complicated classification like multiclass classification](MulticlassClassification.ipynb)\n", 970 | "* Learn how to [multiclass classification with label-dependent features / solve ranking problems](MulticlassLDF.ipynb)\n", 971 | "* Learn how to [do unsupervised learning, like topic modeling and autoencoding](UnsupervisedNLP.ipynb)\n", 972 | "* Learn how to [do structured prediction, like part of speech tagging or dependency parsing](StructuredPrediction.ipynb)\n" 973 | ] 974 | } 975 | ], 976 | "metadata": {} 977 | } 978 | ] 979 | } -------------------------------------------------------------------------------- /MulticlassLDF.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "", 4 | "signature": "sha256:1b3492af97916660f0d45ff003f162b1c66af4cf6e7cebce626206dd49da489d" 5 | }, 6 | "nbformat": 3, 7 | "nbformat_minor": 0, 8 | "worksheets": [ 9 | { 10 | "cells": [ 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "# Classification with Label Dependent Features\n", 16 | "\n", 17 | "In the [MulticlassClassification]() notebook, we saw how to build classifiers whose output is one of K classes. In those examples, each class was treated as an atomic unit: class 1 had no more to do with class 2 than it did with class 65. In some cases, you want to associate features with the different classes (or labels): this gives rise to **label dependent feature** models. (These are also useful to [zero-shot learning](#zeroshot), which we'll see later in this notebook.)\n", 18 | "\n", 19 | "## Formalism and Data Format\n", 20 | "\n", 21 | "A natural problem where LDF arises is in search ranking-like problems. Here, the input to the system is a query (like \"vowpal wabbit ldf howto\") and the output is one of the billion-or-so documents on the web. It doesn't make a lot of sense to give each document on the web a unique id and then try to learn a mapping from inputs (queries) to outputs (documents), treating all documents independently. Instead, what you *want* to do is associate a feature vector with a *pair* $(x,y)$, where $x$ is the query and $y$ is a hypothetical returned document. This feature vector might do things like count how many of the query terms are present in the corresponding document.\n", 22 | "\n", 23 | "The associated LDF learning problem is: given some $x$ (query) and a set of possible retreived documents $(y_1, y_2, \\dots, y_K)$, we want to compute $\\arg\\min_k f( \\varphi(x,y_k) )$, where $\\varphi(x,y_k)$ is the features you extract for that pair, and $f$ is the learned \"cost\" function. *[Note: sometimes people talk about a \"score\" function that we try to maximize rather than a \"cost\" function that we try to minimize. Here, we'll talk only about costs, so lower is always better.]*\n", 24 | "\n", 25 | "The basic data format (we'll see briefer alternatives later) for LDF is something like the following:\n", 26 | "\n", 27 | " 1:1 | features for pair (x1,y1) ...\n", 28 | " 2:0 | features for pair (x1,y2) ... <-- this is the correct \"label\"\n", 29 | " 3:1 | features for pair (x1,y3) ...\n", 30 | " 4:1 | features for pair (x1,y4) ...\n", 31 | " \n", 32 | " 1:0 | features for pair (x2,y1) ... <-- this is the correct \"label\"\n", 33 | " 2:1 | features for pair (x2,y2) ...\n", 34 | " 5:1 | features for pair (x2,y5) ...\n", 35 | " \n", 36 | " \n", 37 | "And so on. Here, each **block** of data is separated by a blank line. A single block corresponds to a single training example, and each line in that block corresponds to a valid prediction for that example. The \"label\" portion consists of a class identifier (1, 2, 3, 4, 5, ...) and a cost (:1 means \"cost=1\" and :0 means \"cost=0\"). After that, any features that make sense can be written out.\n", 38 | "\n", 39 | "A few notes:\n", 40 | "\n", 41 | "1. Costs can be any non-negative values, and there can be multiple labels with a cost of zero.\n", 42 | "1. The actual labels (1, 2, 3, 4, 5, ...) are completely ignored by the learning algorithm. Everything you want to encode *must* be encoded in the features. If each line in a single block has exactly the same features, it will be impossible to differentiate them. This also means that label ids needn't be consistent across blocks.\n", 43 | "1. You don't have to use consecutive label ids (as in the second block above).\n", 44 | "\n", 45 | "## LDF for Quizbowl, Step 1\n", 46 | "\n", 47 | "I'm going to assume you've already gone through the [MulticlassClassification]() notebook and have downloaded the QuizBowl data. If not, do so now. Here is some code copied from before for loading the data." 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "collapsed": false, 53 | "input": [ 54 | "import csv, sys, re, random, os\n", 55 | "\n", 56 | "def sanify(s): return s.replace(':',';').replace('|','/') # replace : with ; and | with /\n", 57 | "def tokenize(s): return re.sub('([^A-Za-z0-9 ]+)', ' \\\\1 ', s).split() # add space around anything not alphanum\n", 58 | "\n", 59 | "def readQuizBowlData(filename):\n", 60 | " train,dev,test = [],[],[]\n", 61 | " data = csv.reader( open(filename, 'r').readlines() )\n", 62 | " header = data.next()\n", 63 | " if header != ['Question ID', 'Fold', 'Category', 'Answer', 'Text']:\n", 64 | " raise Exception('data improperly formatted')\n", 65 | " for item in iter(data):\n", 66 | " y = item[3] \n", 67 | " x = sanify(' '.join(tokenize(item[4].replace('|||',''))))\n", 68 | " if item[1] == 'train': train.append( (x,y) )\n", 69 | " elif item[1] == 'dev' : dev.append( (x,y) )\n", 70 | " elif item[1] == 'test' : test.append( (x,y) )\n", 71 | " return train,dev,test\n", 72 | "\n", 73 | "def makeLabelIDs(train, outputFile):\n", 74 | " labelIds = { label: k+1 for k,label in enumerate(set([y for x,y in train])) }\n", 75 | " labelIds['***UNKNOWN***'] = len(labelIds)+1\n", 76 | " with open(outputFile, 'w') as h:\n", 77 | " for label,k in labelIds.iteritems():\n", 78 | " print >>h, '%d\\t%s' % (k,label)\n", 79 | " return labelIds\n", 80 | "\n", 81 | "train,dev,test = readQuizBowlData('data/questions/questions.csv')" 82 | ], 83 | "language": "python", 84 | "metadata": {}, 85 | "outputs": [], 86 | "prompt_number": 12 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "\n", 93 | "We're going to start off by basically replicating the experiment there, but using a very wasteful LDF format. In particular, for each quizbowl example, we'll generate a single block. This block will have one line *for every class*. For class 10, for instance, we'll replace every features \"f\" with \"f_10\". This will be horribly bloated, but we'll fix that later.\n", 94 | "\n", 95 | "In order to actually make this not take huge amounts of time and disk space, we'll cut down the train/dev/test significantly (to the point that they're toy data)." 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "collapsed": false, 101 | "input": [ 102 | "random.seed(9876)\n", 103 | "random.shuffle(train)\n", 104 | "train_small = train[:10]\n", 105 | "labelIds_small = makeLabelIDs(train_small, 'data/quizbowl.ldf.labels')\n", 106 | "print 'total of %d labels, including one for unknown' % len(labelIds_small)\n", 107 | "\n", 108 | "def writeVWFile(filename, data, labelIds):\n", 109 | " unknownId = labelIds['***UNKNOWN***']\n", 110 | " with open(filename,'w') as h:\n", 111 | " for x,y in data:\n", 112 | " trueLabel = labelIds.get(y, unknownId)\n", 113 | " feats = x.split()\n", 114 | " # generate a single block by going over EVERY POSSIBLE LABEL\n", 115 | " for k in range(1, unknownId+1): # unknownId is the last label\n", 116 | " cost = 0. if trueLabel == k else 1. # how bad is it to predict k?\n", 117 | " x_k = ' '.join(['%s_%d' % (f,k) for f in feats]) # example for label k\n", 118 | " print >>h, '%d:%g |q %s' % (k, cost, x_k)\n", 119 | " print >>h, '' # blank line to separate blocks\n", 120 | " \n", 121 | "writeVWFile('data/quizbowl.ldf.tr', train_small, labelIds_small)\n", 122 | "!wc data/quizbowl.ldf.tr\n", 123 | "!head -n12 data/quizbowl.ldf.tr" 124 | ], 125 | "language": "python", 126 | "metadata": {}, 127 | "outputs": [ 128 | { 129 | "output_type": "stream", 130 | "stream": "stdout", 131 | "text": [ 132 | "total of 11 labels, including one for unknown\n", 133 | " 120 14212 105193 data/quizbowl.ldf.tr\r\n" 134 | ] 135 | }, 136 | { 137 | "output_type": "stream", 138 | "stream": "stdout", 139 | "text": [ 140 | "1:1 |q Within_1 a_1 constant_1 it_1 '_1 s_1 the_1 Laplacian_1 of_1 the_1 function_1 \"_1 one_1 over_1 distance_1 \"_1 and_1 is_1 also_1 the_1 charge_1 density_1 of_1 a_1 point_1 charge_1 ._1 Actually_1 an_1 operator_1 which_1 projects_1 out_1 a_1 function_1 '_1 s_1 value_1 at_1 the_1 origin_1 ,_1 it_1 '_1 s_1 not_1 really_1 a_1 function_1 at_1 all_1 ._1 For_1 ten_1 points_1 ,_1 name_1 this_1 mathematical_1 entity_1 ,_1 which_1 is_1 zero_1 everywhere_1 except_1 the_1 origin_1 and_1 is_1 designated_1 by_1 a_1 Greek_1 letter_1 ._1\r\n", 141 | "2:1 |q Within_2 a_2 constant_2 it_2 '_2 s_2 the_2 Laplacian_2 of_2 the_2 function_2 \"_2 one_2 over_2 distance_2 \"_2 and_2 is_2 also_2 the_2 charge_2 density_2 of_2 a_2 point_2 charge_2 ._2 Actually_2 an_2 operator_2 which_2 projects_2 out_2 a_2 function_2 '_2 s_2 value_2 at_2 the_2 origin_2 ,_2 it_2 '_2 s_2 not_2 really_2 a_2 function_2 at_2 all_2 ._2 For_2 ten_2 points_2 ,_2 name_2 this_2 mathematical_2 entity_2 ,_2 which_2 is_2 zero_2 everywhere_2 except_2 the_2 origin_2 and_2 is_2 designated_2 by_2 a_2 Greek_2 letter_2 ._2\r\n", 142 | "3:1 |q Within_3 a_3 constant_3 it_3 '_3 s_3 the_3 Laplacian_3 of_3 the_3 function_3 \"_3 one_3 over_3 distance_3 \"_3 and_3 is_3 also_3 the_3 charge_3 density_3 of_3 a_3 point_3 charge_3 ._3 Actually_3 an_3 operator_3 which_3 projects_3 out_3 a_3 function_3 '_3 s_3 value_3 at_3 the_3 origin_3 ,_3 it_3 '_3 s_3 not_3 really_3 a_3 function_3 at_3 all_3 ._3 For_3 ten_3 points_3 ,_3 name_3 this_3 mathematical_3 entity_3 ,_3 which_3 is_3 zero_3 everywhere_3 except_3 the_3 origin_3 and_3 is_3 designated_3 by_3 a_3 Greek_3 letter_3 ._3\r\n", 143 | "4:1 |q Within_4 a_4 constant_4 it_4 '_4 s_4 the_4 Laplacian_4 of_4 the_4 function_4 \"_4 one_4 over_4 distance_4 \"_4 and_4 is_4 also_4 the_4 charge_4 density_4 of_4 a_4 point_4 charge_4 ._4 Actually_4 an_4 operator_4 which_4 projects_4 out_4 a_4 function_4 '_4 s_4 value_4 at_4 the_4 origin_4 ,_4 it_4 '_4 s_4 not_4 really_4 a_4 function_4 at_4 all_4 ._4 For_4 ten_4 points_4 ,_4 name_4 this_4 mathematical_4 entity_4 ,_4 which_4 is_4 zero_4 everywhere_4 except_4 the_4 origin_4 and_4 is_4 designated_4 by_4 a_4 Greek_4 letter_4 ._4\r\n", 144 | "5:1 |q Within_5 a_5 constant_5 it_5 '_5 s_5 the_5 Laplacian_5 of_5 the_5 function_5 \"_5 one_5 over_5 distance_5 \"_5 and_5 is_5 also_5 the_5 charge_5 density_5 of_5 a_5 point_5 charge_5 ._5 Actually_5 an_5 operator_5 which_5 projects_5 out_5 a_5 function_5 '_5 s_5 value_5 at_5 the_5 origin_5 ,_5 it_5 '_5 s_5 not_5 really_5 a_5 function_5 at_5 all_5 ._5 For_5 ten_5 points_5 ,_5 name_5 this_5 mathematical_5 entity_5 ,_5 which_5 is_5 zero_5 everywhere_5 except_5 the_5 origin_5 and_5 is_5 designated_5 by_5 a_5 Greek_5 letter_5 ._5\r\n", 145 | "6:1 |q Within_6 a_6 constant_6 it_6 '_6 s_6 the_6 Laplacian_6 of_6 the_6 function_6 \"_6 one_6 over_6 distance_6 \"_6 and_6 is_6 also_6 the_6 charge_6 density_6 of_6 a_6 point_6 charge_6 ._6 Actually_6 an_6 operator_6 which_6 projects_6 out_6 a_6 function_6 '_6 s_6 value_6 at_6 the_6 origin_6 ,_6 it_6 '_6 s_6 not_6 really_6 a_6 function_6 at_6 all_6 ._6 For_6 ten_6 points_6 ,_6 name_6 this_6 mathematical_6 entity_6 ,_6 which_6 is_6 zero_6 everywhere_6 except_6 the_6 origin_6 and_6 is_6 designated_6 by_6 a_6 Greek_6 letter_6 ._6\r\n", 146 | "7:1 |q Within_7 a_7 constant_7 it_7 '_7 s_7 the_7 Laplacian_7 of_7 the_7 function_7 \"_7 one_7 over_7 distance_7 \"_7 and_7 is_7 also_7 the_7 charge_7 density_7 of_7 a_7 point_7 charge_7 ._7 Actually_7 an_7 operator_7 which_7 projects_7 out_7 a_7 function_7 '_7 s_7 value_7 at_7 the_7 origin_7 ,_7 it_7 '_7 s_7 not_7 really_7 a_7 function_7 at_7 all_7 ._7 For_7 ten_7 points_7 ,_7 name_7 this_7 mathematical_7 entity_7 ,_7 which_7 is_7 zero_7 everywhere_7 except_7 the_7 origin_7 and_7 is_7 designated_7 by_7 a_7 Greek_7 letter_7 ._7\r\n", 147 | "8:1 |q Within_8 a_8 constant_8 it_8 '_8 s_8 the_8 Laplacian_8 of_8 the_8 function_8 \"_8 one_8 over_8 distance_8 \"_8 and_8 is_8 also_8 the_8 charge_8 density_8 of_8 a_8 point_8 charge_8 ._8 Actually_8 an_8 operator_8 which_8 projects_8 out_8 a_8 function_8 '_8 s_8 value_8 at_8 the_8 origin_8 ,_8 it_8 '_8 s_8 not_8 really_8 a_8 function_8 at_8 all_8 ._8 For_8 ten_8 points_8 ,_8 name_8 this_8 mathematical_8 entity_8 ,_8 which_8 is_8 zero_8 everywhere_8 except_8 the_8 origin_8 and_8 is_8 designated_8 by_8 a_8 Greek_8 letter_8 ._8\r\n", 148 | "9:1 |q Within_9 a_9 constant_9 it_9 '_9 s_9 the_9 Laplacian_9 of_9 the_9 function_9 \"_9 one_9 over_9 distance_9 \"_9 and_9 is_9 also_9 the_9 charge_9 density_9 of_9 a_9 point_9 charge_9 ._9 Actually_9 an_9 operator_9 which_9 projects_9 out_9 a_9 function_9 '_9 s_9 value_9 at_9 the_9 origin_9 ,_9 it_9 '_9 s_9 not_9 really_9 a_9 function_9 at_9 all_9 ._9 For_9 ten_9 points_9 ,_9 name_9 this_9 mathematical_9 entity_9 ,_9 which_9 is_9 zero_9 everywhere_9 except_9 the_9 origin_9 and_9 is_9 designated_9 by_9 a_9 Greek_9 letter_9 ._9\r\n", 149 | "10:0 |q Within_10 a_10 constant_10 it_10 '_10 s_10 the_10 Laplacian_10 of_10 the_10 function_10 \"_10 one_10 over_10 distance_10 \"_10 and_10 is_10 also_10 the_10 charge_10 density_10 of_10 a_10 point_10 charge_10 ._10 Actually_10 an_10 operator_10 which_10 projects_10 out_10 a_10 function_10 '_10 s_10 value_10 at_10 the_10 origin_10 ,_10 it_10 '_10 s_10 not_10 really_10 a_10 function_10 at_10 all_10 ._10 For_10 ten_10 points_10 ,_10 name_10 this_10 mathematical_10 entity_10 ,_10 which_10 is_10 zero_10 everywhere_10 except_10 the_10 origin_10 and_10 is_10 designated_10 by_10 a_10 Greek_10 letter_10 ._10\r\n", 150 | "11:1 |q Within_11 a_11 constant_11 it_11 '_11 s_11 the_11 Laplacian_11 of_11 the_11 function_11 \"_11 one_11 over_11 distance_11 \"_11 and_11 is_11 also_11 the_11 charge_11 density_11 of_11 a_11 point_11 charge_11 ._11 Actually_11 an_11 operator_11 which_11 projects_11 out_11 a_11 function_11 '_11 s_11 value_11 at_11 the_11 origin_11 ,_11 it_11 '_11 s_11 not_11 really_11 a_11 function_11 at_11 all_11 ._11 For_11 ten_11 points_11 ,_11 name_11 this_11 mathematical_11 entity_11 ,_11 which_11 is_11 zero_11 everywhere_11 except_11 the_11 origin_11 and_11 is_11 designated_11 by_11 a_11 Greek_11 letter_11 ._11\r\n", 151 | "\r\n" 152 | ] 153 | } 154 | ], 155 | "prompt_number": 13 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "So in this silly data, we have 11 labels, so all blocks has length 11. One of these (in this case, label 8) is the correct label and has a cost of zero.\n", 162 | "\n", 163 | "Before we worry about shrinking this file, let's train and test a model on it. The main thing is the `--csoaa_ldf multiline` argument, which says we want to run cost-sensitive one against all training, where data is provided in multiline format." 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "collapsed": false, 169 | "input": [ 170 | "!vw -k -c --csoaa_ldf multiline -d data/quizbowl.ldf.tr --passes 10 --holdout_off -f data/quizbowl.ldf.model\n", 171 | "!vw -i data/quizbowl.ldf.model -t -d data/quizbowl.ldf.tr -r data/quizbowl.ldf.tr.raw\n", 172 | "!head -n24 data/quizbowl.ldf.tr.raw" 173 | ], 174 | "language": "python", 175 | "metadata": {}, 176 | "outputs": [ 177 | { 178 | "output_type": "stream", 179 | "stream": "stdout", 180 | "text": [ 181 | "final_regressor = data/quizbowl.ldf.model\r\n", 182 | "Num weight bits = 18\r\n", 183 | "learning rate = 0.5\r\n", 184 | "initial_t = 0\r\n", 185 | "power_t = 0.5\r\n", 186 | "decay_learning_rate = 1\r\n", 187 | "creating cache_file = data/quizbowl.ldf.tr.cache\r\n", 188 | "Reading datafile = data/quizbowl.ldf.tr\r\n", 189 | "num sources = 1\r\n", 190 | "average since example example current current current\r\n", 191 | "loss last counter weight label predict features\r\n", 192 | "1.000000 1.000000 1 1.0 known 1 847\r\n", 193 | "0.500000 0.000000 2 2.0 known 0 1408\r\n", 194 | "0.750000 1.000000 4 4.0 known 0 1243\r\n", 195 | "0.875000 1.000000 8 8.0 known 0 1166\r\n" 196 | ] 197 | }, 198 | { 199 | "output_type": "stream", 200 | "stream": "stdout", 201 | "text": [ 202 | "0.625000 0.375000 16 16.0 known 0 1386\r\n", 203 | "0.312500 0.000000 32 32.0 known 1 1408\r\n" 204 | ] 205 | }, 206 | { 207 | "output_type": "stream", 208 | "stream": "stdout", 209 | "text": [ 210 | "0.156250 0.000000 64 64.0 known 0 1243\r\n", 211 | "\r\n", 212 | "finished run\r\n", 213 | "number of examples per pass = 10\r\n", 214 | "passes used = 10\r\n", 215 | "weighted example sum = 100.000000\r\n", 216 | "weighted label sum = 0.000000\r\n", 217 | "average loss = 0.100000\r\n", 218 | "total feature number = 141020\r\n" 219 | ] 220 | }, 221 | { 222 | "output_type": "stream", 223 | "stream": "stdout", 224 | "text": [ 225 | "only testing\r\n", 226 | "raw predictions = data/quizbowl.ldf.tr.raw\r\n", 227 | "Num weight bits = 18\r\n", 228 | "learning rate = 0.5\r\n", 229 | "initial_t = 0\r\n", 230 | "power_t = 0.5\r\n", 231 | "using no cache\r\n", 232 | "Reading datafile = data/quizbowl.ldf.tr\r\n", 233 | "num sources = 1\r\n", 234 | "average since example example current current current\r\n", 235 | "loss last counter weight label predict features\r\n", 236 | "0.000000 0.000000 1 1.0 known 0 847\r\n", 237 | "0.000000 0.000000 2 2.0 known 1 1408\r\n", 238 | "0.000000 0.000000 4 4.0 known 0 1243\r\n", 239 | "0.000000 0.000000 8 8.0 known 0 1166\r\n", 240 | "\r\n", 241 | "finished run\r\n", 242 | "number of examples per pass = 10\r\n", 243 | "passes used = 1\r\n", 244 | "weighted example sum = 10.000000\r\n", 245 | "weighted label sum = 0.000000\r\n", 246 | "average loss = 0.000000\r\n", 247 | "total feature number = 14102\r\n" 248 | ] 249 | }, 250 | { 251 | "output_type": "stream", 252 | "stream": "stdout", 253 | "text": [ 254 | "1:1.59138\r\n", 255 | "2:1.53712\r\n", 256 | "3:1.5944\r\n", 257 | "4:1.63682\r\n", 258 | "5:1.55016\r\n", 259 | "6:1.48501\r\n", 260 | "7:1.33151\r\n", 261 | "8:1.3858\r\n", 262 | "9:1.44881\r\n", 263 | "10:-0.202838\r\n", 264 | "11:1.6381\r\n", 265 | "\r\n", 266 | "1:-0.0452115\r\n", 267 | "2:1.05496\r\n", 268 | "3:1.07181\r\n", 269 | "4:1.01842\r\n", 270 | "5:1.14079\r\n", 271 | "6:1.2917\r\n", 272 | "7:1.00388\r\n", 273 | "8:1.12249\r\n", 274 | "9:1.32976\r\n", 275 | "10:1.89733\r\n", 276 | "11:1.30963\r\n", 277 | "\r\n" 278 | ] 279 | } 280 | ], 281 | "prompt_number": 14 282 | }, 283 | { 284 | "cell_type": "markdown", 285 | "metadata": {}, 286 | "source": [ 287 | "Here, we've trained a model and made predictions on the training data. The numbers printed in the raw file are the expected costs.\n", 288 | "\n", 289 | "Now, there's a lot of stuff going on here that's silly. We'll address these in order:\n", 290 | "\n", 291 | "1. The features in the above example are really a cross product between the input $x$ and the label $y_k$. But `vw` has built in support for cross-produces with `-q`, so let's just use that. After fixing that problem, the same input $x$ gets repeated over and over again. That's really redundant.\n", 292 | "1. Now, the same features (or in this case, single feature) gets repeated over and over again for the label. That's also redundant.\n", 293 | "1. Finally, we'd like more interesting features on the label.\n", 294 | "\n", 295 | "## Separating out the input and output using quadratic features: shared features\n", 296 | "\n", 297 | "Hopefully it's clear that a training example like:\n", 298 | "\n", 299 | " 1:1 |q If_1 it_1 were_1 always_1 true_1 ,_1 fractional_1 distillation_1 would_1 ...\n", 300 | " \n", 301 | "is exactly the same as:\n", 302 | "\n", 303 | " 1:1 |l 1 |q If it were always true , fractional distillation would ...\n", 304 | "\n", 305 | "if `-q lq` is added to the command line. This simplifies the data, but then the blocks will look like:\n", 306 | "\n", 307 | " 1:1 |l 1 |q If it were always true , fractional distillation would ...\n", 308 | " 2:1 |l 2 |q If it were always true , fractional distillation would ...\n", 309 | " 3:1 |l 3 |q If it were always true , fractional distillation would ...\n", 310 | " 4:1 |l 4 |q If it were always true , fractional distillation would ...\n", 311 | " ...\n", 312 | " \n", 313 | "where the \"`q`\" part will be the same in every example. We can fix this problem by having \"`shared`\" features for blocks. This looks like the following:\n", 314 | "\n", 315 | " ***shared*** |q If it were always true , fractional distillation would ...\n", 316 | " 1:1 |l 1\n", 317 | " 2:1 |l 2\n", 318 | " 3:1 |l 3\n", 319 | " 4:1 |l 4\n", 320 | " ...\n", 321 | "\n", 322 | "What this says is: pretend that the `shared` line appears on every example within this block. We can fix our code to output data that looks like this, which will be much smaller." 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "collapsed": false, 328 | "input": [ 329 | "def writeVWFile(filename, data, labelIds):\n", 330 | " unknownId = labelIds['***UNKNOWN***']\n", 331 | " with open(filename,'w') as h:\n", 332 | " for x,y in data:\n", 333 | " print >>h, '***shared*** |q %s' % x\n", 334 | " trueLabel = labelIds.get(y, unknownId)\n", 335 | " # generate a single block by going over EVERY POSSIBLE LABEL\n", 336 | " for k in range(1, unknownId+1): # unknownId is the last label\n", 337 | " cost = 0. if trueLabel == k else 1. # how bad is it to predict k?\n", 338 | " print >>h, '%d:%g |l %d' % (k, cost, k)\n", 339 | " print >>h, '' # blank line to separate blocks\n", 340 | " \n", 341 | "writeVWFile('data/quizbowl.ldf.tr', train_small, labelIds_small)\n", 342 | "!head -n13 data/quizbowl.ldf.tr\n", 343 | "!vw -k -c --csoaa_ldf multiline -d data/quizbowl.ldf.tr --passes 10 --holdout_off -f data/quizbowl.ldf.model -q lq\n", 344 | "!vw -i data/quizbowl.ldf.model -t -d data/quizbowl.ldf.tr -r data/quizbowl.ldf.tr.raw\n", 345 | "!head -n24 data/quizbowl.ldf.tr.raw" 346 | ], 347 | "language": "python", 348 | "metadata": {}, 349 | "outputs": [ 350 | { 351 | "output_type": "stream", 352 | "stream": "stdout", 353 | "text": [ 354 | "***shared*** |q Within a constant it ' s the Laplacian of the function \" one over distance \" and is also the charge density of a point charge . Actually an operator which projects out a function ' s value at the origin , it ' s not really a function at all . For ten points , name this mathematical entity , which is zero everywhere except the origin and is designated by a Greek letter .\r\n", 355 | "1:1 |l 1\r\n", 356 | "2:1 |l 2\r\n", 357 | "3:1 |l 3\r\n", 358 | "4:1 |l 4\r\n", 359 | "5:1 |l 5\r\n", 360 | "6:1 |l 6\r\n", 361 | "7:1 |l 7\r\n", 362 | "8:1 |l 8\r\n", 363 | "9:1 |l 9\r\n", 364 | "10:0 |l 10\r\n", 365 | "11:1 |l 11\r\n", 366 | "\r\n" 367 | ] 368 | }, 369 | { 370 | "output_type": "stream", 371 | "stream": "stdout", 372 | "text": [ 373 | "creating quadratic features for pairs: lq \r\n", 374 | "final_regressor = data/quizbowl.ldf.model\r\n", 375 | "Num weight bits = 18\r\n", 376 | "learning rate = 0.5\r\n", 377 | "initial_t = 0\r\n", 378 | "power_t = 0.5\r\n", 379 | "decay_learning_rate = 1\r\n", 380 | "creating cache_file = data/quizbowl.ldf.tr.cache\r\n", 381 | "Reading datafile = data/quizbowl.ldf.tr\r\n", 382 | "num sources = 1\r\n", 383 | "average since example example current current current\r\n", 384 | "loss last counter weight label predict features\r\n", 385 | "1.000000 1.000000 1 1.0 known 1 858\r\n", 386 | "0.500000 0.000000 2 2.0 known 0 1419\r\n", 387 | "0.750000 1.000000 4 4.0 known 0 1254\r\n", 388 | "0.875000 1.000000 8 8.0 known 0 1177\r\n", 389 | "0.625000 0.375000 16 16.0 known 0 1397\r\n", 390 | "0.312500 0.000000 32 32.0 known 1 1419\r\n", 391 | "0.156250 0.000000 64 64.0 known 0 1254\r\n", 392 | "\r\n", 393 | "finished run\r\n", 394 | "number of examples per pass = 10\r\n", 395 | "passes used = 10\r\n", 396 | "weighted example sum = 100.000000\r\n", 397 | "weighted label sum = 0.000000\r\n", 398 | "average loss = 0.100000\r\n", 399 | "total feature number = 142120\r\n" 400 | ] 401 | }, 402 | { 403 | "output_type": "stream", 404 | "stream": "stdout", 405 | "text": [ 406 | "creating quadratic features for pairs: lq \r\n", 407 | "only testing\r\n", 408 | "raw predictions = data/quizbowl.ldf.tr.raw\r\n", 409 | "Num weight bits = 18\r\n", 410 | "learning rate = 0.5\r\n", 411 | "initial_t = 0\r\n", 412 | "power_t = 0.5\r\n", 413 | "using no cache\r\n", 414 | "Reading datafile = data/quizbowl.ldf.tr\r\n", 415 | "num sources = 1\r\n", 416 | "average since example example current current current\r\n", 417 | "loss last counter weight label predict features\r\n", 418 | "0.000000 0.000000 1 1.0 known 0 858\r\n", 419 | "0.000000 0.000000 2 2.0 known 1 1419\r\n", 420 | "0.000000 0.000000 4 4.0 known 0 1254\r\n", 421 | "0.000000 0.000000 8 8.0 known 0 1177\r\n", 422 | "\r\n", 423 | "finished run\r\n", 424 | "number of examples per pass = 10\r\n", 425 | "passes used = 1\r\n", 426 | "weighted example sum = 10.000000\r\n", 427 | "weighted label sum = 0.000000\r\n", 428 | "average loss = 0.000000\r\n", 429 | "total feature number = 14212\r\n" 430 | ] 431 | }, 432 | { 433 | "output_type": "stream", 434 | "stream": "stdout", 435 | "text": [ 436 | "1:1.41836\r\n", 437 | "2:1.0625\r\n", 438 | "3:1.06238\r\n", 439 | "4:1.07689\r\n", 440 | "5:1.1484\r\n", 441 | "6:1.14225\r\n", 442 | "7:1.07524\r\n", 443 | "8:1.08765\r\n", 444 | "9:1.01596\r\n", 445 | "10:-0.0424166\r\n", 446 | "11:1.48031\r\n", 447 | "\r\n", 448 | "1:-0.178392\r\n", 449 | "2:1.71998\r\n", 450 | "3:1.3529\r\n", 451 | "4:1.09386\r\n", 452 | "5:1.41592\r\n", 453 | "6:1.11247\r\n", 454 | "7:1.11531\r\n", 455 | "8:1.17376\r\n", 456 | "9:1.12266\r\n", 457 | "10:1.56086\r\n", 458 | "11:1.25449\r\n", 459 | "\r\n" 460 | ] 461 | } 462 | ], 463 | "prompt_number": 15 464 | }, 465 | { 466 | "cell_type": "markdown", 467 | "metadata": {}, 468 | "source": [ 469 | "Note that the output is similar but not identical. It's not identical because there are extra features (the `-q` continues to include the linear terms, which are not in the original formulation).\n", 470 | "\n", 471 | "## Headers with label-specific features\n", 472 | "\n", 473 | "In some cases, you really want to be able to define whatever arbitrary features you want for $\\varphi(x,y_k)$. But in many cases, $\\varphi(x,y_k)$ is just the outer product of some features of $x$ (that might be written as `shared`) features and some features of $y_k$. In the case above, the features of $y_k$ is just the singleton feature `k`, but (as we'll see below) you often want more. We can specify all of these in a header as follows:\n", 474 | "\n", 475 | " label:1 |l 1\n", 476 | " label:2 |l 2\n", 477 | " ...\n", 478 | " label:11 |l 11\n", 479 | " \n", 480 | " \n", 481 | "This batch of label definitions basically says \"any time you see an example with, eg, label 2, add \"`|l 2`\" to it. We can then generate the equivalent data much more simply as:" 482 | ] 483 | }, 484 | { 485 | "cell_type": "code", 486 | "collapsed": false, 487 | "input": [ 488 | "def writeVWFile(filename, data, labelIds):\n", 489 | " unknownId = labelIds['***UNKNOWN***']\n", 490 | " with open(filename,'w') as h:\n", 491 | " for k in range(1, unknownId+1):\n", 492 | " print >>h, '***label***:%d |l %d' % (k,k)\n", 493 | " print >>h, ''\n", 494 | " \n", 495 | " for x,y in data:\n", 496 | " print >>h, '***shared*** |q %s' % x\n", 497 | " trueLabel = labelIds.get(y, unknownId)\n", 498 | " # generate a single block by going over EVERY POSSIBLE LABEL\n", 499 | " for k in range(1, unknownId+1): # unknownId is the last label\n", 500 | " cost = 0. if trueLabel == k else 1. # how bad is it to predict k?\n", 501 | " print >>h, '%d:%g | bias' % (k, cost)\n", 502 | " print >>h, '' # blank line to separate blocks\n", 503 | " \n", 504 | "writeVWFile('data/quizbowl.ldf.tr', train_small, labelIds_small)\n", 505 | "!echo -n \"\\nTRAINING DATA:\\n\\n\"\n", 506 | "!head -n20 data/quizbowl.ldf.tr\n", 507 | "!echo -n \"\\nTRAINING:\\n\\n\"\n", 508 | "!vw -k -c --csoaa_ldf multiline -d data/quizbowl.ldf.tr --passes 10 --holdout_off -f data/quizbowl.ldf.model -q lq\n", 509 | "!echo -n \"\\nTESTING:\\n\\n\"\n", 510 | "!vw -i data/quizbowl.ldf.model -t -d data/quizbowl.ldf.tr -r data/quizbowl.ldf.tr.raw\n", 511 | "!echo -n \"\\nOUTPUT (HEAD):\\n\\n\"\n", 512 | "!head -n24 data/quizbowl.ldf.tr.raw" 513 | ], 514 | "language": "python", 515 | "metadata": {}, 516 | "outputs": [ 517 | { 518 | "output_type": "stream", 519 | "stream": "stdout", 520 | "text": [ 521 | "\r\n", 522 | "TRAINING DATA:\r\n", 523 | "\r\n" 524 | ] 525 | }, 526 | { 527 | "output_type": "stream", 528 | "stream": "stdout", 529 | "text": [ 530 | "***label***:1 |l 1\r\n", 531 | "***label***:2 |l 2\r\n", 532 | "***label***:3 |l 3\r\n", 533 | "***label***:4 |l 4\r\n", 534 | "***label***:5 |l 5\r\n", 535 | "***label***:6 |l 6\r\n", 536 | "***label***:7 |l 7\r\n", 537 | "***label***:8 |l 8\r\n", 538 | "***label***:9 |l 9\r\n", 539 | "***label***:10 |l 10\r\n", 540 | "***label***:11 |l 11\r\n", 541 | "\r\n", 542 | "***shared*** |q Within a constant it ' s the Laplacian of the function \" one over distance \" and is also the charge density of a point charge . Actually an operator which projects out a function ' s value at the origin , it ' s not really a function at all . For ten points , name this mathematical entity , which is zero everywhere except the origin and is designated by a Greek letter .\r\n", 543 | "1:1 | bias\r\n", 544 | "2:1 | bias\r\n", 545 | "3:1 | bias\r\n", 546 | "4:1 | bias\r\n", 547 | "5:1 | bias\r\n", 548 | "6:1 | bias\r\n", 549 | "7:1 | bias\r\n" 550 | ] 551 | }, 552 | { 553 | "output_type": "stream", 554 | "stream": "stdout", 555 | "text": [ 556 | "\r\n", 557 | "TRAINING:\r\n", 558 | "\r\n" 559 | ] 560 | }, 561 | { 562 | "output_type": "stream", 563 | "stream": "stdout", 564 | "text": [ 565 | "creating quadratic features for pairs: lq \r\n", 566 | "final_regressor = data/quizbowl.ldf.model\r\n", 567 | "Num weight bits = 18\r\n", 568 | "learning rate = 0.5\r\n", 569 | "initial_t = 0\r\n", 570 | "power_t = 0.5\r\n", 571 | "decay_learning_rate = 1\r\n", 572 | "creating cache_file = data/quizbowl.ldf.tr.cache\r\n", 573 | "Reading datafile = data/quizbowl.ldf.tr\r\n", 574 | "num sources = 1\r\n", 575 | "average since example example current current current\r\n", 576 | "loss last counter weight label predict features\r\n", 577 | "1.000000 1.000000 1 1.0 known 1 880\r\n", 578 | "0.500000 0.000000 2 2.0 known 0 1441\r\n", 579 | "0.750000 1.000000 4 4.0 known 0 1276\r\n", 580 | "0.875000 1.000000 8 8.0 known 0 1199\r\n", 581 | "0.625000 0.375000 16 16.0 known 0 1419\r\n", 582 | "0.312500 0.000000 32 32.0 known 1 1441\r\n", 583 | "0.156250 0.000000 64 64.0 known 0 1276\r\n", 584 | "\r\n", 585 | "finished run\r\n", 586 | "number of examples per pass = 10\r\n", 587 | "passes used = 10\r\n", 588 | "weighted example sum = 100.000000\r\n", 589 | "weighted label sum = 0.000000\r\n", 590 | "average loss = 0.100000\r\n", 591 | "total feature number = 144320\r\n" 592 | ] 593 | }, 594 | { 595 | "output_type": "stream", 596 | "stream": "stdout", 597 | "text": [ 598 | "\r\n", 599 | "TESTING:\r\n", 600 | "\r\n" 601 | ] 602 | }, 603 | { 604 | "output_type": "stream", 605 | "stream": "stdout", 606 | "text": [ 607 | "creating quadratic features for pairs: lq \r\n", 608 | "only testing\r\n", 609 | "raw predictions = data/quizbowl.ldf.tr.raw\r\n", 610 | "Num weight bits = 18\r\n", 611 | "learning rate = 0.5\r\n", 612 | "initial_t = 0\r\n", 613 | "power_t = 0.5\r\n", 614 | "using no cache\r\n", 615 | "Reading datafile = data/quizbowl.ldf.tr\r\n", 616 | "num sources = 1\r\n", 617 | "average since example example current current current\r\n", 618 | "loss last counter weight label predict features\r\n", 619 | "0.000000 0.000000 1 1.0 known 0 869\r\n", 620 | "0.000000 0.000000 2 2.0 known 1 1430\r\n", 621 | "0.000000 0.000000 4 4.0 known 0 1265\r\n", 622 | "0.000000 0.000000 8 8.0 known 0 1188\r\n", 623 | "\r\n", 624 | "finished run\r\n", 625 | "number of examples per pass = 10\r\n", 626 | "passes used = 1\r\n", 627 | "weighted example sum = 10.000000\r\n", 628 | "weighted label sum = 0.000000\r\n", 629 | "average loss = 0.000000\r\n", 630 | "total feature number = 14322\r\n" 631 | ] 632 | }, 633 | { 634 | "output_type": "stream", 635 | "stream": "stdout", 636 | "text": [ 637 | "\r\n", 638 | "OUTPUT (HEAD):\r\n", 639 | "\r\n" 640 | ] 641 | }, 642 | { 643 | "output_type": "stream", 644 | "stream": "stdout", 645 | "text": [ 646 | "1:1.41235\r\n", 647 | "2:1.05971\r\n", 648 | "3:1.06102\r\n", 649 | "4:1.07412\r\n", 650 | "5:1.14742\r\n", 651 | "6:1.14116\r\n", 652 | "7:1.07265\r\n", 653 | "8:1.08606\r\n", 654 | "9:1.01476\r\n", 655 | "10:-0.0439742\r\n", 656 | "11:1.47786\r\n", 657 | "\r\n", 658 | "1:-0.178582\r\n", 659 | "2:1.71463\r\n", 660 | "3:1.34784\r\n", 661 | "4:1.08935\r\n", 662 | "5:1.41376\r\n", 663 | "6:1.11073\r\n", 664 | "7:1.11116\r\n", 665 | "8:1.1721\r\n", 666 | "9:1.12157\r\n", 667 | "10:1.55313\r\n", 668 | "11:1.25278\r\n", 669 | "\r\n" 670 | ] 671 | } 672 | ], 673 | "prompt_number": 16 674 | }, 675 | { 676 | "cell_type": "markdown", 677 | "metadata": {}, 678 | "source": [ 679 | "Note that in the above we had to add a \"bias\" feature to each example. Technically this should be unnecessary, but for some reason there's an arguable bug in vw where it ignores these examples with no features; until that bug is fixed, you have to have at least *something* written as a feature for each example.\n", 680 | "\n", 681 | "## Actually putting useful features in the label-dependent features\n", 682 | "\n", 683 | "Right now we're not actually sharing any useful information in the label dependent features, *but we can*! For the QuizBowl data, each label corresponds to a Wikipedia page, so we'll make the labels depend on the words that appear in that page." 684 | ] 685 | }, 686 | { 687 | "cell_type": "code", 688 | "collapsed": false, 689 | "input": [ 690 | "from collections import Counter\n", 691 | "\n", 692 | "def readWikipedia(labelIds):\n", 693 | " wiki = {}\n", 694 | " directory = 'data' + os.sep + 'questions' + os.sep + 'wiki' + os.sep\n", 695 | " for labelName in labelIds.iterkeys():\n", 696 | " try:\n", 697 | " with open(directory + labelName + '.txt', 'r') as h:\n", 698 | " bow = Counter()\n", 699 | " for l in h.readlines():\n", 700 | " if not l.startswith('== '): continue\n", 701 | " for w in tokenize(l.lower())[1:-1]:\n", 702 | " bow[w] += 1\n", 703 | " wiki[labelName] = ' '.join(['%s:%d' % (sanify(w), c) for (w,c) in bow.iteritems()])\n", 704 | " except IOError:\n", 705 | " print 'warning: could not read wikipedia page for %s' % labelName\n", 706 | " wiki[labelName] = 'unknown'\n", 707 | " return wiki\n", 708 | "\n", 709 | "def writeVWFile(filename, data, labelIds, wikipedia):\n", 710 | " unknownId = labelIds['***UNKNOWN***']\n", 711 | " with open(filename,'w') as h:\n", 712 | " for labelName, k in labelIds.iteritems():\n", 713 | " print >>h, '***label***:%d |l label%d %s' % (k, k, wikipedia[labelName])\n", 714 | " print >>h, ''\n", 715 | " \n", 716 | " for x,y in data:\n", 717 | " print >>h, '***shared*** |q %s' % x\n", 718 | " trueLabel = labelIds.get(y, unknownId)\n", 719 | " # generate a single block by going over EVERY POSSIBLE LABEL\n", 720 | " for k in range(1, unknownId+1): # unknownId is the last label\n", 721 | " cost = 0. if trueLabel == k else 1. # how bad is it to predict k?\n", 722 | " print >>h, '%d:%g | bias' % (k, cost)\n", 723 | " print >>h, '' # blank line to separate blocks\n", 724 | " \n", 725 | "labelIds = makeLabelIDs(train, 'data/quizbowl.ldf.labels')\n", 726 | "print 'total of %d labels, including one for unknown' % len(labelIds)\n", 727 | "\n", 728 | "wikipedia = readWikipedia(labelIds_small)\n", 729 | "\n", 730 | "writeVWFile('data/quizbowl.ldf.tr', train_small, labelIds_small, wikipedia)\n", 731 | "!echo -n \"\\nTRAINING DATA:\\n\\n\"\n", 732 | "!head -n20 data/quizbowl.ldf.tr\n", 733 | "!echo -n \"\\nTRAINING:\\n\\n\"\n", 734 | "!vw -k -c --csoaa_ldf multiline -d data/quizbowl.ldf.tr --passes 10 --holdout_off -f data/quizbowl.ldf.model -q lq\n", 735 | "!echo -n \"\\nTESTING:\\n\\n\"\n", 736 | "!vw -i data/quizbowl.ldf.model -t -d data/quizbowl.ldf.tr -r data/quizbowl.ldf.tr.raw\n", 737 | "!echo -n \"\\nOUTPUT (HEAD):\\n\\n\"\n", 738 | "!head -n24 data/quizbowl.ldf.tr.raw" 739 | ], 740 | "language": "python", 741 | "metadata": {}, 742 | "outputs": [ 743 | { 744 | "output_type": "stream", 745 | "stream": "stdout", 746 | "text": [ 747 | "total of 2321 labels, including one for unknown\n", 748 | "warning: could not read wikipedia page for ***UNKNOWN***\n", 749 | "\r\n", 750 | "TRAINING DATA:\r\n", 751 | "\r\n" 752 | ] 753 | }, 754 | { 755 | "output_type": "stream", 756 | "stream": "stdout", 757 | "text": [ 758 | "***label***:2 |l label2 opium:2 see:1 also:1 second:1 references:1 war:2 first:1\r\n", 759 | "***label***:4 |l label4 theory:1 paintings:1 color:1 external:1 influence:1 links:1 also:1 see:1 references:1 drawings:1 biography:1 exhibitions:1\r\n", 760 | "***label***:5 |l label5 and:1 also:1 industrial:1 safety:1 links:1 compounds:1 characteristics:1 notes:1 storage:1 citations:1 applications:1 see:1 production:1 role:1 external:1 references:1 of:1 biological:1 precautions:1 o2:1 history:1\r\n", 761 | "***label***:3 |l label3 and:1 activities:1 life:2 writings:1 bibliography:1 family:1 service:1 notes:1 later:1 links:1 early:1 legacy:1 references:1 reading:1 in:1 navy:1 further:1 the:1 works:1 religious:1 external:1\r\n", 762 | "***label***:6 |l label6 links:1 also:1 rankings:1 demographics:1 see:1 culture:1 etymology:1 history:1 references:1 external:1 international:1 politics:1 economy:1 notes:1 geography:1\r\n", 763 | "***label***:11 |l label11 unknown\r\n", 764 | "***label***:7 |l label7 and:1 colleges:1 links:1 people:1 culture:1 references:1 politics:1 research:1 around:1 universities:1 relations:1 sports:1 international:1 geography:1 economy:1 transportation:1 scientific:1 demographics:1 munich:1 famous:1 external:1 subdivisions:1 architecture:1 institutions:1 history:1\r\n", 765 | "***label***:1 |l label1 and:3 life:1 modern:1 letters:1 links:1 of:1 abelard:1 philosophy:1 translations:1 theology:1 cultural:1 citations:1 music:1 external:1 editions:1 further:1 references:2 the:1 reading:1 heloise:1 notes:1\r\n", 766 | "***label***:8 |l label8 and:1 death:1 links:1 notes:1 work:1 also:1 see:1 legacy:1 references:1 external:1 further:1 reading:1 biography:1 reputation:1\r\n", 767 | "***label***:9 |l label9 literary:1 life:1 links:1 career:1 private:1 years:1 early:1 reputation:1 references:1 external:1 works:1\r\n", 768 | "***label***:10 |l label10 links:1 sokhotski:1 overview:1 see:1 references:1 derivatives:1 also:1 distributional:1 -:1 transform:1 to:1 kronecker:1 function:1 relationship:1 theorem:1 applications:1 external:1 delta:2 representations:1 comb:1 properties:1 plemelj:1 dirac:1 of:1 notes:1 definitions:1 the:2 fourier:1 history:1\r\n", 769 | "\r\n", 770 | "***shared*** |q Within a constant it ' s the Laplacian of the function \" one over distance \" and is also the charge density of a point charge . Actually an operator which projects out a function ' s value at the origin , it ' s not really a function at all . For ten points , name this mathematical entity , which is zero everywhere except the origin and is designated by a Greek letter .\r\n", 771 | "1:1 | bias\r\n", 772 | "2:1 | bias\r\n", 773 | "3:1 | bias\r\n", 774 | "4:1 | bias\r\n", 775 | "5:1 | bias\r\n", 776 | "6:1 | bias\r\n", 777 | "7:1 | bias\r\n" 778 | ] 779 | }, 780 | { 781 | "output_type": "stream", 782 | "stream": "stdout", 783 | "text": [ 784 | "\r\n", 785 | "TRAINING:\r\n", 786 | "\r\n" 787 | ] 788 | }, 789 | { 790 | "output_type": "stream", 791 | "stream": "stdout", 792 | "text": [ 793 | "creating quadratic features for pairs: lq \r\n", 794 | "final_regressor = data/quizbowl.ldf.model\r\n", 795 | "Num weight bits = 18\r\n", 796 | "learning rate = 0.5\r\n", 797 | "initial_t = 0\r\n", 798 | "power_t = 0.5\r\n", 799 | "decay_learning_rate = 1\r\n", 800 | "creating cache_file = data/quizbowl.ldf.tr.cache\r\n", 801 | "Reading datafile = data/quizbowl.ldf.tr\r\n", 802 | "num sources = 1\r\n", 803 | "average since example example current current current\r\n", 804 | "loss last counter weight label predict features\r\n", 805 | "1.000000 1.000000 1 1.0 known 1 1234\r\n", 806 | "0.500000 0.000000 2 2.0 known 0 1795\r\n", 807 | "0.750000 1.000000 4 4.0 known 0 1630\r\n", 808 | "0.875000 1.000000 8 8.0 known 0 1553\r\n", 809 | "0.625000 0.375000 16 16.0 known 0 1773\r\n", 810 | "0.312500 0.000000 32 32.0 known 1 1795\r\n", 811 | "0.156250 0.000000 64 64.0 known 0 1630\r\n", 812 | "\r\n", 813 | "finished run\r\n", 814 | "number of examples per pass = 10\r\n", 815 | "passes used = 10\r\n", 816 | "weighted example sum = 100.000000\r\n", 817 | "weighted label sum = 0.000000\r\n", 818 | "average loss = 0.100000\r\n", 819 | "total feature number = 179720\r\n" 820 | ] 821 | }, 822 | { 823 | "output_type": "stream", 824 | "stream": "stdout", 825 | "text": [ 826 | "\r\n", 827 | "TESTING:\r\n", 828 | "\r\n" 829 | ] 830 | }, 831 | { 832 | "output_type": "stream", 833 | "stream": "stdout", 834 | "text": [ 835 | "creating quadratic features for pairs: lq \r\n", 836 | "only testing\r\n", 837 | "raw predictions = data/quizbowl.ldf.tr.raw\r\n", 838 | "Num weight bits = 18\r\n", 839 | "learning rate = 0.5\r\n", 840 | "initial_t = 0\r\n", 841 | "power_t = 0.5\r\n", 842 | "using no cache\r\n", 843 | "Reading datafile = data/quizbowl.ldf.tr\r\n", 844 | "num sources = 1\r\n", 845 | "average since example example current current current\r\n", 846 | "loss last counter weight label predict features\r\n", 847 | "0.000000 0.000000 1 1.0 known 0 1046\r\n", 848 | "0.000000 0.000000 2 2.0 known 1 1607\r\n", 849 | "0.000000 0.000000 4 4.0 known 0 1442\r\n", 850 | "0.000000 0.000000 8 8.0 known 0 1365\r\n", 851 | "\r\n", 852 | "finished run\r\n", 853 | "number of examples per pass = 10\r\n", 854 | "passes used = 1\r\n", 855 | "weighted example sum = 10.000000\r\n", 856 | "weighted label sum = 0.000000\r\n", 857 | "average loss = 0.000000\r\n", 858 | "total feature number = 16092\r\n" 859 | ] 860 | }, 861 | { 862 | "output_type": "stream", 863 | "stream": "stdout", 864 | "text": [ 865 | "\r\n", 866 | "OUTPUT (HEAD):\r\n", 867 | "\r\n" 868 | ] 869 | }, 870 | { 871 | "output_type": "stream", 872 | "stream": "stdout", 873 | "text": [ 874 | "1:1.60651\r\n", 875 | "2:1.55012\r\n", 876 | "3:1.27702\r\n", 877 | "4:1.30667\r\n", 878 | "5:1.10201\r\n", 879 | "6:1.17644\r\n", 880 | "7:1.12054\r\n", 881 | "8:1.09035\r\n", 882 | "9:1.24678\r\n", 883 | "10:-0.164604\r\n", 884 | "11:1.52328\r\n", 885 | "\r\n", 886 | "1:-0.112427\r\n", 887 | "2:1.21939\r\n", 888 | "3:1.41709\r\n", 889 | "4:1.15276\r\n", 890 | "5:1.51793\r\n", 891 | "6:1.23505\r\n", 892 | "7:1.12692\r\n", 893 | "8:1.1945\r\n", 894 | "9:1.19273\r\n", 895 | "10:1.9022\r\n", 896 | "11:1.27931\r\n", 897 | "\r\n" 898 | ] 899 | } 900 | ], 901 | "prompt_number": 17 902 | } 903 | ], 904 | "metadata": {} 905 | } 906 | ] 907 | } -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # vwnlp 2 | Solving NLP problems with Vowpal Wabbit: Tutorial and more 3 | 4 | You should start by reading GettingStarted.ipynb and then use the links at the bottom there to learn more complicated things! 5 | 6 | Oh, you should also install [vowpal wabbit](https://github.com/JohnLangford/vowpal_wabbit) 7 | 8 | -------------------------------------------------------------------------------- /RareCategory.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "", 4 | "signature": "sha256:91769ee8fccda583b214132ef11e1e99e61df0bd952f499a034720471f78cfa1" 5 | }, 6 | "nbformat": 3, 7 | "nbformat_minor": 0, 8 | "worksheets": [ 9 | { 10 | "cells": [ 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "# Rare Category Detection and Related Tasks\n", 16 | "\n", 17 | "By default, `vw` optimizes (a surrogate to) zero-one loss for binary classification. This is great for \"balanced\" or approximately balanced problems, in which there are (roughly) an equal number of positive and negative examples.\n", 18 | "\n", 19 | "Some tasks, however, are more like \"needle in a haystack\" type problems, where there are very few positives (needles) to find amongst a large set of negatives (hay). For such problems, zero-one binary loss is just not a good measure to optimize. If 99% of the data is negative, then you can simply label everything as negative and get 1% error. This is not what we want.\n", 20 | "\n", 21 | "Often in NLP-land, we evaluate the performance of rare category problems using precision, recall and F. Precision is: of all the needles I detected, how many were correct? Recall is: of all the needles that I should have found, how many did I find? and F is the harmonic mean of P and R: $F = \\left[\\frac 1 P + \\frac 1 R\\right]^{-1} = \\frac {2PR} {P+R}$.\n", 22 | "\n", 23 | "The question is: how to optimize something like $F$ rather than binary loss. The current solution is example weighting.\n", 24 | "\n", 25 | "# Example Weighting\n", 26 | "\n", 27 | "In example weighting, we simply say that some examples (the needles) are much more important than other examples (the hay).\n", 28 | "\n", 29 | "As a prototypical example, suppose that the positive class accounts for about 1% of the data, so that for every one positive example there are 99 negative examples. A standard heuristic is to give the negative examples a smaller weight than the positive examples to balance this out. In this case, if we give each negative example a weight of $1/99 \\approx 0.0101$, then the total **weight** of positive examples in the training data will match the total weight of the negative examples.\n", 30 | "\n", 31 | "When `vw` is given example weights, it optimizes a weighted zero-one loss. Now, predicting always negative will incur a 50% weighted error because we've downweighted the negatives to account for only half the mass of the training data.\n", 32 | "\n", 33 | "# A Running Example\n", 34 | "\n", 35 | "The sentiment data set from the previous tutorials is not a good place to start because it was constructed to be balanced. We'll use a slightly different problem: identifying **word-level sentiment** in reviews from Rotten Tomatoes. The data we'll use is the [Stanford Sentiment Dataset](http://nlp.stanford.edu/sentiment/).\n", 36 | "\n", 37 | "First, we need to download the data. We can get it directly (it's about 771k):" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "collapsed": false, 43 | "input": [ 44 | "!curl -o data/trainDevTestTrees_PTB.zip http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip\n", 45 | "!rm -rf data/trees\n", 46 | "!unzip -d data data/trainDevTestTrees_PTB.zip\n", 47 | "!head -n1 data/trees/train.txt" 48 | ], 49 | "language": "python", 50 | "metadata": {}, 51 | "outputs": [ 52 | { 53 | "output_type": "stream", 54 | "stream": "stdout", 55 | "text": [ 56 | " % Total % Received % Xferd Average Speed Time Time Time Current\r\n", 57 | " Dload Upload Total Spent Left Speed\r\n", 58 | "\r", 59 | " 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0" 60 | ] 61 | }, 62 | { 63 | "output_type": "stream", 64 | "stream": "stdout", 65 | "text": [ 66 | "\r", 67 | " 55 771k 55 428k 0 0 641k 0 0:00:01 --:--:-- 0:00:01 641k" 68 | ] 69 | }, 70 | { 71 | "output_type": "stream", 72 | "stream": "stdout", 73 | "text": [ 74 | "\r", 75 | "100 771k 100 771k 0 0 949k 0 --:--:-- --:--:-- --:--:-- 949k\r\n" 76 | ] 77 | }, 78 | { 79 | "output_type": "stream", 80 | "stream": "stdout", 81 | "text": [ 82 | "Archive: data/trainDevTestTrees_PTB.zip\r\n", 83 | " creating: data/trees/\r\n", 84 | " inflating: data/trees/dev.txt \r\n", 85 | " inflating: data/trees/test.txt \r\n", 86 | " inflating: data/trees/train.txt \r\n" 87 | ] 88 | }, 89 | { 90 | "output_type": "stream", 91 | "stream": "stdout", 92 | "text": [ 93 | "(3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2 to) (2 (2 be) (2 (2 the) (2 (2 21st) (2 (2 (2 Century) (2 's)) (2 (3 new) (2 (2 ``) (2 Conan)))))))) (2 '')) (2 and)) (3 (2 that) (3 (2 he) (3 (2 's) (3 (2 going) (3 (2 to) (4 (3 (2 make) (3 (3 (2 a) (3 splash)) (2 (2 even) (3 greater)))) (2 (2 than) (2 (2 (2 (2 (1 (2 Arnold) (2 Schwarzenegger)) (2 ,)) (2 (2 Jean-Claud) (2 (2 Van) (2 Damme)))) (2 or)) (2 (2 Steven) (2 Segal))))))))))))) (2 .)))\r\n" 94 | ] 95 | } 96 | ], 97 | "prompt_number": 1 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": {}, 102 | "source": [ 103 | "This data is in Penn Treebank format. Each word has a sentiment label (0=strongly negative, 1=negative, 2=neutral, 3=positive, 4=strongly positive). Each phrase also has a sentiment label. We're going to ignore the phrase-level sentiment labels and just work with the words. To make this a binary problem, we're going to predict: strong-opinion-versus-not, namely \"label=1,2,3\" is going to be the negative class (neutral) and \"label=0,4\" are the positive classes (strongly opinionated). The training data is roughly 97.5% negative examples and 2.5% positive examples, which makes this a somewhat-rare category detection problem.\n", 104 | "\n", 105 | "First, we'll write some small functions to load this data into Python:" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "collapsed": false, 111 | "input": [ 112 | "import re\n", 113 | "def readSentimentTree(s):\n", 114 | " # return a list of (word, sentimentLabel) for all words in the tree\n", 115 | " return [(w[3:-1].lower(), int(w[1])) for w in re.findall('\$[0-4] [^\\(\$]+\\)', s)]\n", 116 | "\n", 117 | "def readSentimentFile(filename):\n", 118 | " return map(readSentimentTree, open(filename,'r').readlines())\n", 119 | "\n", 120 | "train = readSentimentFile('data/trees/train.txt')\n", 121 | "dev = readSentimentFile('data/trees/dev.txt')\n", 122 | "test = readSentimentFile('data/trees/test.txt')\n", 123 | "\n", 124 | "print 'example: ', train[0]" 125 | ], 126 | "language": "python", 127 | "metadata": {}, 128 | "outputs": [ 129 | { 130 | "output_type": "stream", 131 | "stream": "stdout", 132 | "text": [ 133 | "example: [('the', 2), ('rock', 2), ('is', 2), ('destined', 2), ('to', 2), ('be', 2), ('the', 2), ('21st', 2), ('century', 2), (\"'s\", 2), ('new', 3), ('``', 2), ('conan', 2), (\"''\", 2), ('and', 2), ('that', 2), ('he', 2), (\"'s\", 2), ('going', 2), ('to', 2), ('make', 2), ('a', 2), ('splash', 3), ('even', 2), ('greater', 3), ('than', 2), ('arnold', 2), ('schwarzenegger', 2), (',', 2), ('jean-claud', 2), ('van', 2), ('damme', 2), ('or', 2), ('steven', 2), ('segal', 2), ('.', 2)]\n" 134 | ] 135 | } 136 | ], 137 | "prompt_number": 2 138 | }, 139 | { 140 | "cell_type": "markdown", 141 | "metadata": {}, 142 | "source": [ 143 | "We now need to generate appropriate `vw` input files for this. The way that you specify example weights for `vw` is to put the weight right next to the label, but before the pipe. For example:\n", 144 | "\n", 145 | " +1 | ...\n", 146 | " +1 | ...\n", 147 | " -1 0.01 | ...\n", 148 | " +1 | ...\n", 149 | " -1 0.01 | ...\n", 150 | "\n", 151 | "And so on. In this case, the two negative examples have a weight of 0.01 and the positive examples have an implicit weight of one. You can put whatever weight you want on whatever examples you want: they don't have to be consistent in any way. Examples with higher weight just \"matter more\" from the perspective of the loss being optimized.\n", 152 | "\n", 153 | "The other question is what features will we use? We'll make significant use of namespaces here. Obviously we'll include the word to be labeled; this will go in it's own namespace (called `w`). We'll also use a context window of 5 words to the left (into a `l` namespace) and 5 words to the right (into an `r`) namespace.\n", 154 | "\n", 155 | "The label itself will be -1 if sentLabel is 1,2 or 3; and +1 otherwise.\n", 156 | "\n", 157 | "Given this, we're ready to construct some data to be directly written to a file:" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "collapsed": false, 163 | "input": [ 164 | "def sanify(s): return s.replace(':','COLON').replace('|','PIPE')\n", 165 | "\n", 166 | "def sentimentToVW(sentence, negativeWeight, outFile):\n", 167 | " N = len(sentence)\n", 168 | " for n,(word,sentLabel) in enumerate(sentence): # loop over each word, and get it's position n\n", 169 | " label = '-1 %g' % negativeWeight if sentLabel in [1,2,3] else '+1'\n", 170 | " leftBoundary = max(0,n-5)\n", 171 | " leftContext = ' '.join([sanify(x[0]) for x in sentence[leftBoundary:n]])\n", 172 | " rightContext = ' '.join([sanify(x[0]) for x in sentence[n+1:n+6]])\n", 173 | " print >>outFile, '%s |w %s |l %s |r %s' % (label, sanify(word), leftContext, rightContext)\n", 174 | "\n", 175 | "# test it\n", 176 | "import sys\n", 177 | "sentimentToVW(train[0], 2.5/97.5, sys.stdout)" 178 | ], 179 | "language": "python", 180 | "metadata": {}, 181 | "outputs": [ 182 | { 183 | "output_type": "stream", 184 | "stream": "stdout", 185 | "text": [ 186 | "-1 0.025641 |w the |l |r rock is destined to be\n", 187 | "-1 0.025641 |w rock |l the |r is destined to be the\n", 188 | "-1 0.025641 |w is |l the rock |r destined to be the 21st\n", 189 | "-1 0.025641 |w destined |l the rock is |r to be the 21st century\n", 190 | "-1 0.025641 |w to |l the rock is destined |r be the 21st century 's\n", 191 | "-1 0.025641 |w be |l the rock is destined to |r the 21st century 's new\n", 192 | "-1 0.025641 |w the |l rock is destined to be |r 21st century 's new ``\n", 193 | "-1 0.025641 |w 21st |l is destined to be the |r century 's new `` conan\n", 194 | "-1 0.025641 |w century |l destined to be the 21st |r 's new `` conan ''\n", 195 | "-1 0.025641 |w 's |l to be the 21st century |r new `` conan '' and\n", 196 | "-1 0.025641 |w new |l be the 21st century 's |r `` conan '' and that\n", 197 | "-1 0.025641 |w `` |l the 21st century 's new |r conan '' and that he\n", 198 | "-1 0.025641 |w conan |l 21st century 's new `` |r '' and that he 's\n", 199 | "-1 0.025641 |w '' |l century 's new `` conan |r and that he 's going\n", 200 | "-1 0.025641 |w and |l 's new `` conan '' |r that he 's going to\n", 201 | "-1 0.025641 |w that |l new `` conan '' and |r he 's going to make\n", 202 | "-1 0.025641 |w he |l `` conan '' and that |r 's going to make a\n", 203 | "-1 0.025641 |w 's |l conan '' and that he |r going to make a splash\n", 204 | "-1 0.025641 |w going |l '' and that he 's |r to make a splash even\n", 205 | "-1 0.025641 |w to |l and that he 's going |r make a splash even greater\n", 206 | "-1 0.025641 |w make |l that he 's going to |r a splash even greater than\n", 207 | "-1 0.025641 |w a |l he 's going to make |r splash even greater than arnold\n", 208 | "-1 0.025641 |w splash |l 's going to make a |r even greater than arnold schwarzenegger\n", 209 | "-1 0.025641 |w even |l going to make a splash |r greater than arnold schwarzenegger ,\n", 210 | "-1 0.025641 |w greater |l to make a splash even |r than arnold schwarzenegger , jean-claud\n", 211 | "-1 0.025641 |w than |l make a splash even greater |r arnold schwarzenegger , jean-claud van\n", 212 | "-1 0.025641 |w arnold |l a splash even greater than |r schwarzenegger , jean-claud van damme\n", 213 | "-1 0.025641 |w schwarzenegger |l splash even greater than arnold |r , jean-claud van damme or\n", 214 | "-1 0.025641 |w , |l even greater than arnold schwarzenegger |r jean-claud van damme or steven\n", 215 | "-1 0.025641 |w jean-claud |l greater than arnold schwarzenegger , |r van damme or steven segal\n", 216 | "-1 0.025641 |w van |l than arnold schwarzenegger , jean-claud |r damme or steven segal .\n", 217 | "-1 0.025641 |w damme |l arnold schwarzenegger , jean-claud van |r or steven segal .\n", 218 | "-1 0.025641 |w or |l schwarzenegger , jean-claud van damme |r steven segal .\n", 219 | "-1 0.025641 |w steven |l , jean-claud van damme or |r segal .\n", 220 | "-1 0.025641 |w segal |l jean-claud van damme or steven |r .\n", 221 | "-1 0.025641 |w . |l van damme or steven segal |r \n" 222 | ] 223 | } 224 | ], 225 | "prompt_number": 3 226 | }, 227 | { 228 | "cell_type": "markdown", 229 | "metadata": {}, 230 | "source": [ 231 | "Now we just need to construct training, development and test files.\n", 232 | "\n", 233 | "What we're *actually* going to do is put the training and development data together into one file, and use `--holdout_after` to let `vw` handle the development data. Of course, first we also need to shuffle the data. (Note: here, we're just going to shuffle the sentences; in real life, we'd probably want to actually shuffle the VW examples. But we're lazy.)" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "collapsed": false, 239 | "input": [ 240 | "import random\n", 241 | "random.shuffle(train)\n", 242 | "random.shuffle(dev)\n", 243 | "random.shuffle(test)\n", 244 | "\n", 245 | "def sentimentToVWFile(filename, data, negativeWeight):\n", 246 | " with open(filename, 'w') as h:\n", 247 | " for sentence in data:\n", 248 | " sentimentToVW(sentence, negativeWeight, h)\n", 249 | "\n", 250 | "# remove old data\n", 251 | "!rm -f data/sentiword.*\n", 252 | "\n", 253 | "# generate new data\n", 254 | "negWeight = 2.5/97.5\n", 255 | "sentimentToVWFile('data/sentiword.tr', train, negWeight)\n", 256 | "sentimentToVWFile('data/sentiword.de', dev , negWeight)\n", 257 | "sentimentToVWFile('data/sentiword.te', test , negWeight)\n", 258 | "\n", 259 | "# combine train and dev into one\n", 260 | "!cat data/sentiword.tr data/sentiword.de > data/sentiword.trde\n", 261 | "\n", 262 | "!wc -l data/sentiword.*" 263 | ], 264 | "language": "python", 265 | "metadata": {}, 266 | "outputs": [ 267 | { 268 | "output_type": "stream", 269 | "stream": "stdout", 270 | "text": [ 271 | " 21274 data/sentiword.de\r\n", 272 | " 42405 data/sentiword.te\r\n", 273 | " 163563 data/sentiword.tr\r\n", 274 | " 184837 data/sentiword.trde\r\n", 275 | " 412079 total\r\n" 276 | ] 277 | } 278 | ], 279 | "prompt_number": 4 280 | }, 281 | { 282 | "cell_type": "markdown", 283 | "metadata": {}, 284 | "source": [ 285 | "So, we have about 163563 training examples, 21k dev examples and 42k test examples. The combined train/dev set has about 185k examples.\n", 286 | "\n", 287 | "Now we just need to train `vw`!" 288 | ] 289 | }, 290 | { 291 | "cell_type": "code", 292 | "collapsed": false, 293 | "input": [ 294 | "!vw -k -c -b 27 --binary data/sentiword.trde --passes 100 -f data/sentiword.model --loss_function logistic --holdout_after 163564" 295 | ], 296 | "language": "python", 297 | "metadata": {}, 298 | "outputs": [ 299 | { 300 | "output_type": "stream", 301 | "stream": "stdout", 302 | "text": [ 303 | "final_regressor = data/sentiword.model\r\n", 304 | "Num weight bits = 27\r\n", 305 | "learning rate = 0.5\r\n", 306 | "initial_t = 0\r\n", 307 | "power_t = 0.5\r\n", 308 | "decay_learning_rate = 1\r\n", 309 | "creating cache_file = data/sentiword.trde.cache\r\n", 310 | "Reading datafile = data/sentiword.trde\r\n", 311 | "num sources = 1\r\n", 312 | "average since example example current current current\r\n", 313 | "loss last counter weight label predict features\r\n", 314 | "0.000000 0.000000 40 1.0 -1.0000 -1.0000 12\r\n", 315 | "0.000000 0.000000 81 2.1 -1.0000 -1.0000 8\r\n" 316 | ] 317 | }, 318 | { 319 | "output_type": "stream", 320 | "stream": "stdout", 321 | "text": [ 322 | "0.240741 0.481482 124 4.2 -1.0000 -1.0000 12\r\n", 323 | "0.361111 0.481482 210 8.3 -1.0000 -1.0000 7\r\n" 324 | ] 325 | }, 326 | { 327 | "output_type": "stream", 328 | "stream": "stdout", 329 | "text": [ 330 | "0.283843 0.214876 497 17.6 1.0000 -1.0000 9\r\n" 331 | ] 332 | }, 333 | { 334 | "output_type": "stream", 335 | "stream": "stdout", 336 | "text": [ 337 | "0.369724 0.455604 880 35.2 -1.0000 -1.0000 12\r\n" 338 | ] 339 | }, 340 | { 341 | "output_type": "stream", 342 | "stream": "stdout", 343 | "text": [ 344 | "0.404986 0.439742 1552 71.0 1.0000 -1.0000 12\r\n" 345 | ] 346 | }, 347 | { 348 | "output_type": "stream", 349 | "stream": "stdout", 350 | "text": [ 351 | "0.396135 0.387283 2952 141.9 -1.0000 1.0000 11\r\n" 352 | ] 353 | }, 354 | { 355 | "output_type": "stream", 356 | "stream": "stdout", 357 | "text": [ 358 | "0.392593 0.389069 5778 284.6 1.0000 1.0000 11\r\n" 359 | ] 360 | }, 361 | { 362 | "output_type": "stream", 363 | "stream": "stdout", 364 | "text": [ 365 | "0.319142 0.245698 11519 569.2 -1.0000 -1.0000 12\r\n", 366 | "0.286698 0.254297 23485 1139.1 1.0000 -1.0000 12\r\n" 367 | ] 368 | }, 369 | { 370 | "output_type": "stream", 371 | "stream": "stdout", 372 | "text": [ 373 | "0.242487 0.198276 46818 2278.1 -1.0000 -1.0000 12\r\n" 374 | ] 375 | }, 376 | { 377 | "output_type": "stream", 378 | "stream": "stdout", 379 | "text": [ 380 | "0.195484 0.148482 94396 4556.2 -1.0000 -1.0000 8\r\n" 381 | ] 382 | }, 383 | { 384 | "output_type": "stream", 385 | "stream": "stdout", 386 | "text": [ 387 | "0.101881 0.101881 188108 9112.4 -1.0000 -1.0000 12 h\r\n" 388 | ] 389 | }, 390 | { 391 | "output_type": "stream", 392 | "stream": "stdout", 393 | "text": [ 394 | "0.091317 0.080753 376097 18225.7 1.0000 1.0000 8 h\r\n" 395 | ] 396 | }, 397 | { 398 | "output_type": "stream", 399 | "stream": "stdout", 400 | "text": [ 401 | "0.082237 0.073156 752536 36451.3 -1.0000 -1.0000 12 h\r\n" 402 | ] 403 | }, 404 | { 405 | "output_type": "stream", 406 | "stream": "stdout", 407 | "text": [ 408 | "0.074883 0.069000 1504525 72903.2 1.0000 1.0000 8 h\r\n" 409 | ] 410 | }, 411 | { 412 | "output_type": "stream", 413 | "stream": "stdout", 414 | "text": [ 415 | "\r\n", 416 | "finished run\r\n", 417 | "number of examples per pass = 163563\r\n", 418 | "passes used = 12\r\n", 419 | "weighted example sum = 95096.873733\r\n", 420 | "weighted label sum = -3200.873733\r\n", 421 | "average loss = 0.066373 h\r\n", 422 | "best constant = -0.067344\r\n", 423 | "best constant's loss = 0.692581\r\n", 424 | "total feature number = 20493552\r\n" 425 | ] 426 | } 427 | ], 428 | "prompt_number": 5 429 | }, 430 | { 431 | "cell_type": "markdown", 432 | "metadata": {}, 433 | "source": [ 434 | "# Making Predictions and Evaluating Performance\n", 435 | "\n", 436 | "And now we will make raw predictions (we need raw predictions because of how we will evaluate performance). This is exactly the same as before. However, because we're going to be choosing weights as hyperparameters, we are going to do all evaluations on the development data." 437 | ] 438 | }, 439 | { 440 | "cell_type": "code", 441 | "collapsed": false, 442 | "input": [ 443 | "!vw --binary -t -r data/sentiword.de.raw -i data/sentiword.model data/sentiword.de --quiet\n", 444 | "!head data/sentiword.de.raw" 445 | ], 446 | "language": "python", 447 | "metadata": {}, 448 | "outputs": [ 449 | { 450 | "output_type": "stream", 451 | "stream": "stdout", 452 | "text": [ 453 | "-4.507044\r\n", 454 | "-0.372609\r\n", 455 | "-1.805572\r\n", 456 | "-3.560715\r\n", 457 | "-2.972163\r\n", 458 | "-1.174249\r\n", 459 | "-0.556056\r\n", 460 | "-0.684190\r\n", 461 | "-2.711704\r\n", 462 | "-1.565568\r\n" 463 | ] 464 | } 465 | ], 466 | "prompt_number": 6 467 | }, 468 | { 469 | "cell_type": "markdown", 470 | "metadata": {}, 471 | "source": [ 472 | "In order to evaluate the model, we'll use use the [\"perf\" evaluation script](http://osmot.cs.cornell.edu/kddcup/software.html) from the KDD 2004 challenge. This script computes basically every measure of performance you could possibly want. In order to follow along here, you'll need to download it, build it, and put it in your path. You can test that it works with:" 473 | ] 474 | }, 475 | { 476 | "cell_type": "code", 477 | "collapsed": false, 478 | "input": [ 479 | "!perf --help | head -n4" 480 | ], 481 | "language": "python", 482 | "metadata": {}, 483 | "outputs": [ 484 | { 485 | "output_type": "stream", 486 | "stream": "stdout", 487 | "text": [ 488 | "\r\n", 489 | "Error: Unrecognized program option --help\r\n", 490 | "Version 5.12 [KDDCUP-2004 July 12, 2004]\r\n", 491 | "\r\n" 492 | ] 493 | } 494 | ], 495 | "prompt_number": 7 496 | }, 497 | { 498 | "cell_type": "markdown", 499 | "metadata": {}, 500 | "source": [ 501 | "The eval script needs a single input that has two columns: (1) the truth and (2) scored predictions. It needs scores because it needs to think of the predictions as a ranked list. We can get the true labels by extracting the first column from the test data, and then the raw predictions work directly as scored predictions:" 502 | ] 503 | }, 504 | { 505 | "cell_type": "code", 506 | "collapsed": false, 507 | "input": [ 508 | "!echo \"The first few lines...\"\n", 509 | "!cut -d' ' -f1 data/sentiword.de | paste - data/sentiword.de.raw | head\n", 510 | "!echo \"\"\n", 511 | "!echo \"Running perf...\"\n", 512 | "!cut -d' ' -f1 data/sentiword.de | paste - data/sentiword.de.raw | perf -t 0 -PRE -REC -PRF -PRB" 513 | ], 514 | "language": "python", 515 | "metadata": {}, 516 | "outputs": [ 517 | { 518 | "output_type": "stream", 519 | "stream": "stdout", 520 | "text": [ 521 | "The first few lines...\r\n" 522 | ] 523 | }, 524 | { 525 | "output_type": "stream", 526 | "stream": "stdout", 527 | "text": [ 528 | "-1\t-4.507044\r\n", 529 | "-1\t-0.372609\r\n", 530 | "-1\t-1.805572\r\n", 531 | "-1\t-3.560715\r\n", 532 | "-1\t-2.972163\r\n", 533 | "-1\t-1.174249\r\n", 534 | "-1\t-0.556056\r\n", 535 | "-1\t-0.684190\r\n", 536 | "-1\t-2.711704\r\n", 537 | "-1\t-1.565568\r\n", 538 | "paste: write error: Broken pipe\r\n", 539 | "paste: write error\r\n", 540 | "cut: write error: Broken pipe\r\n" 541 | ] 542 | }, 543 | { 544 | "output_type": "stream", 545 | "stream": "stdout", 546 | "text": [ 547 | "\r\n" 548 | ] 549 | }, 550 | { 551 | "output_type": "stream", 552 | "stream": "stdout", 553 | "text": [ 554 | "Running perf...\r\n" 555 | ] 556 | }, 557 | { 558 | "output_type": "stream", 559 | "stream": "stdout", 560 | "text": [ 561 | "PRE 0.66527 pred_thresh 0.000000\r\n", 562 | "REC 0.87963 pred_thresh 0.000000\r\n", 563 | "PRF 0.75758 pred_thresh 0.000000\r\n", 564 | "PRB 0.82593\r\n" 565 | ] 566 | } 567 | ], 568 | "prompt_number": 8 569 | }, 570 | { 571 | "cell_type": "markdown", 572 | "metadata": {}, 573 | "source": [ 574 | "We asked `perf` just to give us precision, recall and F score with a threshold of zero. If you run it without those arguments, you get a *lot* more statistics.\n", 575 | "\n", 576 | "In this case, we get a precision of 76.3% a recall of 86.9% and an F score of 81.2%. We've also asked for the precision-recall break-even point (that's the point where P=R=F); here it is 84.6%. (PRB is often an upper-bound, optimistic estimate of how good your precision/recall could be if you magically chose the best threshold.)\n", 577 | "\n", 578 | "For comparison, let's see what happens if we don't do example weighting:" 579 | ] 580 | }, 581 | { 582 | "cell_type": "code", 583 | "collapsed": false, 584 | "input": [ 585 | "negWeight = 1.0\n", 586 | "sentimentToVWFile('data/sentiword-unw.tr', train, negWeight)\n", 587 | "sentimentToVWFile('data/sentiword-unw.de', dev , negWeight)\n", 588 | "sentimentToVWFile('data/sentiword-unw.te', test , negWeight)\n", 589 | "\n", 590 | "# combine train and dev into one\n", 591 | "!cat data/sentiword-unw.tr data/sentiword-unw.de > data/sentiword-unw.trde\n", 592 | "!vw -k -c -b 27 --binary data/sentiword-unw.trde --passes 100 -f data/sentiword-unw.model --loss_function logistic --holdout_after 163564\n", 593 | "!vw --binary -t -r data/sentiword-unw.de.raw -i data/sentiword-unw.model data/sentiword-unw.de --quiet\n", 594 | "!echo \"\"\n", 595 | "!echo \"Running perf...\"\n", 596 | "!cut -d' ' -f1 data/sentiword-unw.de | paste - data/sentiword-unw.de.raw | perf -t 0.0 -PRE -REC -PRF -PRB" 597 | ], 598 | "language": "python", 599 | "metadata": {}, 600 | "outputs": [ 601 | { 602 | "output_type": "stream", 603 | "stream": "stdout", 604 | "text": [ 605 | "final_regressor = data/sentiword-unw.model\r\n", 606 | "Num weight bits = 27\r\n", 607 | "learning rate = 0.5\r\n", 608 | "initial_t = 0\r\n", 609 | "power_t = 0.5\r\n", 610 | "decay_learning_rate = 1\r\n", 611 | "creating cache_file = data/sentiword-unw.trde.cache\r\n", 612 | "Reading datafile = data/sentiword-unw.trde\r\n", 613 | "num sources = 1\r\n", 614 | "average since example example current current current\r\n", 615 | "loss last counter weight label predict features\r\n", 616 | "0.000000 0.000000 1 1.0 -1.0000 -1.0000 7\r\n", 617 | "0.000000 0.000000 2 2.0 -1.0000 -1.0000 8\r\n", 618 | "0.000000 0.000000 4 4.0 -1.0000 -1.0000 10\r\n", 619 | "0.000000 0.000000 8 8.0 -1.0000 -1.0000 12\r\n", 620 | "0.000000 0.000000 16 16.0 -1.0000 -1.0000 12\r\n", 621 | "0.000000 0.000000 32 32.0 -1.0000 -1.0000 11\r\n", 622 | "0.000000 0.000000 64 64.0 -1.0000 -1.0000 7\r\n" 623 | ] 624 | }, 625 | { 626 | "output_type": "stream", 627 | "stream": "stdout", 628 | "text": [ 629 | "0.007812 0.015625 128 128.0 -1.0000 -1.0000 12\r\n", 630 | "0.011719 0.015625 256 256.0 -1.0000 -1.0000 11\r\n" 631 | ] 632 | }, 633 | { 634 | "output_type": "stream", 635 | "stream": "stdout", 636 | "text": [ 637 | "0.009766 0.007812 512 512.0 -1.0000 -1.0000 10\r\n" 638 | ] 639 | }, 640 | { 641 | "output_type": "stream", 642 | "stream": "stdout", 643 | "text": [ 644 | "0.015625 0.021484 1024 1024.0 -1.0000 -1.0000 12\r\n" 645 | ] 646 | }, 647 | { 648 | "output_type": "stream", 649 | "stream": "stdout", 650 | "text": [ 651 | "0.020996 0.026367 2048 2048.0 -1.0000 -1.0000 8\r\n" 652 | ] 653 | }, 654 | { 655 | "output_type": "stream", 656 | "stream": "stdout", 657 | "text": [ 658 | "0.022705 0.024414 4096 4096.0 1.0000 -1.0000 7\r\n", 659 | "0.025146 0.027588 8192 8192.0 -1.0000 -1.0000 12\r\n", 660 | "0.024170 0.023193 16384 16384.0 -1.0000 -1.0000 12\r\n" 661 | ] 662 | }, 663 | { 664 | "output_type": "stream", 665 | "stream": "stdout", 666 | "text": [ 667 | "0.023590 0.023010 32768 32768.0 -1.0000 -1.0000 12\r\n" 668 | ] 669 | }, 670 | { 671 | "output_type": "stream", 672 | "stream": "stdout", 673 | "text": [ 674 | "0.022614 0.021637 65536 65536.0 -1.0000 -1.0000 12\r\n" 675 | ] 676 | }, 677 | { 678 | "output_type": "stream", 679 | "stream": "stdout", 680 | "text": [ 681 | "0.021667 0.020721 131072 131072.0 -1.0000 -1.0000 12\r\n" 682 | ] 683 | }, 684 | { 685 | "output_type": "stream", 686 | "stream": "stdout", 687 | "text": [ 688 | "0.020542 0.020542 262144 262144.0 -1.0000 -1.0000 12 h\r\n" 689 | ] 690 | }, 691 | { 692 | "output_type": "stream", 693 | "stream": "stdout", 694 | "text": [ 695 | "0.017611 0.016146 524288 524288.0 -1.0000 -1.0000 12 h\r\n" 696 | ] 697 | }, 698 | { 699 | "output_type": "stream", 700 | "stream": "stdout", 701 | "text": [ 702 | "0.015473 0.013334 1048576 1048576.0 -1.0000 -1.0000 7 h\r\n" 703 | ] 704 | }, 705 | { 706 | "output_type": "stream", 707 | "stream": "stdout", 708 | "text": [ 709 | "0.013506 0.011540 2097152 2097152.0 -1.0000 -1.0000 12 h\r\n" 710 | ] 711 | }, 712 | { 713 | "output_type": "stream", 714 | "stream": "stdout", 715 | "text": [ 716 | "0.011819 0.010262 4194304 4194304.0 -1.0000 -1.0000 9 h\r\n" 717 | ] 718 | }, 719 | { 720 | "output_type": "stream", 721 | "stream": "stdout", 722 | "text": [ 723 | "\r\n", 724 | "finished run\r\n", 725 | "number of examples per pass = 163563\r\n", 726 | "passes used = 36\r\n", 727 | "weighted example sum = 5888268.000000\r\n", 728 | "weighted label sum = -5612580.000000\r\n", 729 | "average loss = 0.009166 h\r\n", 730 | "best constant = -3.730906\r\n", 731 | "best constant's loss = 0.111029\r\n", 732 | "total feature number = 61480656\r\n" 733 | ] 734 | }, 735 | { 736 | "output_type": "stream", 737 | "stream": "stdout", 738 | "text": [ 739 | "\r\n" 740 | ] 741 | }, 742 | { 743 | "output_type": "stream", 744 | "stream": "stdout", 745 | "text": [ 746 | "Running perf...\r\n" 747 | ] 748 | }, 749 | { 750 | "output_type": "stream", 751 | "stream": "stdout", 752 | "text": [ 753 | "PRE 0.97784 pred_thresh 0.000000\r\n", 754 | "REC 0.65370 pred_thresh 0.000000\r\n", 755 | "PRF 0.78357 pred_thresh 0.000000\r\n", 756 | "PRB 0.81667\r\n" 757 | ] 758 | } 759 | ], 760 | "prompt_number": 9 761 | }, 762 | { 763 | "cell_type": "markdown", 764 | "metadata": {}, 765 | "source": [ 766 | "As expected, this is worse than before. The F score is 75.7% (< 81.2%) and the break-even point is 79.8% (< 84.6%). The gap is definitely enough to care about.\n", 767 | "\n", 768 | "# Optimal Example Weights?\n", 769 | "\n", 770 | "A question that immediately comes up is: what is the *best* weight to give the positive and negative examples? We used a heuristic setting above, but is this optimal?\n", 771 | "\n", 772 | "Unfortunately, we don't know. What we [do know](http://papers.nips.cc/paper/5454-consistent-binary-classification-with-generalized-performance-metrics) is that if the end performance measure is F score, then there *exists* some weighting that is guaranteed to optimize this, but we don't *constructively* know what it is. The weighting then becomes a new hyperparameter that we can tune to get the best possible F.\n", 773 | "\n", 774 | "This is somewhat cumbersome, but we can try a few different values as follows. We'll switch to hinge loss too." 775 | ] 776 | }, 777 | { 778 | "cell_type": "code", 779 | "collapsed": false, 780 | "input": [ 781 | "baseRatio = 2.5/97.5\n", 782 | "for multiplier in [2.0**k for k in range(-5,6)]:\n", 783 | " negWeight = baseRatio * multiplier\n", 784 | " sentimentToVWFile('data/sentiword.tr', train, negWeight)\n", 785 | " sentimentToVWFile('data/sentiword.de', dev , negWeight)\n", 786 | " sentimentToVWFile('data/sentiword.te', test , negWeight)\n", 787 | "\n", 788 | " # combine train and dev into one\n", 789 | " print 'weight = %g * %g = %g' % (baseRatio, multiplier, negWeight)\n", 790 | " !cat data/sentiword.tr data/sentiword.de > data/sentiword.trde\n", 791 | " !vw -k -c -b 27 --binary data/sentiword.trde --passes 100 -f data/sentiword.model --loss_function logistic --quiet --holdout_after 163564\n", 792 | " !vw --binary -t -r data/sentiword.de.raw -i data/sentiword.model data/sentiword.de --quiet\n", 793 | " !cut -d' ' -f1 data/sentiword.de | paste - data/sentiword.de.raw | perf -PRF -t 0\n", 794 | " print ''" 795 | ], 796 | "language": "python", 797 | "metadata": {}, 798 | "outputs": [ 799 | { 800 | "output_type": "stream", 801 | "stream": "stdout", 802 | "text": [ 803 | "weight = 0.025641 * 0.03125 = 0.000801282\n" 804 | ] 805 | }, 806 | { 807 | "output_type": "stream", 808 | "stream": "stdout", 809 | "text": [ 810 | "PRF 0.09226 pred_thresh 0.000000\r\n" 811 | ] 812 | }, 813 | { 814 | "output_type": "stream", 815 | "stream": "stdout", 816 | "text": [ 817 | "\n", 818 | "weight = 0.025641 * 0.0625 = 0.00160256" 819 | ] 820 | }, 821 | { 822 | "output_type": "stream", 823 | "stream": "stdout", 824 | "text": [ 825 | "\n" 826 | ] 827 | }, 828 | { 829 | "output_type": "stream", 830 | "stream": "stdout", 831 | "text": [ 832 | "PRF 0.14089 pred_thresh 0.000000\r\n" 833 | ] 834 | }, 835 | { 836 | "output_type": "stream", 837 | "stream": "stdout", 838 | "text": [ 839 | "\n", 840 | "weight = 0.025641 * 0.125 = 0.00320513" 841 | ] 842 | }, 843 | { 844 | "output_type": "stream", 845 | "stream": "stdout", 846 | "text": [ 847 | "\n" 848 | ] 849 | }, 850 | { 851 | "output_type": "stream", 852 | "stream": "stdout", 853 | "text": [ 854 | "PRF 0.16000 pred_thresh 0.000000\r\n" 855 | ] 856 | }, 857 | { 858 | "output_type": "stream", 859 | "stream": "stdout", 860 | "text": [ 861 | "\n", 862 | "weight = 0.025641 * 0.25 = 0.00641026" 863 | ] 864 | }, 865 | { 866 | "output_type": "stream", 867 | "stream": "stdout", 868 | "text": [ 869 | "\n" 870 | ] 871 | }, 872 | { 873 | "output_type": "stream", 874 | "stream": "stdout", 875 | "text": [ 876 | "PRF 0.33624 pred_thresh 0.000000\r\n" 877 | ] 878 | }, 879 | { 880 | "output_type": "stream", 881 | "stream": "stdout", 882 | "text": [ 883 | "\n", 884 | "weight = 0.025641 * 0.5 = 0.0128205" 885 | ] 886 | }, 887 | { 888 | "output_type": "stream", 889 | "stream": "stdout", 890 | "text": [ 891 | "\n" 892 | ] 893 | }, 894 | { 895 | "output_type": "stream", 896 | "stream": "stdout", 897 | "text": [ 898 | "PRF 0.72214 pred_thresh 0.000000\r\n" 899 | ] 900 | }, 901 | { 902 | "output_type": "stream", 903 | "stream": "stdout", 904 | "text": [ 905 | "\n", 906 | "weight = 0.025641 * 1 = 0.025641" 907 | ] 908 | }, 909 | { 910 | "output_type": "stream", 911 | "stream": "stdout", 912 | "text": [ 913 | "\n" 914 | ] 915 | }, 916 | { 917 | "output_type": "stream", 918 | "stream": "stdout", 919 | "text": [ 920 | "PRF 0.75758 pred_thresh 0.000000\r\n" 921 | ] 922 | }, 923 | { 924 | "output_type": "stream", 925 | "stream": "stdout", 926 | "text": [ 927 | "\n", 928 | "weight = 0.025641 * 2 = 0.0512821" 929 | ] 930 | }, 931 | { 932 | "output_type": "stream", 933 | "stream": "stdout", 934 | "text": [ 935 | "\n" 936 | ] 937 | }, 938 | { 939 | "output_type": "stream", 940 | "stream": "stdout", 941 | "text": [ 942 | "PRF 0.86267 pred_thresh 0.000000\r\n" 943 | ] 944 | }, 945 | { 946 | "output_type": "stream", 947 | "stream": "stdout", 948 | "text": [ 949 | "\n", 950 | "weight = 0.025641 * 4 = 0.102564" 951 | ] 952 | }, 953 | { 954 | "output_type": "stream", 955 | "stream": "stdout", 956 | "text": [ 957 | "\n" 958 | ] 959 | }, 960 | { 961 | "output_type": "stream", 962 | "stream": "stdout", 963 | "text": [ 964 | "PRF 0.85249 pred_thresh 0.000000\r\n" 965 | ] 966 | }, 967 | { 968 | "output_type": "stream", 969 | "stream": "stdout", 970 | "text": [ 971 | "\n", 972 | "weight = 0.025641 * 8 = 0.205128" 973 | ] 974 | }, 975 | { 976 | "output_type": "stream", 977 | "stream": "stdout", 978 | "text": [ 979 | "\n" 980 | ] 981 | }, 982 | { 983 | "output_type": "stream", 984 | "stream": "stdout", 985 | "text": [ 986 | "PRF 0.83633 pred_thresh 0.000000\r\n" 987 | ] 988 | }, 989 | { 990 | "output_type": "stream", 991 | "stream": "stdout", 992 | "text": [ 993 | "\n", 994 | "weight = 0.025641 * 16 = 0.410256" 995 | ] 996 | }, 997 | { 998 | "output_type": "stream", 999 | "stream": "stdout", 1000 | "text": [ 1001 | "\n" 1002 | ] 1003 | }, 1004 | { 1005 | "output_type": "stream", 1006 | "stream": "stdout", 1007 | "text": [ 1008 | "PRF 0.81553 pred_thresh 0.000000\r\n" 1009 | ] 1010 | }, 1011 | { 1012 | "output_type": "stream", 1013 | "stream": "stdout", 1014 | "text": [ 1015 | "\n", 1016 | "weight = 0.025641 * 32 = 0.820513" 1017 | ] 1018 | }, 1019 | { 1020 | "output_type": "stream", 1021 | "stream": "stdout", 1022 | "text": [ 1023 | "\n" 1024 | ] 1025 | }, 1026 | { 1027 | "output_type": "stream", 1028 | "stream": "stdout", 1029 | "text": [ 1030 | "PRF 0.79473 pred_thresh 0.000000\r\n" 1031 | ] 1032 | }, 1033 | { 1034 | "output_type": "stream", 1035 | "stream": "stdout", 1036 | "text": [ 1037 | "\n" 1038 | ] 1039 | } 1040 | ], 1041 | "prompt_number": 10 1042 | }, 1043 | { 1044 | "cell_type": "markdown", 1045 | "metadata": {}, 1046 | "source": [ 1047 | "From this, we can see a fairly wide range of performance. The optimal performance here is with a weight of around 0.05, yielding an F score of 86.3%.\n", 1048 | "\n", 1049 | "# Making Final Predictions\n", 1050 | "\n", 1051 | "Now that we know a good negative example weighting (0.0512821), we're going to train a final model and predict and evaluate on test data:" 1052 | ] 1053 | }, 1054 | { 1055 | "cell_type": "code", 1056 | "collapsed": false, 1057 | "input": [ 1058 | "negWeight = 0.0512821\n", 1059 | "sentimentToVWFile('data/sentiword.tr', train, negWeight)\n", 1060 | "sentimentToVWFile('data/sentiword.de', dev , negWeight)\n", 1061 | "sentimentToVWFile('data/sentiword.te', test , negWeight)\n", 1062 | "\n", 1063 | "# combine train and dev into one\n", 1064 | "!cat data/sentiword.tr data/sentiword.de > data/sentiword.trde\n", 1065 | "# train\n", 1066 | "!vw -k -c -b 27 --binary data/sentiword.trde --passes 100 -f data/sentiword.model --loss_function logistic --quiet --holdout_after 163564\n", 1067 | "# now, predict on TEST\n", 1068 | "!vw --binary -t -r data/sentiword.te.raw -i data/sentiword.model data/sentiword.te --quiet\n", 1069 | "!cut -d' ' -f1 data/sentiword.te | paste - data/sentiword.te.raw | perf -PRF -t 0" 1070 | ], 1071 | "language": "python", 1072 | "metadata": {}, 1073 | "outputs": [ 1074 | { 1075 | "output_type": "stream", 1076 | "stream": "stdout", 1077 | "text": [ 1078 | "PRF 0.84909 pred_thresh 0.000000\r\n" 1079 | ] 1080 | } 1081 | ], 1082 | "prompt_number": 11 1083 | }, 1084 | { 1085 | "cell_type": "markdown", 1086 | "metadata": {}, 1087 | "source": [ 1088 | "And voila, we have a system that gets an F score of 84.9% on test data.\n", 1089 | "\n", 1090 | "How does this compare to anything else? I have no idea. This was kind of a made-up task :)." 1091 | ] 1092 | } 1093 | ], 1094 | "metadata": {} 1095 | } 1096 | ] 1097 | } -------------------------------------------------------------------------------- /UnsupervisedNLP.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "", 4 | "signature": "sha256:d487a559a2178f2ec8f9812ca4dde51d724651d8b20cef6bc5858c45359a21b8" 5 | }, 6 | "nbformat": 3, 7 | "nbformat_minor": 0, 8 | "worksheets": [ 9 | { 10 | "cells": [ 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "# Unsupervised Learning in VW\n", 16 | "\n", 17 | "# DO NOT READ THIS YET, THERE IS A BUG IN VW LDA THAT WE'RE WORKING ON!\n", 18 | "\n", 19 | "Up to this point, all we've talked about has been supervised learning. For the most part, `vw` *is* a toolkit for supervised learning, but it does have a few ways to do unsupervised learning. The first is built-in support for topic models (in this case, streaming latent Dirichlet allocation). The second is a style of matrix factorization that we can use to construct autoencoders.\n", 20 | "\n", 21 | "# Topics Models in VW\n", 22 | "\n", 23 | "Before we begin, we need to make sure we have some data. We'll re-use the 20 Newsgroups data from the [Multiclass Classification](MulticlassClassification.ipynb) notebook. If you haven't downloaded it yet, please go execute the first two command blocks in that notebook. If that has succeeded, then the following should work:" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "collapsed": false, 29 | "input": [ 30 | "!wc data/20ng.t[re]" 31 | ], 32 | "language": "python", 33 | "metadata": {}, 34 | "outputs": [ 35 | { 36 | "output_type": "stream", 37 | "stream": "stdout", 38 | "text": [ 39 | " 7532 2932098 13603023 data/20ng.te\r\n" 40 | ] 41 | }, 42 | { 43 | "output_type": "stream", 44 | "stream": "stdout", 45 | "text": [ 46 | " 11314 4877902 22111515 data/20ng.tr\r\n", 47 | " 18846 7810000 35714538 total\r\n" 48 | ] 49 | } 50 | ], 51 | "prompt_number": 1 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "First, we have to get the data into LDA-friendly format. This means (a) removing the labels (because this is unsupervised learning, after all!); (b) reducing the data to a single, un-named namespace; and (c) reducing everything to words and their counts. LDA also tends to work better after stopwords have been removed; we'll use a list of stopwords from [ranks.nl](http://www.ranks.nl/stopwords). We'll do this all below with a bit of python." 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "collapsed": false, 63 | "input": [ 64 | "import re\n", 65 | "from collections import Counter\n", 66 | "\n", 67 | "def tokenize(s): return re.sub('([^A-Za-z0-9 ]+)', ' \\\\1 ', s).split() # add space around anything not alphanum\n", 68 | "\n", 69 | "stopwords = set(tokenize(\"\"\"a about above after again against all am an and any are aren't as at be because\n", 70 | " been before being below between both but by can't cannot could couldn't did didn't\n", 71 | " do does doesn't doing don't down during each few for from further had hadn't has\n", 72 | " hasn't have haven't having he he'd he'll he's her here here's hers herself him\n", 73 | " himself his how how's i i'd i'll i'm i've if in into is isn't it it's its itself\n", 74 | " let's me more most mustn't my myself no nor not of off on once only or other ought\n", 75 | " our ours ourselves out over own same shan't she she'd she'll she's should shouldn't\n", 76 | " so some such than that that's the their theirs them themselves then there there's\n", 77 | " these they they'd they'll they're they've this those through to too under until up\n", 78 | " very was wasn't we we'd we'll we're we've were weren't what what's when when's\n", 79 | " where where's which while who who's whom why why's with won't would wouldn't you\n", 80 | " you'd you'll you're you've your yours yourself yourselves\"\"\"))\n", 81 | "\n", 82 | "def countWords(txt):\n", 83 | " match = re.match('^.* \\|w ([^\\|]*)', txt)\n", 84 | " if match is None: return Counter()\n", 85 | " words = match.group(1).lower().split()\n", 86 | " return Counter([w for w in words if w not in stopwords and w.isalpha() and len(w) >= 3 and len(w) < 20])\n", 87 | "\n", 88 | "def computeDocumentFrequencies(inputFile):\n", 89 | " df = Counter()\n", 90 | " numDocs = 0\n", 91 | " with open(inputFile, 'r') as h:\n", 92 | " for l in h.readlines():\n", 93 | " count = countWords(l)\n", 94 | " for w in count.iterkeys():\n", 95 | " df[w] += 1\n", 96 | " numDocs += 1\n", 97 | " return df,numDocs\n", 98 | "\n", 99 | "def dataToLDA(inputFile, outputHandle, df, minFreq, maxFreq):\n", 100 | " with open(inputFile, 'r') as h:\n", 101 | " for l in h.readlines():\n", 102 | " count = countWords(l)\n", 103 | " if len(count) > 0: # unfortunately, the current vw segfaults on empty lines\n", 104 | " print >>outputHandle, '|', ' '.join(['%s:%d' % (w,c)\n", 105 | " for w,c in count.iteritems()\n", 106 | " if df[w] >= minFreq and df[w] <= maxFreq])\n", 107 | "\n", 108 | "df,numDocs = computeDocumentFrequencies('data/20ng.tr')\n", 109 | "with open('data/20ng.unlab', 'w') as h:\n", 110 | " dataToLDA('data/20ng.tr', h, df, 10, numDocs/10)\n", 111 | " \n", 112 | "!head -n2 data/20ng.unlab\n", 113 | "!wc -l data/20ng.unlab" 114 | ], 115 | "language": "python", 116 | "metadata": {}, 117 | "outputs": [ 118 | { 119 | "output_type": "stream", 120 | "stream": "stdout", 121 | "text": [ 122 | "| saying:1 money:1 attempts:1 hell:1 beliefs:1 seems:2 symptoms:1 treatment:1 covered:1 happened:1 combination:1 read:1 sarcastic:1 geb:1 every:1 accepted:2 condition:2 areas:1 school:1 countries:3 issue:1 switzerland:1 found:2 homeopathy:3 wrote:1 view:1 absolutely:1 direct:1 depends:1 helpless:1 insurance:2 living:1 lead:1 deeply:1 robert:2 scientists:1 medicine:5 reading:1 conditions:1 method:1 med:1 scientific:1 business:1 however:1 enables:1 country:1 changed:1 pitt:1 experience:1 keep:1 gordon:1 germany:1 mean:1 comment:1 pays:1 austria:2 open:1 least:1 doubt:1 taken:1 waste:1 vienna:1 life:2 offer:1 excuse:1 worked:1 case:3 holland:1 patients:1 personality:1 powerless:1 modern:4 mind:2 helped:1 britain:1 cure:2 sense:1 relatively:1 pay:1 offensive:1 note:1 answer:1 european:1 practitioner:1 intended:1 normal:2 treatments:1 paid:1 swiss:1 charm:1 oracle:3 coming:1 makes:1 severe:1 banks:1 sounded:1\r\n", 123 | "| sometimes:1 washington:1 pitt:2 brain:2 medication:1 throat:2 worry:2 elizabeth:1 given:1 uucp:1 carriers:1 culture:1 boot:1 vessels:1 forms:1 live:1 causing:1 assuming:1 taken:1 becoming:1 common:1 kills:1 swell:1 took:1 geb:1 course:1 blood:1 schools:1 especially:1 covering:1 camp:1 negative:1 gordon:1 carrier:1 banks:1 attacking:1\r\n" 124 | ] 125 | }, 126 | { 127 | "output_type": "stream", 128 | "stream": "stdout", 129 | "text": [ 130 | "11297 data/20ng.unlab\r\n" 131 | ] 132 | } 133 | ], 134 | "prompt_number": 1 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "Now that the data is in LDA-friendly mode, we simply need to tell `vw` to run LDA.\n", 141 | "\n", 142 | "Unfortunately, unlike supervised learning, there are a number of hyperparameters that you have to tweak fairly carefully to get reasonable performance out of LDA, more or less because unsupervised learning is hard. The main parameters you have to set are:\n", 143 | "\n", 144 | "* How many topics you want! This you have to use your judgment. We'll do 5 just so that it's easy to look at the output.\n", 145 | "* The Dirichlet prior on p(word|topic). This is called `--lda_rho`. If you want sparsity, this should be less than 1. A starting point of 0.1 is usually reasonable.\n", 146 | "* The Dirichlet prior on p(topic|document). This is called `--lda_alpha` and 0.1 is also a reasonable starting point here.\n", 147 | "* The minibatch size: basically how many documents should LDA look at at a single time. 256 is reasonable.\n", 148 | "\n", 149 | "Here we go:" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "collapsed": false, 155 | "input": [ 156 | "!vw -k -c -b20 -d data/20ng.unlab --passes 100 --lda 20 --lda_D 11297 --lda_alpha 0.1 --lda_rho 0.1 --minibatch 512 -f data/20ng.unlab.lda -l 0.1 -p data/20ng.unlab.pred --random_seed 1234 --random_weights on\n", 157 | "!tail -n2000 data/20ng.unlab.pred | sort -R | tail" 158 | ], 159 | "language": "python", 160 | "metadata": {}, 161 | "outputs": [ 162 | { 163 | "output_type": "stream", 164 | "stream": "stdout", 165 | "text": [ 166 | "final_regressor = data/20ng.unlab.lda\r\n", 167 | "predictions = data/20ng.unlab.pred\r\n", 168 | "Num weight bits = 20\r\n", 169 | "learning rate = 0.1\r\n", 170 | "initial_t = 0\r\n", 171 | "power_t = 0.5\r\n", 172 | "decay_learning_rate = 1\r\n" 173 | ] 174 | }, 175 | { 176 | "output_type": "stream", 177 | "stream": "stdout", 178 | "text": [ 179 | "creating cache_file = data/20ng.unlab.cache\r\n", 180 | "Reading datafile = data/20ng.unlab\r\n", 181 | "num sources = 1\r\n", 182 | "average since example example current current current\r\n", 183 | "loss last counter weight label predict features\r\n" 184 | ] 185 | }, 186 | { 187 | "output_type": "stream", 188 | "stream": "stdout", 189 | "text": [ 190 | "15.983848 15.983848 1 1.0 unknown 0.0000 96\r\n", 191 | "16.033634 16.083420 2 2.0 unknown 0.0000 36\r\n", 192 | "15.997947 15.962260 4 4.0 unknown 0.0000 90\r\n", 193 | "16.024702 16.051457 8 8.0 unknown 0.0000 147\r\n", 194 | "17.079831 18.134961 16 16.0 unknown 0.0000 178\r\n", 195 | "17.608963 18.138095 32 32.0 unknown 0.0000 5\r\n" 196 | ] 197 | }, 198 | { 199 | "output_type": "stream", 200 | "stream": "stdout", 201 | "text": [ 202 | "17.839076 18.069188 64 64.0 unknown 0.0000 54\r\n", 203 | "17.829170 17.819265 128 128.0 unknown 0.0000 55\r\n" 204 | ] 205 | }, 206 | { 207 | "output_type": "stream", 208 | "stream": "stdout", 209 | "text": [ 210 | "17.824724 17.820278 256 256.0 unknown 0.0000 80\r\n" 211 | ] 212 | }, 213 | { 214 | "output_type": "stream", 215 | "stream": "stdout", 216 | "text": [ 217 | "17.827845 17.830965 512 512.0 unknown 0.0000 8\r\n" 218 | ] 219 | }, 220 | { 221 | "output_type": "stream", 222 | "stream": "stdout", 223 | "text": [ 224 | "17.848065 17.868286 1024 1024.0 unknown 0.0000 69\r\n" 225 | ] 226 | }, 227 | { 228 | "output_type": "stream", 229 | "stream": "stdout", 230 | "text": [ 231 | "17.846609 17.845153 2048 2048.0 unknown 0.0000 75\r\n" 232 | ] 233 | }, 234 | { 235 | "output_type": "stream", 236 | "stream": "stdout", 237 | "text": [ 238 | "17.850727 17.854845 4096 4096.0 unknown 0.0000 115\r\n" 239 | ] 240 | }, 241 | { 242 | "output_type": "stream", 243 | "stream": "stdout", 244 | "text": [ 245 | "17.837450 17.824174 8192 8192.0 unknown 0.0000 16\r\n" 246 | ] 247 | }, 248 | { 249 | "output_type": "stream", 250 | "stream": "stdout", 251 | "text": [ 252 | "17.839481 17.841512 16384 16384.0 unknown 0.0000 32\r\n" 253 | ] 254 | }, 255 | { 256 | "output_type": "stream", 257 | "stream": "stdout", 258 | "text": [ 259 | "17.837647 17.835812 32768 32768.0 unknown 0.0000 99\r\n" 260 | ] 261 | }, 262 | { 263 | "output_type": "stream", 264 | "stream": "stdout", 265 | "text": [ 266 | "17.837805 17.837963 65536 65536.0 unknown 0.0000 17\r\n" 267 | ] 268 | }, 269 | { 270 | "output_type": "stream", 271 | "stream": "stdout", 272 | "text": [ 273 | "17.836958 17.836111 131072 131072.0 unknown 0.0000 68\r\n" 274 | ] 275 | }, 276 | { 277 | "output_type": "stream", 278 | "stream": "stdout", 279 | "text": [ 280 | "17.836983 17.837008 262144 262144.0 unknown 0.0000 21\r\n" 281 | ] 282 | }, 283 | { 284 | "output_type": "stream", 285 | "stream": "stdout", 286 | "text": [ 287 | "17.837052 17.837121 524288 524288.0 unknown 0.0000 23\r\n" 288 | ] 289 | }, 290 | { 291 | "output_type": "stream", 292 | "stream": "stdout", 293 | "text": [ 294 | "\r\n", 295 | "finished run\r\n", 296 | "number of examples = 1016800\r\n", 297 | "weighted example sum = 1016800.000000\r\n", 298 | "weighted label sum = 0.000000\r\n", 299 | "average loss = 0.000000 h\r\n", 300 | "total feature number = 72343800\r\n" 301 | ] 302 | }, 303 | { 304 | "output_type": "stream", 305 | "stream": "stdout", 306 | "text": [ 307 | "16.815300 0.100029 7.135149 0.100027 0.100026 0.100025 0.100035 0.100000 6.202374 11.961309 0.100027 0.100000 0.100000 0.100000 30.385586 0.100031 0.100000 0.100024 0.100028 0.100032 \r\n", 308 | "0.100026 0.100020 0.100036 0.100030 0.100028 0.100026 0.100032 0.100000 0.100035 2.770464 4.002769 0.100000 0.100000 0.100000 10.295366 0.100032 0.100000 12.651035 12.780082 0.100020 \r\n", 309 | "0.100035 0.100029 18.554890 0.100037 0.100032 0.100035 5.396076 0.100000 0.100035 0.100026 0.100033 0.100000 0.100000 0.100000 0.100035 11.562382 0.100000 0.100027 26.366322 9.620002 \r\n", 310 | "145.182770 31.085636 50.398434 99.868141 116.868889 0.108624 66.067795 0.100000 4.942534 56.463730 0.100030 0.100000 0.100000 0.100000 25.900484 88.307060 0.100000 0.100031 0.100031 99.005722 \r\n", 311 | "0.100034 14.342416 31.472063 9.750040 21.763355 0.100028 0.100030 0.100000 0.100025 0.100028 0.100034 0.100000 0.100000 0.100000 0.100027 0.100032 0.100000 4.723732 0.100028 11.548128 \r\n", 312 | "7.564924 0.100031 0.100027 0.100024 0.100033 0.100032 4.798840 0.100000 0.100024 13.499141 0.100027 0.100000 0.100000 0.100000 4.536791 0.100025 0.100000 0.100030 0.100025 0.100028 \r\n", 313 | "25.012398 11.817080 0.100028 0.100027 0.100032 57.715881 30.296272 0.100000 0.101200 0.100031 0.100031 0.100000 0.100000 0.100000 30.209276 42.206295 0.100000 24.220533 17.320871 0.100029 \r\n", 314 | "0.100036 3.238522 0.100041 0.100045 10.685641 0.100036 0.100028 0.100000 0.100040 6.746550 0.100022 0.100000 0.100000 0.100000 0.100042 0.100039 0.100000 0.100040 11.728886 0.100032 \r\n", 315 | "6.544247 0.100038 9.555730 0.100036 0.100022 0.100023 0.100035 0.100000 0.100021 0.100028 4.012021 0.100000 0.100000 0.100000 14.180709 10.207019 0.100000 0.100021 0.100023 0.100030 \r\n", 316 | "0.100037 0.100036 0.100029 0.100037 0.100032 0.100031 6.457939 0.100000 0.100036 0.100033 0.100028 0.100000 0.100000 0.100000 0.100024 24.260477 0.100000 0.100038 20.440647 20.240576 \r\n" 317 | ] 318 | } 319 | ], 320 | "prompt_number": 38 321 | }, 322 | { 323 | "cell_type": "code", 324 | "collapsed": false, 325 | "input": [ 326 | "!vw -t -d data/20ng.unlab -i data/20ng.unlab.lda -p data/20ng.unlab.pred --quiet\n", 327 | "!head -n20 data/20ng.unlab.pred" 328 | ], 329 | "language": "python", 330 | "metadata": {}, 331 | "outputs": [ 332 | { 333 | "output_type": "stream", 334 | "stream": "stdout", 335 | "text": [ 336 | "21.715076 23.202663 7.844841 0.100031 0.100033 19.938475 0.100034 0.100000 0.100027 0.100030 0.100030 0.100000 0.100000 0.100000 25.687893 0.100032 0.100000 12.902250 11.408561 0.100025 \r\n", 337 | "0.100030 14.869518 0.100027 14.881866 0.100025 10.548245 0.100031 0.100000 0.100036 0.100031 0.100035 0.100000 0.100000 0.100000 0.100031 0.100032 0.100000 0.100031 0.100032 0.100029 \r\n", 338 | "0.100028 0.100033 13.696851 23.532206 0.100026 33.708664 0.100032 0.100000 10.899662 0.100038 0.100031 0.100000 0.100000 0.100000 0.100023 0.100027 0.100000 11.662322 0.100036 0.100026 \r\n", 339 | "0.100025 32.965546 0.100029 0.100032 14.288982 17.240261 0.100029 0.100000 0.100029 17.485605 0.100033 0.100000 0.100000 0.100000 11.590836 0.100025 0.100000 0.100031 17.028515 0.100030 \r\n", 340 | "0.100022 0.100033 26.766905 0.100033 0.100030 6.657275 0.100024 0.100000 0.100023 7.312836 15.085741 0.100000 0.100000 0.100000 21.328510 18.448481 0.100000 0.100025 0.100024 0.100035 \r\n", 341 | "7.674386 11.294849 15.190828 10.945913 10.477317 0.100027 0.100026 0.100000 0.100025 8.016451 0.100035 0.100000 0.100000 0.100000 0.100024 0.100025 0.100000 0.100034 0.100038 0.100027 \r\n", 342 | "0.100027 0.100020 0.100012 0.100031 0.100030 0.100033 0.100038 0.100000 0.100021 3.301056 6.424484 0.100000 0.100000 0.100000 0.100016 0.100015 0.100000 3.574184 0.100017 0.100016 \r\n", 343 | "0.100029 0.100029 16.585390 0.100027 0.100030 34.432915 0.100029 0.100000 18.102285 7.212412 79.950989 0.100000 0.100000 0.100000 24.416714 0.100029 0.100000 0.100032 12.999031 0.100033 \r\n", 344 | "0.100040 15.319128 0.100035 0.100024 0.100036 0.100030 0.100034 0.100000 0.100022 0.100031 0.100029 0.100000 0.100000 0.100000 7.586847 0.100033 0.100000 0.100020 0.100032 14.393661 \r\n", 345 | "0.101697 12.081765 42.313911 0.100031 55.201290 15.307070 16.451059 0.100000 33.695316 42.657902 0.100030 0.100000 0.100000 0.100000 46.229485 95.495476 0.100000 22.323343 36.293098 27.148571 \r\n", 346 | "6.316563 0.100049 0.100020 0.100021 0.100014 0.100032 0.100013 0.100000 0.100015 2.477398 0.100028 0.100000 0.100000 0.100000 8.505736 0.100044 0.100000 0.100020 0.100022 0.100023 \r\n", 347 | "17.756857 0.100026 0.100026 11.236880 0.100032 0.100033 21.933317 0.100000 0.100035 9.471942 0.100027 0.100000 0.100000 0.100000 0.100032 0.100028 0.100000 0.100035 0.100033 10.100698 \r\n", 348 | "0.100034 0.100037 0.100033 0.100040 0.100046 0.100051 9.074137 0.100000 0.100037 0.100030 19.091578 0.100000 0.100000 0.100000 0.100033 0.100023 0.100000 6.133863 0.100018 0.100046 \r\n", 349 | "0.100031 0.100032 0.100037 5.835958 0.100023 0.100024 0.100018 0.100000 0.100034 0.100049 0.100027 0.100000 0.100000 0.100000 3.929354 0.100032 0.100000 0.100037 18.534317 0.100030 \r\n", 350 | "0.100029 0.100034 0.100021 0.100036 0.100027 9.389404 0.100032 0.100000 0.100037 19.837572 0.100031 0.100000 0.100000 0.100000 20.347139 20.825548 0.100000 0.100034 0.100037 0.100024 \r\n", 351 | "0.100017 2.735978 0.100018 0.100047 0.100021 0.100049 4.274239 0.100000 0.100026 0.100020 0.100032 0.100000 0.100000 0.100000 0.100015 0.100016 0.100000 0.100023 0.100020 2.289477 \r\n", 352 | "35.242641 9.909006 22.206738 31.911646 0.100343 0.100033 8.737320 0.100000 12.071489 9.889423 56.425446 0.100000 0.100000 0.100000 0.100030 38.502533 0.100000 29.103296 0.100032 0.100030 \r\n", 353 | "17.962646 12.813928 17.916031 25.007462 14.369626 47.378246 0.100040 0.100000 43.498878 9.396447 7.466072 0.100000 0.100000 0.100000 0.100029 19.701447 0.100000 34.633305 0.100031 50.055798 \r\n", 354 | "7.346448 0.100030 0.100022 2.790637 0.100033 0.100042 4.827148 0.100000 0.100038 0.100031 0.100022 0.100000 0.100000 0.100000 3.435438 0.100024 0.100000 0.100027 0.100031 0.100027 \r\n", 355 | "0.100034 0.100022 0.100024 0.100029 0.100035 7.024383 4.535814 0.100000 0.100028 0.100035 0.100022 0.100000 0.100000 0.100000 6.481256 0.100026 0.100000 4.769609 7.688655 0.100027 \r\n" 356 | ] 357 | } 358 | ], 359 | "prompt_number": 39 360 | }, 361 | { 362 | "cell_type": "markdown", 363 | "metadata": {}, 364 | "source": [ 365 | "Overall this command line basically looks like a normal `vw` command line. The only annoying thing is that you have to tell LDA how many documents there are total (via `--lda_D`).\n", 366 | "\n", 367 | "At the end, we want to get some sort of predictions out. Either we want to know, for some documents, what the topic distribution is for each document (this will be achieved with `-p` predictions, like for supervised learning) or we want to know the topic-to-word distribution (this will be achieved by bending over backwards a bit because unfortunately the current [LDA implementation doesn't support `--invert_hash`](https://groups.yahoo.com/neo/groups/vowpal_wabbit/conversations/topics/3622))." 368 | ] 369 | }, 370 | { 371 | "cell_type": "markdown", 372 | "metadata": {}, 373 | "source": [ 374 | "Here, each line correponds to a document and each column corresponds to a topic. If you want probabilities, you should normalize these to sum to one. The small ones (the ones very close to 0.1) are the ones that are basically only getting their mass from the prior. So the first document is mostly about topic 3 and somewhat about topic 1. The fourth document is mostly about topic 1 and somewhat about topic 3.\n", 375 | "\n", 376 | "We can also look at the topic-to-word distribution. Unfortunately, LDA doesn't currently support `--invert_hash` so we have to go with a backdoor solution. What we'll do is create a \"vocabulary\" data set, where each example corresponds to a single word, and we remove duplicates. We can then run this data through `vw` in \"audit\" mode. In \"audit\" mode, `vw` will output some running statistics about what it is computing on each example. We can extract from this output the word-to-topic distribution. It sounds complicated, but it's not soo bad." 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "collapsed": false, 382 | "input": [ 383 | "!cat data/20ng.unlab | cut -d' ' -f2- | sed 's/:[0-9]*//g' | tr ' ' '\\n' | sort -u | sed 's/^/| /' > data/20ng.unlab.vocab\n", 384 | "!echo \"Vocabulary examples:\"\n", 385 | "!head data/20ng.unlab.vocab\n", 386 | "!echo \"\"\n", 387 | "!echo \"Running the first few through vw with audit on:\"\n", 388 | "!head data/20ng.unlab.vocab | vw -t -i data/20ng.unlab.lda --audit --quiet" 389 | ], 390 | "language": "python", 391 | "metadata": {}, 392 | "outputs": [ 393 | { 394 | "output_type": "stream", 395 | "stream": "stdout", 396 | "text": [ 397 | "Vocabulary examples:\r\n" 398 | ] 399 | }, 400 | { 401 | "output_type": "stream", 402 | "stream": "stdout", 403 | "text": [ 404 | "| \r\n", 405 | "| aaa\r\n", 406 | "| aardvark\r\n", 407 | "| aaron\r\n", 408 | "| aau\r\n", 409 | "| abandon\r\n", 410 | "| abandoned\r\n", 411 | "| abbreviation\r\n", 412 | "| abc\r\n", 413 | "| aber\r\n" 414 | ] 415 | }, 416 | { 417 | "output_type": "stream", 418 | "stream": "stdout", 419 | "text": [ 420 | "\r\n" 421 | ] 422 | }, 423 | { 424 | "output_type": "stream", 425 | "stream": "stdout", 426 | "text": [ 427 | "Running the first few through vw with audit on:\r\n" 428 | ] 429 | }, 430 | { 431 | "output_type": "stream", 432 | "stream": "stdout", 433 | "text": [ 434 | "0.000000\r\n", 435 | "\t ^aaa:24503:1:0.182356:0.329447:0.169864:0.182029:0.108256:0.2706:0.110772:0.1246:0.119324:0.179318:0.131611:0.185727:0.119479:0.129452:0.27818:0.138858:0.127822:0.40772:0.220733:0.199408 total of 1 features.\r\n", 436 | "0.000000\r\n", 437 | "\t ^aardvark:294333:1:0.129039:0.122834:0.119256:0.191947:0.112045:0.233309:0.113939:0.169658:0.121593:0.236195:0.24682:0.427933:0.110589:0.296139:0.141129:0.214868:0.169591:0.195062:0.152851:0.18028 total of 1 features.\r\n", 438 | "0.000000\r\n", 439 | "\t ^aaron:748696:1:0.141309:0.123212:0.146817:0.19391:0.123514:0.264397:0.299233:0.21692:0.255138:0.262913:0.199882:0.183131:0.109128:0.365715:0.180723:0.21685:0.43045:0.283646:0.182521:0.50269 total of 1 features.\r\n", 440 | "0.000000\r\n", 441 | "\t ^aau:109285:1:0.11168:0.260085:0.142467:0.289339:0.171577:0.281379:0.155165:0.16833:0.163008:0.495602:0.19089:0.253979:0.114711:0.11282:0.228666:0.114903:0.117536:0.293731:0.174034:0.294063 total of 1 features.\r\n", 442 | "0.000000\r\n", 443 | "\t ^abandon:631650:1:0.200887:0.128976:0.194837:0.370842:0.291929:0.160005:0.152399:0.306269:0.23724:0.175018:0.39876:0.111871:0.327165:0.470243:0.137895:0.265763:0.113316:0.109048:0.111025:0.356025 total of 1 features.\r\n", 444 | "0.000000\r\n", 445 | "\t ^abandoned:916826:1:0.319634:0.164488:0.626414:0.421522:0.127739:0.261759:0.146873:0.127785:0.524455:0.178173:0.323854:0.130943:0.470056:0.251227:0.166257:0.119897:0.372174:0.144705:0.246781:0.157225 total of 1 features.\r\n", 446 | "0.000000\r\n", 447 | "\t ^abbreviation:139971:1:0.197607:0.160489:0.349651:0.196141:0.266556:0.108351:0.189101:0.231393:0.228699:0.126509:0.357983:0.151137:0.147059:0.207933:0.158753:0.110079:0.486701:0.200756:0.351235:0.154029 total of 1 features.\r\n", 448 | "0.000000\r\n", 449 | "\t ^abc:889850:1:0.339874:0.169692:0.237028:0.154603:0.283697:0.181479:0.128209:0.119643:0.112896:0.436227:0.468355:0.534854:0.179209:0.278722:0.25511:0.261242:0.115069:0.262839:0.10839:0.137095 total of 1 features.\r\n", 450 | "0.000000\r\n", 451 | "\t ^aber:413121:1:0.213243:0.113454:0.172553:0.148069:0.180906:0.259975:0.258192:0.197191:0.276445:0.243005:0.126313:0.195293:0.163656:0.312354:0.185354:0.155635:0.161777:0.13528:0.112745:0.3145 total of 1 features.\r\n" 452 | ] 453 | } 454 | ], 455 | "prompt_number": 40 456 | }, 457 | { 458 | "cell_type": "markdown", 459 | "metadata": {}, 460 | "source": [ 461 | "Here, we can see the basic format for audit. First you get an overall prediction (for LDA this will also be 0.0). Then you get the word, followed by it's hash (eg 24503 for \"aaa\"), then it's count (the count in the vocabular file is always 1) and then it's unnormalized probability for each of the five topics. We can extract the relevant information by some shell scripting:" 462 | ] 463 | }, 464 | { 465 | "cell_type": "code", 466 | "collapsed": false, 467 | "input": [ 468 | "!vw -t -d data/20ng.unlab.vocab -i data/20ng.unlab.lda --audit --quiet 2>&1 | grep -v '^0' | cut -d' ' -f2 | tr ':' '\\t' | cut -f1,4- > data/20ng.unlab.topics\n", 469 | "!head data/20ng.unlab.topics" 470 | ], 471 | "language": "python", 472 | "metadata": {}, 473 | "outputs": [ 474 | { 475 | "output_type": "stream", 476 | "stream": "stdout", 477 | "text": [ 478 | "^aaa\t0.182356\t0.329447\t0.169864\t0.182029\t0.108256\t0.2706\t0.110772\t0.1246\t0.119324\t0.179318\t0.131611\t0.185727\t0.119479\t0.129452\t0.27818\t0.138858\t0.127822\t0.40772\t0.220733\t0.199408\r\n", 479 | "^aardvark\t0.129039\t0.122834\t0.119256\t0.191947\t0.112045\t0.233309\t0.113939\t0.169658\t0.121593\t0.236195\t0.24682\t0.427933\t0.110589\t0.296139\t0.141129\t0.214868\t0.169591\t0.195062\t0.152851\t0.18028\r\n", 480 | "^aaron\t0.141309\t0.123212\t0.146817\t0.19391\t0.123514\t0.264397\t0.299233\t0.21692\t0.255138\t0.262913\t0.199882\t0.183131\t0.109128\t0.365715\t0.180723\t0.21685\t0.43045\t0.283646\t0.182521\t0.50269\r\n", 481 | "^aau\t0.11168\t0.260085\t0.142467\t0.289339\t0.171577\t0.281379\t0.155165\t0.16833\t0.163008\t0.495602\t0.19089\t0.253979\t0.114711\t0.11282\t0.228666\t0.114903\t0.117536\t0.293731\t0.174034\t0.294063\r\n", 482 | "^abandon\t0.200887\t0.128976\t0.194837\t0.370842\t0.291929\t0.160005\t0.152399\t0.306269\t0.23724\t0.175018\t0.39876\t0.111871\t0.327165\t0.470243\t0.137895\t0.265763\t0.113316\t0.109048\t0.111025\t0.356025\r\n", 483 | "^abandoned\t0.319634\t0.164488\t0.626414\t0.421522\t0.127739\t0.261759\t0.146873\t0.127785\t0.524455\t0.178173\t0.323854\t0.130943\t0.470056\t0.251227\t0.166257\t0.119897\t0.372174\t0.144705\t0.246781\t0.157225\r\n", 484 | "^abbreviation\t0.197607\t0.160489\t0.349651\t0.196141\t0.266556\t0.108351\t0.189101\t0.231393\t0.228699\t0.126509\t0.357983\t0.151137\t0.147059\t0.207933\t0.158753\t0.110079\t0.486701\t0.200756\t0.351235\t0.154029\r\n", 485 | "^abc\t0.339874\t0.169692\t0.237028\t0.154603\t0.283697\t0.181479\t0.128209\t0.119643\t0.112896\t0.436227\t0.468355\t0.534854\t0.179209\t0.278722\t0.25511\t0.261242\t0.115069\t0.262839\t0.10839\t0.137095\r\n", 486 | "^aber\t0.213243\t0.113454\t0.172553\t0.148069\t0.180906\t0.259975\t0.258192\t0.197191\t0.276445\t0.243005\t0.126313\t0.195293\t0.163656\t0.312354\t0.185354\t0.155635\t0.161777\t0.13528\t0.112745\t0.3145\r\n", 487 | "^abiding\t0.19583\t0.126143\t0.13001\t0.117153\t0.372594\t0.174991\t0.201046\t0.210155\t0.141046\t0.154915\t0.112312\t0.277414\t0.167284\t0.183201\t0.122946\t0.123565\t0.20932\t0.277415\t0.2191\t0.143324\r\n" 488 | ] 489 | } 490 | ], 491 | "prompt_number": 41 492 | }, 493 | { 494 | "cell_type": "markdown", 495 | "metadata": {}, 496 | "source": [ 497 | "This now gives us something to work with. Let's look at the top terms for each of the topics." 498 | ] 499 | }, 500 | { 501 | "cell_type": "code", 502 | "collapsed": false, 503 | "input": [ 504 | "%%%bash --err err\n", 505 | "for n in `seq 1 20` ; do\n", 506 | " let col=$n+1\n", 507 | " echo \"top words from topic $n (from column $col)\"\n", 508 | " cat data/20ng.unlab.topics | cut -f1,$col | sort -k2,2gr | head\n", 509 | " echo \"\"\n", 510 | "done" 511 | ], 512 | "language": "python", 513 | "metadata": {}, 514 | "outputs": [ 515 | { 516 | "output_type": "stream", 517 | "stream": "stdout", 518 | "text": [ 519 | "top words from topic 1 (from column 2)\n", 520 | "^permission\t1.3347\n", 521 | "^corinthians\t1.18478\n", 522 | "^knives\t1.18478\n", 523 | "^worship\t0.997135\n", 524 | "^seize\t0.975493\n", 525 | "^nanosecond\t0.916404\n", 526 | "^voices\t0.897943\n", 527 | "^acker\t0.891526\n", 528 | "^represents\t0.884127\n", 529 | "^democracy\t0.859931\n", 530 | "\n", 531 | "top words from topic 2 (from column 3)\n", 532 | "^invited\t1.11815\n", 533 | "^dubious\t1.00535\n", 534 | "^hudson\t0.987452\n", 535 | "^methodology\t0.962736\n", 536 | "^med\t0.960603\n", 537 | "^full\t0.956764\n", 538 | "^mosaic\t0.917933\n", 539 | "^explicit\t0.9074\n", 540 | "^ibm\t0.904663\n", 541 | "^compact\t0.885661\n", 542 | "\n", 543 | "top words from topic 3 (from column 4)\n", 544 | "^brute\t1.07875\n", 545 | "^hansen\t0.999382\n", 546 | "^lengthy\t0.986911\n", 547 | "^launcher\t0.972936\n", 548 | "^lafibm\t0.93067\n", 549 | "^restart\t0.913159\n", 550 | "^binah\t0.909636\n", 551 | "^girl\t0.881105\n", 552 | "^billboard\t0.878396\n", 553 | "^grenade\t0.851245\n", 554 | "\n", 555 | "top words from topic 4 (from column 5)\n", 556 | "^punishment\t1.32391\n", 557 | "^informatik\t1.31411\n", 558 | "^interference\t1.24872\n", 559 | "^clearer\t1.00772\n", 560 | "^goaltenders\t0.955884\n", 561 | "^bits\t0.937808\n", 562 | "^permits\t0.896536\n", 563 | "^mccall\t0.869107\n", 564 | "^lemon\t0.867107\n", 565 | "^fluke\t0.841388\n", 566 | "\n", 567 | "top words from topic 5 (from column 6)\n", 568 | "^nissan\t1.10143\n", 569 | "^source\t1.0535\n", 570 | "^racism\t1.01605\n", 571 | "^nmsu\t1.01441\n", 572 | "^gif\t1.01389\n", 573 | "^vocal\t0.980711\n", 574 | "^cryptology\t0.948117\n", 575 | "^neglected\t0.948117\n", 576 | "^operating\t0.884318\n", 577 | "^pose\t0.854931\n", 578 | "\n", 579 | "top words from topic 6 (from column 7)\n", 580 | "^mature\t1.15052\n", 581 | "^council\t1.11209\n", 582 | "^haul\t1.09126\n", 583 | "^increases\t1.01843\n", 584 | "^secretary\t1.00604\n", 585 | "^smoke\t0.954511\n", 586 | "^gifs\t0.938437\n", 587 | "^fuer\t0.913386\n", 588 | "^rational\t0.889523\n", 589 | "^optimal\t0.881105\n", 590 | "\n", 591 | "top words from topic 7 (from column 8)\n", 592 | "^clever\t1.12456\n", 593 | "^shell\t1.12456\n", 594 | "^overnight\t1.08892\n", 595 | "^loosing\t1.05242\n", 596 | "^turks\t1.02999\n", 597 | "^authenticity\t0.972306\n", 598 | "^attraction\t0.960076\n", 599 | "^approximation\t0.922112\n", 600 | "^talon\t0.917178\n", 601 | "^ends\t0.911875\n", 602 | "\n", 603 | "top words from topic 8 (from column 9)\n", 604 | "^defining\t1.02342\n", 605 | "^humanity\t0.981563\n", 606 | "^grown\t0.965316\n", 607 | "^expert\t0.902774\n", 608 | "^troy\t0.869741\n", 609 | "^hypothesis\t0.869091\n", 610 | "^belongs\t0.861842\n", 611 | "^newman\t0.861085\n", 612 | "^player\t0.853319\n", 613 | "^iscp\t0.848127\n", 614 | "\n", 615 | "top words from topic 9 (from column 10)\n", 616 | "^bps\t1.11052\n", 617 | "^expects\t1.10091\n", 618 | "^naming\t1.02374\n", 619 | "^blackhawks\t0.994692\n", 620 | "^dunno\t0.961593\n", 621 | "^cryptographically\t0.920888\n", 622 | "^organizers\t0.908545\n", 623 | "^yankee\t0.901851\n", 624 | "^gripe\t0.894153\n", 625 | "^outrageous\t0.845494\n", 626 | "\n", 627 | "top words from topic 10 (from column 11)\n", 628 | "^past\t1.25183\n", 629 | "^weighed\t1.04362\n", 630 | "^texts\t1.02087\n", 631 | "^constellation\t0.979739\n", 632 | "^playing\t0.962413\n", 633 | "^johns\t0.960041\n", 634 | "^illegal\t0.956764\n", 635 | "^bios\t0.951493\n", 636 | "^nth\t0.932608\n", 637 | "^greatly\t0.927932\n", 638 | "\n", 639 | "top words from topic 11 (from column 12)\n", 640 | "^severely\t1.21558\n", 641 | "^indian\t1.00876\n", 642 | "^cancer\t0.988724\n", 643 | "^utkvx\t0.976508\n", 644 | "^radiator\t0.942288\n", 645 | "^noone\t0.940316\n", 646 | "^smb\t0.930803\n", 647 | "^backs\t0.917295\n", 648 | "^shaky\t0.887557\n", 649 | "^clinical\t0.882631\n", 650 | "\n", 651 | "top words from topic 12 (from column 13)\n", 652 | "^jebright\t1.23132\n", 653 | "^gifts\t1.19482\n", 654 | "^genes\t1.16153\n", 655 | "^lucky\t1.08294\n", 656 | "^sprite\t0.983331\n", 657 | "^stone\t0.959971\n", 658 | "^sincerely\t0.956019\n", 659 | "^ida\t0.926768\n", 660 | "^leftover\t0.921376\n", 661 | "^misc\t0.916708\n", 662 | "\n", 663 | "top words from topic 13 (from column 14)\n", 664 | "^einstien\t1.10078\n", 665 | "^nelson\t1.08973\n", 666 | "^description\t1.07261\n", 667 | "^focus\t0.98535\n", 668 | "^unnecessarily\t0.977988\n", 669 | "^handgun\t0.974567\n", 670 | "^evolve\t0.97357\n", 671 | "^needless\t0.953881\n", 672 | "^combining\t0.946622\n", 673 | "^semitism\t0.946622\n", 674 | "\n", 675 | "top words from topic 14 (from column 15)\n", 676 | "^apartment\t1.20185\n", 677 | "^brooks\t1.04593\n", 678 | "^transmission\t1.03086\n", 679 | "^digress\t1.02198\n", 680 | "^pressure\t0.984336\n", 681 | "^impress\t0.978943\n", 682 | "^berg\t0.977286\n", 683 | "^consequences\t0.954511\n", 684 | "^sucked\t0.940054\n", 685 | "^expressions\t0.928688\n", 686 | "\n", 687 | "top words from topic 15 (from column 16)\n", 688 | "^louray\t1.27392\n", 689 | "^diverted\t1.0006\n", 690 | "^beginner\t0.949887\n", 691 | "^nosc\t0.908155\n", 692 | "^foods\t0.905631\n", 693 | "^castle\t0.857253\n", 694 | "^attach\t0.838165\n", 695 | "^comprehensive\t0.83462\n", 696 | "^readable\t0.824822\n", 697 | "^bos\t0.820325\n", 698 | "\n", 699 | "top words from topic 16 (from column 17)\n", 700 | "^deepak\t1.10663\n", 701 | "^phils\t1.06101\n", 702 | "^lose\t1.05792\n", 703 | "^mentions\t1.03952\n", 704 | "^camel\t1.02986\n", 705 | "^uga\t1.01355\n", 706 | "^tuba\t1.0054\n", 707 | "^opt\t0.968073\n", 708 | "^amoco\t0.952893\n", 709 | "^hoc\t0.939706\n", 710 | "\n", 711 | "top words from topic 17 (from column 18)\n", 712 | "^domino\t1.01314\n", 713 | "^equipped\t1.00992\n", 714 | "^del\t0.97266\n", 715 | "^recovered\t0.969326\n", 716 | "^win\t0.953452\n", 717 | "^shadows\t0.942616\n", 718 | "^billboards\t0.914505\n", 719 | "^mostly\t0.90597\n", 720 | "^chaos\t0.903746\n", 721 | "^ists\t0.89816\n", 722 | "\n", 723 | "top words from topic 18 (from column 19)\n", 724 | "^wdstarr\t1.15196\n", 725 | "^warner\t1.06726\n", 726 | "^wicked\t1.05827\n", 727 | "^saving\t1.01777\n", 728 | "^crucified\t1.01138\n", 729 | "^hackers\t0.99763\n", 730 | "^udel\t0.994837\n", 731 | "^safety\t0.948242\n", 732 | "^buttons\t0.908328\n", 733 | "^hsh\t0.903704\n", 734 | "\n", 735 | "top words from topic 19 (from column 20)\n", 736 | "^versions\t1.24085\n", 737 | "^planet\t1.11267\n", 738 | "^turk\t1.09499\n", 739 | "^murphy\t1.08549\n", 740 | "^gregg\t1.01623\n", 741 | "^hard\t1.00953\n", 742 | "^mozumder\t1.00178\n", 743 | "^unaware\t0.92718\n", 744 | "^differing\t0.915845\n", 745 | "^vehicle\t0.90695\n", 746 | "\n", 747 | "top words from topic 20 (from column 21)\n", 748 | "^plea\t1.13435\n", 749 | "^chocolate\t1.04523\n", 750 | "^cruiser\t1.01048\n", 751 | "^mirror\t1.00019\n", 752 | "^whitmore\t0.985972\n", 753 | "^reward\t0.979236\n", 754 | "^depend\t0.976917\n", 755 | "^taxation\t0.969785\n", 756 | "^kotb\t0.953386\n", 757 | "^avenue\t0.9235\n", 758 | "\n" 759 | ] 760 | } 761 | ], 762 | "prompt_number": 43 763 | }, 764 | { 765 | "cell_type": "markdown", 766 | "metadata": {}, 767 | "source": [ 768 | "These are not super meaningful, largely because we didn't run enough topics." 769 | ] 770 | } 771 | ], 772 | "metadata": {} 773 | } 774 | ] 775 | } --------------------------------------------------------------------------------