├── README ├── download.py ├── examples ├── SurprisalInContext │ ├── README │ ├── analyze.R │ ├── run-GoogleBooks.sh │ └── surprisal-2gram.py └── TrigramMatches │ ├── README │ ├── badwords.txt │ ├── compute_trigram_stats.py │ ├── find_matched_items.py │ └── find_matched_items_aXb.py ├── ngrampy ├── LineFile.py ├── LineFile.pyc ├── LineFileInMemory.py ├── __init__.py ├── __init__.pyc ├── debug.py └── helpers.py ├── process-google.py ├── process-initial.sh └── tests ├── ngrampy_tests.py ├── smallcorpus-malformed.txt.bz2 └── smallcorpus.txt.bz2 /README: -------------------------------------------------------------------------------- 1 | 2 | ngrampy is a python class for manipulating google (or similarly formatted) n-gram data. It provides a python class for very basic table manipulations such that operations on tables are mimiced by operations on the hard drive, with huge n-gram files that cannot be read into RAM. This takes a lot of hard drive time, but can handle arbitrary file sizes (5~20gb is typical). This is *not* optimized for speed, since these things take a long time anyways and are typically run once. 3 | 4 | Usually, it makes more sense to process the google files once, concatinging and collapsing by some dates into a large file with all the ngrams (since this may take a few days). For this, the process-google.py script is fastest (much faster than LineFile). In collapsing dates, it makes a much smaller file (~10GB for eng-us 2grams) 5 | gzip -dc /home/piantado/Desktop/GoogleBooks/eng-us-all/2/* | python process-google.py /tmp/G-eng-us-all 6 | Or, unpigz is about 2x as fast as gzip on my computer (it multithreads fetching, file io, etc.) 7 | 8 | This perl script does not do any fancy filtering of the ngrams. 9 | 10 | 11 | To download data from google, you can use download.py 12 | 13 | NOTE: In general, you should use this library with 14 | 15 | export PYTHONIOENCODING=utf-8 16 | 17 | so that you can handle utf-8 characters from google. 18 | 19 | NOTE: This splits columns in the text files by whitespace; if you want something else, you should merge with underscores or something 20 | 21 | NOTE: The pypy tends to run much faster than python for this! 22 | 23 | ======================================================== 24 | == LICENSE 25 | ======================================================== 26 | 27 | ngrampy is licensed under GPL 3.0 28 | 29 | ======================================================== 30 | == INSTALLATION: 31 | ======================================================== 32 | 33 | Put this library somewhere--mine lives in /home/piantado/mit/Libraries/ngrampy/ 34 | 35 | Set the PYTHONPATH environment variable to point to ngrampy/: 36 | 37 | export PYTHONPATH=$PYTHONPATH:/home/piantado/Desktop/mit/Libraries/ngrampy 38 | 39 | You can put this into your .bashrc file to make it loaded automatically when you open a terminal. On ubuntu and most linux, this is: 40 | 41 | echo 'export PYTHONPATH=$PYTHONPATH:/home/piantado/Desktop/mit/Libraries/ngrampy' >> ~/.bashrc 42 | 43 | You can also do 44 | 45 | echo 'export PYTHONIOENCODING=utf-8' >> ~/.bashrc 46 | 47 | although this will change your default python encoding. 48 | 49 | And you should be ready to use the library 50 | 51 | -------------------------------------------------------------------------------- /download.py: -------------------------------------------------------------------------------- 1 | """ 2 | This downloads from google all of the files matching a pattern on the Google Books Ngram Download page. 3 | """ 4 | 5 | import httplib2 6 | from BeautifulSoup import BeautifulSoup, SoupStrainer 7 | import re 8 | import os 9 | import urllib 10 | 11 | # Use httplib2 and BeautifulSoup to scrape the links from the google index page: 12 | http = httplib2.Http() 13 | status, response = http.request('http://storage.googleapis.com/books/ngrams/books/datasetsv2.html') 14 | 15 | for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')): 16 | if link.has_key('href'): 17 | url = link['href'] 18 | 19 | # IF we match what we want: 20 | if re.search("[12]gram.+20120701", url): 21 | # Decode this 22 | m = re.search(r"googlebooks-([\w\-]+)-(\d+)gram.+",url) 23 | language, n = m.groups(None) 24 | 25 | # Only download some language 26 | if language not in set(["eng-us-all","eng-gb-all", "fre-all", "ger-all", "heb-all", "ita-all", "rus-all", "spa-all" ]): continue 27 | 28 | filename = re.split(r"/", url)[-1] # last item on filename split 29 | 30 | # Make the directory if it does not exist 31 | if not os.path.exists(language): os.mkdir(language) 32 | if not os.path.exists(language+"/"+n): os.mkdir(language+"/"+n) 33 | 34 | if not os.path.exists(language+"/"+n+"/"+filename): 35 | print "# Downloading %s to %s" % (url, language+"/"+n+"/"+filename) 36 | urllib.urlretrieve(url, language+"/"+n+"/"+filename ) 37 | 38 | 39 | 40 | -------------------------------------------------------------------------------- /examples/SurprisalInContext/README: -------------------------------------------------------------------------------- 1 | This computes surprisal measures for 11 languages from Piantadosi, Tily, & Gibson's word length paper. It is a complete re-implementation, using the LineFile class to handle the big corpus. This handles unicode and filters 2 | the corpora somewhat differently than the original paper, but the results are largely the same. 3 | 4 | For fastest running, you should do this on a solid state drive. 5 | 6 | The analysis throws out many of the garbarge words on google by using vocabularies from OpenSubtlex, taking the most frequent 25k words. The Extract_Vocabularies directory contains a script to extract these vocabularies. 7 | -------------------------------------------------------------------------------- /examples/SurprisalInContext/analyze.R: -------------------------------------------------------------------------------- 1 | 2 | # A script for analyzing the results of run-all.sh, which will populate the Surprisal directory 3 | # In the original work, we used Opensubtlex to define vocabularies, but now for simplicity 4 | # Let's just use the most frequent strings 5 | 6 | # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 7 | # Some handy functions 8 | # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 9 | 10 | stdize <- function(x, ...) { (x - mean(x,...)) / sd(x, ...) } 11 | 12 | sort.by.frequency <- function(d) { d[order(d$Log.Frequency, decreasing=T),] } 13 | 14 | # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 15 | # Define the vocabulary -- take the most frequent words in some year 16 | # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 17 | 18 | VOCAB <- as.character(sort.by.frequency(read.table("Surprisal/eng-us-2.1950.txt", header=T))[1:25000,"Word"]) 19 | 20 | # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 21 | # Now analyze: 22 | # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 23 | 24 | D <- NULL 25 | for(Y in c("1900", "1925", "1950", "1975", "2000")){#"1500", "1525", "1550", "1600", "1625", "1650", "1675", "1700", "1725", "1750", "1775", "1800")) { 26 | for(L in c("eng-gb-2", "eng-us-2")){ 27 | 28 | d <- read.table(paste("Surprisal/", L, ".", Y, ".txt", sep=""), header=T) 29 | d <- d[is.element(d$Word, VOCAB),] 30 | 31 | # Very simple--just nonparametric correlations 32 | # NOTE: Email Steve for fancier scripts and analysis (partials, bootstrapping, etc.) 33 | sc <- cor.test(d$Surprisal, d$Orthographic.Length, method="spearman") 34 | fc <- cor.test(-d$Log.Frequency, d$Orthographic.Length, method="spearman") ## Negative log freq here so that its on the same scale (we didn't normalize freq) 35 | 36 | l <- lm( stdize(Orthographic.Length) ~ stdize(Surprisal), data=d) 37 | 38 | D <- rbind(D, data.frame( Language=L, 39 | Year=Y, 40 | Surprisal.cor=sc$estimate, 41 | Frequency.cor=fc$estimate, 42 | # Surprisal.p.value=sc$p.value, 43 | # Frequency.p.value=fc$p.value, 44 | mean.surprisal=mean(d$Surprisal), 45 | sd.surprisal=sd(d$Surprisal), 46 | lm.icpt=coef(l)[1], 47 | lm.slope=coef(l)[2], 48 | mean.fw.surprisal=weighted.mean(d$Surprisal, exp(d$Log.Frequency)), 49 | total.freq=sum(2.0**d$Log.Frequency) 50 | )) 51 | } 52 | } 53 | 54 | print(D) 55 | 56 | 57 | # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 58 | # Build the monster data frame 59 | # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 60 | 61 | # 62 | # D <- NULL 63 | # for(Y in c("1500", "1525", "1550", "1600", "1625", "1650", "1675", "1700", "1725", "1750", "1775", "1800")) { 64 | # for(L in c("eng-us-2")){ 65 | # 66 | # d <- read.table(paste("Surprisal/", L, ".", Y, ".txt", sep=""), header=T) 67 | # d <- d[is.element(d$Word, VOCAB),] 68 | # d$Total.Log.Frequency <- log(sum(2.0**d$Log.Frequency)) # TODO:Logsumexp 69 | # d$Year <- as.numeric(Y) 70 | # d$Language <- L 71 | # 72 | # D <- rbind(D, d) 73 | # } 74 | # } 75 | -------------------------------------------------------------------------------- /examples/SurprisalInContext/run-GoogleBooks.sh: -------------------------------------------------------------------------------- 1 | 2 | # Pass in the google directory with each file you'd like to process 3 | 4 | # > bash run-GoogleBooks.sh /CorpusA/GoogleBooks/Processed/eng-us-2/* 5 | 6 | # This then computes, stores the google archive to Archive/xxx.7z and surprisal to Surprisal 7 | 8 | for f in $@ 9 | do 10 | x=$(basename $f) # the base file name 11 | python surprisal-2gram.py --in=$f --path=/tmp/$x.google > Surprisal/$x.txt 12 | 7z a -mx=9 Archive/$x.7z /tmp/$x.google && rm /tmp/$x.google & # run in background since its slow 13 | done -------------------------------------------------------------------------------- /examples/SurprisalInContext/surprisal-2gram.py: -------------------------------------------------------------------------------- 1 | """ 2 | This file shows how to use ngrampy to compute the average surprisal measures from Piantadosi, Tily & Gibson (2011) 3 | 4 | python surprisal-2gram.py --in=/path/to/google --path=/tmp/temporaryfile > surprisal.txt 5 | 6 | """ 7 | from ngrampy.LineFile import * 8 | import os 9 | import argparse 10 | import glob 11 | 12 | ASSERT_SORTED = True # if you want an extra check on sorting 13 | 14 | parser = argparse.ArgumentParser(description='Compute average surprisal from google style data') 15 | parser.add_argument('--in', dest='in', type=str, default="/home/piantado/Desktop/mit/Corpora/GoogleNGrams/2/*", nargs="?", help='The directory with google files (e.g. Google/3gms/)') 16 | parser.add_argument('--path', dest='path', type=str, default="/tmp/GoogleSurprisal", nargs="?", help='Where the database file lives') 17 | args = vars(parser.parse_args()) 18 | 19 | print "# Loading files" 20 | G = LineFile( glob.glob(args['in']), header=["w1", "w2", "cnt12"], path=args['path']) 21 | print "# Cleaning" 22 | G.clean(columns=3) 23 | 24 | # Since we collapsed case, go through and re-sum the triple counts 25 | print "# Resumming for case collapsing" 26 | G.sort(keys="w1 w2") 27 | G.resum_equal("w1 w2", "cnt12", assert_sorted=ASSERT_SORTED ) # in collapsing case, etc., we need to re-sum 28 | 29 | # Now go through and 30 | print "# Making marginal counts" 31 | G.make_marginal_column("cnt1", "w1", "cnt12") 32 | 33 | # and compute surprisal 34 | print "# Sorting by word" 35 | G.sort("w2") 36 | 37 | print "# Computing surprisal" 38 | G.print_average_surprisal("w2", "cnt12", "cnt1", assert_sorted=ASSERT_SORTED) 39 | 40 | # And remove my temporary file: 41 | print "# Removing my temporary file" 42 | G.delete_tmp() 43 | 44 | # If you have a file that's already sorted, etc: 45 | #G = LineFile(["/ssd/GoogleSurprisal-ALREADYFILTERED"], path="/ssd/Gsurprisal", header=["w1", "w2", "w3", "cnt12", "cnt12"]) # for debugging 46 | #G.print_average_surprisal("w3", "cnt12", "cnt12", assert_sorted=False) 47 | -------------------------------------------------------------------------------- /examples/TrigramMatches/README: -------------------------------------------------------------------------------- 1 | This is for computing trigrams with matched properties (e.g. matched unigram and bigram stats). So you can find trigrams that are controlled on all but the joint probability, for instance. I once tried to do a project involving them. 2 | 3 | First build a "database" with compute_trigram_stats.py. This will take google and build a bigger file with each trigram and other measures such as the unigram and bigram probabilities. 4 | 5 | Then, run find_matched_items.py, which takes the output fo compute_trigram_stats (assumed to live in /ssd/trigram-stats), and then subsamples it, and then sorts to generate items which are matched. It outputs to stdout the number of items in the stack, the item number, and the two lines of /ssd/trigram-stats which are matched. 6 | -------------------------------------------------------------------------------- /examples/TrigramMatches/badwords.txt: -------------------------------------------------------------------------------- 1 | sexual 2 | pubic 3 | groin 4 | crotch 5 | genital 6 | genitals 7 | sex 8 | blow 9 | blowjob 10 | sexed 11 | sexting 12 | ahole 13 | anus 14 | ash0le 15 | ash0les 16 | asholes 17 | ass 18 | hot 19 | Ass Monkey 20 | Assface 21 | assh0le 22 | assh0lez 23 | asshole 24 | assholes 25 | assholz 26 | asswipe 27 | azzhole 28 | balls 29 | bassterds 30 | bastard 31 | bastards 32 | bastardz 33 | basterds 34 | basterdz 35 | Biatch 36 | bitch 37 | bitches 38 | Blow Job 39 | boffing 40 | butthole 41 | buttwipe 42 | c0ck 43 | c0cks 44 | c0k 45 | Carpet Muncher 46 | cawk 47 | cawks 48 | Clit 49 | cnts 50 | cntz 51 | cock 52 | cockhead 53 | cock-head 54 | cocks 55 | CockSucker 56 | cock-sucker 57 | crap 58 | cum 59 | cunt 60 | cunts 61 | cuntz 62 | dick 63 | dild0 64 | dild0s 65 | dildo 66 | dildos 67 | dilld0 68 | dilld0s 69 | dominatricks 70 | dominatrics 71 | dominatrix 72 | dyke 73 | enema 74 | f u c k 75 | f u c k e r 76 | fag 77 | fag1t 78 | faget 79 | fagg1t 80 | faggit 81 | faggot 82 | fagit 83 | fags 84 | fagz 85 | faig 86 | faigs 87 | fart 88 | flipping the bird 89 | fuck 90 | fucker 91 | fuckin 92 | fucking 93 | fucked 94 | fuckhole 95 | fuckable 96 | fuck-yes 97 | fucks 98 | Fudge Packer 99 | fuk 100 | Fukah 101 | Fuken 102 | fuker 103 | Fukin 104 | Fukk 105 | Fukkah 106 | Fukken 107 | Fukker 108 | Fukkin 109 | g00k 110 | gay 111 | gayboy 112 | gaygirl 113 | gays 114 | gayz 115 | God-damned 116 | h00r 117 | h0ar 118 | h0re 119 | hells 120 | hoar 121 | hoor 122 | hoore 123 | jackoff 124 | jap 125 | japs 126 | jerk-off 127 | jisim 128 | jiss 129 | jizm 130 | jizz 131 | knob 132 | knobs 133 | knobz 134 | kunt 135 | kunts 136 | kuntz 137 | Lesbian 138 | Lezzian 139 | Lipshits 140 | Lipshitz 141 | masochist 142 | masokist 143 | massterbait 144 | masstrbait 145 | masstrbate 146 | masterbaiter 147 | masterbate 148 | masterbates 149 | Motha Fucker 150 | Motha Fuker 151 | Motha Fukkah 152 | Motha Fukker 153 | Mother Fucker 154 | Mother Fukah 155 | Mother Fuker 156 | Mother Fukkah 157 | Mother Fukker 158 | mother-fucker 159 | Mutha Fucker 160 | Mutha Fukah 161 | Mutha Fuker 162 | Mutha Fukkah 163 | Mutha Fukker 164 | n1gr 165 | nastt 166 | nude 167 | nigger 168 | niiger 169 | nigur 170 | orgasim 171 | nigger; 172 | nigur; 173 | niiger; 174 | niigr; 175 | orafis 176 | orgasim; 177 | orgasm 178 | orgasum 179 | oriface 180 | orifice 181 | orifiss 182 | packi 183 | packie 184 | packy 185 | paki 186 | pakie 187 | paky 188 | pecker 189 | peeenus 190 | peeenusss 191 | peenus 192 | peinus 193 | pen1s 194 | penas 195 | penis 196 | penis-breath 197 | penus 198 | penuus 199 | Phuc 200 | Phuck 201 | Phuk 202 | Phuker 203 | Phukker 204 | polac 205 | polack 206 | polak 207 | Poonani 208 | pr1c 209 | pr1ck 210 | pr1k 211 | pusse 212 | pussee 213 | pussy 214 | puuke 215 | puuker 216 | queer 217 | queers 218 | queerz 219 | qweers 220 | qweerz 221 | qweir 222 | recktum 223 | rectum 224 | retard 225 | sadist 226 | scank 227 | schlong 228 | screwing 229 | semen 230 | sex 231 | sexy 232 | Sh!t 233 | sh1t 234 | sh1ter 235 | sh1ts 236 | sh1tter 237 | sh1tz 238 | shit 239 | shits 240 | shitter 241 | Shitty 242 | Shity 243 | shitz 244 | Shyt 245 | Shyte 246 | Shytty 247 | Shyty 248 | skanck 249 | skank 250 | skankee 251 | skankey 252 | skanks 253 | Skanky 254 | slut 255 | sluts 256 | Slutty 257 | slutz 258 | son-of-a-bitch 259 | tit 260 | turd 261 | va1jina 262 | vag1na 263 | vagiina 264 | vagina 265 | vaj1na 266 | vajina 267 | vullva 268 | vulva 269 | w0p 270 | wh00r 271 | wh0re 272 | whore 273 | xrated 274 | xxx 275 | b!+ch 276 | bitch 277 | blowjob 278 | clit 279 | arschloch 280 | fuck 281 | shit 282 | ass 283 | asshole 284 | b!tch 285 | b17ch 286 | b1tch 287 | bastard 288 | bi+ch 289 | boiolas 290 | buceta 291 | c0ck 292 | cawk 293 | chink 294 | cipa 295 | clits 296 | cock 297 | cum 298 | cunt 299 | dildo 300 | dirsa 301 | ejakulate 302 | fatass 303 | fcuk 304 | fuk 305 | fux0r 306 | hoer 307 | hore 308 | jism 309 | kawk 310 | l3itch 311 | l3i+ch 312 | lesbian 313 | masturbate 314 | masterbat 315 | masterbat3 316 | motherfucker 317 | s.o.b. 318 | mofo 319 | nazi 320 | nigga 321 | nigger 322 | nutsack 323 | phuck 324 | pimpis 325 | pusse 326 | pussy 327 | scrotum 328 | sh!t 329 | shemale 330 | shi+ 331 | sh!+ 332 | slut 333 | smut 334 | teets 335 | tits 336 | boobs 337 | b00bs 338 | teez 339 | testical 340 | testicle 341 | titt 342 | w00se 343 | jackoff 344 | wank 345 | whoar 346 | whore 347 | damn 348 | dyke 349 | fuck 350 | shit 351 | @$$ 352 | amcik 353 | andskota 354 | arse 355 | assrammer 356 | ayir 357 | bi7ch 358 | bitch 359 | bollock 360 | breasts 361 | butt-pirate 362 | cabron 363 | cazzo 364 | chraa 365 | chuj 366 | Cock 367 | cunt 368 | d4mn 369 | daygo 370 | dego 371 | dick 372 | dike 373 | dupa 374 | dziwka 375 | ejackulate 376 | Ekrem 377 | Ekto 378 | enculer 379 | faen 380 | fag 381 | fanculo 382 | fanny 383 | feces 384 | feg 385 | Felcher 386 | ficken 387 | fitt 388 | Flikker 389 | foreskin 390 | Fotze 391 | Fu( 392 | fuk 393 | futkretzn 394 | gay 395 | gook 396 | guiena 397 | h0r 398 | h4x0r 399 | hell 400 | helvete 401 | hoer 402 | honkey 403 | Huevon 404 | hui 405 | injun 406 | jizz 407 | kanker 408 | kike 409 | klootzak 410 | kraut 411 | knulle 412 | kuk 413 | kuksuger 414 | Kurac 415 | kurwa 416 | kusi 417 | kyrpa 418 | lesbo 419 | mamhoon 420 | masturbat 421 | merd 422 | mibun 423 | monkleigh 424 | mouliewop 425 | muie 426 | mulkku 427 | muschi 428 | nazis 429 | nepesaurio 430 | nigger 431 | orospu 432 | paska 433 | perse 434 | picka 435 | pierdol 436 | pillu 437 | pimmel 438 | piss 439 | pizda 440 | poontsee 441 | poop 442 | porn 443 | p0rn 444 | pr0n 445 | preteen 446 | pula 447 | pule 448 | puta 449 | puto 450 | qahbeh 451 | queef 452 | rautenberg 453 | schaffer 454 | scheiss 455 | schlampe 456 | schmuck 457 | screw 458 | sh!t 459 | sharmuta 460 | sharmute 461 | shipal 462 | shiz 463 | skribz 464 | skurwysyn 465 | sphencter 466 | spic 467 | spierdalaj 468 | splooge 469 | suka 470 | b00b 471 | testicle 472 | titt 473 | twat 474 | vittu 475 | wank 476 | wetback 477 | wichser 478 | wop 479 | yed 480 | zabourah -------------------------------------------------------------------------------- /examples/TrigramMatches/compute_trigram_stats.py: -------------------------------------------------------------------------------- 1 | 2 | from ngrampy.LineFile import * 3 | import os 4 | GOOGLE_ENGLISH_DIR = "/home/piantado/Desktop/mit/Corpora/GoogleNGrams/3/" 5 | VOCAB_FILE = "Vocabulary/EnglishVocabulary.txt" 6 | 7 | # Read the vocabulary file 8 | vocabulary = [ l.strip() for l in open(VOCAB_FILE, "r") ] 9 | 10 | #rawG = LineFile(["test3.txt"], header=["w1", "w2", "w3", "cnt123"]) # for debugging 11 | rawG = LineFile([GOOGLE_ENGLISH_DIR+x for x in os.listdir(GOOGLE_ENGLISH_DIR)], header=["w1", "w2", "w3", "cnt123"]) 12 | 13 | rawG.clean() # already done! 14 | rawG.restrict_vocabulary("w1 w2 w3", vocabulary) # in fields w1 and w2, restrict our vocabulary 15 | rawG.sort(keys="w1 w2 w3") # Since we collapsed case, etc. This could also be rawG.sort(keys=["w1","w2","w3"]) in the other format. 16 | rawG.resum_equal("w1 w2 w3", "cnt123" ) 17 | 18 | # Where we store all lines 19 | G = rawG.copy() 20 | 21 | # Now go through and compute what we want 22 | G1 = rawG.copy() # start with a copy 23 | G1.delete_columns( "w2 w3" ) # delete the columns we don't want 24 | G1.sort("w1" ) # sort this by the one we do want 25 | G1.resum_equal( "w1", "cnt123" ) # resum equal 26 | G1.rename_column("cnt123", "cnt1") # rename the column since its now a sum of 1 27 | G.sort("w1") # sort our target by w 28 | G.merge(G1, keys1="w1", tocopy="cnt1") # merge in 29 | G1.delete() # and delete this temporary 30 | 31 | G2 = rawG.copy() 32 | G2.delete_columns( "w1 w3" ) 33 | G2.sort("w2" ) 34 | G2.resum_equal( "w2", "cnt123" ) 35 | G2.rename_column("cnt123", "cnt2") 36 | G.sort("w2") 37 | G.merge(G2, keys1="w2", tocopy="cnt2") 38 | G2.delete() 39 | 40 | G3 = rawG.copy() 41 | G3.delete_columns( "w1 w2" ) 42 | G3.sort("w3") 43 | G3.resum_equal( "w3", "cnt123" ) 44 | G3.rename_column("cnt123", "cnt3") 45 | G.sort("w3") 46 | G.merge(G3, keys1="w3", tocopy="cnt3") 47 | G3.delete() 48 | 49 | G12 = rawG.copy() 50 | G12.delete_columns( ["w3"] ) 51 | G12.sort("w1 w2" ) 52 | G12.resum_equal( "w1 w2", "cnt123" ) 53 | G12.rename_column("cnt123", "cnt12") 54 | G.sort("w1 w2") # do this for merging 55 | G.merge(G12, keys1="w1 w2", tocopy=["cnt12"]) 56 | G12.delete() 57 | 58 | G23 = rawG.copy() 59 | G23.delete_columns( ["w1"] ) 60 | G23.sort("w2 w3" ) 61 | G23.resum_equal( "w2 w3", "cnt123" ) 62 | G23.rename_column("cnt123", "cnt23") 63 | G.sort("w2 w3") # do this for merging 64 | G.merge(G23, keys1="w2 w3", tocopy=["cnt23"]) 65 | G23.delete() 66 | 67 | # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 68 | # Now compute all the arithmetic, etc. 69 | # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 70 | 71 | #Make a colum: call it unigram, a function of three arguments, and give it w1,w2,w3 as arguments 72 | from math import log 73 | def log2(x): log(x,2.0) 74 | 75 | def logsum(*x): return str(round(sum(map(log,map(float,x))), 4)) # must take a string and return a string 76 | #def logcol(x) : return logsum([x]) 77 | G.make_column("unigram", logsum, "cnt1 cnt2 cnt3") 78 | G.make_column("bigram", logsum, "cnt12 cnt23") 79 | G.make_column("trigram", logsum, "cnt123") 80 | 81 | G.sort("unigram bigram trigram", dtype=float) 82 | 83 | ##G.cat() 84 | G.head() 85 | -------------------------------------------------------------------------------- /examples/TrigramMatches/find_matched_items.py: -------------------------------------------------------------------------------- 1 | """ 2 | This file is for constructing stimuli which are matched on bigram and unigram surprisal 3 | It requires you to have run compute_trigram_stats and output the result to /ssd/trigram-stats 4 | It also takes a bad word file to filter out bad words from our experimental stimuli 5 | """ 6 | 7 | from ngrampy.LineFile import * 8 | import os 9 | SUBSAMPLE_N = 15000 10 | tolerance = 0.001 11 | BAD_WORD_FILE = "badwords.txt" 12 | 13 | def check_tolerance(x,y): 14 | """ 15 | A handy function to check if some variables are within tolerance percent of each other 16 | """ 17 | return abs(x-y) / ((x+y)/2.) < tolerance 18 | 19 | # This will copy the file, make a new one, and then print out possible lines 20 | G = LineFile(files=["/ssd/trigram-stats"], path="/ssd/subsampled-stimuli", header="w1 w2 w3 c123 c1 c2 c3 c12 c23 unigram bigram trigram") 21 | 22 | # Now throw out the porno words 23 | porno_vocabulary = [ l.strip() for l in open(BAD_WORD_FILE, "r") ] 24 | G.restrict_vocabulary("w1 w2 w3", porno_vocabulary, invert=True) 25 | 26 | # and then subsample 27 | G.subsample_lines(N=SUBSAMPLE_N) 28 | 29 | # and make sure we are sorted for the below 30 | G.sort("unigram bigram trigram", dtype=float) 31 | G.head() # just a peek 32 | 33 | item_number = 0 34 | line_stack = [] 35 | for l in G.lines(tmp=False, parts=False): 36 | # extrac the columns from line 37 | unigram, bigram, trigram = G.extract_columns(l, keys="unigram bigram trigram", dtype=float) 38 | 39 | # now remove things which cannot possibly match anymore 40 | while len(line_stack) > 0 and not check_tolerance(unigram, G.extract_columns(line_stack[0], keys="unigram", dtype=float)[0]): 41 | del line_stack[0] 42 | 43 | # now go through the line_stack and try out each 44 | # it must already be within tolerance on unigram, or it would have been removed 45 | for x in line_stack: 46 | #print "Checking ", x 47 | x_unigram, x_bigram, x_trigram = G.extract_columns(x, keys="unigram bigram trigram", dtype=float) 48 | 49 | # it must have already been within tolerance on unigram or it would be removed 50 | assert( check_tolerance(unigram, x_unigram) ) 51 | 52 | # and check the bigrams 53 | if check_tolerance(bigram, x_bigram): 54 | print len(line_stack), item_number, l 55 | print len(line_stack), item_number, x 56 | item_number += 1 57 | 58 | # and add this on 59 | line_stack.append(l) 60 | -------------------------------------------------------------------------------- /examples/TrigramMatches/find_matched_items_aXb.py: -------------------------------------------------------------------------------- 1 | """ 2 | This file is for constructing stimuli pairs 3 | 4 | A X B <-> A Y B 5 | 6 | with the unigram and bigram stats matched 7 | 8 | It requires you to have run compute_trigram_stats and output the result to /ssd/trigram-stats 9 | It also takes a bad word file to filter out bad words from our experimental stimuli. 10 | 11 | This 12 | """ 13 | 14 | from ngrampy.LineFile import * 15 | import os 16 | SUBSAMPLE_N = 50000000 17 | tolerance = 0.01 18 | BAD_WORD_FILE = "badwords.txt" 19 | 20 | def check_tolerance(x,y): 21 | """ 22 | A handy function to check if some variables are within tolerance percent of each other 23 | """ 24 | return abs(x-y) / ((x+y)/2.) < tolerance 25 | 26 | # This will copy the file, make a new one, and then print out possible lines 27 | G = LineFile(files=["/ssd/trigram-stats"], path="/ssd/subsampled-stimuli", header="w1 w2 w3 c123 c1 c2 c3 c12 c23 unigram bigram trigram") 28 | 29 | # Now throw out the porno words 30 | #porno_vocabulary = [ l.strip() for l in open(BAD_WORD_FILE, "r") ] 31 | #G.restrict_vocabulary("w1 w2 w3", porno_vocabulary, invert=True) 32 | 33 | # draw a subsample 34 | #if SUBSAMPLE_N is not None: 35 | #G.subsample_lines(N=SUBSAMPLE_N) 36 | 37 | # we need to resort this so that we can have w1 and w3 equal and then all the n-grams matched 38 | G.sort("w1 w3 unigram bigram trigram", lines=1000000) 39 | G.head() 40 | 41 | item_number = 0 42 | line_stack = [] 43 | for l in G.lines(tmp=False, parts=False): 44 | # extract the columns from line 45 | w1, w3, unigram, bigram, trigram = G.extract_columns(l, keys="w1 w3 unigram bigram trigram", dtype=[str, str, float, float, float]) 46 | 47 | # now remove things which cannot possibly match anymore 48 | while len(line_stack) > 0: 49 | w1_, w3_, unigram_, bigram_, trigram = G.extract_columns(line_stack[0], keys="w1 w3 unigram bigram trigram", dtype=[str, str, float, float, float]) 50 | 51 | if not (w1_ == w1 and w3_ == w3 and check_tolerance(unigram, unigram_)): 52 | del line_stack[0] 53 | 54 | # now go through the line_stack and try out each 55 | # it must already be within tolerance on unigram, or it would have been removed 56 | for x in line_stack: 57 | w1_, w3_, unigram_, bigram_, trigram = G.extract_columns(x, keys="w1 w3 unigram bigram trigram", dtype=[str, str, float, float, float]) 58 | 59 | # it must have already been within tolerance on unigram or it would be removed 60 | assert( check_tolerance(unigram, unigram_) ) 61 | assert( w1_ == w1 and w3_ == w3 ) 62 | 63 | # and check the bigrams 64 | if check_tolerance(bigram, bigram_) and (w1==w1_ and w3==w3_): 65 | print len(line_stack), item_number, l 66 | print len(line_stack), item_number, x 67 | item_number += 1 68 | 69 | # and add this on 70 | line_stack.append(l) 71 | -------------------------------------------------------------------------------- /ngrampy/LineFile.py: -------------------------------------------------------------------------------- 1 | """ 2 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 3 | 4 | This class allows manipulation of google ngram data (and similar formatted) data. 5 | When you call functions on LineFiles, the changes are echoed in the file. 6 | 7 | The uses tab (\t) as the column separator. 8 | 9 | When you run this, if you get an encoding error, you may need to set the environment to 10 | 11 | export PYTHONIOENCODING=utf-8 12 | 13 | 14 | TODO: 15 | - Make this so each function call etc. will output what it did 16 | NOTE: 17 | - Column names cannot contain spaces. 18 | 19 | Licensed under GPL 3.0 20 | 21 | This program is free software: you can redistribute it and/or modify 22 | it under the terms of the GNU General Public License as published by 23 | the Free Software Foundation, either version 3 of the License, or 24 | (at your option) any later version. 25 | 26 | This program is distributed in the hope that it will be useful, 27 | but WITHOUT ANY WARRANTY; without even the implied warranty of 28 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 29 | GNU General Public License for more details. 30 | 31 | You should have received a copy of the GNU General Public License 32 | along with this program. If not, see . 33 | 34 | 35 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 36 | """ 37 | from __future__ import division 38 | import os 39 | import sys 40 | import re 41 | import unicodedata 42 | import heapq 43 | import shutil 44 | import random 45 | import codecs 46 | import itertools 47 | from math import log 48 | from collections import Counter 49 | from copy import deepcopy 50 | 51 | # handly numpy with pypy 52 | try: 53 | import numpy 54 | except ImportError: 55 | try: 56 | import numpypy as numpy 57 | except ImportError: 58 | pass 59 | 60 | from debug import * 61 | from helpers import * 62 | 63 | # A temporary file like /tmp 64 | NGRAMPY_DEFAULT_PATH = "/tmp" #If no path is specified, we go here 65 | 66 | ECHO_SYSTEM = True # show the system calls we make? 67 | SORT_DEFAULT_LINES = 10000000 # how many lines to sorted at a time in RAM when we sort a large file? 68 | ENCODING = 'utf-8' 69 | 70 | # Set this so we can write stderr 71 | sys.stdout = codecs.getwriter(ENCODING)(sys.stdout) 72 | sys.stderr = codecs.getwriter(ENCODING)(sys.stderr) 73 | 74 | IO_BUFFER_SIZE = int(100e6) # approx size of input buffer 75 | 76 | COLUMN_SEPARATOR = u"\t" # must be passed to string.split() or else empty columns are collapsed! 77 | 78 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 79 | # # Class definition 80 | # # # # # # # # # # # # # # # # # s# # # # # # # # # # # # # # # # # # # # # # # # # # # 81 | 82 | class LineFile(object): 83 | 84 | def __init__(self, files, header=None, path=None, force_nocopy=False): 85 | """ 86 | Create a new file object with the specified header. It takes a list of files 87 | and cats them to path (overwriting it). A single file is acceptable. 88 | 89 | header - give each column a name (you can refer by name instead of number) 90 | path - where is this file stored? If None, we make a new temporary files 91 | force_nocopy - Primarily for debugging, this prevents us from copying a file and just uses the path as is 92 | you should pass files a list of length 1 which is the file, and path should be None 93 | as in, LineFile(files=["/ssd/trigram-stats"], header="w1 w2 w3 c123 c1 c2 c3 c12 c23 unigram bigram trigram", force_nocopy=True) 94 | """ 95 | if isinstance(files, str): 96 | files = [files] 97 | 98 | assert len(files) > 0, "*** Must provide non-empty list of files!" 99 | 100 | if force_nocopy: 101 | assert(len(files) == 1) 102 | self.path = files[0] 103 | 104 | else: 105 | if path is None: 106 | self.path = NGRAMPY_DEFAULT_PATH+"/tmp" # used by get_new_path, since self.path is where we get the dir from 107 | self.path = self.get_new_path() # overwrite iwth a new file 108 | else: 109 | self.path = path 110 | 111 | # if it exists, let's just move it to a backup name 112 | if os.path.exists(self.path): 113 | systemcall("mv "+self.path+" "+self.path+".old") 114 | 115 | # and if we specified a bunch of input files 116 | for f in files: 117 | if f.endswith(".idf"): 118 | continue # skip the index files 119 | if f.endswith(".gz"): 120 | systemcall("gunzip -d -c "+f+" >> "+self.path) 121 | elif f.endswith(".bz2"): 122 | systemcall("bzip2 -d -c "+f+" >> "+self.path) 123 | elif f.endswith(".xz") or f.endswith(".lzma"): 124 | systemcall("xz -d -c "+f+" >> "+self.path) 125 | else: 126 | systemcall("cat "+f+" >> "+self.path) 127 | 128 | # just keep track 129 | self.files = files 130 | 131 | # and store some variables 132 | self.tmppath = self.path+".tmp" 133 | 134 | if isinstance(header, str): 135 | self.header = header.split(COLUMN_SEPARATOR) 136 | else: 137 | self.header = header 138 | 139 | self._lines = None 140 | self.preprocess() 141 | 142 | def preprocess(self): 143 | def fix_separators(line): 144 | return COLUMN_SEPARATOR.join(line.split()) 145 | self.map(fix_separators) 146 | 147 | def write(self, it, lazy=False): 148 | """ Write 149 | 150 | Write the lines in an iterable to the LineFile. 151 | 152 | If lazy, then delay actually evaluating the iterable and 153 | writing it to file. 154 | 155 | WARNING! If you specify lazy=True, then you can only read() 156 | those lines once! If you need to read lines more than once, 157 | you need to do lazy=False and write the lines to the file. 158 | 159 | Lazy iterators can be chained into efficient pipelines. 160 | 161 | """ 162 | if lazy: 163 | self._lines = it 164 | else: 165 | # Write lines to tmppath (note it used to be the other way around!) 166 | with codecs.open(self.tmppath, mode='w', encoding=ENCODING, 167 | errors='strict', buffering=IO_BUFFER_SIZE) as outfile: 168 | for item in it: 169 | print >>outfile, item 170 | 171 | # And move tmppath to path 172 | self.mv_from_tmp() 173 | 174 | def read(self): 175 | """ Read 176 | 177 | Return the current lines of the LineFile, whether from 178 | a file or from a lazy iterator. 179 | 180 | """ 181 | if self._lines is None: 182 | return codecs.open(self.path, mode='r', encoding=ENCODING, 183 | errors='strict', buffering=IO_BUFFER_SIZE) 184 | else: 185 | result = iter(self._lines) 186 | self._lines = None # only allow the lazy iterator to be read once!! 187 | return result 188 | 189 | 190 | def setheader(self, *x): 191 | self.header = x 192 | 193 | def rename_column(self, x, v): 194 | self.header[self.to_column_number(x)] = v 195 | 196 | def to_column_number(self, x): 197 | """ 198 | Takes either: 199 | a column number - just echoes back 200 | a string - returns the right column number for the string 201 | a whitespace separated string - returns an array of column numbers 202 | an array - maps along and returns 203 | 204 | """ 205 | if isinstance(x, int): 206 | return x 207 | elif isinstance(x, list): 208 | return map(self.to_column_number, x) 209 | elif isinstance(x, str): 210 | if re_SPACE.search(x): # if spaces, treat it as an array and map 211 | return map(self.to_column_number, x.split(" ")) 212 | 213 | # otherwise, a single string so just find the header that equals it 214 | for i, item in enumerate(self.header): 215 | if item == x: 216 | return i 217 | 218 | print >>sys.stderr, "Invalid header name ["+x+"]", self.header 219 | exit(1) 220 | 221 | def delete_columns(self, cols, lazy=False): 222 | 223 | # make sure these are *decreasing* so we can delete in order 224 | cols = sorted(listifnot(self.to_column_number(cols)), reverse=True) 225 | 226 | def generate_deleted(lines): 227 | for parts in lines: 228 | for c in cols: 229 | del parts[c] 230 | yield "\t".join(parts) 231 | # and delete from the header, after deletion is complete 232 | if self.header is not None: 233 | for c in cols: 234 | del self.header[c] 235 | 236 | self.write(generate_deleted(self.lines(parts=True)), lazy=lazy) 237 | 238 | 239 | 240 | def copy(self, path=None): 241 | 242 | if path is None: 243 | path = self.get_new_path() # make a new path if its not specified 244 | 245 | # we can just copy the file by treating it as one of the "files" 246 | # and then use this new path, not the old one! 247 | return LineFile([self.path], header=deepcopy(self.header), path=path) 248 | 249 | def get_new_path(self): 250 | ind = 1 251 | while True: 252 | path = os.path.dirname(self.path)+"/ngrampy-"+str(ind) 253 | if not os.path.isfile(path): 254 | return path 255 | ind += 1 256 | 257 | def mv_tmp(self): 258 | """ 259 | Move myself to my temporary file, so that I can cat to my self.path 260 | """ 261 | #print "# >>", self.path, self.tmppath 262 | shutil.move(self.path, self.tmppath) 263 | 264 | def mv_from_tmp(self): 265 | """ 266 | Move myself from self.tmppath to self.path. 267 | """ 268 | shutil.move(self.tmppath, self.path) 269 | 270 | 271 | def rename(self, n): 272 | shutil.move(self.path, n) 273 | self.path = n 274 | 275 | def rm_tmp(self): 276 | """ 277 | Remove the temporary file 278 | """ 279 | os.remove(self.tmppath) 280 | 281 | def cp(self, f): 282 | shutil.cp(self.path, f) 283 | 284 | def extract_columns(self, line, keys, dtype=unicode): 285 | """ 286 | Extract some columns from a single line. Assumes that keys are numbers (e.g. already mapped through to_column_number) 287 | and will return the columns as the specified dtype 288 | NOTE: This always returns a list, even if one column is specified. This may change in the future 289 | 290 | e.g. line="a\tb\tc\td" 291 | keys=[1,4] 292 | gives: ["a", "b", "c", "d"], "b\td" 293 | """ 294 | if isinstance(keys, str): 295 | keys = listifnot(self.to_column_number(keys)) 296 | 297 | parts = line.split(COLUMN_SEPARATOR) 298 | 299 | if isinstance(dtype,list): 300 | return [ dtype[i](parts[x]) for i,x in enumerate(keys)] 301 | else: 302 | return [ dtype(parts[x]) for x in keys ] 303 | 304 | def filter(self, fn, lazy=False, verbose=False): 305 | """ Keep only lines where the function returns True. """ 306 | if verbose: 307 | def echo_wrapper(fn): 308 | def wrapper(x, **kwargs): 309 | result = fn(x, **kwargs) 310 | if not result: 311 | print >>sys.stderr, u"Tossed line due to %s:" % fn.__name__, x 312 | return result 313 | return wrapper 314 | fn = echo_wrapper(fn) 315 | 316 | filtered = itertools.ifilter(fn, self.lines()) 317 | self.write(filtered, lazy=lazy) 318 | 319 | def map(self, fn, lazy=False, verbose=False): 320 | """ Apply function to all lines. """ 321 | if verbose: 322 | def echo_wrapper(fn): 323 | def wrapper(x, **kwargs): 324 | result = fn(x, **kwargs) 325 | print >>sys.stderr, u"%s => %s" % (unicode(x), unicode(result)) 326 | return result 327 | return wrapper 328 | fn = echo_wrapper(fn) 329 | 330 | mapped = itertools.imap(fn, self.lines()) 331 | self.write(mapped, lazy=lazy) 332 | 333 | def clean(self, columns=None, lower=True, alphanumeric=True, count_columns=True, nounderscores=True, echo_toss=False, filter_fn=None, modifier_fn=None, lazy=False): 334 | """ 335 | This does several things: 336 | columns - how many cols should there be? If None, then we use the first line 337 | lower - convert to lowercase 338 | alphanumeric - toss lines with non-letter category characters (in unicode). WARNING: Tosses lines with "_" (e.g. syntactic tags in google) 339 | count_columns - if True, we throw out rows that don't have the same number of columns as the first line 340 | nounderscores - if True, we remove everything matching _[^\s]\s -> " " 341 | echo_toss - tell us who was removed 342 | filter_fn - User-provided boolean filtering function 343 | modifier_fn - User-provided function to modify the line (downcase etc) 344 | 345 | NOTE: filtering by alphanumeric allows underscores at the beginning of columns (as in google tags) 346 | NOTE: nounderscores may remove columns if there is a column for tags (e.g. a column with _adv) 347 | """ 348 | def filter_alphanumeric(line): 349 | collapsed = re_tagstartchar.sub("", line) # remove these so that tags don't cause us to toss lines. Must come before spaces removed 350 | collapsed = re_collapser.sub("", collapsed) 351 | collapsed = re_sentence_boundary.sub("", collapsed) 352 | char_categories = (unicodedata.category(k) for k in collapsed) 353 | return all(n == "Ll" or n == "Lu" for n in char_categories) 354 | 355 | def generate_filtered_columns(lines, columns=columns): 356 | for line in lines: 357 | cols = line.split(COLUMN_SEPARATOR) 358 | cn = len(cols) 359 | if columns is None: 360 | columns = cn # save the first line 361 | 362 | if not (columns != cn or any(not non_whitespace_matcher.search(ci) for ci in cols)): 363 | yield line 364 | elif echo_toss: 365 | print >>sys.stderr, "Tossed line with bad column count: %s" % line 366 | print >>sys.stderr, "Line has %d columns; I expected %d." % (cn, columns) 367 | 368 | # Filters. 369 | if filter_fn: 370 | self.filter(filter_fn, lazy=True, verbose=echo_toss) 371 | if alphanumeric: 372 | self.filter(filter_alphanumeric, lazy=True, verbose=echo_toss) 373 | if count_columns: 374 | self.write(generate_filtered_columns(self.lines()), lazy=True) 375 | 376 | # Maps. 377 | if nounderscores: 378 | self.map(lambda line: re_underscore.sub("", line), lazy=True) 379 | if lower: 380 | self.map(lambda line: line.lower(), lazy=True) 381 | if modifier_fn: 382 | self.map(modifier_fn, lazy=True) 383 | 384 | if not lazy: 385 | self.write(self.lines()) 386 | 387 | def restrict_vocabulary(self, cols, vocabulary, invert=False, lazy=False): 388 | """ 389 | Make a new version where "cols" contain only words matching the vocabulary 390 | OR if invert=True, throw out anything matching cols 391 | """ 392 | 393 | cols = listifnot(self.to_column_number(cols)) 394 | 395 | vocabulary = set(vocabulary) 396 | 397 | def restrict(line, cols=cols, vocabulary=vocabulary): 398 | parts = line.split(COLUMN_SEPARATOR) 399 | for c in cols: 400 | if invert and parts[c] not in vocabulary: 401 | return l 402 | elif parts[c] in vocabulary: 403 | return l 404 | 405 | self.map(restrict, lazy=lazy) 406 | 407 | def make_marginal_column(self, newname, keys, sumkey, lazy=False): 408 | self.copy_column(newname, sumkey, lazy=True) 409 | self.sort(keys) 410 | self.resum_equal(keys, newname, keep_all=True, assert_sorted=False, lazy=lazy) 411 | 412 | def resum_equal(self, keys, sumkeys, assert_sorted=True, keep_all=False, lazy=False): 413 | """ 414 | Takes all rows which are equal on the keys and sums the sumkeys, overwriting them. 415 | Anything not in keys or sumkeys, there are only guarantees for if keep_all=True. 416 | """ 417 | keys = listifnot(self.to_column_number(keys)) 418 | sumkeys = listifnot(self.to_column_number(sumkeys)) 419 | 420 | if assert_sorted: 421 | self.assert_sorted(keys, allow_equal=True, lazy=True) 422 | 423 | def generate_resummed(groups): 424 | for compkey, lines in groups: 425 | if keep_all: 426 | lines = list(lines) # load into memory; otherwise we can only iterate through once 427 | sums = Counter() 428 | for parts in lines: 429 | for sumkey in sumkeys: 430 | try: 431 | sums[sumkey] += int(parts[sumkey]) 432 | except IndexError: 433 | print >>sys.stderr, "IndexError:", parts, sumkeys 434 | if keep_all: 435 | for parts in lines: 436 | for sumkey in sumkeys: 437 | parts[sumkey] = str(sums[sumkey]) 438 | yield "\t".join(parts) 439 | else: 440 | try: 441 | for sumkey in sumkeys: 442 | parts[sumkey] = str(sums[sumkey]) # "parts" is the last line 443 | except IndexError: 444 | print >>sys.stderr, "IndexError:", parts, sumkeys 445 | 446 | yield "\t".join(parts) 447 | 448 | groups = self.groupby(keys) 449 | self.write(generate_resummed(groups), lazy=lazy) 450 | 451 | def assert_sorted(self, keys, dtype=unicode, allow_equal=False, lazy=False): 452 | """ 453 | Assert that a file is sorted by certain columns 454 | This good for merging, etc., which optionally check requirements 455 | to be sorted 456 | 457 | """ 458 | def gen_assert_sorted(lines, keys=keys): 459 | """ yield lines while asserted their sortedness """ 460 | keys = self.to_column_number(keys) 461 | prev_sortkey = None 462 | for line in lines: 463 | line = line.strip() 464 | yield line # yield all line and check afterwards 465 | sortkey = self.extract_columns(line, keys=keys, dtype=dtype) 466 | 467 | if prev_sortkey is not None: 468 | if allow_equal: 469 | myassert(prev_sortkey <= sortkey, line+";"+unicode(prev_sortkey)+";"+unicode(sortkey)) 470 | else: 471 | myassert(prev_sortkey < sortkey, line+";"+unicode(prev_sortkey)+";"+unicode(sortkey)) 472 | 473 | prev_sortkey = sortkey 474 | 475 | self.write(gen_assert_sorted(self.lines()), lazy=lazy) 476 | 477 | def cat(self): 478 | systemcall("cat "+self.path) 479 | 480 | def head(self, n=10): 481 | print self.header 482 | lines = self.lines() 483 | for _ in xrange(n): 484 | print next(lines) 485 | 486 | def delete(self): 487 | try: 488 | os.remove(self.path) 489 | except OSError: 490 | pass 491 | try: 492 | os.remove(self.tmppath) 493 | except OSError: 494 | pass # no temporary file exists 495 | 496 | def delete_tmp(self): 497 | print >>sys.stderr, "*** delete_tmp now phased out! Please remove from code!" 498 | #os.remove(self.tmppath) 499 | 500 | def copy_column(self, newname, key, lazy=False): 501 | """ Copy a column. """ 502 | key = self.to_column_number(key) 503 | 504 | def generate_new_col(lines): 505 | for line in lines: 506 | parts = line.split(COLUMN_SEPARATOR) 507 | yield "\t".join([line, parts[key]]) 508 | self.header.extend(listifnot(newname)) 509 | 510 | self.write(generate_new_col(self.lines()), lazy=lazy) 511 | 512 | def make_column(self, newname, function, keys, lazy=False): 513 | """ 514 | Make a new column as some function of the other rows 515 | make_column("unigram", lambda x,y: int(x)+int(y), "cnt1 cnt2") 516 | will make a column called "unigram" that is the sum of cnt1 cnt2 517 | 518 | NOTE: The function MUST take strings and return strings, or else we die 519 | 520 | newname - the name for the new column. You can pass multiple if function returns tab-sep strings 521 | function - a function of other row arguments. Must return strings 522 | args - column names to get the arguments 523 | """ 524 | keys = listifnot( self.to_column_number(keys) ) 525 | 526 | def generate_new_col(lines): 527 | for line in lines: 528 | parts = line.split(COLUMN_SEPARATOR) 529 | yield "\t".join([line, function(*[parts[i] for i in keys])]) 530 | self.header.extend(listifnot(newname)) 531 | 532 | self.write(generate_new_col(self.lines()), lazy=lazy) 533 | 534 | def sort(self, keys, num_lines=SORT_DEFAULT_LINES, dtype=unicode, reverse=False): 535 | """ 536 | Sort me by my keys. this breaks the file up into subfiles of "lines", sorts them in RAM, 537 | and the mergesorts them 538 | 539 | We could use unix "sort" but that gives weirdness sometimes, and doesn't handle our keys 540 | as nicely, since it treats spaces in a counterintuitive way 541 | 542 | dtype - the type of the data to be sorted. Should be a castable python type 543 | e.g. str, int, float 544 | """ 545 | sorted_tmp_files = [] # a list of generators, yielding each line of the file 546 | 547 | keys = listifnot(self.to_column_number(keys)) 548 | 549 | # a generator to hand back lines of a file and keys for sorting 550 | def yield_lines(f): 551 | with codecs.open(f, "r", encoding=ENCODING) as infile: 552 | for l in infile: 553 | yield get_sort_key(l.strip()) 554 | 555 | # Map a line to sort keys (e.g. respecting dtype, etc); 556 | # we use the fact that python will sort arrays (yay) 557 | def get_sort_key(l): 558 | sort_key = self.extract_columns(l, keys=keys, dtype=dtype) 559 | sort_key.append(l) # the second element is the line 560 | return sort_key 561 | 562 | temp_id = 0 563 | for chunk in chunks(self.lines(), num_lines): 564 | sorted_tmp_path = self.path+".sorted."+str(temp_id) 565 | with codecs.open(sorted_tmp_path, 'w', encoding=ENCODING) as outfile: 566 | print >>outfile, "\n".join(sorted(chunk, key=get_sort_key)) 567 | sorted_tmp_files.append(sorted_tmp_path) 568 | temp_id += 1 569 | 570 | # okay now go through and merge sort -- use this cool heapq merging trick! 571 | def merge_sort(): 572 | for x in heapq.merge(*map(yield_lines, sorted_tmp_files)): 573 | yield x[-1] # the last item is the line itself, everything else is sort keys 574 | 575 | self.write(merge_sort()) 576 | 577 | # clean up 578 | for f in sorted_tmp_files: 579 | os.remove(f) 580 | 581 | def merge(self, other, keys1, tocopy, keys2=None, newheader=None, assert_sorted=True): 582 | """ 583 | Copy lines of other that match on keys onto self 584 | 585 | other - a LineFile object -- who to merge in 586 | keys1 - the keys of self for merging 587 | keys2 - the keys of other for merging. If not specified, we assume they are the same as keys1 588 | newheader - If specified, gives the names for the *new* columns 589 | assert_sorted - make False if you don't want an extra check on sorting (things can go very bad) 590 | 591 | NOTE: This assumes that every line of self occurs in other, but not vice-versa. It 592 | also allows multiples in self, but *not* other 593 | """ 594 | # fix up the keys 595 | # Note: Keys2 must be processed first here so we can specify by names, 596 | # and not have keys1 overwritten when they are mapped to numbers 597 | keys2 = listifnot(other.to_column_number(keys1 if keys2 is None else keys2)) 598 | tocopy = listifnot(other.to_column_number(tocopy)) 599 | keys1 = listifnot(self.to_column_number(keys1)) 600 | 601 | # this only works if we are sorted -- let's assert 602 | if assert_sorted: 603 | self.assert_sorted(keys1, allow_equal=True, lazy=True) # we can have repeat lines 604 | other.assert_sorted(keys2, allow_equal=False, lazy=True) # we cannot have repeat lines (how would they be mapped?) 605 | 606 | in1 = self.lines() 607 | in2 = other.lines() 608 | 609 | line1, parts1, key1 = read_and_parse(in1, keys=keys1) 610 | line2, parts2, key2 = read_and_parse(in2, keys=keys2) 611 | 612 | def generate_merged(in1, in2): 613 | while True: 614 | if key1 == key2: 615 | yield line1+"\t"+"\t".join(self.extract_columns(line2, keys=tocopy)) 616 | 617 | line1, parts1, key1 = read_and_parse(in1, keys=keys1) 618 | if not line1: 619 | break 620 | else: 621 | line2, parts2, key2 = read_and_parse(in2, keys=keys2) 622 | if not line2: # okay there is no match for line1 anywhere 623 | print >>sys.stderr, "** Error in merge: end of line2 before end of line 1:" 624 | print >>sys.stderr, "\t", line1 625 | print >>sys.stderr, "\t", line2 626 | exit(1) 627 | self.header.extend([other.header[i] for i in tocopy ]) # copy the header names from other 628 | 629 | self.write(generate_merged(in1, in2)) 630 | 631 | def print_conditional_entropy(self, W, cntXgW, downsample=10000, assert_sorted=True, pre="", preh="", header=True): 632 | """ 633 | Print the entropy H[X | W] for each W, assuming sorted by W. 634 | Here, P(X|W) is given by unnormalized cntXgW 635 | Also prints the total frequency 636 | downsample - also prints the downsampled measures, where we only have downsample counts total. An attempt to correct H bias 637 | """ 638 | if assert_sorted: 639 | self.assert_sorted(listifnot(W), allow_equal=True, lazy=True) # allow W to be true 640 | 641 | 642 | W = self.to_column_number(W) 643 | assert not isinstance(W,list) 644 | #Xcol = self.to_column_number(X) 645 | #assert not isinstance(X,list) 646 | 647 | cntXgW = self.to_column_number(cntXgW) 648 | assert not isinstance(cntXgW, list) 649 | 650 | prevW = None 651 | if header: print preh+"Word\tFrequency\tContextCount\tContextEntropy\tContextEntropy2\tContextEntropy5\tContextEntropy10\tContextEntropy%i\tContextCount%i" % (downsample, downsample) 652 | for w, lines in self.groupby(W): 653 | w = w[0] # w comes out as ("hello",) 654 | wcounts = np.array([float(parts[cntXgW]) for parts in lines]) 655 | sumcount = sum(wcounts) 656 | dp = numpy.sort(numpy.random.multinomial(downsample, wcounts / sumcount)) # sort so we can take top on next line 657 | tp2, tp5, tp10 = dp[-2:], dp[-5:], dp[-10:] 658 | print pre, w, "\t", sumcount, "\t", len(wcounts), "\t", c2H(wcounts), "\t", c2H(tp2), "\t", c2H(tp5), "\t", c2H(tp10), "\t", c2H(dp), "\t", numpy.sum(dp>0) 659 | 660 | 661 | def average_surprisal(self, W, CWcnt, Ccnt, transcribe_fn=None, assert_sorted=True): 662 | """ 663 | Compute the average in-context surprisal, as in Piantadosi, Tily Gibson (2011). 664 | Yield output for each word. 665 | 666 | - W - column for the word 667 | - CWcnt - column for the count of context-word 668 | - Ccnt - column for the count of the context 669 | - transcribe_fn (optional) - transcription to do before measuring word length 670 | i.e. convert word to IPA, convert Chinese characters to pinyin, etc. 671 | 672 | """ 673 | 674 | W = self.to_column_number(W) 675 | assert(not isinstance(W,list)) 676 | CWcnt = self.to_column_number(CWcnt) 677 | assert(not isinstance(CWcnt,list)) 678 | Ccnt = self.to_column_number(Ccnt) 679 | assert(not isinstance(Ccnt,list)) 680 | 681 | if assert_sorted: 682 | self.assert_sorted(listifnot(W), allow_equal=True, lazy=True) 683 | 684 | for word, lines in self.groupby(W): 685 | word = word[0] # word comes out as (word,) 686 | if transcribe_fn: 687 | word = transcribe_fn(word) 688 | sum_surprisal = 0 689 | total_word_frequency = 0 690 | total_context_count = 0 691 | for parts in lines: 692 | cwcnt = int(parts[CWcnt]) 693 | ccnt = int(parts[Ccnt]) 694 | sum_surprisal -= (log2(cwcnt) - log2(ccnt)) * cwcnt 695 | total_word_frequency += cwcnt 696 | total_context_count += 1 697 | length = len(word) 698 | yield u'"%s"'%word, length, sum_surprisal/total_word_frequency, log2(total_word_frequency), total_context_count 699 | 700 | def print_average_surprisal(self, W, CWcnt, Ccnt, transcribe_fn=None, assert_sorted=True): 701 | print "Word\tOrthographic.Length\tSurprisal\tLog.Frequency\tTotal.Context.Count" 702 | for line in self.average_surprisal(W, CWcnt, Ccnt, 703 | transcribe_fn=transcribe_fn, assert_sorted=assert_sorted): 704 | print u"\t".join(map(unicode, line)) 705 | 706 | ################################################################################################# 707 | # Iterators 708 | 709 | def lines(self, parts=False): 710 | """ 711 | Yield me a stripped version of each line of tmplines 712 | 713 | - parts - if true, we return an array that is split on tabs 714 | 715 | """ 716 | if parts: 717 | return (line.strip().split(COLUMN_SEPARATOR) for line in self.read()) 718 | else: 719 | return (line.strip() for line in self.read()) 720 | 721 | def groupby(self, keys): 722 | """ 723 | A groupby iterator matching the given keys. 724 | 725 | """ 726 | keys = listifnot(self.to_column_number(keys)) 727 | key_fn = lambda parts: tuple(parts[x] for x in keys) 728 | return itertools.groupby(self.lines(parts=True), key_fn) 729 | 730 | def __len__(self): 731 | """ 732 | How many total lines? 733 | """ 734 | return sum(1 for _ in self.read()) 735 | 736 | def subsample_lines(self, N=1000000): 737 | """ 738 | Make me a smaller copy of myself by randomly subsampling *lines* 739 | not according to counts. This is useful for creating a temporary 740 | file 741 | NOTE: N must fit into memory 742 | """ 743 | 744 | # We'll use a reservoir sampling algorithm 745 | sample = [] 746 | 747 | for idx, line in enumerate(self.lines()): 748 | if idx < N: 749 | sample.append(line) 750 | else: 751 | r = random.randrange(idx+1) 752 | if r < N: sample[r] = line 753 | 754 | # now output the sample 755 | self.write(sample) 756 | 757 | def sum_column(self, col, cast=int): 758 | col = self.to_column_number(col) 759 | return sum(cast(parts[col]) for parts in self.lines(parts=True)) 760 | 761 | def downsample_tokens(self, N, ccol, keep_zero_counts=False): 762 | """ 763 | Subsample myself via counts with the existing probability distribution. 764 | - N - the total sample size we end up with. 765 | - ccol - the column we use to estimate probabilities. Unnormalized, non-log probs (e.g. counts) 766 | 767 | NOTE: this assumes a multinomial on trigrams, which may not be accurate. If you started from a corpus, this will NOT in general keep 768 | counts consistent with a corpus. 769 | 770 | This uses a conditional beta distribution, once for each line for a total of N. 771 | See pg 12 of w3.jouy.inra.fr/unites/miaj/public/nosdoc/rap2012-5.pdf 772 | """ 773 | self.header.extend(ccol) 774 | ccol = self.to_column_number(ccol) 775 | 776 | Z = self.sum_column(ccol) 777 | 778 | def generate_downsampled(lines): 779 | for parts in lines: 780 | 781 | cnt = int(parts[ccol]) 782 | 783 | # Randomly sample 784 | if N > 0: 785 | newcnt = numpy.random.binomial(N,float(cnt)/float(Z)) 786 | else: 787 | newcnt = 0 788 | 789 | # Update the conditional multinomial 790 | N = N-newcnt # samples to draw 791 | Z = Z-cnt # normalizer for everything else 792 | 793 | parts[ccol] = str(newcnt) # update this 794 | 795 | if keep_zero_counts or newcnt > 0: 796 | yield '\t'.join(parts) 797 | 798 | self.write(generate_downsampled(self.lines(parts=True))) 799 | -------------------------------------------------------------------------------- /ngrampy/LineFile.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/piantado/ngrampy/792b25e3293f06ac9561a3c02bfaad22d6149d9a/ngrampy/LineFile.pyc -------------------------------------------------------------------------------- /ngrampy/LineFileInMemory.py: -------------------------------------------------------------------------------- 1 | """ 2 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 3 | 4 | This file contains a drop-in replacement for LineFile for in-memory operations. 5 | It mocks the interface of LineFile, including file-related arguments, 6 | but performs all operations in memory. 7 | 8 | This is not the most efficient way to do this in-memory, but it provides 9 | compability with scripts written for the on-disk version. 10 | 11 | Richard Futrell, 2013 12 | 13 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 14 | """ 15 | from __future__ import division 16 | import os 17 | import sys 18 | import re 19 | import unicodedata 20 | import heapq 21 | import shutil 22 | import random 23 | import codecs # for writing utf-8 24 | import itertools 25 | from math import log 26 | from collections import Counter 27 | from copy import deepcopy 28 | 29 | try: 30 | import numpy 31 | except ImportError: 32 | try: 33 | import numpypy as numpy 34 | except ImportError: 35 | pass 36 | 37 | from debug import * 38 | from helpers import * 39 | import filehandling as fh 40 | from LineFile import LineFile 41 | 42 | ENCODING = 'utf-8' 43 | CLEAN_TMP = False 44 | SORT_DEFAULT_LINES = None 45 | 46 | # Set this so we can write stderr 47 | #sys.stdout = codecs.getwriter(ENCODING)(sys.stdout) 48 | #sys.stderr = codecs.getwriter(ENCODING)(sys.stderr) 49 | 50 | IO_BUFFER_SIZE = int(100e6) # approx size of input buffer 51 | 52 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 53 | # # Class definition 54 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 55 | 56 | class LineFileInMemory(LineFile): 57 | 58 | def __init__(self, files, header=None, path=None, force_nocopy=False): 59 | """ 60 | Create a new file object with the specified header. It takes a list of files 61 | and reads them into memory as a list of lines. 62 | 63 | header - give each column a name (you can refer by name instead of number) 64 | path - does nothing, for compatibility with LineFile 65 | force_nocopy - does nothing, for compatibility with LineFile 66 | """ 67 | if isinstance(files, str): 68 | files = [files] 69 | 70 | self.path = None 71 | self.tmppath = None 72 | 73 | # load the files into memory 74 | self._lines = [] 75 | for f in files: 76 | if f.endswith(".idf"): 77 | continue # skip the index files 78 | else: 79 | with fh.open(f, encoding=ENCODING) as infile: 80 | self._lines.extend(infile) 81 | 82 | # just keep track 83 | self.files = files 84 | 85 | if isinstance(header, str): 86 | self.header = header.split() 87 | else: 88 | self.header = header 89 | 90 | self.preprocess() 91 | 92 | def write(self, it, lazy=False): 93 | if lazy: 94 | self._lines = it 95 | else: 96 | self._lines = list(it) 97 | 98 | def read(self): 99 | return iter(self._lines) 100 | 101 | def copy(self, path=None): 102 | return deepcopy(self) 103 | 104 | def sort(self, keys, lines=None, dtype=unicode, reverse=False): 105 | """ 106 | Sort me by my keys. 107 | 108 | dtype - the type of the data to be sorted. Should be a castable python type 109 | e.g. str, int, float 110 | """ 111 | keys = listifnot(self.to_column_number(keys)) 112 | 113 | # Map a line to sort keys (e.g. respecting dtype, etc) ; we use the fact that python will sort arrays (yay) 114 | def get_sort_key(l): 115 | sort_key = self.extract_columns(l, keys=keys, dtype=dtype) # extract_columns gives them back tab-sep, but we need to cast them 116 | sort_key.append(l) # the second element is the line 117 | return sort_key 118 | 119 | self.write(sorted(self.lines(), key=get_sort_key)) 120 | 121 | 122 | def __len__(self): 123 | """ 124 | How many total lines? 125 | """ 126 | return len(list(self._lines)) 127 | 128 | -------------------------------------------------------------------------------- /ngrampy/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/piantado/ngrampy/792b25e3293f06ac9561a3c02bfaad22d6149d9a/ngrampy/__init__.py -------------------------------------------------------------------------------- /ngrampy/__init__.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/piantado/ngrampy/792b25e3293f06ac9561a3c02bfaad22d6149d9a/ngrampy/__init__.pyc -------------------------------------------------------------------------------- /ngrampy/debug.py: -------------------------------------------------------------------------------- 1 | """ debug 2 | 3 | Utilities for debugging. 4 | Many ideas taken from funcy, http://github.com/Suor/funcy 5 | 6 | """ 7 | from __future__ import print_function 8 | import sys 9 | import inspect 10 | 11 | def tap(x, end="\n", file=sys.stdout): 12 | print(x, end=end, file=file) 13 | return x 14 | 15 | def log_calls(fn): 16 | def _fn(*args, **kwargs): 17 | binding = inspect.getcallargs(fn, *args, **kwargs) 18 | binding_str = ", ".join("%s=%s" % item for item in binding.iteritems()) 19 | signature = fn.__name__ + "(%s)" % binding_str 20 | print(signature, file=sys.stderr) 21 | return fn(*args, **kwargs) 22 | return _fn 23 | 24 | def myassert(tf, s): 25 | if not tf: 26 | print >>sys.stderr, "*** Assertion fail: ",s 27 | assert tf 28 | -------------------------------------------------------------------------------- /ngrampy/helpers.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import os 3 | import re 4 | from math import log 5 | import subprocess 6 | from itertools import * 7 | 8 | ECHO_SYSTEM = False 9 | 10 | # Some friendly Regexes. May need to change encoding here for other encodings? 11 | re_SPACE = re.compile(r"\s", re.UNICODE) # for splitting on spaces, etc. 12 | re_underscore = re.compile(r"_[A-Za-z\-\_]+", re.UNICODE) # for filtering out numbers and whitespace 13 | re_collapser = re.compile(r"[\d\s]", re.UNICODE) # for filtering out numbers and whitespace 14 | re_sentence_boundary = re.compile(r"", re.UNICODE) 15 | re_tagstartchar = re.compile(r"(\s|^)_", re.UNICODE) # underscores may be okay at the start of words 16 | non_whitespace_matcher = re.compile(r"[^\s]", re.UNICODE) 17 | 18 | PRINT_LOG = False # should we log each action? 19 | 20 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 21 | # # Some helpful functions 22 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 23 | 24 | def printlog(x): 25 | if PRINT_LOG: 26 | print >>sys.stderr, x 27 | 28 | def read_and_parse(inn, keys): 29 | """ 30 | Read a line and parse it by tabs, returning the line, the tab parts, and some columns 31 | """ 32 | line = next(inn).strip() 33 | if not line: 34 | return line, None, None 35 | else: 36 | parts = line.split() 37 | return line, parts, "\t".join([parts[x] for x in keys]) 38 | 39 | def systemcall(cmd, echo=ECHO_SYSTEM): 40 | if echo: 41 | print >>sys.stderr, cmd 42 | p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=None) 43 | output, _ = p.communicate() 44 | return output 45 | 46 | def ifelse(x, y, z): 47 | return y if x else z 48 | 49 | def listifnot(x): 50 | return x if isinstance(x, list) else [x] 51 | 52 | def log2(x): 53 | return log(x,2.) 54 | 55 | def c2H(counts): 56 | """ Normalize counts and compute entropy 57 | 58 | Counts can be a generator. 59 | Doesn't depend on numpy. 60 | 61 | """ 62 | total = 0.0 63 | clogc = 0.0 64 | for c in counts: 65 | total += c 66 | clogc += c * log(c) 67 | return -(clogc/total - log(total)) / log(2) 68 | 69 | def chunks(iterable, size): 70 | """ Chunks 71 | 72 | Break an iterable into chunks of specified size. 73 | 74 | Params: 75 | iterable: An iterable 76 | size: An integer size. 77 | 78 | Yields: 79 | Tuples of size less than or equal to n, chunks of the input iterable. 80 | 81 | Examples: 82 | >>> lst = ['foo', 'bar', 'baz', 'qux', 'zim', 'cat', 'dog'] 83 | >>> list(chunks(lst, 3)) 84 | [('foo', 'bar', 'baz'), ('qux', 'zim', 'cat'), ('dog',)] 85 | 86 | """ 87 | it = iter(iterable) 88 | while True: 89 | chunk = islice(it, None, size) 90 | probe = next(chunk) # raises StopIteration if nothing's there 91 | yield chain([probe], chunk) 92 | -------------------------------------------------------------------------------- /process-google.py: -------------------------------------------------------------------------------- 1 | """ 2 | 3 | Note: This appears to be quicker than using "import gzip", even if we copy 4 | the gziped file to a SSD before reading. It's also faster than trying to change buffering on stdin 5 | 6 | pypy is MUCH faster 7 | 8 | TODO: 9 | - Clean this up to make play nicer with Tags. We can make it give NAs to nonexistent words! 10 | - Make this handle unicode correctly -- is it even wrong? 11 | Changes: 12 | - Sep 20 2013: this now outputs the tags as their own column, with NA for missing tags 13 | """ 14 | 15 | import os 16 | import re 17 | import sys 18 | import itertools 19 | import codecs 20 | import gzip 21 | import glob 22 | 23 | import argparse 24 | parser = argparse.ArgumentParser(description='Process google ngrams into year bins') 25 | parser.add_argument('--in', dest='in', type=str, default=None, nargs="?", help='The file name for input') 26 | parser.add_argument('--out', dest='out', type=str, default="/tmp/", nargs="?", help='The file name for output (year appended)') 27 | parser.add_argument('--year-bin', dest='year-bin', type=int, default=10, nargs="?", help='How to bin the years') 28 | parser.add_argument('--quiet', dest='quiet', default=False, action="store_true", help='Output tossed lines?') 29 | parser.add_argument('--N', dest='N', default=3, nargs="?", help="Order of the ngram") 30 | args = vars(parser.parse_args()) 31 | 32 | YEAR_BIN = int(args['year-bin']) 33 | BUFSIZE = int(1e6) # We can allow huge buffers if we want... 34 | ENCODING = 'utf-8' 35 | LINE_N = int(args['N'])+3 # three extra columns 36 | 37 | prev_year,prev_ngram = None, None 38 | count = 0 39 | 40 | year2file = dict() 41 | part_count = None 42 | 43 | # python is not much slower than perl if we pre-compile regexes 44 | 45 | #cleanup = re.compile(r"(_[A-Za-z\_\-]+)|(\")") # The old way -- delete tags and quotes 46 | line_splitter = re.compile(r"\n", re.U) 47 | cleanup_quotes = re.compile(r"(\")", re.U) # kill quotes 48 | #column_splitter = re.compile(r"[\s]", re.U) # split on tabs OR spaces, since some of google seems to use one or the other. 49 | 50 | tag_match = re.compile(r"^(.+?)(_[A-Z\_\-\.\,\;\:]+)?$", re.U) # match a tag at the end of words (assumes 51 | def tagify(x): 52 | """ 53 | Take a word with a tag ("man_NOUN") and give back ("man","NOUN") with "NA" if the tag or word is not there 54 | """ 55 | m = tag_match.match(x) 56 | if m: 57 | g = m.groups() 58 | 59 | word = (g[0] if g[0] is not None else "NA") 60 | tag = (g[1] if g[1] is not None else "NA") 61 | return (word,tag) 62 | #if g[1] is None: return (g[0], "NA") 63 | #else: return g 64 | else: return [] 65 | 66 | def chain(args): 67 | a = [] 68 | for x in args: a.extend(x) 69 | return a 70 | 71 | 72 | for f in glob.glob(args['in']): 73 | 74 | # Unzip and encode 75 | inputfile = gzip.open(f, 'r') 76 | for l in inputfile: 77 | #l = l.decode(ENCODING) 78 | 79 | l = l.strip() ## To collapse case 80 | l = cleanup_quotes.sub("", l) # remove quotes 81 | 82 | #print >>sys.stderr, l 83 | 84 | #parts = column_splitter.split(l) 85 | parts = l.split() # defaultly should handle splitting on whitespace, much friendlier with unicode 86 | 87 | # Our check on the number of parts -- we require this to be passed in (otherwise it's hard to parse) 88 | if len(parts) != LINE_N: 89 | if not args['quiet']: print "Wrong number of items on line: skiping ", l, parts, " IN FILE ", f 90 | continue # skip this line if its garbage NOTE: this may mess up with some unicode chars? 91 | #print parts 92 | # parts[-1] is the number of books -- ignored here 93 | c = int(parts[-2]) # the count 94 | year = int(int(parts[-3]) / YEAR_BIN) * YEAR_BIN # round the year 95 | ngram_ar = chain(map(tagify,parts[0:-3])) 96 | #print ngram_ar 97 | #if all([x != "NA" for x in ngram_ar]): # Chuck lines that don't have all things tagged 98 | #else: continue 99 | ngram = "\t".join(chain(map(tagify,parts[0:-3]))) # join everything else, including the tags separated out 100 | 101 | # output if different 102 | if year != prev_year or ngram != prev_ngram: 103 | 104 | if prev_year is not None: 105 | if prev_year_s not in year2file: 106 | year2file[prev_year_s] = open(args['out']+".%i"%prev_year, 'w', BUFSIZE) 107 | year2file[prev_year_s].write( "%s\t%i\n" % (prev_ngram,count) ) # write to the year file TODO: This might should be unicode fanciness? 108 | 109 | prev_ngram = ngram 110 | prev_year = year 111 | prev_year_s = str(prev_year) 112 | count = c 113 | else: 114 | count += c 115 | 116 | # And write the last line if we didn't alerady! 117 | if year == prev_year and ngram == prev_ngram: 118 | if prev_year_s not in year2file: 119 | year2file[prev_year_s] = open(args['out']+".%i"%prev_year, 'w', BUFSIZE) 120 | year2file[prev_year_s].write( "%s\t%i\n" % (prev_ngram,count) ) # write to the year file TODO: This might should be unicode fanciness? 121 | 122 | inputfile.close() 123 | 124 | # And close everything 125 | for year in year2file.keys(): 126 | year2file[year].close() 127 | 128 | 129 | 130 | -------------------------------------------------------------------------------- /process-initial.sh: -------------------------------------------------------------------------------- 1 | 2 | 3 | # A script to initially process google data. The significantly speeds up everything later. 4 | 5 | DIR=/home/piantado/Corpus/GoogleBooks/ 6 | #for L in eng-us-all fre-all heb-all ita-all rus-all spa-all ger-all; do 7 | for L in eng-us-all fre-all heb-all; do 8 | for N in 1 2 3; do 9 | myDIR=$DIR/Processed/$L/$N 10 | 11 | mkdir $DIR/Processed/$L 12 | mkdir $myDIR 13 | 14 | pypy process-google.py --in=$DIR/$L/$N/* --out=$myDIR/processed-google --N=$N --quiet & ## TODO: IF YOU CHANGE N, CHANGE THE 15 | done 16 | done 17 | 18 | -------------------------------------------------------------------------------- /tests/ngrampy_tests.py: -------------------------------------------------------------------------------- 1 | import os, os.path 2 | import random 3 | import math 4 | 5 | from nose.tools import * 6 | 7 | from ngrampy.LineFile import LineFile 8 | from ngrampy.LineFileInMemory import LineFileInMemory 9 | 10 | try: 11 | os.mkdir("tests/tmp") 12 | except OSError: 13 | pass 14 | 15 | 16 | def test_basics(): 17 | G = LineFile("tests/smallcorpus.txt.bz2", header="foo bar baz qux".split(), 18 | path="tests/tmp/testcorpus") 19 | assert_equal(G.header, "foo bar baz qux".split()) 20 | assert_equal(G.files, ["tests/smallcorpus.txt.bz2"]) 21 | assert_equal(G.path, "tests/tmp/testcorpus") 22 | assert_equal(G.tmppath, "tests/tmp/testcorpus.tmp") 23 | assert_equal(os.path.isfile("tests/tmp/testcorpus"), True) 24 | 25 | G_copy = G.copy() 26 | copy_path = G_copy.path 27 | assert_not_equal(copy_path, G.path) 28 | 29 | G_copy.mv_tmp() 30 | assert_equal(os.path.isfile(G_copy.path + ".tmp"), True) 31 | #G_copy.delete_tmp() 32 | #assert_equal(os.path.isfile(G_copy.path + ".tmp"), False) 33 | 34 | G.make_column("quux", lambda x, y, z, w: "cat", "foo bar baz qux".split()) 35 | assert_equal(G.header, "foo bar baz qux quux".split()) 36 | for line in G.lines(parts=False): 37 | assert_equal(G.extract_columns(line, "quux"), ["cat"]) 38 | 39 | G.delete_columns("quux") 40 | assert_equal(G.header, "foo bar baz qux".split()) 41 | 42 | G.copy_column("quux", "qux") 43 | assert_equal(G.header, "foo bar baz qux quux".split()) 44 | for line in G.lines(parts=False): 45 | assert_equal(G.extract_columns(line, "qux"), 46 | G.extract_columns(line, "quux") 47 | ) 48 | 49 | G.delete() 50 | assert_equal(os.path.isfile("tests/tmp/testcorpus"), False) 51 | 52 | def test_clean(): 53 | G = LineFile("tests/smallcorpus-malformed.txt.bz2", header="foo bar baz qux".split(), 54 | path="tests/tmp/testcorpus") 55 | len_G = len(G) 56 | G.clean(columns=4, lower=False, alphanumeric=False, count_columns=True, 57 | nounderscores=False, echo_toss=True) 58 | assert_equal(len(G), len_G - 2) 59 | G.delete() 60 | 61 | G = LineFile("tests/smallcorpus-malformed.txt.bz2", header="foo bar baz qux".split(), 62 | path="tests/tmp/testcorpus") 63 | G.clean(lower=True, alphanumeric=True, count_columns=False, echo_toss=True) 64 | assert_equal(len(G), 8562) 65 | G.delete() 66 | 67 | G = LineFile("tests/smallcorpus-malformed.txt.bz2", header="foo bar baz qux".split(), 68 | path="tests/tmp/testcorpus") 69 | G.clean(lower=True, alphanumeric=True, count_columns=False, echo_toss=True, 70 | filter_fn=lambda x: False) 71 | assert_equal(len(G), 0) 72 | G.delete() 73 | 74 | G = LineFile("tests/smallcorpus-malformed.txt.bz2", header="foo bar baz qux".split(), 75 | path="tests/tmp/testcorpus") 76 | G.clean(lower=True, alphanumeric=False, count_columns=False, echo_toss=True, 77 | modifier_fn=lambda x: "hello") 78 | assert_equal(len(G), len_G) 79 | for line in G.lines(): 80 | assert_equal(line, "hello") 81 | G.delete() 82 | 83 | 84 | def test_clean_lazy(): 85 | G = LineFile("tests/smallcorpus-malformed.txt.bz2", header="foo bar baz qux".split(), 86 | path="tests/tmp/testcorpus") 87 | len_G = len(G) 88 | G.clean(columns=4, lower=False, alphanumeric=False, count_columns=True, 89 | nounderscores=False, echo_toss=True, lazy=True) 90 | assert_equal(len(G), len_G - 2) 91 | G.delete() 92 | 93 | G = LineFile("tests/smallcorpus-malformed.txt.bz2", header="foo bar baz qux".split(), 94 | path="tests/tmp/testcorpus") 95 | G.clean(lower=True, alphanumeric=True, count_columns=False, echo_toss=True, lazy=True) 96 | assert_equal(len(G), 8562) 97 | G.delete() 98 | 99 | G = LineFile("tests/smallcorpus-malformed.txt.bz2", header="foo bar baz qux".split(), 100 | path="tests/tmp/testcorpus") 101 | G.clean(lower=True, alphanumeric=True, count_columns=False, echo_toss=True, 102 | filter_fn=lambda x: False, lazy=True) 103 | assert_equal(len(G), 0) 104 | G.delete() 105 | 106 | G = LineFile("tests/smallcorpus-malformed.txt.bz2", header="foo bar baz qux".split(), 107 | path="tests/tmp/testcorpus") 108 | G.clean(lower=True, alphanumeric=False, count_columns=False, echo_toss=True, 109 | modifier_fn=lambda x: "hello", lazy=True) 110 | for line in G.lines(parts=False): 111 | assert_equal(line, "hello") 112 | G.delete() 113 | 114 | def test_resum_equal(): 115 | G = LineFile("tests/smallcorpus.txt.bz2", header="foo bar baz qux".split(), 116 | path="tests/tmp/testcorpus") 117 | len_G = len(G) 118 | total = G.sum_column("qux") 119 | G.resum_equal("foo", "qux", assert_sorted=True, keep_all=False) 120 | assert_equal(len(G), 1) 121 | for line in G.lines(): 122 | assert_equal(int(G.extract_columns(line, "qux")[0]), total) 123 | G.delete() 124 | 125 | G = LineFile("tests/smallcorpus.txt.bz2", header="foo bar baz qux".split(), 126 | path="tests/tmp/testcorpus") 127 | G.resum_equal("foo", "qux", assert_sorted=True, keep_all=True) 128 | assert_equal(len(G), len_G) 129 | for line in G.lines(): 130 | assert_equal(int(G.extract_columns(line, "qux")[0]), total) 131 | G.delete() 132 | 133 | def test_resum_equal_lazy(): 134 | G = LineFile("tests/smallcorpus.txt.bz2", header="foo bar baz qux".split(), 135 | path="tests/tmp/testcorpus") 136 | len_G = len(G) 137 | total = G.sum_column("qux") 138 | G.resum_equal("foo", "qux", assert_sorted=True, keep_all=False, lazy=True) 139 | for line in G.lines(): 140 | assert_equal(int(G.extract_columns(line, "qux")[0]), total) 141 | G.delete() 142 | 143 | G = LineFile("tests/smallcorpus.txt.bz2", header="foo bar baz qux".split(), 144 | path="tests/tmp/testcorpus") 145 | G.resum_equal("foo", "qux", assert_sorted=True, keep_all=True, lazy=True) 146 | for line in G.lines(): 147 | assert_equal(int(G.extract_columns(line, "qux")[0]), total) 148 | G.delete() 149 | 150 | 151 | """ 152 | def test_avg_surprisal(): 153 | G = LineFile("tests/smallcorpus.txt.bz2", header="foo bar baz qux", 154 | path="tests/tmp/testcorpus") 155 | G.make_marginal_column("quux", "foo bar", "qux") 156 | G.sort("baz") 157 | for line in G.average_surprisal("baz", "qux", "quux", assert_sorted=True): 158 | # TODO this test 159 | pass 160 | 161 | G.delete() 162 | """ 163 | 164 | def test_unicode(): 165 | """ test unicode 166 | 167 | replace every word in the test corpus with random unicode 168 | and see if we get the same surprisal scores. 169 | 170 | """ 171 | def generate_random_unicode(): 172 | for _ in xrange(5): 173 | yield unichr(random.choice((0x300, 0x9999)) + random.randint(0, 0xff)) 174 | 175 | scramblemap = {} 176 | 177 | G = LineFile("tests/smallcorpus.txt.bz2", header="foo bar baz qux".split(), 178 | path="tests/tmp/testcorpus") 179 | G.clean(lower=True, alphanumeric=False, count_columns=False, echo_toss=True, lazy=True) 180 | G.make_marginal_column("quux", "foo bar".split(), "qux", lazy=True) 181 | G.sort("baz") 182 | len_G = len(G) 183 | sum_counts = G.sum_column("quux") 184 | sum_surprisal = math.fsum(line[2] for line in G.average_surprisal("baz", "qux", "quux", assert_sorted=True)) 185 | G.delete() 186 | 187 | G = LineFile("tests/smallcorpus.txt.bz2", header="foo bar baz qux".split(), 188 | path="tests/tmp/testcorpus") 189 | 190 | def scramble(line): 191 | words = line.split()[:3] 192 | count = line.split()[-1] 193 | for i, word in enumerate(words): 194 | if word in scramblemap: 195 | words[i] = scramblemap[word] 196 | else: 197 | garbage = u"".join(generate_random_unicode()) 198 | words[i] = garbage 199 | scramblemap[word] = garbage 200 | 201 | return "\t".join(words + [count]) 202 | 203 | G.clean(lower=True, alphanumeric=False, count_columns=False, echo_toss=True, 204 | modifier_fn=scramble) 205 | G.make_marginal_column("quux", "foo bar".split(), "qux") 206 | G.sort("baz") 207 | sum_counts_scrambled = G.sum_column("quux") 208 | assert_equal(sum_counts, sum_counts_scrambled) 209 | assert_equal(len_G, len(G)) 210 | sum_surprisal_scrambled = math.fsum(line[2] for line in G.average_surprisal("baz", "qux", "quux", assert_sorted=True)) 211 | G.delete() 212 | 213 | assert_equal(sum_surprisal, sum_surprisal_scrambled) 214 | 215 | def test_basics_in_memory(): 216 | G = LineFileInMemory("tests/smallcorpus.txt.bz2", header="foo bar baz qux", 217 | path="tests/tmp/testcorpus") 218 | assert_equal(G.header, "foo bar baz qux".split()) 219 | assert_equal(G.files, ["tests/smallcorpus.txt.bz2"]) 220 | 221 | G.make_column("quux", lambda x, y, z, w: "cat", "foo bar baz qux") 222 | assert_equal(G.header, "foo bar baz qux quux".split()) 223 | for line in G.lines(parts=False): 224 | assert_equal(G.extract_columns(line, "quux"), ["cat"]) 225 | 226 | G.delete_columns("quux") 227 | assert_equal(G.header, "foo bar baz qux".split()) 228 | 229 | G.copy_column("quux", "qux") 230 | assert_equal(G.header, "foo bar baz qux quux".split()) 231 | for line in G.lines(parts=False): 232 | assert_equal(G.extract_columns(line, "qux"), 233 | G.extract_columns(line, "quux") 234 | ) 235 | 236 | def test_clean_in_memory(): 237 | G = LineFileInMemory("tests/smallcorpus-malformed.txt.bz2", header="foo bar baz qux".split(), 238 | path="tests/tmp/testcorpus") 239 | len_G = len(G) 240 | G.clean(columns=4, lower=False, alphanumeric=False, count_columns=True, 241 | nounderscores=False, echo_toss=True) 242 | assert_equal(len(G), len_G - 2) 243 | 244 | G = LineFileInMemory("tests/smallcorpus-malformed.txt.bz2", header="foo bar baz qux".split(), 245 | path="tests/tmp/testcorpus") 246 | G.clean(lower=True, alphanumeric=True, count_columns=False, echo_toss=True) 247 | assert_equal(len(G), 8562) 248 | 249 | G = LineFileInMemory("tests/smallcorpus-malformed.txt.bz2", header="foo bar baz qux".split(), 250 | path="tests/tmp/testcorpus") 251 | G.clean(lower=True, alphanumeric=True, count_columns=False, echo_toss=True, 252 | filter_fn=lambda x: False) 253 | assert_equal(len(G), 0) 254 | 255 | G = LineFileInMemory("tests/smallcorpus-malformed.txt.bz2", header="foo bar baz qux".split(), 256 | path="tests/tmp/testcorpus") 257 | G.clean(lower=True, alphanumeric=False, count_columns=False, echo_toss=True, 258 | modifier_fn=lambda x: "hello") 259 | assert_equal(len(G), len_G) 260 | for line in G.lines(parts=False): 261 | assert_equal(line, "hello") 262 | 263 | def test_resum_equal_in_memory(): 264 | G = LineFileInMemory("tests/smallcorpus.txt.bz2", header="foo bar baz qux".split(), 265 | path="tests/tmp/testcorpus") 266 | len_G = len(G) 267 | total = G.sum_column("qux") 268 | G.resum_equal("foo", "qux", assert_sorted=True, keep_all=False) 269 | assert_equal(len(G), 1) 270 | for line in G.lines(): 271 | assert_equal(int(G.extract_columns(line, "qux")[0]), total) 272 | 273 | G = LineFileInMemory("tests/smallcorpus.txt.bz2", header="foo bar baz qux".split(), 274 | path="tests/tmp/testcorpus") 275 | G.resum_equal("foo", "qux", assert_sorted=True, keep_all=True) 276 | assert_equal(len(G), len_G) 277 | for line in G.lines(): 278 | assert_equal(int(G.extract_columns(line, "qux")[0]), total) 279 | 280 | def test_unicode_in_memory(): 281 | def generate_random_unicode(): 282 | for _ in xrange(5): 283 | yield unichr(random.choice((0x300, 0x9999)) + random.randint(0, 0xff)) 284 | 285 | scramblemap = {} 286 | 287 | G = LineFileInMemory("tests/smallcorpus.txt.bz2", header="foo bar baz qux".split(), 288 | path="tests/tmp/testcorpus") 289 | G.clean(lower=True, alphanumeric=False, count_columns=False, echo_toss=True, lazy=True) 290 | G.make_marginal_column("quux", "foo bar".split(), "qux", lazy=True) 291 | G.sort("baz") 292 | len_G = len(G) 293 | sum_counts = G.sum_column("quux") 294 | sum_surprisal = math.fsum(line[2] for line in G.average_surprisal("baz", "qux", "quux", assert_sorted=True)) 295 | 296 | 297 | G = LineFileInMemory("tests/smallcorpus.txt.bz2", header="foo bar baz qux".split(), 298 | path="tests/tmp/testcorpus") 299 | 300 | def scramble(line): 301 | words = line.split()[:3] 302 | count = line.split()[-1] 303 | for i, word in enumerate(words): 304 | if word in scramblemap: 305 | words[i] = scramblemap[word] 306 | else: 307 | garbage = u"".join(generate_random_unicode()) 308 | words[i] = garbage 309 | scramblemap[word] = garbage 310 | 311 | return "\t".join(words + [count]) 312 | 313 | G.clean(lower=True, alphanumeric=False, count_columns=False, echo_toss=True, 314 | modifier_fn=scramble, lazy=True) 315 | G.make_marginal_column("quux", "foo bar".split(), "qux", lazy=True) 316 | G.sort("baz") 317 | sum_counts_scrambled = G.sum_column("quux") 318 | assert_equal(sum_counts, sum_counts_scrambled) 319 | assert_equal(len_G, len(G)) 320 | sum_surprisal_scrambled = math.fsum(line[2] for line in G.average_surprisal("baz", "qux", "quux", assert_sorted=True)) 321 | 322 | assert_equal(sum_surprisal, sum_surprisal_scrambled) 323 | 324 | -------------------------------------------------------------------------------- /tests/smallcorpus-malformed.txt.bz2: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/piantado/ngrampy/792b25e3293f06ac9561a3c02bfaad22d6149d9a/tests/smallcorpus-malformed.txt.bz2 -------------------------------------------------------------------------------- /tests/smallcorpus.txt.bz2: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/piantado/ngrampy/792b25e3293f06ac9561a3c02bfaad22d6149d9a/tests/smallcorpus.txt.bz2 --------------------------------------------------------------------------------