├── README
├── download.py
├── examples
    ├── SurprisalInContext
    │   ├── README
    │   ├── analyze.R
    │   ├── run-GoogleBooks.sh
    │   └── surprisal-2gram.py
    └── TrigramMatches
    │   ├── README
    │   ├── badwords.txt
    │   ├── compute_trigram_stats.py
    │   ├── find_matched_items.py
    │   └── find_matched_items_aXb.py
├── ngrampy
    ├── LineFile.py
    ├── LineFile.pyc
    ├── LineFileInMemory.py
    ├── __init__.py
    ├── __init__.pyc
    ├── debug.py
    └── helpers.py
├── process-google.py
├── process-initial.sh
└── tests
    ├── ngrampy_tests.py
    ├── smallcorpus-malformed.txt.bz2
    └── smallcorpus.txt.bz2


/README:
--------------------------------------------------------------------------------
 1 | 
 2 | ngrampy is a python class for manipulating google (or similarly formatted) n-gram data. It provides a python class for very basic table manipulations such that operations on tables are mimiced by operations on the hard drive, with huge n-gram files that cannot be read into RAM. This takes a lot of hard drive time, but can handle arbitrary file sizes (5~20gb is typical). This is *not* optimized for speed, since these things take a long time anyways and are typically run once. 
 3 | 
 4 | Usually, it makes more sense to process the google files once, concatinging and collapsing by some dates into a large file with all the ngrams (since this may take a few days). For this, the process-google.py script is fastest (much faster than LineFile). In collapsing dates, it makes a much smaller file (~10GB for eng-us 2grams)
 5 | 	gzip -dc /home/piantado/Desktop/GoogleBooks/eng-us-all/2/* | python process-google.py /tmp/G-eng-us-all
 6 | 	Or, unpigz is about 2x as fast as gzip on my computer (it multithreads fetching, file io, etc.)
 7 | 
 8 | This perl script does not do any fancy filtering of the ngrams.
 9 | 	
10 | 
11 | To download data from google, you can use download.py
12 | 
13 | NOTE: In general, you should use this library with 
14 |  
15 |    export PYTHONIOENCODING=utf-8
16 |  
17 | so that you can handle utf-8 characters from google. 
18 | 
19 | NOTE: This splits columns in the text files by whitespace; if you want something else, you should merge with underscores or something
20 | 
21 | NOTE: The pypy tends to run much faster than python for this!
22 | 
23 | ========================================================
24 | == LICENSE
25 | ========================================================
26 | 
27 | ngrampy is licensed under GPL 3.0
28 | 
29 | ========================================================
30 | == INSTALLATION:
31 | ========================================================
32 | 
33 | Put this library somewhere--mine lives in /home/piantado/mit/Libraries/ngrampy/
34 | 	
35 | Set the PYTHONPATH environment variable to point to ngrampy/:
36 | 	
37 | 	export PYTHONPATH=$PYTHONPATH:/home/piantado/Desktop/mit/Libraries/ngrampy
38 | 	
39 | You can put this into your .bashrc file to make it loaded automatically when you open a terminal. On ubuntu and most linux, this is:
40 | 	
41 | 	echo 'export PYTHONPATH=$PYTHONPATH:/home/piantado/Desktop/mit/Libraries/ngrampy' >> ~/.bashrc
42 | 
43 | You can also do
44 |   
45 |         echo 'export PYTHONIOENCODING=utf-8' >> ~/.bashrc
46 |         
47 | although this will change your default python encoding. 
48 | 	
49 | And you should be ready to use the library
50 | 
51 | 


--------------------------------------------------------------------------------
/download.py:
--------------------------------------------------------------------------------
 1 | """
 2 | 	This downloads from google all of the files matching a pattern on the Google Books Ngram Download page. 
 3 | """
 4 | 
 5 | import httplib2
 6 | from BeautifulSoup import BeautifulSoup, SoupStrainer
 7 | import re
 8 | import os
 9 | import urllib
10 | 
11 | # Use httplib2 and BeautifulSoup to scrape the links from the google index page:
12 | http = httplib2.Http()
13 | status, response = http.request('http://storage.googleapis.com/books/ngrams/books/datasetsv2.html')
14 | 
15 | for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
16 | 	if link.has_key('href'):
17 | 		url = link['href']
18 | 		
19 | 		# IF we match what we want:
20 | 		if re.search("[12]gram.+20120701", url):
21 | 			# Decode this
22 | 			m = re.search(r"googlebooks-([\w\-]+)-(\d+)gram.+",url)
23 | 			language, n = m.groups(None)
24 | 			
25 | 			# Only download some language
26 | 			if language not in set(["eng-us-all","eng-gb-all", "fre-all", "ger-all", "heb-all", "ita-all", "rus-all", "spa-all" ]): continue
27 | 
28 | 			filename = re.split(r"/", url)[-1] # last item on filename split
29 | 			
30 | 			# Make the directory if it does not exist
31 | 			if not os.path.exists(language):       os.mkdir(language)
32 | 			if not os.path.exists(language+"/"+n): os.mkdir(language+"/"+n)
33 | 			
34 | 			if not os.path.exists(language+"/"+n+"/"+filename):
35 | 				print "# Downloading %s to %s" % (url, language+"/"+n+"/"+filename)
36 | 				urllib.urlretrieve(url, language+"/"+n+"/"+filename )
37 | 			
38 |         
39 |         
40 | 


--------------------------------------------------------------------------------
/examples/SurprisalInContext/README:
--------------------------------------------------------------------------------
1 | This computes surprisal measures for 11 languages from Piantadosi, Tily, & Gibson's word length paper. It is a complete re-implementation, using the LineFile class to handle the big corpus. This handles unicode and filters
2 | the corpora somewhat differently than the original paper, but the results are largely the same. 
3 | 
4 | For fastest running, you should do this on a solid state drive. 
5 | 
6 | The analysis throws out many of the garbarge words on google by using vocabularies from OpenSubtlex, taking the most frequent 25k words. The Extract_Vocabularies directory contains a script to extract these vocabularies. 
7 | 


--------------------------------------------------------------------------------
/examples/SurprisalInContext/analyze.R:
--------------------------------------------------------------------------------
 1 | 
 2 | # A script for analyzing the results of run-all.sh, which will populate the Surprisal directory
 3 | # In the original work, we used Opensubtlex to define vocabularies, but now for simplicity
 4 | # Let's just use the most frequent strings
 5 | 
 6 | # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 7 | # Some handy functions
 8 | # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 9 | 
10 | stdize <- function(x, ...) { (x - mean(x,...)) / sd(x, ...) }
11 | 
12 | sort.by.frequency <- function(d) { d[order(d$Log.Frequency, decreasing=T),] }
13 | 
14 | # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15 | # Define the vocabulary -- take the most frequent words in some year
16 | # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
17 | 
18 | VOCAB <- as.character(sort.by.frequency(read.table("Surprisal/eng-us-2.1950.txt", header=T))[1:25000,"Word"])
19 | 
20 | # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
21 | # Now analyze:
22 | # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
23 | 
24 | D <- NULL
25 | for(Y in c("1900", "1925", "1950", "1975", "2000")){#"1500", "1525", "1550", "1600", "1625", "1650", "1675", "1700", "1725", "1750", "1775", "1800")) { 
26 | for(L in c("eng-gb-2", "eng-us-2")){
27 | 
28 | 	d <- read.table(paste("Surprisal/", L, ".", Y, ".txt", sep=""), header=T)
29 | 	d <- d[is.element(d$Word, VOCAB),]
30 | 	
31 | 	# Very simple--just nonparametric correlations
32 | 	# NOTE: Email Steve for fancier scripts and analysis (partials, bootstrapping, etc.)
33 | 	sc <- cor.test(d$Surprisal, d$Orthographic.Length, method="spearman")
34 | 	fc <- cor.test(-d$Log.Frequency, d$Orthographic.Length, method="spearman")  ## Negative log freq here so that its on the same scale (we didn't normalize freq)
35 | 
36 | 	l <- lm( stdize(Orthographic.Length) ~ stdize(Surprisal), data=d)
37 | 
38 | 	D <- rbind(D, data.frame( Language=L,
39 | 				   Year=Y,
40 | 				   Surprisal.cor=sc$estimate,
41 | 				   Frequency.cor=fc$estimate,
42 | #				   Surprisal.p.value=sc$p.value,
43 | #				   Frequency.p.value=fc$p.value, 
44 | 				   mean.surprisal=mean(d$Surprisal),
45 |        				   sd.surprisal=sd(d$Surprisal),
46 | 				   lm.icpt=coef(l)[1],
47 | 				   lm.slope=coef(l)[2],
48 | 				   mean.fw.surprisal=weighted.mean(d$Surprisal, exp(d$Log.Frequency)),
49 | 				   total.freq=sum(2.0**d$Log.Frequency)
50 | 			))
51 | }
52 | }
53 | 
54 | print(D)
55 | 
56 | 
57 | # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
58 | # Build the monster data frame
59 | # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
60 | 
61 | # 
62 | # D <- NULL
63 | # for(Y in c("1500", "1525", "1550", "1600", "1625", "1650", "1675", "1700", "1725", "1750", "1775", "1800")) { 
64 | # for(L in c("eng-us-2")){
65 | # 
66 | # 	d <- read.table(paste("Surprisal/", L, ".", Y, ".txt", sep=""), header=T)
67 | # 	d <- d[is.element(d$Word, VOCAB),]
68 | # 	d$Total.Log.Frequency <- log(sum(2.0**d$Log.Frequency)) # TODO:Logsumexp
69 | # 	d$Year <- as.numeric(Y)
70 | # 	d$Language <- L
71 | # 	
72 | # 	D <- rbind(D, d)
73 | # }
74 | # }
75 | 


--------------------------------------------------------------------------------
/examples/SurprisalInContext/run-GoogleBooks.sh:
--------------------------------------------------------------------------------
 1 | 
 2 | # Pass in the google directory with each file you'd like to process
 3 | 
 4 | # > bash run-GoogleBooks.sh /CorpusA/GoogleBooks/Processed/eng-us-2/*
 5 | 
 6 | # This then computes, stores the google archive to Archive/xxx.7z and surprisal to Surprisal
 7 | 
 8 | for f in $@
 9 | do
10 | 	x=$(basename $f) # the base file name
11 | 	python surprisal-2gram.py --in=$f --path=/tmp/$x.google > Surprisal/$x.txt
12 | 	7z a -mx=9 Archive/$x.7z /tmp/$x.google && rm /tmp/$x.google & # run in background since its slow
13 | done


--------------------------------------------------------------------------------
/examples/SurprisalInContext/surprisal-2gram.py:
--------------------------------------------------------------------------------
 1 | """
 2 | 	This file shows how to use ngrampy to compute the average surprisal measures from Piantadosi, Tily & Gibson (2011)
 3 | 	
 4 | 	python surprisal-2gram.py --in=/path/to/google --path=/tmp/temporaryfile > surprisal.txt
 5 | 	
 6 | """
 7 | from ngrampy.LineFile import *
 8 | import os
 9 | import argparse
10 | import glob
11 | 
12 | ASSERT_SORTED = True # if you want an extra check on sorting
13 | 
14 | parser = argparse.ArgumentParser(description='Compute average surprisal from google style data')
15 | parser.add_argument('--in', dest='in', type=str, default="/home/piantado/Desktop/mit/Corpora/GoogleNGrams/2/*", nargs="?", help='The directory with google files (e.g. Google/3gms/)')
16 | parser.add_argument('--path', dest='path', type=str, default="/tmp/GoogleSurprisal", nargs="?", help='Where the database file lives')
17 | args = vars(parser.parse_args())	
18 | 
19 | print "# Loading files"
20 | G = LineFile( glob.glob(args['in']), header=["w1", "w2", "cnt12"], path=args['path']) 
21 | print "# Cleaning"
22 | G.clean(columns=3)
23 | 
24 | # Since we collapsed case, go through and re-sum the triple counts
25 | print "# Resumming for case collapsing"
26 | G.sort(keys="w1 w2") 
27 | G.resum_equal("w1 w2", "cnt12", assert_sorted=ASSERT_SORTED ) # in collapsing case, etc., we need to re-sum
28 | 
29 | # Now go through and
30 | print "# Making marginal counts"
31 | G.make_marginal_column("cnt1", "w1", "cnt12") 
32 | 
33 | # and compute surprisal
34 | print "# Sorting by word"
35 | G.sort("w2")
36 | 
37 | print "# Computing surprisal"
38 | G.print_average_surprisal("w2", "cnt12", "cnt1", assert_sorted=ASSERT_SORTED)
39 | 
40 | # And remove my temporary file:
41 | print "# Removing my temporary file"
42 | G.delete_tmp()
43 | 
44 | # If you have a file that's already sorted, etc:
45 | #G = LineFile(["/ssd/GoogleSurprisal-ALREADYFILTERED"], path="/ssd/Gsurprisal", header=["w1", "w2", "w3", "cnt12", "cnt12"])  # for debugging
46 | #G.print_average_surprisal("w3", "cnt12", "cnt12", assert_sorted=False)
47 | 


--------------------------------------------------------------------------------
/examples/TrigramMatches/README:
--------------------------------------------------------------------------------
1 | This is for computing trigrams with matched properties (e.g. matched unigram and bigram stats). So you can find trigrams that are controlled on all but the joint probability, for instance. I once tried to do a project involving them. 
2 | 
3 | First build a "database" with compute_trigram_stats.py. This will take google and build a bigger file with each trigram and other measures such as the unigram and bigram probabilities. 
4 | 
5 | Then, run find_matched_items.py, which takes the output fo compute_trigram_stats (assumed to live in /ssd/trigram-stats), and then subsamples it, and then sorts to generate items which are matched. It outputs to stdout the number of items in the stack, the item number, and the two lines of /ssd/trigram-stats which are matched. 
6 | 


--------------------------------------------------------------------------------
/examples/TrigramMatches/badwords.txt:
--------------------------------------------------------------------------------
  1 | sexual
  2 | pubic
  3 | groin
  4 | crotch
  5 | genital
  6 | genitals
  7 | sex
  8 | blow
  9 | blowjob
 10 | sexed
 11 | sexting
 12 | ahole
 13 | anus
 14 | ash0le
 15 | ash0les
 16 | asholes
 17 | ass
 18 | hot
 19 | Ass Monkey
 20 | Assface
 21 | assh0le
 22 | assh0lez
 23 | asshole
 24 | assholes
 25 | assholz
 26 | asswipe
 27 | azzhole
 28 | balls
 29 | bassterds
 30 | bastard
 31 | bastards
 32 | bastardz
 33 | basterds
 34 | basterdz
 35 | Biatch
 36 | bitch
 37 | bitches
 38 | Blow Job
 39 | boffing
 40 | butthole
 41 | buttwipe
 42 | c0ck
 43 | c0cks
 44 | c0k
 45 | Carpet Muncher
 46 | cawk
 47 | cawks
 48 | Clit
 49 | cnts
 50 | cntz
 51 | cock
 52 | cockhead
 53 | cock-head
 54 | cocks
 55 | CockSucker
 56 | cock-sucker
 57 | crap
 58 | cum
 59 | cunt
 60 | cunts
 61 | cuntz
 62 | dick
 63 | dild0
 64 | dild0s
 65 | dildo
 66 | dildos
 67 | dilld0
 68 | dilld0s
 69 | dominatricks
 70 | dominatrics
 71 | dominatrix
 72 | dyke
 73 | enema
 74 | f u c k
 75 | f u c k e r
 76 | fag
 77 | fag1t
 78 | faget
 79 | fagg1t
 80 | faggit
 81 | faggot
 82 | fagit
 83 | fags
 84 | fagz
 85 | faig
 86 | faigs
 87 | fart
 88 | flipping the bird
 89 | fuck
 90 | fucker
 91 | fuckin
 92 | fucking
 93 | fucked
 94 | fuckhole
 95 | fuckable
 96 | fuck-yes
 97 | fucks
 98 | Fudge Packer
 99 | fuk
100 | Fukah
101 | Fuken
102 | fuker
103 | Fukin
104 | Fukk
105 | Fukkah
106 | Fukken
107 | Fukker
108 | Fukkin
109 | g00k
110 | gay
111 | gayboy
112 | gaygirl
113 | gays
114 | gayz
115 | God-damned
116 | h00r
117 | h0ar
118 | h0re
119 | hells
120 | hoar
121 | hoor
122 | hoore
123 | jackoff
124 | jap
125 | japs
126 | jerk-off
127 | jisim
128 | jiss
129 | jizm
130 | jizz
131 | knob
132 | knobs
133 | knobz
134 | kunt
135 | kunts
136 | kuntz
137 | Lesbian
138 | Lezzian
139 | Lipshits
140 | Lipshitz
141 | masochist
142 | masokist
143 | massterbait
144 | masstrbait
145 | masstrbate
146 | masterbaiter
147 | masterbate
148 | masterbates
149 | Motha Fucker
150 | Motha Fuker
151 | Motha Fukkah
152 | Motha Fukker
153 | Mother Fucker
154 | Mother Fukah
155 | Mother Fuker
156 | Mother Fukkah
157 | Mother Fukker
158 | mother-fucker
159 | Mutha Fucker
160 | Mutha Fukah
161 | Mutha Fuker
162 | Mutha Fukkah
163 | Mutha Fukker
164 | n1gr
165 | nastt
166 | nude
167 | nigger
168 | niiger
169 | nigur
170 | orgasim
171 | nigger;
172 | nigur;
173 | niiger;
174 | niigr;
175 | orafis
176 | orgasim;
177 | orgasm
178 | orgasum
179 | oriface
180 | orifice
181 | orifiss
182 | packi
183 | packie
184 | packy
185 | paki
186 | pakie
187 | paky
188 | pecker
189 | peeenus
190 | peeenusss
191 | peenus
192 | peinus
193 | pen1s
194 | penas
195 | penis
196 | penis-breath
197 | penus
198 | penuus
199 | Phuc
200 | Phuck
201 | Phuk
202 | Phuker
203 | Phukker
204 | polac
205 | polack
206 | polak
207 | Poonani
208 | pr1c
209 | pr1ck
210 | pr1k
211 | pusse
212 | pussee
213 | pussy
214 | puuke
215 | puuker
216 | queer
217 | queers
218 | queerz
219 | qweers
220 | qweerz
221 | qweir
222 | recktum
223 | rectum
224 | retard
225 | sadist
226 | scank
227 | schlong
228 | screwing
229 | semen
230 | sex
231 | sexy
232 | Sh!t
233 | sh1t
234 | sh1ter
235 | sh1ts
236 | sh1tter
237 | sh1tz
238 | shit
239 | shits
240 | shitter
241 | Shitty
242 | Shity
243 | shitz
244 | Shyt
245 | Shyte
246 | Shytty
247 | Shyty
248 | skanck
249 | skank
250 | skankee
251 | skankey
252 | skanks
253 | Skanky
254 | slut
255 | sluts
256 | Slutty
257 | slutz
258 | son-of-a-bitch
259 | tit
260 | turd
261 | va1jina
262 | vag1na
263 | vagiina
264 | vagina
265 | vaj1na
266 | vajina
267 | vullva
268 | vulva
269 | w0p
270 | wh00r
271 | wh0re
272 | whore
273 | xrated
274 | xxx
275 | b!+ch
276 | bitch
277 | blowjob
278 | clit
279 | arschloch
280 | fuck
281 | shit
282 | ass
283 | asshole
284 | b!tch
285 | b17ch
286 | b1tch
287 | bastard
288 | bi+ch
289 | boiolas
290 | buceta
291 | c0ck
292 | cawk
293 | chink
294 | cipa
295 | clits
296 | cock
297 | cum
298 | cunt
299 | dildo
300 | dirsa
301 | ejakulate
302 | fatass
303 | fcuk
304 | fuk
305 | fux0r
306 | hoer
307 | hore
308 | jism
309 | kawk
310 | l3itch
311 | l3i+ch
312 | lesbian
313 | masturbate
314 | masterbat
315 | masterbat3
316 | motherfucker
317 | s.o.b.
318 | mofo
319 | nazi
320 | nigga
321 | nigger
322 | nutsack
323 | phuck
324 | pimpis
325 | pusse
326 | pussy
327 | scrotum
328 | sh!t
329 | shemale
330 | shi+
331 | sh!+
332 | slut
333 | smut
334 | teets
335 | tits
336 | boobs
337 | b00bs
338 | teez
339 | testical
340 | testicle
341 | titt
342 | w00se
343 | jackoff
344 | wank
345 | whoar
346 | whore
347 | damn
348 | dyke
349 | fuck
350 | shit
351 | @$$
352 | amcik
353 | andskota
354 | arse
355 | assrammer
356 | ayir
357 | bi7ch
358 | bitch
359 | bollock
360 | breasts
361 | butt-pirate
362 | cabron
363 | cazzo
364 | chraa
365 | chuj
366 | Cock
367 | cunt
368 | d4mn
369 | daygo
370 | dego
371 | dick
372 | dike
373 | dupa
374 | dziwka
375 | ejackulate
376 | Ekrem
377 | Ekto
378 | enculer
379 | faen
380 | fag
381 | fanculo
382 | fanny
383 | feces
384 | feg
385 | Felcher
386 | ficken
387 | fitt
388 | Flikker
389 | foreskin
390 | Fotze
391 | Fu(
392 | fuk
393 | futkretzn
394 | gay
395 | gook
396 | guiena
397 | h0r
398 | h4x0r
399 | hell
400 | helvete
401 | hoer
402 | honkey
403 | Huevon
404 | hui
405 | injun
406 | jizz
407 | kanker
408 | kike
409 | klootzak
410 | kraut
411 | knulle
412 | kuk
413 | kuksuger
414 | Kurac
415 | kurwa
416 | kusi
417 | kyrpa
418 | lesbo
419 | mamhoon
420 | masturbat
421 | merd
422 | mibun
423 | monkleigh
424 | mouliewop
425 | muie
426 | mulkku
427 | muschi
428 | nazis
429 | nepesaurio
430 | nigger
431 | orospu
432 | paska
433 | perse
434 | picka
435 | pierdol
436 | pillu
437 | pimmel
438 | piss
439 | pizda
440 | poontsee
441 | poop
442 | porn
443 | p0rn
444 | pr0n
445 | preteen
446 | pula
447 | pule
448 | puta
449 | puto
450 | qahbeh
451 | queef
452 | rautenberg
453 | schaffer
454 | scheiss
455 | schlampe
456 | schmuck
457 | screw
458 | sh!t
459 | sharmuta
460 | sharmute
461 | shipal
462 | shiz
463 | skribz
464 | skurwysyn
465 | sphencter
466 | spic
467 | spierdalaj
468 | splooge
469 | suka
470 | b00b
471 | testicle
472 | titt
473 | twat
474 | vittu
475 | wank
476 | wetback
477 | wichser
478 | wop
479 | yed
480 | zabourah


--------------------------------------------------------------------------------
/examples/TrigramMatches/compute_trigram_stats.py:
--------------------------------------------------------------------------------
 1 | 
 2 | from ngrampy.LineFile import *
 3 | import os
 4 | GOOGLE_ENGLISH_DIR = "/home/piantado/Desktop/mit/Corpora/GoogleNGrams/3/"
 5 | VOCAB_FILE = "Vocabulary/EnglishVocabulary.txt"
 6 | 
 7 | # Read the vocabulary file
 8 | vocabulary = [ l.strip() for l in open(VOCAB_FILE, "r") ]
 9 | 
10 | #rawG = LineFile(["test3.txt"], header=["w1", "w2", "w3", "cnt123"]) # for debugging
11 | rawG = LineFile([GOOGLE_ENGLISH_DIR+x for x in os.listdir(GOOGLE_ENGLISH_DIR)], header=["w1", "w2", "w3", "cnt123"]) 
12 | 
13 | rawG.clean() # already done!
14 | rawG.restrict_vocabulary("w1 w2 w3", vocabulary) # in fields w1 and w2, restrict our vocabulary
15 | rawG.sort(keys="w1 w2 w3") # Since we collapsed case, etc. This could also be rawG.sort(keys=["w1","w2","w3"]) in the other format.
16 | rawG.resum_equal("w1 w2 w3", "cnt123" )
17 | 
18 | # Where we store all lines
19 | G = rawG.copy()
20 | 
21 | # Now go through and compute what we want
22 | G1 = rawG.copy() # start with a copy
23 | G1.delete_columns( "w2 w3" ) # delete the columns we don't want
24 | G1.sort("w1" ) # sort this by the one we do want 
25 | G1.resum_equal( "w1", "cnt123" ) # resum equal
26 | G1.rename_column("cnt123", "cnt1") # rename the column since its now a sum of 1
27 | G.sort("w1") # sort our target by w
28 | G.merge(G1, keys1="w1", tocopy="cnt1") # merge in
29 | G1.delete() # and delete this temporary
30 | 
31 | G2 = rawG.copy()
32 | G2.delete_columns( "w1 w3" )
33 | G2.sort("w2" )
34 | G2.resum_equal( "w2", "cnt123" )
35 | G2.rename_column("cnt123", "cnt2")
36 | G.sort("w2")
37 | G.merge(G2, keys1="w2", tocopy="cnt2")
38 | G2.delete()
39 | 
40 | G3 = rawG.copy()
41 | G3.delete_columns( "w1 w2" )
42 | G3.sort("w3")
43 | G3.resum_equal( "w3", "cnt123" )
44 | G3.rename_column("cnt123", "cnt3")
45 | G.sort("w3")
46 | G.merge(G3, keys1="w3", tocopy="cnt3")
47 | G3.delete()
48 | 
49 | G12 = rawG.copy()
50 | G12.delete_columns( ["w3"] )
51 | G12.sort("w1 w2" )
52 | G12.resum_equal( "w1 w2", "cnt123" )
53 | G12.rename_column("cnt123", "cnt12")
54 | G.sort("w1 w2") # do this for merging
55 | G.merge(G12, keys1="w1 w2", tocopy=["cnt12"])
56 | G12.delete()
57 | 
58 | G23 = rawG.copy()
59 | G23.delete_columns( ["w1"] )
60 | G23.sort("w2 w3" )
61 | G23.resum_equal( "w2 w3", "cnt123" )
62 | G23.rename_column("cnt123", "cnt23")
63 | G.sort("w2 w3") # do this for merging
64 | G.merge(G23, keys1="w2 w3", tocopy=["cnt23"])
65 | G23.delete()
66 | 
67 | # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
68 | #  Now compute all the arithmetic, etc. 
69 | # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
70 | 
71 | #Make a colum: call it unigram, a function of three arguments, and give it w1,w2,w3 as arguments
72 | from math import log
73 | def log2(x): log(x,2.0)
74 | 
75 | def logsum(*x): return str(round(sum(map(log,map(float,x))), 4)) # must take a string and return a string
76 | #def logcol(x) : return logsum([x])
77 | G.make_column("unigram", logsum, "cnt1 cnt2 cnt3")
78 | G.make_column("bigram",  logsum, "cnt12 cnt23")
79 | G.make_column("trigram",  logsum, "cnt123")
80 | 
81 | G.sort("unigram bigram trigram", dtype=float)
82 | 
83 | ##G.cat()
84 | G.head()
85 | 


--------------------------------------------------------------------------------
/examples/TrigramMatches/find_matched_items.py:
--------------------------------------------------------------------------------
 1 | """
 2 | 	This file is for constructing stimuli which are matched on bigram and unigram surprisal
 3 | 	It requires you to have run compute_trigram_stats and output the result to /ssd/trigram-stats
 4 | 	It also takes a bad word file to filter out bad words from our experimental stimuli
 5 | """
 6 | 
 7 | from ngrampy.LineFile import *
 8 | import os
 9 | SUBSAMPLE_N = 15000
10 | tolerance = 0.001
11 | BAD_WORD_FILE = "badwords.txt"
12 | 
13 | def check_tolerance(x,y):
14 | 	"""
15 | 		A handy function to check if some variables are within tolerance percent of each other
16 | 	"""
17 | 	return abs(x-y) / ((x+y)/2.) < tolerance
18 | 
19 | # This will copy the file, make a new one, and then print out possible lines
20 | G = LineFile(files=["/ssd/trigram-stats"], path="/ssd/subsampled-stimuli", header="w1 w2 w3 c123 c1 c2 c3 c12 c23 unigram bigram trigram")
21 | 
22 | # Now throw out the porno words
23 | porno_vocabulary = [ l.strip() for l in open(BAD_WORD_FILE, "r") ]
24 | G.restrict_vocabulary("w1 w2 w3", porno_vocabulary, invert=True)
25 | 
26 | # and then subsample
27 | G.subsample_lines(N=SUBSAMPLE_N)
28 | 
29 | # and make sure we are sorted for the below
30 | G.sort("unigram bigram trigram", dtype=float)
31 | G.head() # just a peek
32 | 
33 | item_number = 0
34 | line_stack = []
35 | for l in G.lines(tmp=False, parts=False):
36 | 	# extrac the columns from line
37 | 	unigram, bigram, trigram =  G.extract_columns(l, keys="unigram bigram trigram", dtype=float)
38 | 	
39 | 	# now remove things which cannot possibly match anymore
40 | 	while len(line_stack) > 0 and not check_tolerance(unigram, G.extract_columns(line_stack[0], keys="unigram", dtype=float)[0]):
41 | 		del line_stack[0]
42 | 	
43 | 	# now go through the line_stack and try out each 
44 | 	# it must already be within tolerance on unigram, or it would have been removed
45 | 	for x in line_stack:
46 | 		#print "Checking ", x
47 | 		x_unigram, x_bigram, x_trigram =  G.extract_columns(x, keys="unigram bigram trigram", dtype=float)
48 | 		
49 | 		# it must have already been within tolerance on unigram or it would be removed
50 | 		assert( check_tolerance(unigram, x_unigram) ) 
51 | 		
52 | 		# and check the bigrams
53 | 		if check_tolerance(bigram, x_bigram):
54 | 			print len(line_stack), item_number, l
55 | 			print len(line_stack), item_number, x
56 | 			item_number += 1
57 | 		
58 | 	# and add this on
59 | 	line_stack.append(l)
60 | 


--------------------------------------------------------------------------------
/examples/TrigramMatches/find_matched_items_aXb.py:
--------------------------------------------------------------------------------
 1 | """
 2 | 	This file is for constructing stimuli pairs
 3 | 	
 4 | 	A X B  <->  A Y B
 5 | 	
 6 | 	with the unigram and bigram stats matched
 7 | 	
 8 | 	It requires you to have run compute_trigram_stats and output the result to /ssd/trigram-stats
 9 | 	It also takes a bad word file to filter out bad words from our experimental stimuli.
10 | 	
11 | 	This
12 | """
13 | 
14 | from ngrampy.LineFile import *
15 | import os
16 | SUBSAMPLE_N = 50000000
17 | tolerance = 0.01
18 | BAD_WORD_FILE = "badwords.txt"
19 | 
20 | def check_tolerance(x,y):
21 | 	"""
22 | 		A handy function to check if some variables are within tolerance percent of each other
23 | 	"""
24 | 	return abs(x-y) / ((x+y)/2.) < tolerance
25 | 
26 | # This will copy the file, make a new one, and then print out possible lines
27 | G = LineFile(files=["/ssd/trigram-stats"], path="/ssd/subsampled-stimuli", header="w1 w2 w3 c123 c1 c2 c3 c12 c23 unigram bigram trigram")
28 | 
29 | # Now throw out the porno words
30 | #porno_vocabulary = [ l.strip() for l in open(BAD_WORD_FILE, "r") ]
31 | #G.restrict_vocabulary("w1 w2 w3", porno_vocabulary, invert=True)
32 | 
33 | # draw a subsample
34 | #if SUBSAMPLE_N is not None:
35 | 	#G.subsample_lines(N=SUBSAMPLE_N)
36 | 
37 | # we need to resort  this so that we can have w1 and w3 equal and then all the n-grams matched
38 | G.sort("w1 w3 unigram bigram trigram", lines=1000000)
39 | G.head()
40 | 
41 | item_number = 0
42 | line_stack = []
43 | for l in G.lines(tmp=False, parts=False):
44 | 	# extract the columns from line
45 | 	w1, w3, unigram, bigram, trigram =  G.extract_columns(l, keys="w1 w3 unigram bigram trigram", dtype=[str, str, float, float, float])
46 | 	
47 | 	# now remove things which cannot possibly match anymore
48 | 	while len(line_stack) > 0:
49 | 		w1_, w3_, unigram_, bigram_, trigram =  G.extract_columns(line_stack[0], keys="w1 w3 unigram bigram trigram", dtype=[str, str, float, float, float])
50 | 		
51 | 		if not (w1_ == w1 and w3_ == w3 and check_tolerance(unigram, unigram_)):
52 | 			del line_stack[0]
53 | 			
54 | 	# now go through the line_stack and try out each 
55 | 	# it must already be within tolerance on unigram, or it would have been removed
56 | 	for x in line_stack:
57 | 		w1_, w3_, unigram_, bigram_, trigram =  G.extract_columns(x, keys="w1 w3 unigram bigram trigram", dtype=[str, str, float, float, float])
58 | 		
59 | 		# it must have already been within tolerance on unigram or it would be removed
60 | 		assert( check_tolerance(unigram, unigram_) ) 
61 | 		assert( w1_ == w1 and w3_ == w3 )
62 | 		
63 | 		# and check the bigrams
64 | 		if check_tolerance(bigram, bigram_) and (w1==w1_ and w3==w3_):
65 | 			print len(line_stack), item_number, l
66 | 			print len(line_stack), item_number, x
67 | 			item_number += 1
68 | 		
69 | 	# and add this on
70 | 	line_stack.append(l)
71 | 


--------------------------------------------------------------------------------
/ngrampy/LineFile.py:
--------------------------------------------------------------------------------
  1 | """ 
  2 |     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  3 | 
  4 | 	This class allows manipulation of google ngram data (and similar formatted) data. 
  5 | 	When you call functions on LineFiles, the changes are echoed in the file. 
  6 | 	
  7 | 	The uses tab (\t) as the column separator. 
  8 | 	
  9 | 	When you run this, if you get an encoding error, you may need to set the environment to 
 10 | 	
 11 | 		export PYTHONIOENCODING=utf-8	
 12 | 		
 13 | 	
 14 | 	TODO: 
 15 | 		- Make this so each function call etc. will output what it did
 16 | 	NOTE:
 17 | 		- Column names cannot contain spaces. 
 18 | 		
 19 | 	Licensed under GPL 3.0
 20 | 	
 21 | 	This program is free software: you can redistribute it and/or modify
 22 | 	it under the terms of the GNU General Public License as published by
 23 | 	the Free Software Foundation, either version 3 of the License, or
 24 | 	(at your option) any later version.
 25 | 
 26 | 	This program is distributed in the hope that it will be useful,
 27 | 	but WITHOUT ANY WARRANTY; without even the implied warranty of
 28 | 	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 29 | 	GNU General Public License for more details.
 30 | 
 31 | 	You should have received a copy of the GNU General Public License
 32 | 	along with this program.  If not, see <http://www.gnu.org/licenses/>.
 33 | 
 34 | 	
 35 |     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 36 | """
 37 | from __future__ import division
 38 | import os
 39 | import sys
 40 | import re
 41 | import unicodedata
 42 | import heapq
 43 | import shutil
 44 | import random
 45 | import codecs
 46 | import itertools
 47 | from math import log
 48 | from collections import Counter
 49 | from copy import deepcopy
 50 | 
 51 | # handly numpy with pypy
 52 | try:
 53 | 	import numpy
 54 | except ImportError:
 55 | 	try:
 56 | 		import numpypy as numpy
 57 | 	except ImportError:
 58 | 		pass
 59 | 
 60 | from debug import *
 61 | from helpers import *
 62 | 
 63 | # A temporary file like /tmp
 64 | NGRAMPY_DEFAULT_PATH = "/tmp" #If no path is specified, we go here
 65 | 
 66 | ECHO_SYSTEM = True # show the system calls we make?
 67 | SORT_DEFAULT_LINES = 10000000 # how many lines to sorted at a time in RAM when we sort a large file?
 68 | ENCODING = 'utf-8'
 69 | 
 70 | # Set this so we can write stderr
 71 | sys.stdout = codecs.getwriter(ENCODING)(sys.stdout)
 72 | sys.stderr = codecs.getwriter(ENCODING)(sys.stderr)
 73 | 
 74 | IO_BUFFER_SIZE = int(100e6) # approx size of input buffer 
 75 | 
 76 | COLUMN_SEPARATOR = u"\t" # must be passed to string.split() or else empty columns are collapsed!
 77 | 
 78 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
 79 | # # Class definition
 80 | # # # # # # # # # # # # # # # # # s# # # # # # # # # # # # # # # # # # # # # # # # # # #
 81 | 
 82 | class LineFile(object):
 83 | 	
 84 | 	def __init__(self, files, header=None, path=None, force_nocopy=False):
 85 | 		"""
 86 | 			Create a new file object with the specified header. It takes a list of files
 87 | 			and cats them to path (overwriting it). A single file is acceptable.
 88 | 			
 89 | 			header - give each column a name (you can refer by name instead of number)
 90 | 			path   - where is this file stored? If None, we make a new temporary files
 91 | 			force_nocopy - Primarily for debugging, this prevents us from copying a file and just uses the path as is
 92 | 					you should pass files a list of length 1 which is the file, and path should be None
 93 | 					as in, LineFile(files=["/ssd/trigram-stats"], header="w1 w2 w3 c123 c1 c2 c3 c12 c23 unigram bigram trigram", force_nocopy=True)
 94 | 		"""
 95 | 		if isinstance(files, str):
 96 | 			files = [files]
 97 | 		
 98 | 		assert len(files) > 0, "*** Must provide non-empty list of files!"
 99 | 		
100 | 		if force_nocopy:
101 | 			assert(len(files) == 1)
102 | 			self.path = files[0]
103 | 			
104 | 		else:
105 | 			if path is None:
106 | 				self.path = NGRAMPY_DEFAULT_PATH+"/tmp" # used by get_new_path, since self.path is where we get the dir from
107 | 				self.path = self.get_new_path() # overwrite iwth a new file
108 | 			else:
109 | 				self.path = path
110 | 				
111 | 			# if it exists, let's just move it to a backup name
112 | 			if os.path.exists(self.path):
113 | 				systemcall("mv "+self.path+" "+self.path+".old")
114 | 				
115 | 			# and if we specified a bunch of input files
116 | 			for f in files:
117 | 				if f.endswith(".idf"): 
118 | 					continue # skip the index files
119 | 				if f.endswith(".gz"):    
120 | 					systemcall("gunzip -d -c "+f+" >> "+self.path)
121 | 				elif f.endswith(".bz2"): 
122 | 					systemcall("bzip2 -d -c "+f+" >> "+self.path)
123 | 				elif f.endswith(".xz") or f.endswith(".lzma"):
124 | 					systemcall("xz -d -c "+f+" >> "+self.path)
125 | 				else:                       
126 | 					systemcall("cat "+f+" >> "+self.path)
127 | 		
128 | 		# just keep track
129 | 		self.files = files
130 | 		
131 | 		# and store some variables
132 | 		self.tmppath = self.path+".tmp"
133 | 		
134 | 		if isinstance(header, str): 
135 | 			self.header = header.split(COLUMN_SEPARATOR)
136 | 		else:
137 | 			self.header = header
138 | 
139 | 		self._lines = None
140 | 		self.preprocess()
141 | 
142 | 	def preprocess(self):
143 | 		def fix_separators(line):
144 | 			return COLUMN_SEPARATOR.join(line.split())
145 | 		self.map(fix_separators)
146 | 
147 | 	def write(self, it, lazy=False):
148 | 		""" Write
149 | 
150 | 		Write the lines in an iterable to the LineFile.
151 | 
152 | 		If lazy, then delay actually evaluating the iterable and
153 | 		writing it to file. 
154 | 
155 | 		WARNING! If you specify lazy=True, then you can only read()
156 | 		those lines once! If you need to read lines more than once,
157 | 		you need to do lazy=False and write the lines to the file. 
158 | 
159 | 		Lazy iterators can be chained into efficient pipelines.
160 | 
161 | 		"""
162 | 		if lazy:
163 | 			self._lines = it
164 | 		else:
165 | 			# Write lines to tmppath (note it used to be the other way around!)
166 | 			with codecs.open(self.tmppath, mode='w', encoding=ENCODING, 
167 | 				 errors='strict', buffering=IO_BUFFER_SIZE) as outfile:
168 | 				for item in it:
169 | 					print >>outfile, item
170 | 
171 | 			# And move tmppath to path
172 | 			self.mv_from_tmp()
173 | 
174 | 	def read(self):
175 | 		""" Read
176 | 
177 | 		Return the current lines of the LineFile, whether from
178 | 		a file or from a lazy iterator.
179 | 
180 | 		"""
181 | 		if self._lines is None:
182 | 			return codecs.open(self.path, mode='r', encoding=ENCODING,
183 | 					   errors='strict', buffering=IO_BUFFER_SIZE)
184 | 		else:
185 | 			result = iter(self._lines)
186 | 			self._lines = None # only allow the lazy iterator to be read once!!
187 | 			return result
188 | 
189 | 		
190 | 	def setheader(self, *x): 
191 | 		self.header = x
192 | 
193 | 	def rename_column(self, x, v): 
194 | 		self.header[self.to_column_number(x)] = v
195 | 		
196 | 	def to_column_number(self, x):
197 | 		"""
198 | 		 Takes either:
199 | 			a column number - just echoes back
200 | 			a string - returns the right column number for the string
201 | 			a whitespace separated string - returns an array of column numbers
202 | 			an array - maps along and returns
203 | 		 
204 | 		"""		
205 | 		if isinstance(x, int):    
206 | 			return x
207 | 		elif isinstance(x, list): 
208 | 			return map(self.to_column_number, x)
209 | 		elif isinstance(x, str): 
210 | 			if re_SPACE.search(x):  # if spaces, treat it as an array and map
211 | 				return map(self.to_column_number, x.split(" "))
212 | 			
213 | 			# otherwise, a single string so just find the header that equals it
214 | 			for i, item in enumerate(self.header):
215 | 				if item == x: 
216 | 					return i
217 | 		
218 | 		print >>sys.stderr, "Invalid header name ["+x+"]", self.header
219 | 		exit(1)
220 | 			
221 | 	def delete_columns(self, cols, lazy=False):
222 | 		
223 | 		# make sure these are *decreasing* so we can delete in order
224 | 		cols = sorted(listifnot(self.to_column_number(cols)), reverse=True)
225 | 		
226 | 		def generate_deleted(lines):
227 | 			for parts in lines:
228 | 				for c in cols: 
229 | 					del parts[c]
230 | 				yield "\t".join(parts)
231 | 			# and delete from the header, after deletion is complete
232 | 			if self.header is not None:
233 | 				for c in cols: 
234 | 					del self.header[c]
235 | 		
236 | 		self.write(generate_deleted(self.lines(parts=True)), lazy=lazy)
237 | 
238 | 
239 | 		
240 | 	def copy(self, path=None):
241 | 		
242 | 		if path is None: 
243 | 			   path = self.get_new_path() # make a new path if its not specified
244 | 		
245 | 		# we can just copy the file by treating it as one of the "files"
246 | 		# and then use this new path, not the old one!
247 | 		return LineFile([self.path], header=deepcopy(self.header), path=path)
248 | 	
249 | 	def get_new_path(self): 
250 | 		ind = 1
251 | 		while True:
252 | 			path = os.path.dirname(self.path)+"/ngrampy-"+str(ind)
253 | 			if not os.path.isfile(path): 
254 | 				return path
255 | 			ind += 1
256 | 			
257 | 	def mv_tmp(self):
258 | 		"""
259 | 			Move myself to my temporary file, so that I can cat to my self.path
260 | 		"""
261 | 		#print "# >>", self.path, self.tmppath
262 | 		shutil.move(self.path, self.tmppath)
263 | 
264 | 	def mv_from_tmp(self):
265 | 		"""
266 | 		        Move myself from self.tmppath to self.path.
267 | 		"""
268 | 		shutil.move(self.tmppath, self.path)
269 | 			
270 | 		
271 | 	def rename(self, n):
272 | 		shutil.move(self.path, n)
273 | 		self.path = n
274 | 		
275 | 	def rm_tmp(self):
276 | 		"""
277 | 			Remove the temporary file
278 | 		"""
279 | 		os.remove(self.tmppath)
280 | 
281 | 	def cp(self, f): 
282 | 		shutil.cp(self.path, f)
283 | 	
284 | 	def extract_columns(self, line, keys, dtype=unicode):
285 | 		"""
286 | 			Extract some columns from a single line. Assumes that keys are numbers (e.g. already mapped through to_column_number)
287 | 			and will return the columns as the specified dtype
288 | 			NOTE: This always returns a list, even if one column is specified. This may change in the future
289 | 			
290 | 			e.g. line="a\tb\tc\td"
291 | 			     keys=[1,4]
292 | 			     gives: ["a", "b", "c", "d"], "b\td"
293 | 		"""
294 | 		if isinstance(keys, str): 
295 | 			keys = listifnot(self.to_column_number(keys))
296 | 		
297 | 		parts = line.split(COLUMN_SEPARATOR)
298 | 
299 | 		if isinstance(dtype,list):
300 | 			return [ dtype[i](parts[x]) for i,x in enumerate(keys)]
301 | 		else: 
302 | 			return [ dtype(parts[x]) for x in keys ]
303 | 
304 | 	def filter(self, fn, lazy=False, verbose=False):
305 | 		""" Keep only lines where the function returns True. """
306 | 		if verbose:
307 | 			def echo_wrapper(fn):
308 | 				def wrapper(x, **kwargs):
309 | 					result = fn(x, **kwargs)
310 | 					if not result:
311 | 						print >>sys.stderr, u"Tossed line due to %s:" % fn.__name__, x
312 | 					return result
313 | 				return wrapper
314 | 			fn = echo_wrapper(fn)
315 | 
316 | 		filtered = itertools.ifilter(fn, self.lines())
317 | 		self.write(filtered, lazy=lazy)
318 | 
319 | 	def map(self, fn, lazy=False, verbose=False):
320 | 		""" Apply function to all lines. """
321 | 		if verbose:
322 | 			def echo_wrapper(fn):
323 | 				def wrapper(x, **kwargs):
324 | 					result = fn(x, **kwargs)
325 | 					print >>sys.stderr, u"%s => %s" % (unicode(x), unicode(result))
326 | 					return result
327 | 				return wrapper
328 | 			fn = echo_wrapper(fn)
329 | 
330 | 		mapped = itertools.imap(fn, self.lines())
331 | 		self.write(mapped, lazy=lazy)
332 | 		
333 | 	def clean(self, columns=None, lower=True, alphanumeric=True, count_columns=True, nounderscores=True, echo_toss=False, filter_fn=None, modifier_fn=None, lazy=False):
334 | 		"""
335 | 			This does several things:
336 | 				columns - how many cols should there be? If None, then we use the first line
337 | 				lower - convert to lowercase
338 | 				alphanumeric - toss lines with non-letter category characters (in unicode). WARNING: Tosses lines with "_" (e.g. syntactic tags in google)
339 | 				count_columns - if True, we throw out rows that don't have the same number of columns as the first line
340 | 				nounderscores - if True, we remove everything matching _[^\s]\s -> " " 
341 | 				echo_toss - tell us who was removed
342 | 				filter_fn - User-provided boolean filtering function
343 | 				modifier_fn - User-provided function to modify the line (downcase etc)
344 | 				
345 | 			NOTE: filtering by alphanumeric allows underscores at the beginning of columns (as in google tags)
346 | 			NOTE: nounderscores may remove columns if there is a column for tags (e.g. a column with _adv)
347 | 		"""
348 | 		def filter_alphanumeric(line):
349 | 			collapsed = re_tagstartchar.sub("", line) # remove these so that tags don't cause us to toss lines. Must come before spaces removed
350 | 			collapsed = re_collapser.sub("", collapsed)
351 | 			collapsed = re_sentence_boundary.sub("", collapsed)
352 | 			char_categories = (unicodedata.category(k) for k in collapsed)
353 | 			return all(n == "Ll" or n == "Lu" for n in char_categories)
354 | 
355 | 		def generate_filtered_columns(lines, columns=columns):
356 | 			for line in lines:
357 | 				cols = line.split(COLUMN_SEPARATOR)
358 | 				cn = len(cols)
359 | 				if columns is None:
360 | 					columns = cn # save the first line
361 | 
362 | 				if not (columns != cn or any(not non_whitespace_matcher.search(ci) for ci in cols)):
363 | 					yield line
364 | 				elif echo_toss:
365 | 					print >>sys.stderr, "Tossed line with bad column count: %s" % line
366 | 					print >>sys.stderr, "Line has %d columns; I expected %d." % (cn, columns)
367 | 
368 | 		# Filters.
369 | 		if filter_fn:
370 | 			self.filter(filter_fn, lazy=True, verbose=echo_toss)
371 | 		if alphanumeric:
372 | 			self.filter(filter_alphanumeric, lazy=True, verbose=echo_toss)
373 | 		if count_columns:
374 | 			self.write(generate_filtered_columns(self.lines()), lazy=True)
375 | 
376 | 		# Maps.
377 | 		if nounderscores:
378 | 			self.map(lambda line: re_underscore.sub("", line), lazy=True)
379 | 		if lower:
380 | 			self.map(lambda line: line.lower(), lazy=True)
381 | 		if modifier_fn:
382 | 			self.map(modifier_fn, lazy=True)
383 | 		
384 | 		if not lazy:
385 | 			self.write(self.lines())
386 | 
387 | 	def restrict_vocabulary(self, cols, vocabulary, invert=False, lazy=False):
388 | 		"""
389 | 			Make a new version where "cols" contain only words matching the vocabulary
390 | 			OR if invert=True, throw out anything matching cols
391 | 		"""
392 | 		
393 |                 cols = listifnot(self.to_column_number(cols))
394 | 
395 |                 vocabulary = set(vocabulary)
396 | 
397 | 		def restrict(line, cols=cols, vocabulary=vocabulary):
398 | 			parts = line.split(COLUMN_SEPARATOR)
399 | 			for c in cols:
400 | 				if invert and parts[c] not in vocabulary:
401 | 					return l
402 | 				elif parts[c] in vocabulary:
403 | 					return l
404 | 
405 | 		self.map(restrict, lazy=lazy)
406 | 
407 | 	def make_marginal_column(self, newname, keys, sumkey, lazy=False):
408 | 		self.copy_column(newname, sumkey, lazy=True)
409 | 		self.sort(keys)
410 | 		self.resum_equal(keys, newname, keep_all=True, assert_sorted=False, lazy=lazy)
411 | 	
412 | 	def resum_equal(self, keys, sumkeys, assert_sorted=True, keep_all=False, lazy=False):
413 | 		"""
414 | 			Takes all rows which are equal on the keys and sums the sumkeys, overwriting them. 
415 | 			Anything not in keys or sumkeys, there are only guarantees for if keep_all=True.
416 | 		"""
417 |                 keys    = listifnot(self.to_column_number(keys))
418 |                 sumkeys = listifnot(self.to_column_number(sumkeys))
419 | 
420 |                 if assert_sorted: 
421 |                         self.assert_sorted(keys, allow_equal=True, lazy=True)
422 | 
423 |                 def generate_resummed(groups):
424 |                         for compkey, lines in groups:
425 |                                 if keep_all:
426 |                                         lines = list(lines) # load into memory; otherwise we can only iterate through once
427 |                                 sums = Counter()
428 |                                 for parts in lines:
429 |                                         for sumkey in sumkeys:
430 |                                                 try:
431 |                                                         sums[sumkey] += int(parts[sumkey])
432 |                                                 except IndexError:
433 |                                                         print >>sys.stderr, "IndexError:", parts, sumkeys
434 |                                 if keep_all:
435 |                                         for parts in lines:
436 |                                                 for sumkey in sumkeys:
437 |                                                         parts[sumkey] = str(sums[sumkey])
438 |                                                 yield "\t".join(parts)
439 |                                 else:
440 | 					try:
441 | 						for sumkey in sumkeys:
442 | 							parts[sumkey] = str(sums[sumkey]) # "parts" is the last line
443 | 					except IndexError:
444 | 						print >>sys.stderr, "IndexError:", parts, sumkeys
445 | 						
446 |                                         yield "\t".join(parts)
447 | 
448 | 		groups = self.groupby(keys)
449 | 		self.write(generate_resummed(groups), lazy=lazy)
450 | 		
451 | 	def assert_sorted(self, keys, dtype=unicode, allow_equal=False, lazy=False):
452 | 		"""
453 | 			Assert that a file is sorted by certain columns
454 | 			This good for merging, etc., which optionally check requirements 
455 | 			to be sorted
456 | 
457 | 		"""
458 | 		def gen_assert_sorted(lines, keys=keys):
459 | 			""" yield lines while asserted their sortedness """
460 | 			keys = self.to_column_number(keys)
461 | 			prev_sortkey = None
462 | 			for line in lines:
463 | 				line = line.strip()
464 | 				yield line # yield all line and check afterwards
465 | 				sortkey = self.extract_columns(line, keys=keys, dtype=dtype)
466 | 			
467 | 				if prev_sortkey is not None:
468 | 					if allow_equal: 
469 | 						myassert(prev_sortkey <= sortkey, line+";"+unicode(prev_sortkey)+";"+unicode(sortkey))
470 | 				else:           
471 | 					myassert(prev_sortkey < sortkey, line+";"+unicode(prev_sortkey)+";"+unicode(sortkey))
472 | 			
473 | 			prev_sortkey = sortkey
474 | 			
475 | 		self.write(gen_assert_sorted(self.lines()), lazy=lazy)
476 | 	
477 | 	def cat(self): 
478 | 		systemcall("cat "+self.path)
479 | 
480 |         def head(self, n=10):
481 | 		print self.header
482 |                 lines = self.lines()
483 |                 for _ in xrange(n):
484 |                         print next(lines)
485 | 
486 | 	def delete(self):
487 | 		try:
488 | 			os.remove(self.path)
489 | 		except OSError:
490 | 			pass
491 | 		try:
492 | 			os.remove(self.tmppath)
493 | 		except OSError:
494 | 			pass # no temporary file exists
495 | 
496 | 	def delete_tmp(self):
497 | 		print >>sys.stderr, "*** delete_tmp now phased out! Please remove from code!"
498 | 		#os.remove(self.tmppath)
499 | 
500 | 	def copy_column(self, newname, key, lazy=False):
501 | 		""" Copy a column. """
502 |                 key = self.to_column_number(key)
503 | 
504 | 		def generate_new_col(lines):
505 | 			for line in lines:
506 | 				parts = line.split(COLUMN_SEPARATOR)
507 | 				yield "\t".join([line, parts[key]])
508 | 			self.header.extend(listifnot(newname))
509 | 
510 | 		self.write(generate_new_col(self.lines()), lazy=lazy)
511 | 
512 | 	def make_column(self, newname, function, keys, lazy=False):
513 | 		"""
514 | 			Make a new column as some function of the other rows
515 | 			make_column("unigram", lambda x,y: int(x)+int(y), "cnt1 cnt2")
516 | 			will make a column called "unigram" that is the sum of cnt1 cnt2
517 | 			
518 | 			NOTE: The function MUST take strings and return strings, or else we die
519 | 			
520 | 			newname - the name for the new column. You can pass multiple if function returns tab-sep strings
521 | 			function - a function of other row arguments. Must return strings
522 | 			args - column names to get the arguments
523 | 		"""
524 |                 keys = listifnot( self.to_column_number(keys) )
525 | 
526 | 		def generate_new_col(lines):
527 | 			for line in lines:
528 | 				parts = line.split(COLUMN_SEPARATOR)
529 | 				yield "\t".join([line, function(*[parts[i] for i in keys])])
530 | 			self.header.extend(listifnot(newname))
531 | 
532 | 		self.write(generate_new_col(self.lines()), lazy=lazy)
533 | 
534 | 	def sort(self, keys, num_lines=SORT_DEFAULT_LINES, dtype=unicode, reverse=False):
535 | 		"""
536 | 			Sort me by my keys. this breaks the file up into subfiles of "lines", sorts them in RAM, 
537 | 			and the mergesorts them
538 | 			
539 | 			We could use unix "sort" but that gives weirdness sometimes, and doesn't handle our keys
540 | 			as nicely, since it treats spaces in a counterintuitive way
541 | 			
542 | 			dtype - the type of the data to be sorted. Should be a castable python type
543 | 			        e.g. str, int, float
544 | 		"""
545 | 		sorted_tmp_files = [] # a list of generators, yielding each line of the file
546 | 		
547 | 		keys = listifnot(self.to_column_number(keys))
548 | 
549 | 		# a generator to hand back lines of a file and keys for sorting
550 | 		def yield_lines(f):
551 | 			with codecs.open(f, "r", encoding=ENCODING) as infile:
552 | 				for l in infile:
553 | 					yield get_sort_key(l.strip())
554 | 				
555 | 		# Map a line to sort keys (e.g. respecting dtype, etc); 
556 | 		# we use the fact that python will sort arrays (yay)
557 | 		def get_sort_key(l):
558 | 			sort_key = self.extract_columns(l, keys=keys, dtype=dtype) 
559 | 			sort_key.append(l) # the second element is the line
560 | 			return sort_key
561 | 		
562 | 		temp_id = 0
563 | 		for chunk in chunks(self.lines(), num_lines):
564 | 			sorted_tmp_path = self.path+".sorted."+str(temp_id)
565 | 			with codecs.open(sorted_tmp_path, 'w', encoding=ENCODING) as outfile:
566 | 				print >>outfile, "\n".join(sorted(chunk, key=get_sort_key))
567 | 			sorted_tmp_files.append(sorted_tmp_path)
568 | 			temp_id += 1
569 | 		
570 | 		# okay now go through and merge sort -- use this cool heapq merging trick!
571 | 		def merge_sort():
572 | 			for x in heapq.merge(*map(yield_lines, sorted_tmp_files)):
573 | 				yield x[-1] # the last item is the line itself, everything else is sort keys
574 | 
575 | 		self.write(merge_sort())
576 | 		
577 | 		# clean up
578 | 		for f in sorted_tmp_files: 
579 | 			os.remove(f)
580 | 		
581 | 	def merge(self, other, keys1, tocopy, keys2=None, newheader=None, assert_sorted=True):
582 | 		"""
583 | 			Copy lines of other that match on keys onto self
584 | 			
585 | 			other - a LineFile object -- who to merge in
586 | 			keys1 - the keys of self for merging
587 | 			keys2 - the keys of other for merging. If not specified, we assume they are the same as keys1
588 | 			newheader - If specified, gives the names for the *new* columns
589 | 			assert_sorted - make False if you don't want an extra check on sorting (things can go very bad)
590 | 			
591 | 			NOTE: This assumes that every line of self occurs in other, but not vice-versa. It 
592 | 			      also allows multiples in self, but *not* other
593 | 		"""
594 | 		# fix up the keys
595 | 		# Note: Keys2 must be processed first here so we can specify by names, 
596 | 		#       and not have keys1 overwritten when they are mapped to numbers
597 | 		keys2 = listifnot(other.to_column_number(keys1 if keys2 is None else keys2))
598 | 		tocopy = listifnot(other.to_column_number(tocopy))
599 | 		keys1 = listifnot(self.to_column_number(keys1))
600 | 		
601 | 		# this only works if we are sorted -- let's assert
602 | 		if assert_sorted:
603 | 			self.assert_sorted(keys1,  allow_equal=True, lazy=True) # we can have repeat lines
604 | 			other.assert_sorted(keys2, allow_equal=False, lazy=True) # we cannot have repeat lines (how would they be mapped?)
605 | 		
606 | 		in1 = self.lines()
607 | 		in2 = other.lines()
608 | 		
609 | 		line1, parts1, key1 = read_and_parse(in1, keys=keys1)
610 | 		line2, parts2, key2 = read_and_parse(in2, keys=keys2)
611 | 
612 | 		def generate_merged(in1, in2):
613 | 			while True:
614 | 				if key1 == key2:
615 | 					yield line1+"\t"+"\t".join(self.extract_columns(line2, keys=tocopy))
616 | 					
617 | 					line1, parts1, key1 = read_and_parse(in1, keys=keys1)
618 | 					if not line1: 
619 | 						break
620 | 				else:
621 | 					line2, parts2, key2 = read_and_parse(in2, keys=keys2)
622 | 					if not line2:  # okay there is no match for line1 anywhere
623 | 						print >>sys.stderr, "** Error in merge: end of line2 before end of line 1:"
624 | 						print >>sys.stderr, "\t", line1
625 | 						print >>sys.stderr, "\t", line2
626 | 						exit(1)
627 | 			self.header.extend([other.header[i] for i in tocopy ]) # copy the header names from other
628 | 
629 | 		self.write(generate_merged(in1, in2))
630 | 
631 | 	def print_conditional_entropy(self, W, cntXgW, downsample=10000, assert_sorted=True, pre="", preh="", header=True):
632 | 		"""
633 | 			Print the entropy H[X | W] for each W, assuming sorted by W.
634 | 			Here, P(X|W) is given by unnormalized cntXgW
635 | 			Also prints the total frequency
636 | 			downsample - also prints the downsampled measures, where we only have downsample counts total. An attempt to correct H bias
637 | 		"""
638 | 		if assert_sorted:
639 | 			self.assert_sorted(listifnot(W),  allow_equal=True, lazy=True) # allow W to be true
640 | 			
641 | 		
642 | 		W = self.to_column_number(W)
643 | 		assert not isinstance(W,list)
644 | 		#Xcol = self.to_column_number(X)
645 | 		#assert not isinstance(X,list)
646 | 		
647 | 		cntXgW = self.to_column_number(cntXgW)
648 | 		assert not isinstance(cntXgW, list)
649 | 		
650 | 		prevW = None
651 | 		if header: print preh+"Word\tFrequency\tContextCount\tContextEntropy\tContextEntropy2\tContextEntropy5\tContextEntropy10\tContextEntropy%i\tContextCount%i" % (downsample, downsample)
652 | 		for w, lines in self.groupby(W):
653 | 			w = w[0] # w comes out as ("hello",)
654 | 			wcounts = np.array([float(parts[cntXgW]) for parts in lines])
655 | 			sumcount = sum(wcounts)
656 | 			dp = numpy.sort(numpy.random.multinomial(downsample, wcounts / sumcount)) # sort so we can take top on next line
657 | 			tp2, tp5, tp10 = dp[-2:], dp[-5:], dp[-10:]
658 | 			print pre, w, "\t", sumcount, "\t", len(wcounts), "\t", c2H(wcounts), "\t", c2H(tp2), "\t", c2H(tp5), "\t", c2H(tp10), "\t", c2H(dp), "\t", numpy.sum(dp>0)
659 | 
660 | 			
661 | 	def average_surprisal(self, W, CWcnt, Ccnt, transcribe_fn=None, assert_sorted=True):
662 | 		"""
663 | 			Compute the average in-context surprisal, as in Piantadosi, Tily Gibson (2011). 
664 | 			Yield output for each word.
665 | 			
666 | 			- W     - column for the word
667 | 			- CWcnt - column for the count of context-word
668 | 			- Ccnt  - column for the count of the context
669 | 			- transcribe_fn (optional) - transcription to do before measuring word length
670 | 			     i.e. convert word to IPA, convert Chinese characters to pinyin, etc.
671 | 			
672 | 		"""
673 | 		
674 | 		W = self.to_column_number(W)
675 | 		assert(not isinstance(W,list))
676 | 		CWcnt = self.to_column_number(CWcnt)
677 | 		assert(not isinstance(CWcnt,list))
678 | 		Ccnt = self.to_column_number(Ccnt)
679 | 		assert(not isinstance(Ccnt,list))
680 | 		
681 | 		if assert_sorted:
682 | 			self.assert_sorted(listifnot(W), allow_equal=True, lazy=True)
683 | 		
684 | 		for word, lines in self.groupby(W):
685 | 			word = word[0] # word comes out as (word,)
686 | 			if transcribe_fn:
687 | 				word = transcribe_fn(word)
688 | 			sum_surprisal = 0
689 | 			total_word_frequency = 0
690 | 			total_context_count = 0
691 | 			for parts in lines:
692 | 				cwcnt = int(parts[CWcnt])
693 | 				ccnt = int(parts[Ccnt])
694 | 				sum_surprisal -= (log2(cwcnt) - log2(ccnt)) * cwcnt
695 | 				total_word_frequency += cwcnt
696 | 				total_context_count += 1
697 | 				length = len(word)
698 | 			yield u'"%s"'%word, length, sum_surprisal/total_word_frequency, log2(total_word_frequency), total_context_count
699 | 
700 | 	def print_average_surprisal(self, W, CWcnt, Ccnt, transcribe_fn=None, assert_sorted=True):
701 | 		print "Word\tOrthographic.Length\tSurprisal\tLog.Frequency\tTotal.Context.Count"
702 | 		for line in self.average_surprisal(W, CWcnt, Ccnt, 
703 | 			       transcribe_fn=transcribe_fn, assert_sorted=assert_sorted):
704 | 			print u"\t".join(map(unicode, line))
705 | 	
706 | 	#################################################################################################
707 | 	# Iterators
708 | 	
709 | 	def lines(self, parts=False):
710 | 		"""
711 | 			Yield me a stripped version of each line of tmplines
712 | 			
713 | 			- parts - if true, we return an array that is split on tabs
714 | 
715 | 		"""
716 | 		if parts:
717 | 			return (line.strip().split(COLUMN_SEPARATOR) for line in self.read())
718 | 		else:
719 | 			return (line.strip() for line in self.read())
720 | 
721 | 	def groupby(self, keys):
722 | 		"""
723 |                        A groupby iterator matching the given keys.
724 | 
725 |                 """
726 |                 keys = listifnot(self.to_column_number(keys))
727 |                 key_fn = lambda parts: tuple(parts[x] for x in keys)
728 |                 return itertools.groupby(self.lines(parts=True), key_fn)
729 | 		
730 | 	def __len__(self):
731 | 		"""
732 | 			How many total lines?
733 | 		"""
734 | 		return sum(1 for _ in self.read())
735 | 		
736 | 	def subsample_lines(self, N=1000000):
737 | 		"""
738 | 			Make me a smaller copy of myself by randomly subsampling *lines*
739 | 			not according to counts. This is useful for creating a temporary
740 | 			file 
741 | 			NOTE: N must fit into memory
742 | 		"""
743 | 		
744 | 		# We'll use a reservoir sampling algorithm
745 | 		sample = []
746 | 		
747 | 		for idx, line in enumerate(self.lines()):
748 | 			if idx < N: 
749 | 				sample.append(line)
750 | 			else:
751 | 				r = random.randrange(idx+1)
752 | 				if r < N: sample[r] = line
753 | 		
754 | 		# now output the sample
755 | 		self.write(sample)
756 | 	
757 | 	def sum_column(self, col, cast=int):		
758 | 		col = self.to_column_number(col)
759 | 		return sum(cast(parts[col]) for parts in self.lines(parts=True))
760 | 		
761 | 	def downsample_tokens(self, N, ccol, keep_zero_counts=False):
762 | 		"""
763 | 			Subsample myself via counts with the existing probability distribution.
764 | 			- N - the total sample size we end up with.
765 | 			- ccol - the column we use to estimate probabilities. Unnormalized, non-log probs (e.g. counts)
766 | 			
767 | 			NOTE: this assumes a multinomial on trigrams, which may not be accurate. If you started from a corpus, this will NOT in general keep
768 | 			counts consistent with a corpus. 
769 | 			
770 | 			This uses a conditional beta distribution, once for each line for a total of N.
771 | 			See pg 12 of w3.jouy.inra.fr/unites/miaj/public/nosdoc/rap2012-5.pdf
772 | 		"""
773 | 		self.header.extend(ccol)
774 | 		ccol = self.to_column_number(ccol)
775 | 		
776 | 		Z = self.sum_column(ccol)
777 | 		
778 | 		def generate_downsampled(lines):
779 | 			for parts in lines:
780 | 				
781 | 				cnt = int(parts[ccol]) 
782 | 			
783 | 				# Randomly sample
784 | 				if N > 0: 
785 | 					newcnt = numpy.random.binomial(N,float(cnt)/float(Z))
786 | 				else:     
787 | 					newcnt = 0
788 | 			
789 | 				# Update the conditional multinomial
790 | 				N = N-newcnt # samples to draw
791 | 				Z = Z-cnt    # normalizer for everything else
792 | 			
793 | 				parts[ccol] = str(newcnt) # update this
794 | 				
795 | 				if keep_zero_counts or newcnt > 0:
796 | 					yield '\t'.join(parts)
797 | 
798 | 		self.write(generate_downsampled(self.lines(parts=True)))
799 | 


--------------------------------------------------------------------------------
/ngrampy/LineFile.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/piantado/ngrampy/792b25e3293f06ac9561a3c02bfaad22d6149d9a/ngrampy/LineFile.pyc


--------------------------------------------------------------------------------
/ngrampy/LineFileInMemory.py:
--------------------------------------------------------------------------------
  1 | """ 
  2 |     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  3 | 
  4 |     This file contains a drop-in replacement for LineFile for in-memory operations.
  5 |     It mocks the interface of LineFile, including file-related arguments, 
  6 |     but performs all operations in memory. 
  7 | 
  8 |     This is not the most efficient way to do this in-memory, but it provides
  9 |     compability with scripts written for the on-disk version.
 10 | 
 11 |     Richard Futrell, 2013
 12 |     
 13 |     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 14 | """
 15 | from __future__ import division
 16 | import os
 17 | import sys
 18 | import re
 19 | import unicodedata
 20 | import heapq
 21 | import shutil
 22 | import random
 23 | import codecs # for writing utf-8
 24 | import itertools
 25 | from math import log
 26 | from collections import Counter
 27 | from copy import deepcopy
 28 | 
 29 | try:
 30 | 	import numpy
 31 | except ImportError:
 32 | 	try:
 33 | 		import numpypy as numpy
 34 | 	except ImportError:
 35 | 		pass
 36 | 
 37 | from debug import *
 38 | from helpers import *
 39 | import filehandling as fh
 40 | from LineFile import LineFile
 41 | 
 42 | ENCODING = 'utf-8'
 43 | CLEAN_TMP = False
 44 | SORT_DEFAULT_LINES = None
 45 | 
 46 | # Set this so we can write stderr
 47 | #sys.stdout = codecs.getwriter(ENCODING)(sys.stdout)
 48 | #sys.stderr = codecs.getwriter(ENCODING)(sys.stderr)
 49 | 
 50 | IO_BUFFER_SIZE = int(100e6) # approx size of input buffer 
 51 | 
 52 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
 53 | # # Class definition
 54 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
 55 | 
 56 | class LineFileInMemory(LineFile):
 57 | 	
 58 | 	def __init__(self, files, header=None, path=None, force_nocopy=False):
 59 | 		"""
 60 | 			Create a new file object with the specified header. It takes a list of files
 61 | 			and reads them into memory as a list of lines.
 62 | 			
 63 | 			header - give each column a name (you can refer by name instead of number)
 64 | 			path   - does nothing, for compatibility with LineFile
 65 | 			force_nocopy - does nothing, for compatibility with LineFile
 66 | 		"""
 67 | 		if isinstance(files, str):
 68 | 			files = [files]
 69 | 
 70 | 		self.path = None
 71 | 		self.tmppath = None
 72 | 
 73 | 		# load the files into memory
 74 | 		self._lines = []
 75 | 		for f in files:
 76 | 			if f.endswith(".idf"): 
 77 | 				continue # skip the index files
 78 | 			else:
 79 | 				with fh.open(f, encoding=ENCODING) as infile:
 80 | 					self._lines.extend(infile)
 81 | 		
 82 | 		# just keep track
 83 | 		self.files = files
 84 | 		
 85 | 		if isinstance(header, str): 
 86 | 			self.header = header.split()
 87 | 		else:
 88 | 			self.header = header
 89 | 
 90 | 		self.preprocess()
 91 | 
 92 | 	def write(self, it, lazy=False):
 93 | 		if lazy:
 94 | 			self._lines = it
 95 | 		else:
 96 | 			self._lines = list(it)
 97 | 
 98 | 	def read(self):
 99 | 		return iter(self._lines)
100 | 		
101 | 	def copy(self, path=None):
102 | 		return deepcopy(self)
103 | 	
104 | 	def sort(self, keys, lines=None, dtype=unicode, reverse=False):
105 | 		"""
106 | 			Sort me by my keys.
107 | 			
108 | 			dtype - the type of the data to be sorted. Should be a castable python type
109 | 			        e.g. str, int, float
110 | 		"""
111 | 		keys = listifnot(self.to_column_number(keys))
112 | 		
113 | 		# Map a line to sort keys (e.g. respecting dtype, etc) ; we use the fact that python will sort arrays (yay)
114 | 		def get_sort_key(l):
115 | 			sort_key = self.extract_columns(l, keys=keys, dtype=dtype) # extract_columns gives them back tab-sep, but we need to cast them
116 | 			sort_key.append(l) # the second element is the line
117 | 			return sort_key
118 | 
119 | 		self.write(sorted(self.lines(), key=get_sort_key))
120 | 
121 | 		
122 | 	def __len__(self):
123 | 		"""
124 | 			How many total lines?
125 | 		"""
126 | 		return len(list(self._lines))
127 | 		
128 | 


--------------------------------------------------------------------------------
/ngrampy/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/piantado/ngrampy/792b25e3293f06ac9561a3c02bfaad22d6149d9a/ngrampy/__init__.py


--------------------------------------------------------------------------------
/ngrampy/__init__.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/piantado/ngrampy/792b25e3293f06ac9561a3c02bfaad22d6149d9a/ngrampy/__init__.pyc


--------------------------------------------------------------------------------
/ngrampy/debug.py:
--------------------------------------------------------------------------------
 1 | """ debug
 2 | 
 3 | Utilities for debugging.
 4 | Many ideas taken from funcy, http://github.com/Suor/funcy
 5 | 
 6 | """
 7 | from __future__ import print_function
 8 | import sys
 9 | import inspect
10 | 
11 | def tap(x, end="\n", file=sys.stdout):
12 |     print(x, end=end, file=file)
13 |     return x
14 | 
15 | def log_calls(fn):
16 |     def _fn(*args, **kwargs):
17 |         binding = inspect.getcallargs(fn, *args, **kwargs)
18 |         binding_str = ", ".join("%s=%s" % item for item in binding.iteritems())
19 |         signature = fn.__name__ + "(%s)" % binding_str
20 |         print(signature, file=sys.stderr)
21 |         return fn(*args, **kwargs)
22 |     return _fn
23 | 
24 | def myassert(tf, s):
25 |         if not tf:
26 |                 print >>sys.stderr, "*** Assertion fail: ",s
27 |         assert tf
28 | 


--------------------------------------------------------------------------------
/ngrampy/helpers.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | import os
 3 | import re
 4 | from math import log
 5 | import subprocess
 6 | from itertools import *
 7 | 
 8 | ECHO_SYSTEM = False
 9 | 
10 | # Some friendly Regexes. May need to change encoding here for other encodings?
11 | re_SPACE = re.compile(r"\s", re.UNICODE) # for splitting on spaces, etc.
12 | re_underscore = re.compile(r"_[A-Za-z\-\_]+", re.UNICODE) # for filtering out numbers and whitespace
13 | re_collapser  = re.compile(r"[\d\s]", re.UNICODE) # for filtering out numbers and whitespace
14 | re_sentence_boundary = re.compile(r"</?S>", re.UNICODE)
15 | re_tagstartchar = re.compile(r"(\s|^)_", re.UNICODE) # underscores may be okay at the start of words
16 | non_whitespace_matcher = re.compile(r"[^\s]", re.UNICODE)
17 | 
18 | PRINT_LOG = False # should we log each action?
19 | 
20 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
21 | # # Some helpful functions
22 | # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
23 | 
24 | def printlog(x):
25 | 	if PRINT_LOG:
26 | 		print >>sys.stderr, x
27 | 		
28 | def read_and_parse(inn, keys):
29 |                 """
30 |                         Read a line and parse it by tabs, returning the line, the tab parts, and some columns
31 |                 """
32 |                 line = next(inn).strip()
33 |                 if not line: 
34 | 			return line, None, None
35 |                 else:
36 |                         parts = line.split()
37 |                         return line, parts, "\t".join([parts[x] for x in keys])
38 | 
39 | def systemcall(cmd, echo=ECHO_SYSTEM):
40 |     if echo:
41 |         print >>sys.stderr, cmd
42 |     p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=None)
43 |     output, _ = p.communicate()
44 |     return output
45 | 
46 | def ifelse(x, y, z):
47 |         return y if x else z
48 | 
49 | def listifnot(x):
50 |         return x if isinstance(x, list) else [x]
51 | 
52 | def log2(x):
53 |         return log(x,2.)
54 | 
55 | def c2H(counts):
56 |         """ Normalize counts and compute entropy
57 | 
58 |         Counts can be a generator.
59 |         Doesn't depend on numpy.
60 | 
61 |         """
62 |         total = 0.0
63 |         clogc = 0.0
64 |         for c in counts:
65 |                 total += c
66 |                 clogc += c * log(c)
67 |         return -(clogc/total - log(total)) / log(2)
68 | 
69 | def chunks(iterable, size):
70 | 	""" Chunks
71 | 
72 | 	Break an iterable into chunks of specified size.
73 | 
74 | 	Params:
75 |             iterable: An iterable
76 | 	    size: An integer size.
77 | 
78 | 	Yields:
79 |             Tuples of size less than or equal to n, chunks of the input iterable.
80 | 
81 | 	Examples:
82 |         >>> lst = ['foo', 'bar', 'baz', 'qux', 'zim', 'cat', 'dog']
83 |         >>> list(chunks(lst, 3))
84 |         [('foo', 'bar', 'baz'), ('qux', 'zim', 'cat'), ('dog',)]
85 | 
86 | 	"""
87 | 	it = iter(iterable)
88 | 	while True:
89 | 		chunk = islice(it, None, size)
90 | 		probe = next(chunk) # raises StopIteration if nothing's there
91 | 		yield chain([probe], chunk)
92 | 


--------------------------------------------------------------------------------
/process-google.py:
--------------------------------------------------------------------------------
  1 | """
  2 | 
  3 | 	Note: This appears to be quicker than using "import gzip", even if we copy
  4 | 	the gziped file to a SSD before reading. It's also faster than trying to change buffering on stdin
  5 | 	
  6 | 	pypy is MUCH faster
  7 | 	
  8 | 	TODO:
  9 | 		- Clean this up to make play nicer with Tags. We can make it give NAs to nonexistent words!
 10 | 		- Make this handle unicode correctly -- is it even wrong?
 11 | 	Changes:
 12 | 		- Sep 20 2013: this now outputs the tags as their own column, with NA for missing tags
 13 | """
 14 | 
 15 | import os
 16 | import re
 17 | import sys
 18 | import itertools
 19 | import codecs
 20 | import gzip
 21 | import glob
 22 | 
 23 | import argparse
 24 | parser = argparse.ArgumentParser(description='Process google ngrams into year bins')
 25 | parser.add_argument('--in', dest='in', type=str, default=None, nargs="?", help='The file name for input')
 26 | parser.add_argument('--out', dest='out', type=str, default="/tmp/", nargs="?", help='The file name for output (year appended)')
 27 | parser.add_argument('--year-bin', dest='year-bin', type=int, default=10, nargs="?", help='How to bin the years')
 28 | parser.add_argument('--quiet', dest='quiet', default=False, action="store_true", help='Output tossed lines?')
 29 | parser.add_argument('--N',     dest='N', default=3, nargs="?", help="Order of the ngram")
 30 | args = vars(parser.parse_args())
 31 | 
 32 | YEAR_BIN = int(args['year-bin'])
 33 | BUFSIZE = int(1e6) # We can allow huge buffers if we want...
 34 | ENCODING = 'utf-8'
 35 | LINE_N = int(args['N'])+3 # three extra columns
 36 | 
 37 | prev_year,prev_ngram = None, None
 38 | count = 0
 39 | 
 40 | year2file = dict()
 41 | part_count = None
 42 | 
 43 | # python is not much slower than perl if we pre-compile regexes
 44 | 
 45 | #cleanup = re.compile(r"(_[A-Za-z\_\-]+)|(\")") # The old way -- delete tags and quotes
 46 | line_splitter = re.compile(r"\n", re.U)
 47 | cleanup_quotes = re.compile(r"(\")", re.U) # kill quotes
 48 | #column_splitter = re.compile(r"[\s]", re.U) # split on tabs OR spaces, since some of google seems to use one or the other. 
 49 | 
 50 | tag_match = re.compile(r"^(.+?)(_[A-Z\_\-\.\,\;\:]+)?$", re.U) # match a tag at the end of words (assumes 
 51 | def tagify(x):
 52 | 	"""
 53 | 		Take a word with a tag ("man_NOUN") and give back ("man","NOUN") with "NA" if the tag or word is not there
 54 | 	"""
 55 | 	m = tag_match.match(x)
 56 | 	if m:
 57 | 		g = m.groups()
 58 | 
 59 | 		word = (g[0] if g[0] is not None else "NA")
 60 | 		tag  = (g[1] if g[1] is not None else "NA")
 61 | 		return (word,tag)
 62 | 		#if g[1] is None: return (g[0], "NA")
 63 | 		#else:            return g
 64 | 	else: return []
 65 | 
 66 | def chain(args):
 67 | 	a = []
 68 | 	for x in args: a.extend(x)
 69 | 	return a
 70 | 	
 71 | 
 72 | for f in glob.glob(args['in']):
 73 | 	
 74 | 	# Unzip and encode
 75 | 	inputfile = gzip.open(f, 'r')
 76 | 	for l in inputfile:
 77 | 		#l = l.decode(ENCODING)
 78 | 		
 79 | 		l = l.strip() ## To collapse case
 80 | 		l = cleanup_quotes.sub("", l)   # remove quotes
 81 | 		
 82 | 		#print >>sys.stderr, l
 83 | 		
 84 | 		#parts = column_splitter.split(l)
 85 | 		parts = l.split() # defaultly should handle splitting on whitespace, much friendlier with unicode
 86 | 		
 87 | 		# Our check on the number of parts -- we require this to be passed in (otherwise it's hard to parse)
 88 | 		if len(parts) != LINE_N: 
 89 | 			if not args['quiet']: print "Wrong number of items on line: skiping ", l, parts, " IN FILE ", f
 90 | 			continue # skip this line if its garbage NOTE: this may mess up with some unicode chars?
 91 | 		#print parts	
 92 | 		# parts[-1] is the number of books -- ignored here
 93 | 		c = int(parts[-2]) # the count
 94 | 		year = int(int(parts[-3]) / YEAR_BIN) * YEAR_BIN # round the year
 95 | 		ngram_ar = chain(map(tagify,parts[0:-3]))
 96 | 		#print ngram_ar
 97 | 		#if all([x != "NA" for x in ngram_ar]): # Chuck lines that don't have all things tagged
 98 | 		#else: continue
 99 | 		ngram = "\t".join(chain(map(tagify,parts[0:-3]))) # join everything else, including the tags separated out
100 | 		
101 | 		# output if different
102 | 		if year != prev_year or ngram != prev_ngram:
103 | 			
104 | 			if prev_year is not None:
105 | 				if prev_year_s not in year2file: 
106 | 					year2file[prev_year_s] = open(args['out']+".%i"%prev_year, 'w', BUFSIZE)
107 | 				year2file[prev_year_s].write( "%s\t%i\n" % (prev_ngram,count)  ) # write to the year file TODO: This might should be unicode fanciness?
108 | 			
109 | 			prev_ngram = ngram
110 | 			prev_year  = year
111 | 			prev_year_s = str(prev_year)
112 | 			count = c
113 | 		else:
114 | 			count += c
115 | 		
116 | 		# And write the last line if we didn't alerady!
117 | 		if year == prev_year and ngram == prev_ngram:
118 | 			if prev_year_s not in year2file: 
119 | 				year2file[prev_year_s] = open(args['out']+".%i"%prev_year, 'w', BUFSIZE)
120 | 			year2file[prev_year_s].write( "%s\t%i\n" % (prev_ngram,count)  ) # write to the year file TODO: This might should be unicode fanciness?
121 | 			
122 | 	inputfile.close()
123 | 	
124 | # And close everything
125 | for year in year2file.keys():
126 | 	year2file[year].close()
127 | 			
128 | 	
129 | 	
130 | 


--------------------------------------------------------------------------------
/process-initial.sh:
--------------------------------------------------------------------------------
 1 | 
 2 | 
 3 | # A script to initially process google data. The significantly speeds up everything later.
 4 | 
 5 | DIR=/home/piantado/Corpus/GoogleBooks/
 6 | #for L in eng-us-all fre-all heb-all ita-all rus-all spa-all ger-all; do
 7 | for L in eng-us-all fre-all heb-all; do
 8 | for N in 1 2 3; do
 9 | 	myDIR=$DIR/Processed/$L/$N
10 | 
11 | 	mkdir $DIR/Processed/$L
12 | 	mkdir $myDIR
13 | 
14 | 	pypy process-google.py --in=$DIR/$L/$N/* --out=$myDIR/processed-google --N=$N --quiet & ## TODO: IF YOU CHANGE N, CHANGE THE 
15 | done
16 | done
17 | 
18 | 


--------------------------------------------------------------------------------
/tests/ngrampy_tests.py:
--------------------------------------------------------------------------------
  1 | import os, os.path
  2 | import random
  3 | import math
  4 | 
  5 | from nose.tools import *
  6 | 
  7 | from ngrampy.LineFile import LineFile
  8 | from ngrampy.LineFileInMemory import LineFileInMemory
  9 | 
 10 | try:
 11 |     os.mkdir("tests/tmp")
 12 | except OSError:
 13 |     pass
 14 |     
 15 | 
 16 | def test_basics():
 17 |     G = LineFile("tests/smallcorpus.txt.bz2", header="foo bar baz qux".split(), 
 18 |                  path="tests/tmp/testcorpus")
 19 |     assert_equal(G.header, "foo bar baz qux".split())
 20 |     assert_equal(G.files, ["tests/smallcorpus.txt.bz2"])
 21 |     assert_equal(G.path, "tests/tmp/testcorpus")
 22 |     assert_equal(G.tmppath, "tests/tmp/testcorpus.tmp")
 23 |     assert_equal(os.path.isfile("tests/tmp/testcorpus"), True)
 24 | 
 25 |     G_copy = G.copy()
 26 |     copy_path = G_copy.path
 27 |     assert_not_equal(copy_path, G.path)
 28 | 
 29 |     G_copy.mv_tmp()
 30 |     assert_equal(os.path.isfile(G_copy.path + ".tmp"), True)
 31 |     #G_copy.delete_tmp()
 32 |     #assert_equal(os.path.isfile(G_copy.path + ".tmp"), False)
 33 | 
 34 |     G.make_column("quux", lambda x, y, z, w: "cat", "foo bar baz qux".split())
 35 |     assert_equal(G.header, "foo bar baz qux quux".split())
 36 |     for line in G.lines(parts=False):
 37 |         assert_equal(G.extract_columns(line, "quux"), ["cat"])
 38 |     
 39 |     G.delete_columns("quux")
 40 |     assert_equal(G.header, "foo bar baz qux".split())
 41 | 
 42 |     G.copy_column("quux", "qux")
 43 |     assert_equal(G.header, "foo bar baz qux quux".split())
 44 |     for line in G.lines(parts=False):
 45 |         assert_equal(G.extract_columns(line, "qux"), 
 46 |                      G.extract_columns(line, "quux")
 47 |                      )
 48 | 
 49 |     G.delete()
 50 |     assert_equal(os.path.isfile("tests/tmp/testcorpus"), False)
 51 | 
 52 | def test_clean():
 53 |     G = LineFile("tests/smallcorpus-malformed.txt.bz2", header="foo bar baz qux".split(), 
 54 |                  path="tests/tmp/testcorpus")
 55 |     len_G = len(G)
 56 |     G.clean(columns=4, lower=False, alphanumeric=False, count_columns=True, 
 57 |             nounderscores=False, echo_toss=True)
 58 |     assert_equal(len(G), len_G - 2)
 59 |     G.delete()
 60 | 
 61 |     G = LineFile("tests/smallcorpus-malformed.txt.bz2", header="foo bar baz qux".split(), 
 62 |                  path="tests/tmp/testcorpus")
 63 |     G.clean(lower=True, alphanumeric=True, count_columns=False, echo_toss=True)
 64 |     assert_equal(len(G), 8562)
 65 |     G.delete()
 66 | 
 67 |     G = LineFile("tests/smallcorpus-malformed.txt.bz2", header="foo bar baz qux".split(), 
 68 |                  path="tests/tmp/testcorpus")
 69 |     G.clean(lower=True, alphanumeric=True, count_columns=False, echo_toss=True,
 70 |             filter_fn=lambda x: False)
 71 |     assert_equal(len(G), 0)
 72 |     G.delete()
 73 | 
 74 |     G = LineFile("tests/smallcorpus-malformed.txt.bz2", header="foo bar baz qux".split(), 
 75 |                  path="tests/tmp/testcorpus")
 76 |     G.clean(lower=True, alphanumeric=False, count_columns=False, echo_toss=True,
 77 |             modifier_fn=lambda x: "hello")
 78 |     assert_equal(len(G), len_G)
 79 |     for line in G.lines():
 80 |         assert_equal(line, "hello")
 81 |     G.delete()
 82 | 
 83 | 
 84 | def test_clean_lazy():
 85 |     G = LineFile("tests/smallcorpus-malformed.txt.bz2", header="foo bar baz qux".split(), 
 86 |                  path="tests/tmp/testcorpus")
 87 |     len_G = len(G)
 88 |     G.clean(columns=4, lower=False, alphanumeric=False, count_columns=True, 
 89 |             nounderscores=False, echo_toss=True, lazy=True)
 90 |     assert_equal(len(G), len_G - 2)
 91 |     G.delete()
 92 | 
 93 |     G = LineFile("tests/smallcorpus-malformed.txt.bz2", header="foo bar baz qux".split(), 
 94 |                  path="tests/tmp/testcorpus")
 95 |     G.clean(lower=True, alphanumeric=True, count_columns=False, echo_toss=True, lazy=True)
 96 |     assert_equal(len(G), 8562)
 97 |     G.delete()
 98 | 
 99 |     G = LineFile("tests/smallcorpus-malformed.txt.bz2", header="foo bar baz qux".split(), 
100 |                  path="tests/tmp/testcorpus")
101 |     G.clean(lower=True, alphanumeric=True, count_columns=False, echo_toss=True,
102 |             filter_fn=lambda x: False, lazy=True)
103 |     assert_equal(len(G), 0)
104 |     G.delete()
105 | 
106 |     G = LineFile("tests/smallcorpus-malformed.txt.bz2", header="foo bar baz qux".split(), 
107 |                  path="tests/tmp/testcorpus")
108 |     G.clean(lower=True, alphanumeric=False, count_columns=False, echo_toss=True,
109 |             modifier_fn=lambda x: "hello", lazy=True)
110 |     for line in G.lines(parts=False):
111 |         assert_equal(line, "hello")
112 |     G.delete()
113 | 
114 | def test_resum_equal():
115 |     G = LineFile("tests/smallcorpus.txt.bz2", header="foo bar baz qux".split(), 
116 |                  path="tests/tmp/testcorpus")
117 |     len_G = len(G)
118 |     total = G.sum_column("qux")
119 |     G.resum_equal("foo", "qux", assert_sorted=True, keep_all=False)
120 |     assert_equal(len(G), 1)
121 |     for line in G.lines():
122 |         assert_equal(int(G.extract_columns(line, "qux")[0]), total)
123 |     G.delete()
124 | 
125 |     G = LineFile("tests/smallcorpus.txt.bz2", header="foo bar baz qux".split(), 
126 |                  path="tests/tmp/testcorpus")
127 |     G.resum_equal("foo", "qux", assert_sorted=True, keep_all=True)
128 |     assert_equal(len(G), len_G)
129 |     for line in G.lines():
130 |         assert_equal(int(G.extract_columns(line, "qux")[0]), total)
131 |     G.delete()
132 | 
133 | def test_resum_equal_lazy():
134 |     G = LineFile("tests/smallcorpus.txt.bz2", header="foo bar baz qux".split(), 
135 |                  path="tests/tmp/testcorpus")
136 |     len_G = len(G)
137 |     total = G.sum_column("qux")
138 |     G.resum_equal("foo", "qux", assert_sorted=True, keep_all=False, lazy=True)
139 |     for line in G.lines():
140 |         assert_equal(int(G.extract_columns(line, "qux")[0]), total)
141 |     G.delete()
142 | 
143 |     G = LineFile("tests/smallcorpus.txt.bz2", header="foo bar baz qux".split(), 
144 |                  path="tests/tmp/testcorpus")
145 |     G.resum_equal("foo", "qux", assert_sorted=True, keep_all=True, lazy=True)
146 |     for line in G.lines():
147 |         assert_equal(int(G.extract_columns(line, "qux")[0]), total)
148 |     G.delete()
149 | 
150 | 
151 | """
152 | def test_avg_surprisal():
153 |     G = LineFile("tests/smallcorpus.txt.bz2", header="foo bar baz qux", 
154 |                  path="tests/tmp/testcorpus")
155 |     G.make_marginal_column("quux", "foo bar", "qux")
156 |     G.sort("baz")
157 |     for line in G.average_surprisal("baz", "qux", "quux", assert_sorted=True):
158 |         # TODO this test
159 |         pass
160 | 
161 |     G.delete()
162 | """
163 | 
164 | def test_unicode():
165 |     """ test unicode
166 |     
167 |     replace every word in the test corpus with random unicode
168 |     and see if we get the same surprisal scores.
169 | 
170 |     """
171 |     def generate_random_unicode():
172 |         for _ in xrange(5):
173 |             yield unichr(random.choice((0x300, 0x9999)) + random.randint(0, 0xff))
174 | 
175 |     scramblemap = {}
176 | 
177 |     G = LineFile("tests/smallcorpus.txt.bz2", header="foo bar baz qux".split(), 
178 |                  path="tests/tmp/testcorpus")
179 |     G.clean(lower=True, alphanumeric=False, count_columns=False, echo_toss=True, lazy=True)
180 |     G.make_marginal_column("quux", "foo bar".split(), "qux", lazy=True)
181 |     G.sort("baz")
182 |     len_G = len(G)
183 |     sum_counts = G.sum_column("quux")
184 |     sum_surprisal = math.fsum(line[2] for line in G.average_surprisal("baz", "qux", "quux", assert_sorted=True))
185 |     G.delete()
186 | 
187 |     G = LineFile("tests/smallcorpus.txt.bz2", header="foo bar baz qux".split(), 
188 |                  path="tests/tmp/testcorpus")
189 | 
190 |     def scramble(line):
191 |         words = line.split()[:3]
192 |         count = line.split()[-1]
193 |         for i, word in enumerate(words):
194 |             if word in scramblemap:
195 |                 words[i] = scramblemap[word]
196 |             else:
197 |                 garbage = u"".join(generate_random_unicode())
198 |                 words[i] = garbage
199 |                 scramblemap[word] = garbage
200 | 
201 |         return "\t".join(words + [count])
202 | 
203 |     G.clean(lower=True, alphanumeric=False, count_columns=False, echo_toss=True,
204 |             modifier_fn=scramble)
205 |     G.make_marginal_column("quux", "foo bar".split(), "qux")
206 |     G.sort("baz")
207 |     sum_counts_scrambled = G.sum_column("quux")
208 |     assert_equal(sum_counts, sum_counts_scrambled)
209 |     assert_equal(len_G, len(G))
210 |     sum_surprisal_scrambled = math.fsum(line[2] for line in G.average_surprisal("baz", "qux", "quux", assert_sorted=True))
211 |     G.delete()
212 | 
213 |     assert_equal(sum_surprisal, sum_surprisal_scrambled)
214 | 
215 | def test_basics_in_memory():
216 |     G = LineFileInMemory("tests/smallcorpus.txt.bz2", header="foo bar baz qux", 
217 |                  path="tests/tmp/testcorpus")
218 |     assert_equal(G.header, "foo bar baz qux".split())
219 |     assert_equal(G.files, ["tests/smallcorpus.txt.bz2"])
220 | 
221 |     G.make_column("quux", lambda x, y, z, w: "cat", "foo bar baz qux")
222 |     assert_equal(G.header, "foo bar baz qux quux".split())
223 |     for line in G.lines(parts=False):
224 |         assert_equal(G.extract_columns(line, "quux"), ["cat"])
225 |     
226 |     G.delete_columns("quux")
227 |     assert_equal(G.header, "foo bar baz qux".split())
228 | 
229 |     G.copy_column("quux", "qux")
230 |     assert_equal(G.header, "foo bar baz qux quux".split())
231 |     for line in G.lines(parts=False):
232 |         assert_equal(G.extract_columns(line, "qux"), 
233 |                      G.extract_columns(line, "quux")
234 |                      )
235 | 
236 | def test_clean_in_memory():
237 |     G = LineFileInMemory("tests/smallcorpus-malformed.txt.bz2", header="foo bar baz qux".split(), 
238 |                  path="tests/tmp/testcorpus")
239 |     len_G = len(G)
240 |     G.clean(columns=4, lower=False, alphanumeric=False, count_columns=True, 
241 |             nounderscores=False, echo_toss=True)
242 |     assert_equal(len(G), len_G - 2)
243 | 
244 |     G = LineFileInMemory("tests/smallcorpus-malformed.txt.bz2", header="foo bar baz qux".split(), 
245 |                  path="tests/tmp/testcorpus")
246 |     G.clean(lower=True, alphanumeric=True, count_columns=False, echo_toss=True)
247 |     assert_equal(len(G), 8562)
248 | 
249 |     G = LineFileInMemory("tests/smallcorpus-malformed.txt.bz2", header="foo bar baz qux".split(), 
250 |                  path="tests/tmp/testcorpus")
251 |     G.clean(lower=True, alphanumeric=True, count_columns=False, echo_toss=True,
252 |             filter_fn=lambda x: False)
253 |     assert_equal(len(G), 0)
254 | 
255 |     G = LineFileInMemory("tests/smallcorpus-malformed.txt.bz2", header="foo bar baz qux".split(), 
256 |                  path="tests/tmp/testcorpus")
257 |     G.clean(lower=True, alphanumeric=False, count_columns=False, echo_toss=True,
258 |             modifier_fn=lambda x: "hello")
259 |     assert_equal(len(G), len_G)
260 |     for line in G.lines(parts=False):
261 |         assert_equal(line, "hello")
262 | 
263 | def test_resum_equal_in_memory():
264 |     G = LineFileInMemory("tests/smallcorpus.txt.bz2", header="foo bar baz qux".split(), 
265 |                  path="tests/tmp/testcorpus")
266 |     len_G = len(G)
267 |     total = G.sum_column("qux")
268 |     G.resum_equal("foo", "qux", assert_sorted=True, keep_all=False)
269 |     assert_equal(len(G), 1)
270 |     for line in G.lines():
271 |         assert_equal(int(G.extract_columns(line, "qux")[0]), total)
272 | 
273 |     G = LineFileInMemory("tests/smallcorpus.txt.bz2", header="foo bar baz qux".split(), 
274 |                  path="tests/tmp/testcorpus")
275 |     G.resum_equal("foo", "qux", assert_sorted=True, keep_all=True)
276 |     assert_equal(len(G), len_G)
277 |     for line in G.lines():
278 |         assert_equal(int(G.extract_columns(line, "qux")[0]), total)
279 | 
280 | def test_unicode_in_memory():
281 |     def generate_random_unicode():
282 |         for _ in xrange(5):
283 |             yield unichr(random.choice((0x300, 0x9999)) + random.randint(0, 0xff))
284 | 
285 |     scramblemap = {}
286 | 
287 |     G = LineFileInMemory("tests/smallcorpus.txt.bz2", header="foo bar baz qux".split(), 
288 |                  path="tests/tmp/testcorpus")
289 |     G.clean(lower=True, alphanumeric=False, count_columns=False, echo_toss=True, lazy=True)
290 |     G.make_marginal_column("quux", "foo bar".split(), "qux", lazy=True)
291 |     G.sort("baz")
292 |     len_G = len(G)
293 |     sum_counts = G.sum_column("quux")
294 |     sum_surprisal = math.fsum(line[2] for line in G.average_surprisal("baz", "qux", "quux", assert_sorted=True))
295 | 
296 | 
297 |     G = LineFileInMemory("tests/smallcorpus.txt.bz2", header="foo bar baz qux".split(), 
298 |                  path="tests/tmp/testcorpus")
299 | 
300 |     def scramble(line):
301 |         words = line.split()[:3]
302 |         count = line.split()[-1]
303 |         for i, word in enumerate(words):
304 |             if word in scramblemap:
305 |                 words[i] = scramblemap[word]
306 |             else:
307 |                 garbage = u"".join(generate_random_unicode())
308 |                 words[i] = garbage
309 |                 scramblemap[word] = garbage
310 | 
311 |         return "\t".join(words + [count])
312 | 
313 |     G.clean(lower=True, alphanumeric=False, count_columns=False, echo_toss=True,
314 |             modifier_fn=scramble, lazy=True)
315 |     G.make_marginal_column("quux", "foo bar".split(), "qux", lazy=True)
316 |     G.sort("baz")
317 |     sum_counts_scrambled = G.sum_column("quux")
318 |     assert_equal(sum_counts, sum_counts_scrambled)
319 |     assert_equal(len_G, len(G))
320 |     sum_surprisal_scrambled = math.fsum(line[2] for line in G.average_surprisal("baz", "qux", "quux", assert_sorted=True))
321 | 
322 |     assert_equal(sum_surprisal, sum_surprisal_scrambled)
323 | 
324 | 


--------------------------------------------------------------------------------
/tests/smallcorpus-malformed.txt.bz2:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/piantado/ngrampy/792b25e3293f06ac9561a3c02bfaad22d6149d9a/tests/smallcorpus-malformed.txt.bz2


--------------------------------------------------------------------------------
/tests/smallcorpus.txt.bz2:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/piantado/ngrampy/792b25e3293f06ac9561a3c02bfaad22d6149d9a/tests/smallcorpus.txt.bz2


--------------------------------------------------------------------------------