├── .gitignore
├── Makefile
├── README.txt
├── data
    ├── cmudict
    │   ├── 00README_FIRST.txt
    │   ├── cmudict.0.7a.gz
    │   └── cmudict.0.7a.phones
    └── corpus
    │   ├── big.txt.bz2
    │   └── google-ngrams
    │       ├── .gitignore
    │       ├── Makefile
    │       ├── extract.py
    │       ├── fetch.py
    │       ├── import2bin-ngram.c
    │       ├── import2bin-word.py
    │       ├── ngram3bin-compact.c
    │       ├── ngram3bin.c
    │       ├── ngram3bin.h
    │       ├── ngram3binpy.c
    │       ├── scratch
    │           ├── benchmark-str-to-id.py
    │           └── debug-multiprocessing-dict.py
    │       ├── setup.py
    │       └── testbin.py
├── doc
    ├── algorithm.txt
    └── things-that-can-go-wrong-language-wise.txt
├── src
    ├── algo.py
    ├── chick.py
    ├── corpus.py
    ├── doc.py
    ├── gram.py
    ├── grambin.py
    ├── ngramdiff.py
    ├── phon.py
    ├── test.py
    ├── util.py
    ├── web
    │   ├── .gitignore
    │   ├── code.py
    │   ├── conf
    │   │   └── apache2.conf.add
    │   ├── static
    │   │   ├── img
    │   │   │   ├── chick.png
    │   │   │   ├── chick16.png
    │   │   │   ├── chick16.png.ico
    │   │   │   └── chick32.png
    │   │   └── js
    │   │   │   └── spill-chick.js
    │   └── templates
    │   │   ├── base.html
    │   │   └── check.html
    └── word.py
└── test
    ├── Ode-To-My-Spell-Checker-correct.txt
    ├── Ode-To-My-Spell-Checker-original.txt
    ├── Spill-Chick-Yore-Dock-You-Mints-correct.txt
    ├── Spill-Chick-Yore-Dock-You-Mints-original.txt
    └── test.txt


/.gitignore:
--------------------------------------------------------------------------------
1 | *.swp
2 | *.pyc
3 | src/test.prof
4 | data/corpus/*.gz
5 | data/cmudict/cmudict.0.7a
6 | misc/
7 | src/scratch
8 | 


--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
1 | all: data
2 | 
3 | data: ngrams
4 | 
5 | ngrams:
6 | 	$(MAKE) -C data/corpus/google-ngrams
7 | 


--------------------------------------------------------------------------------
/README.txt:
--------------------------------------------------------------------------------
 1 | 
 2 | Author: Ryan Flynn <parseerror+spill-chick@gmail.com>
 3 | 
 4 | spill-chick is a context-sensitive language checker designed to
 5 | correct spelling and grammar errors which pass existing checkers.
 6 | 
 7 | There are all sorts of typing errors one can make.
 8 | 
 9 | 	transcription error .................... speling is hard
10 | 	transposition error .................... causal Friday
11 | 	homophone error ........................ peace of crap
12 | 	grammatical error ...................... your right!
13 | 	word merging/splitting ................. always miss spelling stuff
14 | 	botched idioms ......................... for all intensive purposes
15 | 	word omission .......................... oops, I the word
16 | 	word duplication ....................... and it does does also
17 | 	inconsistency of proper nouns .......... Julius Seizure
18 | 
19 | It is inspired by 'Ode To My Spell Checker', which contains no spelling
20 | errors, is perfectly readable and yet is very incorrect. It begins:
21 | 
22 | 	Eye halve a spelling chequer
23 | 	It came with my pea sea
24 | 	It plainly marques four my revue
25 | 	Miss steaks eye kin knot sea.
26 | 
27 | Progress:
28 | 	I have a spelling checker
29 | 	It came with my pc
30 | 	It plainly marks for my review
31 | 	Mistakes i did not see.
32 | 
33 | 


--------------------------------------------------------------------------------
/data/cmudict/00README_FIRST.txt:
--------------------------------------------------------------------------------
 1 | 
 2 | CMUdict
 3 | -------
 4 | 
 5 | CMUdict (the Carnegie Mellon Pronouncing Dictionary) is a free
 6 | pronouncing dictionary of English, suitable for uses in speech
 7 | technology and is maintained by the Speech Group in the School of
 8 | Computer Science at Carnegie Mellon University.
 9 | 
10 | The Carnegie Mellon Speech Group does not guarantee the accuracy of
11 | this dictionary, nor its suitability for any specific purpose. In
12 | fact, we expect a number of errors, omissions and inconsistencies to
13 | remain in the dictionary. We intend to continually update the
14 | dictionary by correction existing entries and by adding new ones. From
15 | time to time a new major version will be released.
16 | 
17 | We welcome input from users: Please send email to Alex Rudnicky
18 | (air+cmudict@cs.cmu.edu).
19 | 
20 | The Carnegie Mellon Pronouncing Dictionary, in its current and
21 | previous versions is Copyright (C) 1993-2008 by Carnegie Mellon
22 | University.  Use of this dictionary for any research or commercial
23 | purpose is completely unrestricted.  If you make use of or
24 | redistribute this material we request that you acknowledge its
25 | origin in your descriptions.
26 | 
27 | If you add words to or correct words in your version of this
28 | dictionary, we would appreciate it if you could send these additions
29 | and corrections to us (air+cmudict@cs.cmu.edu) for consideration in a
30 | subsequent version. All submissions will be reviewed and approved by
31 | the current maintainer, Alex Rudnicky at Carnegie Mellon.
32 | 
33 | ------------------------------------------------------------------
34 | The current version of cmudict is cmudict.0.7a 
35 | [First released October 29, 2007]
36 | 
37 | 


--------------------------------------------------------------------------------
/data/cmudict/cmudict.0.7a.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rflynn/spill-chick/430257c25369908f243a08d33caa268e8e398aeb/data/cmudict/cmudict.0.7a.gz


--------------------------------------------------------------------------------
/data/cmudict/cmudict.0.7a.phones:
--------------------------------------------------------------------------------
 1 | AA
 2 | AA0
 3 | AA1
 4 | AA2
 5 | AE
 6 | AE0
 7 | AE1
 8 | AE2
 9 | AH
10 | AH0
11 | AH1
12 | AH2
13 | AO
14 | AO0
15 | AO1
16 | AO2
17 | AW
18 | AW0
19 | AW1
20 | AW2
21 | AY
22 | AY0
23 | AY1
24 | AY2
25 | B
26 | CH
27 | D
28 | DH
29 | EH
30 | EH0
31 | EH1
32 | EH2
33 | ER
34 | ER0
35 | ER1
36 | ER2
37 | EY
38 | EY0
39 | EY1
40 | EY2
41 | F
42 | G
43 | HH
44 | IH
45 | IH0
46 | IH1
47 | IH2
48 | IY
49 | IY0
50 | IY1
51 | IY2
52 | JH
53 | K
54 | L
55 | M
56 | N
57 | NG
58 | OW
59 | OW0
60 | OW1
61 | OW2
62 | OY
63 | OY0
64 | OY1
65 | OY2
66 | P
67 | R
68 | S
69 | SH
70 | T
71 | TH
72 | UH
73 | UH0
74 | UH1
75 | UH2
76 | UW
77 | UW0
78 | UW1
79 | UW2
80 | V
81 | W
82 | Y
83 | Z
84 | ZH
85 | 


--------------------------------------------------------------------------------
/data/corpus/big.txt.bz2:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rflynn/spill-chick/430257c25369908f243a08d33caa268e8e398aeb/data/corpus/big.txt.bz2


--------------------------------------------------------------------------------
/data/corpus/google-ngrams/.gitignore:
--------------------------------------------------------------------------------
 1 | *.list.gz
 2 | *.csv.zip
 3 | *.csv.gz
 4 | *.csv
 5 | *-2008.ids.gz
 6 | *.bin
 7 | build/*
 8 | *.o
 9 | *.s
10 | ngram3.bin.*
11 | ngram3bin
12 | ngram3bin-compact
13 | import2bin-ngram
14 | cscope.out
15 | 


--------------------------------------------------------------------------------
/data/corpus/google-ngrams/Makefile:
--------------------------------------------------------------------------------
 1 | CP = cp
 2 | 
 3 | build: ngram3bin.h ngram3bin.c ngram3binpy.c build-py build-py3
 4 | 
 5 | build-py:
 6 | 	python setup.py build
 7 | 	sudo python setup.py install
 8 | 
 9 | build-py3:
10 | 	python3 setup.py build
11 | 	sudo python3 setup.py install
12 | 
13 | # googlebooks-eng-all-3gram-20090715-#.csv.zip
14 | #  -> fetch -> *-2008-list.gz (word,word,word,freq)
15 | #                  -> extract -> *-2008.ids.gz (id,id,id,freq)
16 | #                             -> word.csv.gz (wid,word)
17 | #                                  -> import2bin-word.py -> word.bin (id,word utf8 binary padded)
18 | #                                  -> import2bin-ngram  -> ngram3.bin (id,id,id,freq binary)
19 | #                                                            -> ngram3bin-compact -> ngram3.bin.sort
20 | data: import2bin-ngram ngram3bin-compact
21 | 	./fetch.py --run
22 | 	./extract.py
23 | 	./import2bin-word.py
24 | 	$(RM) ngram3.bin
25 | 	gzip -dc *.ids.gz | ./import2bin-ngram > ngram3.bin
26 | 	./ngram3bin-compact
27 | 	$(RM) ngram3.bin
28 | 	ln -s ngram3.bin.sort ngram3.bin
29 | 
30 | all: ngram3bin
31 | ngram3bin: ngram3bin.o
32 | import2bin-ngram: import2bin-ngram.o
33 | ngram3bin-compact: ngram3bin-compact.o ngram3bin.o
34 | 


--------------------------------------------------------------------------------
/data/corpus/google-ngrams/extract.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | 
  3 | """
  4 | once fetch.py has grabbed a set of ngrams, I parse out a subset and generate CSV.
  5 | given our lists of 'x y z\tcnt', extract, parse and dump
  6 | """
  7 | 
  8 | import os, re, sys
  9 | from glob import glob
 10 | from time import time
 11 | import multiprocessing as mp
 12 | import Queue
 13 | 
 14 | """
 15 | filter n-grams with a freq < MinFreq. range should be somewhere between 20 and 100.
 16 | 
 17 | we do this to substantially reduce the number of n-grams considered by our program,
 18 | and improve the quality of our results. by definition we are trying to reduce the
 19 | document entropy, and need only consider n-grams with a certain frequency.
 20 | 
 21 | the population of n-grams is ~inversely proportional to its frequency.
 22 | approximately 1/2 have freq <= 2, 1/4 have freq >2 and <=4 etc.
 23 | 
 24 | by filtering we eliminate ~90%
 25 | 
 26 | we must maintain a balance between accepting garbage typos that appear a few times
 27 | globally and glossing over legitimate but infrequent phrases.
 28 | """
 29 | MinFreq = 20
 30 | 
 31 | # one megabyte, 4*MB more clear than 4*1024*1024 or 4000000
 32 | MB = 1024 ** 2
 33 | 
 34 | Ids = {}
 35 | Ids['UNKNOWN'] = 0
 36 | Ids['$PROPERNOUN'] = 1
 37 | Ids['$NUMBER'] = 2
 38 | 
 39 | # translate each unique token into a unique numeric id
 40 | # must be thread-safe on write
 41 | def tokid(key):
 42 | 	global Ids
 43 | 	if key not in Ids:
 44 | 		# create a new id, must be unique per key and linear
 45 | 		Ids[key] = len(Ids)
 46 | 	return Ids[key]
 47 | 
 48 | # gunzip 'filename', translate string tokens into ids and gzip write to 'dst'
 49 | def extractfile(nth, total, filename, dst, ids):
 50 | 	global Ids
 51 | 	start = time()
 52 | 	with os.popen('gunzip -dc ' + filename, 'r') as gunzip:
 53 | 		contents = '\n' + gunzip.read().lower()
 54 | 	with os.popen('gzip -c - > ' + dst, 'wb', 4*MB) as gz:
 55 | 		#for x,y,z,cnt in re.findall('\n([^\d\W]+) ([^\d\W]+) ([^\d\W]+)\t(\d+)', contents):
 56 | 		#for m in re.finditer('\n([^ ]+) ([^ ]+) ([^ ]+)\t(\d+)', contents):
 57 | 		# regexes are more expensive than string splitting but allow us a finer control over
 58 | 		# what we accept which means we can reasonably skip exception setup.
 59 | 		# turns out not setting up an exception for each of 200M lines shaves ~2/3x of our time(!)
 60 | 		# include periods and apostrophes
 61 | 		for m in re.finditer('\n([\w\']+) ([\w\']+) ([\w\']+)\t(\d+)',contents):
 62 | 			x,y,z,cnt = m.groups()
 63 | 			cnt = int(cnt)
 64 | 			if cnt >= MinFreq:
 65 | 				gz.write('%u,%u,%u,%u\n' % \
 66 | 					(tokid(x), tokid(y), tokid(z), cnt))
 67 | 	print '%3u/%3u %s (%.1f sec) ids:%u' % (nth, total, dst, time() - start, len(Ids))
 68 | 
 69 | # pulls filenames out of the queue and hand parameters off
 70 | # when we run out of items to process we timeout and return
 71 | def worker(q, ids):
 72 | 	while True:
 73 | 		try:
 74 | 			nth,total,filename,dst = q.get(timeout=1)
 75 | 			extractfile(nth,total,filename,dst, ids)
 76 | 		except Queue.Empty:
 77 | 			break
 78 | 
 79 | Q = mp.Queue()
 80 | 
 81 | # build queue of files to process
 82 | filenames = sorted(glob('*-2008.list.gz'))
 83 | total = len(filenames)
 84 | for nth,filename in enumerate(filenames):
 85 | 	dst = str.replace(filename,'list.gz','ids.gz')
 86 | 	if os.path.exists(dst):
 87 | 		continue
 88 | 	Q.put((nth+1,total,filename,dst))
 89 | 
 90 | if Q.qsize():
 91 | 	print 'Queued %u files.' % (Q.qsize(),)
 92 | 
 93 | 	# multiprocessing is great, except with 2 CPUs the overhead from manager
 94 | 	# overcomes the benefit of keeping both CPUs busy, bummer. with 4+ CPUs it
 95 | 	# might be a different story, I don't know.
 96 | 	# for now, single CPU with dict() is the fastest
 97 | 	DoMP = False
 98 | 	if DoMP:
 99 | 		manager = mp.Manager()
100 | 		Ids = mp.dict()
101 | 		Ids['UNKNOWN'] = 0
102 | 		Ids['$PROPERNOUN'] = 1
103 | 		Ids['$NUMBER'] = 2
104 | 
105 | 		# create workers, run, wait for completion
106 | 		W = [ mp.Process(target=worker, args=(Q, Ids))
107 | 			for _ in range(mp.cpu_count()) ]
108 | 		for w in W: w.start()
109 | 		for w in W: w.join()
110 | 	else:
111 | 		worker(Q, None)
112 | 
113 | 	print 'len(Ids)=', len(Ids)
114 | 	assert len(Ids) > 3
115 | 
116 | 	with os.popen('gzip -c - > word.csv.gz', 'wb') as gz:
117 | 		Ids = sorted(Ids.items(), key=lambda x:x[1])
118 | 		for word,wid in Ids:
119 | 			gz.write('%s,%s\n' % (wid, word))
120 | 
121 | 


--------------------------------------------------------------------------------
/data/corpus/google-ngrams/fetch.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | 
  3 | """
  4 | fetch Google Books' 3-ary ngrams.
  5 | run me, then extract.py
  6 | enumerate, download, extract, filter and delete files
  7 | """
  8 | 
  9 | """
 10 | Traceback (most recent call last):
 11 |   File "./fetch.py", line 70, in <module>
 12 |     download(url, dst)
 13 |   File "./fetch.py", line 22, in download
 14 |     chunk = req.read(CHUNK)
 15 |   File "/usr/lib/python2.6/socket.py", line 353, in read
 16 |     data = self._sock.recv(left)
 17 |   File "/usr/lib/python2.6/httplib.py", line 538, in read
 18 |     s = self.fp.read(amt)
 19 |   File "/usr/lib/python2.6/socket.py", line 353, in read
 20 |     data = self._sock.recv(left)
 21 | socket.error: [Errno 104] Connection reset by peer
 22 | make: *** [data] Error 1
 23 | """
 24 | 
 25 | import datetime
 26 | 
 27 | def log(what, msg):
 28 | 	print('%s %s %s' % (datetime.datetime.now(), what, msg))
 29 | 
 30 | import urllib2
 31 | 
 32 | def download(url, dst):
 33 | 	log(dst, 'download')
 34 | 	CHUNK = 2 * 1024 * 1024
 35 | 	while True:
 36 | 		try:
 37 | 			req = urllib2.urlopen(url)
 38 | 			with open(dst, 'wb') as fp:
 39 | 				while 1:
 40 | 					chunk = req.read(CHUNK)
 41 | 					if not chunk: break
 42 | 					fp.write(chunk)
 43 | 			break
 44 | 		except socket.error:
 45 | 			log(dst, 'error, continuing...')
 46 | 			continue
 47 | 	return dst
 48 | 
 49 | import re, collections
 50 | import os
 51 | 
 52 | def extract(filename, gzfile):
 53 | 	log(filename, 'extract')
 54 | 	CHUNK = 8 * 1024 * 1024
 55 | 	with os.popen('unzip -p ' + filename) as fd:
 56 | 		d = {}
 57 | 		while 1:
 58 | 			txt = fd.read(CHUNK)
 59 | 			if not txt: break
 60 | 			# ! ! Along       2008    4
 61 | 			d.update(re.findall('\n([^\t]+)\t2008\t(\d+)', txt))
 62 | 		with os.popen('gzip -c - > ' + gzfile, 'wb') as out:
 63 | 			for k in sorted(d.keys()):
 64 | 				out.write('%s\t%s\n' % (k, d[k]))
 65 | 
 66 | def delete(filename):
 67 | 	try:
 68 | 		if os.path.exists(filename):
 69 | 			log(filename, 'delete')
 70 | 			os.remove(filename)
 71 | 	except:
 72 | 		pass
 73 | 
 74 | def urls():
 75 | 	for n in range(0, 200):
 76 | 		yield 'http://commondatastorage.googleapis.com/books/ngrams/books/googlebooks-eng-all-3gram-20090715-' + str(n) + '.csv.zip'
 77 | 
 78 | def url2filename(url):
 79 | 	return url[url.rfind('/')+1:]
 80 | 
 81 | def filename2gz(filename):
 82 | 	return filename + '-2008.list.gz'
 83 | 
 84 | if __name__ == '__main__':
 85 | 	import sys
 86 | 	if len(sys.argv) > 1 and sys.argv[1] == '--run':
 87 | 		for url in urls():
 88 | 			try:
 89 | 				dst = url2filename(url)
 90 | 				dstgz = filename2gz(dst)
 91 | 				if not os.path.exists(dstgz):
 92 | 					if not os.path.exists(dst):
 93 | 						download(url, dst)
 94 | 					extract(dst, dstgz)
 95 | 			except urllib2.HTTPError, e:
 96 | 				print(e.reason)
 97 | 				print('continuing...')
 98 | 			finally:
 99 | 				delete(dst) # delete either partial and or complete
100 | 
101 | 


--------------------------------------------------------------------------------
/data/corpus/google-ngrams/import2bin-ngram.c:
--------------------------------------------------------------------------------
 1 | // ex: set ts=8 noet:
 2 | 
 3 | // Convert text-based CSV format "x,y,z,freq" to packed little-endian binary format
 4 | //
 5 | // Usage: gzip -dc *.ids.gz | ./import2bin-ngram > ngram3.bin.orig.c
 6 | //
 7 | // Port from import2bin.py; it was just too slow. We're >10x faster.
 8 | 
 9 | #include <locale.h>
10 | #include <wchar.h>
11 | #include <wctype.h>
12 | #include <stdio.h>
13 | #include <stdlib.h>
14 | #include <stdint.h>
15 | #include <inttypes.h>
16 | #include <assert.h>
17 | 
18 | typedef struct {
19 | #pragma pack(push, 1)
20 | 	uint32_t id[3],
21 |                  freq;
22 | #pragma pack(pop)
23 | } ngram3;
24 | 
25 | // "id0,id1,id2,freq" -> ngram3
26 | int line2ng(const wchar_t *line, ngram3 *ng)
27 | {
28 | 	return swscanf(line,
29 | 		L"%" SCNu32 ",%" SCNu32 ",%" SCNu32 ",%" SCNu32 "\n",
30 | 		ng->id+0, ng->id+1, ng->id+2, &ng->freq) == 4;
31 | }
32 | 
33 | // stdin -> [ngram3(...),...]
34 | int main(void)
35 | {
36 | 	// we're going to be writing out 100s of MB in a batch; use a large buffer
37 | #       define BUFLEN 32 * 1024 * 1024L
38 | 	static wchar_t line[1024];
39 | 	char *buf = malloc(BUFLEN);
40 | 	ngram3 ng;
41 | 
42 | 	assert(sizeof ng == 16 && "ensure packing");
43 | 
44 | 	if (!setlocale(LC_CTYPE, ""))
45 | 	{
46 | 		fprintf(stderr, "Can't set the specified locale! Check LANG, LC_CTYPE, LC_ALL.\n");
47 | 		return 1;
48 | 	}
49 | 
50 | 	// fully buffer stdout
51 | 	setvbuf(stdout, buf, _IOFBF, BUFLEN);
52 | 
53 | 	// parse lines from stdin, write packed binary ngram to stdout, errors to stderr
54 | 	while (fgetws(line, sizeof line / sizeof line[0], stdin))
55 | 	{
56 | 		if (line2ng(line, &ng))
57 | 		{
58 | 			fwrite(&ng, sizeof ng, 1, stdout);
59 | 		}
60 | 		else
61 | 		{
62 | 			fprintf(stderr, "invalid line '%ls'\n", line);
63 | 		}
64 | 	}
65 | 
66 | 	free(buf);
67 | 
68 | 	return 0;
69 | }
70 | 
71 | 


--------------------------------------------------------------------------------
/data/corpus/google-ngrams/import2bin-word.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | 
 3 | import os
 4 | 
 5 | """
 6 | extract.py generated:
 7 | 	word.csv.gz: master word file (word id,word)
 8 | 	*-2008.ids.gz: files of 3-ary ngrams (id0,id1,id2,freq)
 9 | 
10 | we take word.csv.gz, which is already in sorted order by id ascending,
11 | and compact the words into binary format and write to word.bin
12 | """
13 | 
14 | from struct import pack,unpack
15 | with os.popen('gzip -dc word.csv.gz', 'r') as gz:
16 | 	with open('word.bin', 'wb') as bin:
17 | 		for line in gz:
18 | 			wid,word = line.rstrip().split(',', 1)
19 | 			wid = int(wid)
20 | 			bword = bytes(word, 'utf-8')
21 | 			wlen = len(bword)
22 | 			# pad bword with enough \0 to make next string start with alignment=4
23 | 			bword += b'\0' * (1 + ((len(bword)+1) % 4))
24 | 			"""
25 | 			write [uint32_t len][word ... \0\0?\0?\0?]
26 | 			we use fields that are multiples of 4 bytes to keep the &word[0] 32-bit aligned
27 | 			which improves read performance
28 | 			"""
29 | 			bin.write(pack('<i', wlen) + bword)
30 | 
31 | 
32 | 


--------------------------------------------------------------------------------
/data/corpus/google-ngrams/ngram3bin-compact.c:
--------------------------------------------------------------------------------
 1 | /* ex: set ts=8 noet: */
 2 | /*
 3 |  * Copyright 2011 Ryan Flynn <parseerror+github@gmail.com>
 4 |  *
 5 |  * google's data has duplicate ngrams(!)
 6 |  * sort our ngram.bin file's entries, then merge/sum
 7 |  *
 8 |  */
 9 | 
10 | #include <stdlib.h>
11 | #include <sys/types.h>
12 | #include <sys/stat.h>
13 | #include <sys/mman.h>
14 | #include <fcntl.h>
15 | #include <unistd.h>
16 | #include <string.h>
17 | #include <limits.h>
18 | #include <arpa/inet.h>
19 | #include "ngram3bin.h"
20 | 
21 | static void sortfile(const struct ngram3map *m)
22 | {
23 | 	size_t nmemb = m->size / sizeof(ngram3);
24 | 	printf("%s:%u qsort(%p, %zu, %zu, %p);\n",
25 | 		__func__, __LINE__, (void*)m->m, nmemb, sizeof(ngram3), (void*)ngram3cmp);
26 | 	qsort(m->m, nmemb, sizeof(ngram3), ngram3cmp);
27 | }
28 | 
29 | /*
30 |  * ngram3map.m is a big mmap array of ngram3
31 |  * it's been sorted, we want to merge consecutive identical ids into a single one, summing the freq field
32 |  */
33 | static void mergefile(const struct ngram3map *m)
34 | {
35 | 	char *buf = malloc(1024 * 1024);
36 | 	ngram3 *rd = ngram3map_start(m);
37 | 	const ngram3 *end = ngram3map_end(m);
38 | 	unsigned long uniqcnt = 1;
39 | 	FILE *f = fopen("ngram3.bin.sort", "w");
40 | 	ngram3 wr = *rd;
41 | 	perror("fopen");
42 | 	rd++;
43 | 	setvbuf(f, buf, _IOFBF, 1024 * 1024);
44 | 	perror("setvbuf");
45 | 	while (rd < end)
46 | 	{
47 | 		if (rd->id[0] == wr.id[0] &&
48 | 		    rd->id[1] == wr.id[1] &&
49 | 		    rd->id[2] == wr.id[2])
50 | 		{
51 | 			wr.freq += rd->freq;
52 | 		}
53 | 		else
54 | 		{
55 | 			fwrite(&wr, sizeof wr, 1, f);
56 | 			wr = *rd;
57 | 			uniqcnt++;
58 | 		}
59 | 		rd++;
60 | 	}
61 | 	printf("%s:%u\n", __func__, __LINE__);
62 | 
63 | 	printf("merged into %lu ngram3s...\n", uniqcnt);
64 | 	printf("saving...\n");
65 | 
66 | 	fclose(f);
67 | 	perror("fclose");
68 | 	free(buf);
69 | }
70 | 
71 | int main(void)
72 | {
73 | 	const char *path = "ngram3.bin";
74 | 	struct ngram3map m = ngram3bin_init(path, 1);
75 | 	printf("map %llu bytes (%llu ngram3s)\n", m.size, m.size / sizeof(ngram3));
76 | 	printf("sorting...\n");
77 | 	sortfile(&m);
78 | 	printf("merging...\n");
79 | 	mergefile(&m);
80 | 	printf("done.\n");
81 | 	ngram3bin_fini(m);
82 | 	return 0;
83 | }
84 | 
85 | 


--------------------------------------------------------------------------------
/data/corpus/google-ngrams/ngram3bin.c:
--------------------------------------------------------------------------------
  1 | /* ex: set ts=8 noet: */
  2 | /*
  3 |  * Copyright 2011 Ryan Flynn <parseerror+github@gmail.com>
  4 |  *
  5 |  * our 3-ary ngrams are in binary format in ngram3.bin
  6 |  */
  7 | 
  8 | #include <stdlib.h>
  9 | #include <sys/types.h>
 10 | #include <sys/stat.h>
 11 | #include <sys/mman.h>
 12 | #include <fcntl.h>
 13 | #include <unistd.h>
 14 | #include <string.h>
 15 | #include <arpa/inet.h>
 16 | #include "ngram3bin.h"
 17 | 
 18 | void ngram3bin_str(const struct ngram3map m, FILE *f)
 19 | {
 20 | 	fprintf(f, "ngram3map(size=%llu)", m.size);
 21 | }
 22 | 
 23 | struct ngram3map ngram3bin_init(const char *path, int write)
 24 | {
 25 | 	struct stat st;
 26 | 	struct ngram3map m = { NULL, -1, 0 };
 27 | 	if (!stat(path, &st))
 28 | 	{
 29 | 		//printf("stat(\"%s\") size=%llu\n", path, (unsigned long long)st.st_size);
 30 | 		if (-1 != (m.fd = open(path, write ? O_RDWR : O_RDONLY)))
 31 | 		{
 32 | 			m.size = st.st_size;
 33 | 			m.m = mmap(NULL, m.size, PROT_READ | (write ? PROT_WRITE : 0), MAP_SHARED, m.fd, 0);
 34 | 			if (MAP_FAILED == m.m)
 35 | 			{
 36 | 				perror("mmap");
 37 | 				m.m = NULL;
 38 | 			}
 39 | 		}
 40 | 	}
 41 | 	return m;
 42 | }
 43 | 
 44 | // ng is 'cnt' items long; we need to ensure at least 1 more ngram3 in it
 45 | ngram3 * ngram3_find_spacefor1more(ngram3 *ng, unsigned long cnt)
 46 | {
 47 | 	// allocate space for results on every power of 2
 48 | 	// 0->1, 1->2, 2->4, 4->8, etc.
 49 | 	if ((cnt & (cnt - 1)) == 0)
 50 | 	{
 51 | 		unsigned long alloc = cnt ? cnt * 2 : 1;
 52 | 		ngram3 *tmp = realloc(ng, alloc * sizeof *ng);
 53 | 		if (tmp)
 54 | 		{
 55 | 			ng = tmp;
 56 | 		}
 57 | 		else
 58 | 		{
 59 | 			free(ng);
 60 | 			ng = NULL;
 61 | 		}
 62 | 	}
 63 | 	return ng;
 64 | }
 65 | 
 66 | /*
 67 |  * map contains the mmap'ed contents of a dictionary file
 68 |  * the dictionary file is a list of variable-length entries in the form
 69 |  * [uint32_t id][uint32_t len][utf-8 encoded string of bytes length 'len']
 70 |  */
 71 | struct ngramword ngramword_load(const struct ngram3map m)
 72 | {
 73 | 	ngramwordcursor *cursor = m.m;
 74 | 	ngramwordcursor *end = (void *)((char *)m.m + m.size);
 75 | 	unsigned long maxpossible = m.size / 6 + 1;
 76 | 	struct ngramword w;
 77 | 	w.word = calloc(maxpossible, sizeof *w.word);
 78 | 	w.cnt = 0;
 79 | 	while (cursor < end)
 80 | 	{
 81 | 		const char *str = ngramwordcursor_str(cursor);
 82 | 		w.word[w.cnt].len = cursor->len;
 83 | 		w.word[w.cnt].str = str;
 84 | 		w.cnt++;
 85 | 		cursor = ngramwordcursor_next(cursor);
 86 | 	}
 87 | 	w.word = realloc(w.word, w.cnt * sizeof *w.word);
 88 | 	return w;
 89 | }
 90 | 
 91 | /*
 92 |  * FIXME: O(n)
 93 |  * note: this is mitigated by the python module by using a dict
 94 |  * if i added a stage at the beginning of this whole process and sorted words then we could
 95 |  * reduce this to O(log n)
 96 |  * we can also reduce the impact of this by converting all tokens in a document to their ids
 97 |  * once for the duration of the process; currently we're being lazy and repeatedly translating
 98 |  */
 99 | const unsigned long ngramword_word2id(const char *word, unsigned len, const struct ngramword w)
100 | {
101 | 	unsigned long id = 0;
102 | 	printf("ngramword_word2id(word=\"%s\", w={%lu,%p})\n", word, w.cnt, w.word);
103 | 	while (id < w.cnt)
104 | 	{
105 | 		if (w.word[id].len == len && 0 == memcmp(word, w.word[id].str, len))
106 | 			break;
107 | 		id++;
108 | 	}
109 | 	if (id == w.cnt)
110 | 		id = 0;
111 | 	return id;
112 | }
113 | 
114 | const char * ngramword_id2word(unsigned long id, const struct ngramword w)
115 | {
116 | 	if (id < w.cnt)
117 | 		return w.word[id].str;
118 | 	return NULL;
119 | }
120 | 
121 | void ngramword_fini(struct ngramword w)
122 | {
123 | 	free(w.word);
124 | }
125 | 
126 | /*
127 |  * ngram3 comparison callback
128 |  * ascending order
129 |  */
130 | int ngram3cmp(const void *va, const void *vb)
131 | {
132 | 	const ngram3 *a = va,
133 | 	             *b = vb;
134 | 	if (a->id[0] != b->id[0]) return (int)(a->id[0] - b->id[0]);
135 | 	if (a->id[1] != b->id[1]) return (int)(a->id[1] - b->id[1]);
136 | 	if (a->id[2] != b->id[2]) return (int)(a->id[2] - b->id[2]);
137 | 	return 0;
138 | }
139 | 
140 | /*
141 |  * 
142 |  */
143 | unsigned long ngram3bin_freq(ngram3 find, const struct ngram3map *m)
144 | {
145 | 	ngram3 *base = m->m;
146 | 	size_t nmemb = m->size / sizeof *base;
147 | 	const ngram3 *res = bsearch(&find, base, nmemb, sizeof *base, ngram3cmp);
148 | 	return res ? res->freq : 0;
149 | }
150 | 
151 | /*
152 |  * given find (x,y) sum the occurences of (x,y,_) and (_,x,y)
153 |  */
154 | unsigned long ngram3bin_freq2(ngram3 find, const struct ngram3map *m)
155 | {
156 | 	unsigned long freq = 0;
157 | 	ngram3 *cur = m->m;
158 | 	const ngram3 *end = (ngram3 *)((char *)cur + m->size);
159 | 	while (cur < end)
160 | 	{
161 | 		if (cur->id[0] == find.id[0] &&
162 | 		    cur->id[1] == find.id[1])
163 | 		{
164 | 			freq += cur->freq;
165 | 		}
166 | 		else
167 | 		if (cur->id[1] == find.id[0] &&
168 | 		    cur->id[2] == find.id[1])
169 | 		{
170 | 			freq += cur->freq;
171 | 		}
172 | 		cur++;
173 | 	}
174 | 	return freq;
175 | }
176 | 
177 | 
178 | /*
179 |  * given an id 3-gram (x,y,z) and a list of ngram frequencies
180 |  * return matches (_,y,z) or (x,_,z) or (x,y,_)
181 |  */
182 | ngram3 * ngram3bin_like(ngram3 find, const struct ngram3map *m)
183 | {
184 | 	unsigned long ngcnt = 0;
185 | 	ngram3 *cur = m->m;
186 | 	const ngram3 *end = (ngram3*)((char*)cur + m->size);
187 | 	ngram3 *res = NULL;
188 | 	while (cur < end)
189 | 	{
190 | 		if (((cur->id[0] == find.id[0]) +
191 | 		     (cur->id[1] == find.id[1]) +
192 | 		     (cur->id[2] == find.id[2])) == 2)
193 | 		{
194 | 			res = ngram3_find_spacefor1more(res, ngcnt);
195 | 			if (!res)
196 | 				break;
197 | 			res[ngcnt] = *cur; /* copy result */
198 | 			ngcnt++;
199 | 		}
200 | 		cur++;
201 | 	}
202 | 	if (res)
203 | 	{
204 | 		if ((res = ngram3_find_spacefor1more(res, ngcnt)))
205 | 			res[ngcnt].freq = 0; // sentinel
206 | 	}
207 | 	return res;
208 | }
209 | 
210 | static unsigned long ngram3bin_like_xy_(ngram3 find, const struct ngram3map *m, ngram3 **res, unsigned long rescnt);
211 | static unsigned long ngram3bin_like_x_z(ngram3 find, const struct ngram3map *m, ngram3 **res, unsigned long rescnt);
212 | static unsigned long ngram3bin_like__yz(ngram3 find, const struct ngram3map *m, ngram3 **res, unsigned long rescnt,
213 | 					ngram3bin_index *idx);
214 | 
215 | /*
216 |  * given an id 3-gram (x,y,z) and a list of ngram frequencies
217 |  * return matches (_,y,z) or (x,_,z) or (x,y,_)
218 |  *
219 |  * note: this is really the crux of the application: finding ngram-based context.
220 |  * this function will be run thousands of times for every page of text our application checks.
221 |  * 'm' represents 10s of millions of records totalling 100s of MBs.
222 |  * efficiency is critical.
223 |  *
224 |  * note: upgrade of ngram3bin_like(), which performed a sequential scan of the entire 'm' every time.
225 |  * this was simple and effective but just too inefficient.
226 |  * so, we broke up the 3 types of matches performed into separate functions which incorporate binary
227 |  * searches, which should reduce CPU-memory traffic considerably.
228 |  * update: preliminary profiling suggests this is ~40x faster.
229 |  */
230 | ngram3 * ngram3bin_like_better(ngram3 find, const struct ngram3map *m, ngram3bin_index *idx)
231 | {
232 | 	ngram3 *res = NULL;
233 | 	unsigned long rescnt = 0;
234 | 	rescnt = ngram3bin_like_xy_(find, m, &res, rescnt);
235 | 	rescnt = ngram3bin_like_x_z(find, m, &res, rescnt);
236 | 	rescnt = ngram3bin_like__yz(find, m, &res, rescnt, idx);
237 | 	if (res)
238 | 	{
239 | 		if ((res = ngram3_find_spacefor1more(res, rescnt)))
240 | 			res[rescnt].freq = 0; // sentinel
241 | 	}
242 | 	return res;
243 | }
244 | 
245 | static int ngram3cmp_xy_(const void *va, const void *vb)
246 | {
247 | 	const ngram3 *a = va,
248 | 	             *b = vb;
249 | 	if (a->id[0] != b->id[0]) return (int)(a->id[0] - b->id[0]);
250 | 	if (a->id[1] != b->id[1]) return (int)(a->id[1] - b->id[1]);
251 | 	return 0;
252 | }
253 | 
254 | /*
255 |  * find entries in m matching (x,y,_) from find
256 |  * because m's contents are sorted we can use bsearch
257 |  */
258 | static unsigned long ngram3bin_like_xy_(ngram3 find, const struct ngram3map *m, ngram3 **res, unsigned long rescnt)
259 | {
260 | 	const ngram3 *base = m->m;
261 | 	const size_t nmemb = m->size / sizeof *base;
262 | 	const ngram3 *bs = bsearch(&find, base, nmemb, sizeof *base, ngram3cmp_xy_);
263 | 	if (bs)
264 | 	{
265 | 		const ngram3 *end = (ngram3*)((char*)m->m + m->size);
266 | 		// at least one x,y_ exists, but many may exist and we can't be certain
267 | 		// where in that range we have landed
268 | 		// rewind to the beginning of the range...
269 | 		while (bs > base && (bs-1)->id[0] == find.id[0] && (bs-1)->id[1] == find.id[1])
270 | 			bs--;
271 | 		// ...and then seek forward, capturing all (contiguous) matches
272 | 		while (bs < end && bs->id[0] == find.id[0] && bs->id[1] == find.id[1])
273 | 		{
274 | 			*res = ngram3_find_spacefor1more(*res, rescnt);
275 | 			if (!*res)
276 | 				break;
277 | 			(*res)[rescnt] = *bs;
278 | 			rescnt++;
279 | 			bs++;
280 | 		}
281 | 	}
282 | 	return rescnt;
283 | }
284 | 
285 | static int ngram3cmp_x__(const void *va, const void *vb)
286 | {
287 | 	const ngram3 *a = va,
288 | 		     *b = vb;
289 | 	if (a->id[0] != b->id[0]) return (int)(a->id[0] - b->id[0]);
290 | 	return 0;
291 | }
292 | 
293 | /*
294 |  * find entries in m matching (x,_,z) from find
295 |  * because m's contents are sorted we can use bsearch
296 |  */
297 | static unsigned long ngram3bin_like_x_z(ngram3 find, const struct ngram3map *m, ngram3 **res, unsigned long rescnt)
298 | {
299 | 	const ngram3 *base = m->m;
300 | 	const size_t nmemb = m->size / sizeof *base;
301 | 	const ngram3 *bs = bsearch(&find, base, nmemb, sizeof *base, ngram3cmp_x__);
302 | 	if (bs)
303 | 	{
304 | 		const ngram3 *end = (ngram3*)((char*)m->m + m->size);
305 | 		// rewind to the beginning of (x,_,_) range...
306 | 		while (bs > base && (bs-1)->id[0] == find.id[0])
307 | 			bs--;
308 | 		// and then seek forward through all (x,_,_),
309 | 		// recording any (x,_,z) matches
310 | 		while (bs < end && bs->id[0] == find.id[0])
311 | 		{
312 | 			if (bs->id[2] == find.id[2])
313 | 			{
314 | 				*res = ngram3_find_spacefor1more(*res, rescnt);
315 | 				if (!*res)
316 | 					break;
317 | 				(*res)[rescnt] = *bs;
318 | 				rescnt++;
319 | 			}
320 | 			bs++;
321 | 		}
322 | 	}
323 | 	return rescnt;
324 | }
325 | 
326 | /*
327 |  * given find (x,y,z), search m for all matches of (_,y,z) with help of the index
328 |  * m entries are sorted by (x,y,z)
329 |  * idx is a length of the spans of entries with the same (x,_,_)
330 |  * search through m by idx[] records at a time.
331 |  * search sequential for small spans, bsearch large ones
332 |  */
333 | static unsigned long ngram3bin_like__yz(ngram3 find, const struct ngram3map *m,
334 | 					ngram3 **res, unsigned long rescnt,
335 | 					ngram3bin_index *idx)
336 | {
337 | #	define SPAN_LARGE 16 // arbitrary, somewhat-reasonable number
338 | 	uint32_t *span = idx->span;
339 | 	const ngram3 *mcur = m->m;
340 | 	while (*span)
341 | 	{
342 | 		if (*span < SPAN_LARGE)
343 | 		{
344 | 			// small span, search sequentially
345 | 			const ngram3 *mend = mcur + *span;
346 | 			while (mcur < mend)
347 | 			{
348 | 				if (mcur->id[1] == find.id[1] &&
349 | 				    mcur->id[2] == find.id[2])
350 | 				{
351 | 					if ((*res = ngram3_find_spacefor1more(*res, rescnt)))
352 | 						(*res)[rescnt++] = *mcur;
353 | 					mcur = mend;
354 | 					break;
355 | 				}
356 | 				mcur++;
357 | 			}
358 | 		}
359 | 		else
360 | 		{
361 | 			// large span, bsearch
362 | 			const ngram3 *bs;
363 | 			find.id[0] = mcur->id[0]; // first id must match(!)
364 | 			if ((bs = bsearch(&find, mcur, *span, sizeof *mcur, ngram3cmp)))
365 | 			{
366 | 				if ((*res = ngram3_find_spacefor1more(*res, rescnt)))
367 | 					(*res)[rescnt++] = *bs;
368 | 			}
369 | 			mcur += *span;
370 | 		}
371 | 		// mcur set to previous mcur + *span by this point
372 | 		span++;
373 | 	}
374 | 	return rescnt;
375 | }
376 | 
377 | /*
378 |  * sum ngram3 word frequencies in w.word[n].freq
379 |  */
380 | void ngramword_totalfreqs(struct ngramword w, const struct ngram3map *m)
381 | {
382 | 	ngram3 *cur = m->m;
383 | 	const ngram3 *end = (ngram3*)((char*)cur + m->size);
384 | 	while (cur < end)
385 | 	{
386 | 		if (cur->id[0] < w.cnt) w.word[cur->id[0]].freq += cur->freq;
387 | 		if (cur->id[1] < w.cnt) w.word[cur->id[1]].freq += cur->freq;
388 | 		if (cur->id[2] < w.cnt) w.word[cur->id[2]].freq += cur->freq;
389 | 		cur++;
390 | 	}
391 | 	{
392 | 		unsigned long i, cnt = w.cnt;
393 | 		for (i = 0; i < cnt; i++)
394 | 			w.word[i].freq /= 2;
395 | 	}
396 | }
397 | 
398 | /*
399 |  * build an index that speeds out searches of (_,y,z) searches
400 |  * count the spans of consecutive id[0]s in m
401 |  * e.g. [(x,_,_),(x,_,_),(y,_,_),(z,_,_),(z,_,_),(z,_,_)]
402 |  *        |_______|       |       |_______________|
403 |  *            2           1               3
404 |  */
405 | int ngram3bin_index_init(ngram3bin_index *idx, const struct ngram3map *m, const struct ngramword *w)
406 | {
407 | 	/*
408 |          * allocate enough space to hold a counter for every existing unique word,
409 | 	 * even though not every word may necessarily be present in id[0]
410 |          */
411 | 	idx->span = malloc((w->cnt + 1) * sizeof *idx->span);
412 | 	if (idx->span)
413 | 	{
414 | 		unsigned long spanidx = 0,
415 |                               spancnt = 1;
416 | 		const ngram3 *cur = m->m;
417 | 		const ngram3 *end = (ngram3*)((char*)cur + m->size);
418 | 		const ngram3 *nxt = cur+1;
419 | 		while (nxt < end)
420 | 		{
421 | 			if (cur->id[0] == nxt->id[0])
422 | 			{
423 | 				spancnt++;
424 | 			}
425 | 			else
426 | 			{
427 | 				idx->span[spanidx] = spancnt;
428 | 				spanidx++;
429 | 				spancnt = 1;
430 | 			}
431 | 			cur = nxt;
432 | 			nxt++;
433 | 		}
434 | 		idx->span[spanidx] = spancnt;
435 | 		idx->span[spanidx+1] = 0; // sentinel
436 | 	}
437 | 	return !!idx->span;
438 | }
439 | 
440 | void ngram3bin_index_fini(ngram3bin_index *idx)
441 | {
442 | 	free(idx->span);
443 | }
444 | 
445 | void ngram3bin_fini(struct ngram3map m)
446 | {
447 | 	munmap(m.m, m.size);
448 | 	close(m.fd);
449 | }
450 | 
451 | /*
452 |  * sort descending by frequency
453 |  */
454 | static int follows_cmp(const void *va, const void *vb)
455 | {
456 | 	const ngram3 *a = va,
457 | 	             *b = vb;
458 | 	return (int)(b->freq - a->freq);
459 | }
460 | 
461 | /*
462 |  * given a single word, return a list of words follow and their frequency
463 |  */
464 | ngram3 * ngram3bin_follows(const ngram3 *find, const struct ngram3map *m)
465 | {
466 | 	uint32_t fid = find->id[0];
467 | 	unsigned long ngcnt = 0;
468 | 	ngram3 *cur = m->m;
469 | 	const ngram3 *end = (ngram3*)((char*)cur + m->size);
470 | 	ngram3 *res = NULL;
471 | 	while (cur < end)
472 | 	{
473 | 		int foundindex;
474 | 		if (cur->id[0] == fid)
475 | 			foundindex = 1;
476 | 		else if (cur->id[1] == fid)
477 | 			foundindex = 2;
478 | 		else
479 | 			foundindex = 0;
480 | 
481 | 		if (foundindex)
482 | 		{
483 | 			int i;
484 | 			// linear scan for already found...
485 | 			for (i = 0; i < ngcnt; i++)
486 | 			{
487 | 				if (res[i].id[0] == cur->id[foundindex])
488 | 				{
489 | 					res[i].freq++;
490 | 					if (i > 0 && res[i].freq > res[i-1].freq * 2)
491 | 					{
492 | 						// bring most common entries to front of list
493 | 						ngram3 tmp = res[i];
494 | 						res[i] = res[i-1];
495 | 						res[i-1] = tmp;
496 | 					}
497 | 					break;
498 | 				}
499 | 			}
500 | 			// didn't find, add another entry to list
501 | 			if (i == ngcnt)
502 | 			{
503 | 				res = ngram3_find_spacefor1more(res, ngcnt);
504 | 				if (!res)
505 | 					break;
506 | 				res[ngcnt].id[0] = cur->id[foundindex];
507 | 				res[ngcnt].freq = 1;
508 | 				ngcnt++;
509 | 			}
510 | 		}
511 | 		cur++;
512 | 	}
513 | 	if (res)
514 | 	{
515 | 		res = ngram3_find_spacefor1more(res, ngcnt);
516 | 		if (res)
517 | 		{
518 | 			res[ngcnt].freq = 0; // sentinel
519 | 			// sort results
520 | 			qsort(res, ngcnt, sizeof *res, follows_cmp);
521 | 		}
522 | 	}
523 | 	return res;
524 | }
525 | 
526 | #ifdef TEST
527 | 
528 | /*
529 |  * dump binary entries for sanity checking
530 |  */
531 | static void ngram3bin_dump(const struct ngram3map *m, const struct ngramword w)
532 | {
533 | 	const ngram3 *cur = m->m;
534 | 	const ngram3 *end = (ngram3*)((char*)cur + m->size);
535 | 	while (cur < end)
536 | 	{
537 | 		printf("%6lu:%-16s %6lu:%-16s %6lu:%-16s %8lu\n",
538 | 			(unsigned long)cur->id[0], ngramword_id2word(cur->id[0], w),
539 | 			(unsigned long)cur->id[1], ngramword_id2word(cur->id[1], w),
540 | 			(unsigned long)cur->id[2], ngramword_id2word(cur->id[2], w),
541 | 			(unsigned long)cur->freq);
542 | 		cur++;
543 | 	}
544 | }
545 | 
546 | int main(void)
547 | {
548 | 	struct ngram3map mb = ngram3bin_init("ngram3.bin", 0);
549 | 	struct ngram3map mw = ngram3bin_init("word.bin", 0);
550 | 	struct ngramword w = ngramword_load(mw);
551 | 	const ngram3 find = { 5, 29835, 22, 0 }; // am fond of
552 | // googlebooks-eng-all-3gram-20090715-24.csv.zip-2008.list.gz
553 | // 552621:am fond of       3170
554 | // $ zcat googlebooks-eng-all-3gram-20090715-24.csv.zip-2008.ids.gz | grep -In '^5,29835,22,3170$'
555 | // 427250:5,29835,22,3170
556 | 	printf("map %llu bytes (%llu ngram3s)\n", mb.size, mb.size / sizeof find);
557 | 	printf("freq of %lu.%lu.%lu: %lu\n",
558 | 		(unsigned long)find.id[0],
559 | 		(unsigned long)find.id[1],
560 | 		(unsigned long)find.id[2],
561 | 		ngram3bin_freq(find, &mb));
562 | 	ngram3bin_dump(&mb, w);
563 | 	ngram3bin_fini(mb);
564 | 	ngramword_fini(w);
565 | 	ngram3bin_fini(mw);
566 | 	return 0;
567 | }
568 | 
569 | #endif
570 | 
571 | 


--------------------------------------------------------------------------------
/data/corpus/google-ngrams/ngram3bin.h:
--------------------------------------------------------------------------------
 1 | /* ex: set ts=8 noet: */
 2 | /*
 3 |  * Copyright 2011 Ryan Flynn <parseerror+github@gmail.com>
 4 |  */
 5 | 
 6 | #ifndef NGRAM3BIN_H
 7 | #define NGRAM3BIN_H
 8 | 
 9 | #include <stdio.h>
10 | #include <stdint.h>
11 | 
12 | #define UNKNOWN_ID (0)
13 | #define IMPOSSIBLE_ID (~0)
14 | 
15 | struct ngram3map
16 | {
17 | 	void *m;
18 | 	int fd;
19 | 	unsigned long long size;
20 | };
21 | 
22 | #define ngram3map_start(map) ((ngram3*)((map)->m))
23 | #define ngram3map_end(map) ((ngram3*)(((char *)((map)->m)) + (map)->size))
24 | 
25 | struct ngramword
26 | {
27 | 	unsigned long cnt;
28 | 	struct wordlen {
29 | 		unsigned len;
30 | 		unsigned freq;
31 | 		const char *str;
32 | 	} *word;
33 | };
34 | 
35 | #pragma pack(push, 1)
36 | struct ngramwordcursor {
37 | 	uint32_t len;
38 | };
39 | #pragma pack(pop)
40 | typedef struct ngramwordcursor ngramwordcursor;
41 | 
42 | #define ngramwordcursor_str(cur)  ((char *)(cur) + sizeof *(cur))
43 | #define ngramwordcursor_next(cur) (void *)((char *)(ngramwordcursor_str(cur) + ((cur)->len + (1 + ((cur)->len+1) % 4))))
44 | 
45 | #pragma pack(push, 1)
46 | typedef struct
47 | {
48 | 	uint32_t id[3],
49 | 		 freq;
50 | } ngram3;
51 | #pragma pack(pop)
52 | 
53 | /*
54 |  * ngram3 is a sorted array of 3-grams (x,y,z)
55 |  * for each unique x, count the number of sequential records (x,_,_)
56 |  * this allows us to more efficiently search for (_,y,z)
57 |  *
58 |  * note: we don't need to track which id each span represents, we
59 |  * can retrieve it when necessary; we just need the number of records
60 |  */
61 | typedef struct
62 | {
63 | 	uint32_t *span;
64 | } ngram3bin_index;
65 | 
66 | struct ngramword    ngramword_load(const struct ngram3map);
67 | const unsigned long ngramword_word2id(const char *word, unsigned len, const struct ngramword);
68 | const char *	    ngramword_id2word(unsigned long id, const struct ngramword);
69 | void		    ngramword_totalfreqs(struct ngramword, const struct ngram3map *);
70 | void		    ngramword_fini(struct ngramword);
71 | 
72 | struct ngram3map    ngram3bin_init(const char *path, int write);
73 | unsigned long	    ngram3bin_freq(ngram3 find, const struct ngram3map *);
74 | unsigned long	    ngram3bin_freq2(ngram3 find, const struct ngram3map *);
75 | ngram3 *	    ngram3bin_like(ngram3 find, const struct ngram3map *);
76 | ngram3 *	    ngram3bin_like_better(ngram3 find, const struct ngram3map *, ngram3bin_index *);
77 | void		    ngram3bin_str (const struct ngram3map, FILE *);
78 | void		    ngram3bin_fini(struct ngram3map);
79 | ngram3 *	    ngram3bin_follows(const ngram3 *, const struct ngram3map *);
80 | 
81 | int		    ngram3bin_index_init(ngram3bin_index *, const struct ngram3map *, const struct ngramword *);
82 | void		    ngram3bin_index_fini(ngram3bin_index *);
83 | 
84 | int ngram3cmp(const void *, const void *);
85 | 
86 | #endif /* NGRAM3BIN_H */
87 | 
88 | 


--------------------------------------------------------------------------------
/data/corpus/google-ngrams/ngram3binpy.c:
--------------------------------------------------------------------------------
  1 | /* ex: set ts=8 noet: */
  2 | /*
  3 |  * Copyright 2011 Ryan Flynn <parseerror+github@gmail.com>
  4 |  *
  5 |  * ngram3bin python bindings
  6 |  *
  7 |  * Reference: http://starship.python.net/crew/arcege/extwriting/pyext.html
  8 |  *			http://docs.python.org/release/2.5.2/ext/callingPython.html
  9 |  *			http://www.fnal.gov/docs/products/python/v1_5_2/ext/buildValue.html
 10 |  */
 11 | 
 12 | #include <Python.h>
 13 | #include <stdlib.h>
 14 | #include "ngram3bin.h"
 15 | 
 16 | #if PY_MAJOR_VERSION >= 3
 17 | #define PY3K
 18 | #endif
 19 | 
 20 | /*
 21 |  * obj PyObject wrapper
 22 |  */
 23 | typedef struct {
 24 | 	PyObject_HEAD
 25 | 	struct ngram3map wordmap;
 26 | 	struct ngram3map ngramap;
 27 | 	struct ngramword word;
 28 | 	ngram3bin_index ngramap_index;
 29 | 	PyObject *worddict;
 30 | } ngram3bin;
 31 | 
 32 | static void	 ngram3bin_dealloc(PyObject *self);
 33 | static int	 ngram3bin_print  (PyObject *self, FILE *fp, int flags);
 34 | #ifndef PY3K
 35 | static PyObject *ngram3bin_getattr(PyObject *self, char *attr);
 36 | #endif
 37 | 
 38 | static PyObject *ngram3bin_new      (PyObject *self, PyObject *args);
 39 | static PyObject *ngram3binpy_word2id(PyObject *self, PyObject *args);
 40 | static PyObject *ngram3binpy_id2word(PyObject *self, PyObject *args);
 41 | static PyObject *ngram3binpy_id2freq(PyObject *self, PyObject *args);
 42 | static PyObject *ngram3binpy_wordfreq(PyObject *self, PyObject *args);
 43 | static PyObject *ngram3binpy_freq   (PyObject *self, PyObject *args);
 44 | static PyObject *ngram3binpy_like   (PyObject *self, PyObject *args);
 45 | static PyObject *ngram3binpy_follows(PyObject *self, PyObject *args);
 46 | 
 47 | static struct PyMethodDef ngram3bin_Methods[] = {
 48 | 	{ "word2id",	(PyCFunction) ngram3binpy_word2id,	METH_VARARGS,	NULL },
 49 | 	{ "id2word",	(PyCFunction) ngram3binpy_id2word,	METH_VARARGS,	NULL },
 50 | 	{ "id2freq",	(PyCFunction) ngram3binpy_id2freq,	METH_VARARGS,	NULL },
 51 | 	{ "wordfreq",	(PyCFunction) ngram3binpy_wordfreq,	METH_VARARGS,	NULL },
 52 | 	{ "freq",	(PyCFunction) ngram3binpy_freq,		METH_VARARGS,	NULL },
 53 | 	{ "like",	(PyCFunction) ngram3binpy_like,		METH_VARARGS,	NULL },
 54 | 	{ "ngram3bin",	(PyCFunction) ngram3bin_new,		METH_VARARGS,	NULL },
 55 | 	{ "follows",	(PyCFunction) ngram3binpy_follows,	METH_VARARGS,	NULL },
 56 | 	{ NULL,		NULL,					0,		NULL }
 57 | };
 58 | 
 59 | /*
 60 |  * ngram3bin type-builtin methods
 61 |  */
 62 | PyTypeObject ngram3bin_Type = {
 63 | #ifdef PY3K
 64 | 	PyVarObject_HEAD_INIT(NULL, 0)
 65 | #else
 66 | 	PyObject_HEAD_INIT(NULL)
 67 | 	0,			/* ob_size					*/
 68 | #endif
 69 | 	"ngram3bin",		/* char *tp_name;				*/
 70 | 	sizeof(ngram3bin),	/* int tp_basicsize;				*/
 71 | 	0,			/* int tp_itemsize; not used much		*/
 72 | 	ngram3bin_dealloc,	/* destructor tp_dealloc;			*/
 73 | 	ngram3bin_print,	/* printfunc tp_print;				*/
 74 | #ifdef PY3K
 75 | 	0,			/* getattrfunc tp_getattr;	__getattr__	*/
 76 | #else
 77 | 	ngram3bin_getattr,	/* getattrfunc tp_getattr;	__getattr__	*/
 78 | #endif
 79 | 	0,			/* setattrfunc tp_setattr;	__setattr__	*/
 80 | 	0,			/* cmpfunc tp_compare;		__cmp__		*/
 81 | 	0,			/* reprfunc tp_repr;		__repr__	*/
 82 | 	0,			/* PyNumberMethods *tp_as_number;		*/
 83 | 	0,			/* PySequenceMethods *tp_as_sequence;		*/
 84 | 	0,			/* PyMappingMethods *tp_as_mapping;		*/
 85 | 	0,			/* hashfunc tp_hash;		__hash__	*/
 86 | 	0,			/* ternaryfunc tp_call;		__call__	*/
 87 | 	0,			/* reprfunc tp_str;		__str__		*/
 88 | #ifdef PY3K
 89 | 	PyObject_GenericGetAttr,/* tp_getattro					*/
 90 | 	0,			/* tp_setattro					*/
 91 | 	0,			/* tp_as_buffer					*/
 92 | 	Py_TPFLAGS_DEFAULT,	/* tp_flags					*/
 93 | 	0,			/* tp_doc					*/
 94 | 	0,			/* tp_traverse					*/
 95 | 	0,			/* tp_clear					*/
 96 | 	0,			/* tp_richcompare				*/
 97 | 	0,			/* tp_weaklistoffset				*/
 98 | 	0,			/* tp_iter					*/
 99 | 	0,			/* tp_iternext					*/
100 | 	ngram3bin_Methods,	/* tp_methods					*/
101 | 	0,			/* tp_members					*/
102 | 	0,			/* tp_getset					*/
103 | 	0,			/* tp_base					*/
104 | 	0,			/* tp_dict					*/
105 | 	0,			/* tp_descr_get					*/
106 | 	0,			/* tp_descr_set					*/
107 | 	0,			/* tp_dictoffset				*/
108 | 	0,			/* tp_init					*/
109 | 	0,			/* tp_alloc					*/
110 | 	0,			/* tp_new					*/
111 | #endif
112 | };
113 | 
114 | struct module_state {
115 | 	PyObject *error;
116 | };
117 | 
118 | #if PY_MAJOR_VERSION >= 3
119 | #define GETSTATE(m) ((struct module_state*)PyModule_GetState(m))
120 | #else
121 | #define GETSTATE(m) (&_state)
122 | static struct module_state _state;
123 | #endif
124 | 
125 | #if 0
126 | static PyObject * error_out(PyObject *m)
127 | {
128 | 	struct module_state *st = GETSTATE(m);
129 | 	PyErr_SetString(st->error, "something bad happened");
130 | 	return NULL;
131 | }
132 | #endif
133 | 
134 | #if PY_MAJOR_VERSION >= 3
135 | 
136 | static int ngram3bin_traverse(PyObject *m, visitproc visit, void *arg)
137 | {
138 | 	Py_VISIT(GETSTATE(m)->error);
139 | 	return 0;
140 | }
141 | 
142 | static int ngram3bin_clear(PyObject *m)
143 | {
144 | 	Py_CLEAR(GETSTATE(m)->error);
145 | 	return 0;
146 | }
147 | 
148 | static struct PyModuleDef moduledef =
149 | {
150 | 		PyModuleDef_HEAD_INIT,
151 | 		"ngram3bin",
152 | 		NULL,
153 | 		sizeof(struct module_state),
154 | 		ngram3bin_Methods,
155 | 		NULL,
156 | 		ngram3bin_traverse,
157 | 		ngram3bin_clear,
158 | 		NULL
159 | };
160 | 
161 | #define INITERROR return NULL
162 | 
163 | PyObject *
164 | PyInit_ngram3bin(void)
165 | 
166 | #else
167 | #define INITERROR return
168 | 
169 | void
170 | initngram3bin(void)
171 | #endif
172 | {
173 | #ifdef PY3K
174 | 	PyObject *module = PyModule_Create(&moduledef);
175 | #else
176 | 	PyObject *module = Py_InitModule("ngram3bin", ngram3bin_Methods);
177 | #endif
178 | 
179 | 	if (module == NULL)
180 | 		INITERROR;
181 | 	struct module_state *st = GETSTATE(module);
182 | 
183 | 	st->error = PyErr_NewException("ngram3bin.Error", NULL, NULL);
184 | 	if (st->error == NULL)
185 | 	{
186 | 		Py_DECREF(module);
187 | 		INITERROR;
188 | 	}
189 | 
190 | #ifdef PY3K
191 | 	return module;
192 | #endif
193 | }
194 | 
195 | #ifndef PY3K
196 | PyObject *ngram3bin_getattr(PyObject *self, char *attr)
197 | {
198 | 	PyObject *res = Py_FindMethod(ngram3bin_Methods, self, attr);
199 | 	return res;
200 | }
201 | #endif
202 | 
203 | static PyObject * ngram3bin_NEW(void)
204 | {
205 | 	ngram3bin *obj = PyObject_NEW(ngram3bin, &ngram3bin_Type);
206 | 	obj->wordmap.m = NULL;
207 | 	obj->ngramap.m = NULL;
208 | 	obj->wordmap.fd = -1;
209 | 	obj->ngramap.fd = -1;
210 | 	obj->wordmap.size = 0;
211 | 	obj->ngramap.size = 0;
212 | 	return (PyObject *)obj;
213 | }
214 | 
215 | static PyObject * worddict_new(struct ngramword w)
216 | {
217 | 	PyObject *d = PyDict_New();
218 | 	struct wordlen *wl = w.word;
219 | 	unsigned long id;
220 | 	for (id = 0; id < w.cnt; id++, wl++)
221 | 	{
222 | 		PyObject *v = PyLong_FromUnsignedLong(id);
223 | 		PyObject *k = PyBytes_FromStringAndSize(wl->str, wl->len);
224 | 		(void)PyDict_SetItem(d, k, v);
225 | 	}
226 | 	return d;
227 | }
228 | 
229 | static PyObject * ngram3bin_new(PyObject *self, PyObject *args)
230 | {
231 | 	ngram3bin *obj = (ngram3bin *)ngram3bin_NEW();
232 | 	char *wordpath = NULL;
233 | 	char *ngrampath = NULL;
234 | 	if (PyArg_ParseTuple(args, "ss", &wordpath, &ngrampath))
235 | 	{
236 | 		obj->wordmap = ngram3bin_init(wordpath, 0);
237 | 		obj->word    = ngramword_load(obj->wordmap);
238 | 		obj->ngramap = ngram3bin_init(ngrampath, 0);
239 | 		obj->worddict = worddict_new(obj->word);
240 | 		ngramword_totalfreqs(obj->word, &obj->ngramap);
241 | 		ngram3bin_index_init(&obj->ngramap_index, &obj->ngramap, &obj->word);
242 | 		Py_INCREF(obj->worddict);
243 | 	}
244 | 	Py_INCREF(obj);
245 | 	return (PyObject *)obj;
246 | }
247 | 
248 | static void ngram3bin_dealloc(PyObject *self)
249 | {
250 | 	ngram3bin *obj = (ngram3bin *)self;
251 | 	ngram3bin_fini(obj->wordmap);
252 | 	ngramword_fini(obj->word);
253 | 	ngram3bin_fini(obj->ngramap);
254 | 	PyMem_FREE(self);
255 | }
256 | 
257 | static int ngram3bin_print(PyObject *self, FILE *fp, int flags)
258 | {
259 | 	ngram3bin *obj = (ngram3bin *)self;
260 | 	ngram3bin_str(obj->wordmap, fp);
261 | 	ngram3bin_str(obj->ngramap, fp);
262 | 	return 0;
263 | }
264 | 
265 | static PyObject *ngram3binpy_word2id(PyObject *self, PyObject *args)
266 | {
267 | 	PyObject *res = NULL;
268 | 	Py_UNICODE *u = NULL;
269 | 	int l = 0;
270 | 	if (PyArg_ParseTuple(args, "u#", &u, &l))
271 | 	{
272 | 		PyObject *key = PyUnicode_EncodeUTF8(u, l, NULL);
273 | 		if (key)
274 | 		{
275 | 			ngram3bin *obj = (ngram3bin *)self;
276 | 			res = PyDict_GetItem(obj->worddict, key);
277 | 		}
278 | 	}
279 | 	if (!res)
280 | 		res = PyLong_FromLong(UNKNOWN_ID);
281 | 	Py_INCREF(res);
282 | 	return res;
283 | }
284 | 
285 | static PyObject *ngram3binpy_id2word(PyObject *self, PyObject *args)
286 | {
287 | 	PyObject *res = NULL;
288 | 	ngram3bin *obj = (ngram3bin *)self;
289 | 	unsigned long id = 0;
290 | 	if (PyArg_ParseTuple(args, "i", &id))
291 | 	{
292 | 		const char *word = ngramword_id2word(id, obj->word);
293 | 		if (word)
294 | 			res = PyUnicode_FromStringAndSize(word, strlen(word));
295 | 		else
296 | 			res = PyErr_NewException("ngram3bin.Error", NULL, NULL);
297 | 		Py_INCREF(res);
298 | 	}
299 | 	return res;
300 | }
301 | 
302 | static PyObject *ngram3binpy_id2freq(PyObject *self, PyObject *args)
303 | {
304 | 	PyObject *res = NULL;
305 | 	ngram3bin *obj = (ngram3bin *)self;
306 | 	unsigned long id = 0;
307 | 	if (PyArg_ParseTuple(args, "i", &id))
308 | 	{
309 | 		if (id < obj->word.cnt)
310 | 			res = PyLong_FromUnsignedLong(obj->word.word[id].freq);
311 | 		else
312 | 			res = PyLong_FromLong(0);
313 | 		Py_INCREF(res);
314 | 	}
315 | 	return res;
316 | }
317 | 
318 | /*
319 |  * equivalent of id2freq(word2id(word))
320 |  */
321 | static PyObject *ngram3binpy_wordfreq(PyObject *self, PyObject *args)
322 | {
323 | 	PyObject *res = NULL;
324 | 	ngram3bin *obj = (ngram3bin *)self;
325 | 	unsigned long id = 0;
326 | 	Py_UNICODE *u = NULL;
327 | 	int l = 0;
328 | 	if (PyArg_ParseTuple(args, "u#", &u, &l))
329 | 	{
330 | 		PyObject *key = PyUnicode_EncodeUTF8(u, l, NULL);
331 | 		if (key)
332 | 		{
333 | 			res = PyDict_GetItem(obj->worddict, key);
334 | 			if (res)
335 | 				id = PyLong_AsLong(res);
336 | 		}
337 | 	}
338 | 	if (id < obj->word.cnt)
339 | 		res = PyLong_FromUnsignedLong(obj->word.word[id].freq);
340 | 	else
341 | 		res = PyLong_FromLong(0);
342 | 	Py_INCREF(res);
343 | 	return res;
344 | }
345 | 
346 | /*
347 |  * find frequency of (x,y,z)
348 |  */
349 | static PyObject *ngram3binpy_freq(PyObject *self, PyObject *args)
350 | {
351 | 	PyObject *res = NULL;
352 | 	ngram3bin *obj = (ngram3bin *)self;
353 | 	ngram3 find;
354 | 	find.id[2] = IMPOSSIBLE_ID;
355 | 	unsigned long freq = 0;
356 | 	if (PyArg_ParseTuple(args, "ii|i", find.id+0, find.id+1, find.id+2))
357 | 	{
358 | 		if (find.id[2] == IMPOSSIBLE_ID)
359 | 			freq = ngram3bin_freq2(find, &obj->ngramap);
360 | 		else
361 | 			freq = ngram3bin_freq(find, &obj->ngramap);
362 | 	}
363 | 	res = PyLong_FromUnsignedLong(freq);
364 | 	Py_INCREF(res);
365 | 	return res;
366 | }
367 | 
368 | /*
369 |  * given the results of an ngram3_find() call,
370 |  * import them into a python list of 4-tuples [(x,y,z,freq),...]
371 |  */
372 | static PyObject * ngram3_find_res2py(const ngram3 *f)
373 | {
374 | 	PyObject *res = PyList_New(0);
375 | 	Py_INCREF(res);
376 | 	if (f)
377 | 	{
378 | 		const ngram3 *c = f;
379 | 		while (c->freq)
380 | 		{
381 | 			PyObject *o, *t = PyTuple_New(4);
382 | 			int i;
383 | 			for (i = 0; i < 3; i++)
384 | 			{
385 | 				o = PyLong_FromUnsignedLong(c->id[i]);
386 | 				PyTuple_SetItem(t, i, o);
387 | 				Py_INCREF(o);
388 | 			}
389 | 			o = PyLong_FromUnsignedLong(c->freq);
390 | 			PyTuple_SetItem(t, 3, o);
391 | 			Py_INCREF(o);
392 | 			PyList_Append(res, t);
393 | 			Py_INCREF(t);
394 | 			c++;
395 | 		}
396 | 	}
397 | 	return res;
398 | }
399 | 
400 | static PyObject *ngram3binpy_like(PyObject *self, PyObject *args)
401 | {
402 | 	PyObject *res = NULL;
403 | 	ngram3bin *obj = (ngram3bin *)self;
404 | 	ngram3 find;
405 | 	if (PyArg_ParseTuple(args, "iii", find.id+0, find.id+1, find.id+2))
406 | 	{
407 | 		if (obj->ngramap.m)
408 | 		{
409 | 			ngram3 *f = ngram3bin_like_better(find, &obj->ngramap, &obj->ngramap_index);
410 | 			res = ngram3_find_res2py(f);
411 | 			free(f);
412 | 		}
413 | 	}
414 | 	else
415 | 	{
416 | 		res = PyList_New(0);
417 | 	}
418 | 	return res;
419 | }
420 | 
421 | /*
422 |  * given the results of an ngram3_follows() call,
423 |  * import them into a python list of 2-tuples [(word_id,freq),...]
424 |  */
425 | static PyObject * ngram3_follows_res2py(const ngram3 *f)
426 | {
427 | 	PyObject *res = PyList_New(0);
428 | 	Py_INCREF(res);
429 | 	if (f)
430 | 	{
431 | 		const ngram3 *c = f;
432 | 		while (c->freq)
433 | 		{
434 | 			PyObject *o, *t;
435 | 			t = PyTuple_New(2);
436 | 			o = PyLong_FromUnsignedLong(c->id[0]);
437 | 			PyTuple_SetItem(t, 0, o);
438 | 			Py_INCREF(o);
439 | 			o = PyLong_FromUnsignedLong(c->freq);
440 | 			PyTuple_SetItem(t, 1, o);
441 | 			Py_INCREF(o);
442 | 			PyList_Append(res, t);
443 | 			Py_INCREF(t);
444 | 			c++;
445 | 		}
446 | 	}
447 | 	return res;
448 | }
449 | 
450 | static PyObject *ngram3binpy_follows(PyObject *self, PyObject *args)
451 | {
452 | 	PyObject *res = NULL;
453 | 	ngram3bin *obj = (ngram3bin *)self;
454 | 	ngram3 find;
455 | 	if (PyArg_ParseTuple(args, "i", find.id+0))
456 | 	{
457 | 		if (obj->ngramap.m)
458 | 		{
459 | 			ngram3 *f = ngram3bin_follows(&find, &obj->ngramap);
460 | 			res = ngram3_follows_res2py(f);
461 | 			free(f);
462 | 		}
463 | 	}
464 | 	else
465 | 	{
466 | 		res = PyList_New(0);
467 | 	}
468 | 	return res;
469 | }
470 | 
471 | 


--------------------------------------------------------------------------------
/data/corpus/google-ngrams/scratch/benchmark-str-to-id.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | """
 4 | benchmark fastest implementation of key generating algorithm for tokens.
 5 | gets called ~600 million times in CPU-bound task.
 6 | 
 7 | >>> def doit1():
 8 | ... import string
 9 | ... string.lower('Python')
10 | ...
11 | >>> import string
12 | >>> def doit2():
13 | ... string.lower('Python')
14 | ...
15 | >>> import timeit
16 | >>> t = timeit.Timer(setup='from __main__ import doit1', stmt='doit1()')
17 | >>> t.timeit()
18 | 11.479144930839539
19 | >>> t = timeit.Timer(setup='from __main__ import doit2', stmt='doit2()')
20 | >>> t.timeit()
21 | 4.6661689281463623
22 | """
23 | 
24 | def id1(d, key):
25 | 	try:
26 | 		return d[key]
27 | 	except KeyError:
28 | 		cnt = len(d)
29 | 		d[key] = cnt
30 | 		return cnt
31 | 
32 | def id2(d, key):
33 | 	ld = len(d)
34 | 	val = d.get(key, ld)
35 | 	if val == ld:
36 | 		d[key] = val
37 | 	return val
38 | 
39 | def id3(d, key):
40 | 	if key in d:
41 | 		return d[key]
42 | 	else:
43 | 		cnt = len(d)
44 | 		d[key] = cnt
45 | 		return cnt
46 | 
47 | Id = {}
48 | def id4(_, key):
49 | 	global Id
50 | 	if key in Id:
51 | 		return Id[key]
52 | 	else:
53 | 		cnt = len(Id)
54 | 		Id[key] = cnt
55 | 		return cnt
56 | 
57 | from random import randint
58 | 
59 | def foo(f):
60 | 	d = {}
61 | 	for _ in range(1,1000):
62 | 		f(d, randint(0, 20))
63 | 
64 | import timeit
65 | for n in range(1,5):
66 | 	print '%d:%s' % (n, timeit.Timer(setup='from __main__ import foo,id%d' % n, stmt='foo(id%d)' % n).timeit(number=1000))
67 | 
68 | 


--------------------------------------------------------------------------------
/data/corpus/google-ngrams/scratch/debug-multiprocessing-dict.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding:utf-8 -*-
 3 | 
 4 | """
 5 | a dict() must be shared between worker processes and whose contents,
 6 | written by the workers, must be accessible after they are done.
 7 | 
 8 | this is a contrived example exploring this.
 9 | """
10 | 
11 | import multiprocessing as mp
12 | import Queue
13 | from time import sleep
14 | manager = mp.Manager()
15 | Ids = manager.dict()
16 | Q = mp.Queue()
17 | 
18 | # build queue
19 | for n in range(10):
20 | 	Q.put(n)
21 | 
22 | # mirror my getid() function
23 | def setid(d, key, val): d[key] = val
24 | 
25 | def worker(my_id, q, ids):
26 | 	while True:
27 | 		try:
28 | 			k = q.get(timeout=1)
29 | 			setid(ids, k, my_id)
30 | 			sleep(0.01) # let other worker have a chance
31 | 		except Queue.Empty:
32 | 			return
33 | 
34 | W = [ mp.Process(target=worker, args=(i, Q, Ids))
35 | 	for i in range(2) ]
36 | for w in W: w.start()
37 | for w in W: w.join()
38 | 
39 | # figure out who did what
40 | from operator import itemgetter as ig
41 | from itertools import groupby
42 | for k,g in groupby(sorted(Ids.items(), key=ig(1)), key=ig(1)):
43 | 	print 'worker', k, 'wrote keys', [x[0] for x in g]
44 | 
45 | 


--------------------------------------------------------------------------------
/data/corpus/google-ngrams/setup.py:
--------------------------------------------------------------------------------
 1 | """
 2 | How To Use
 3 | 
 4 | $ python setup.py build
 5 | $ sudo python setup.py install
 6 | $ time python3 -i < testbin.py
 7 | 
 8 | """
 9 | 
10 | from distutils.core import setup, Extension
11 | 
12 | setup(name = 'ngram3bin',
13 |       version = '1.0',
14 |       ext_modules = [Extension('ngram3bin', ['ngram3bin.c','ngram3binpy.c'])])
15 | 


--------------------------------------------------------------------------------
/data/corpus/google-ngrams/testbin.py:
--------------------------------------------------------------------------------
 1 | 
 2 | # Usage: python3 -i testbin.py
 3 | 
 4 | from ngram3bin import ngram3bin
 5 | #ng = ngram3bin('xxx') # too few parameters
 6 | ng = ngram3bin('word.bin','ngram3.bin')
 7 | ng.word2id('freq')
 8 | ng.word2id('FDWD#$#$@#@')
 9 | list(map(ng.word2id, ['activities','as','buddhist']))
10 | [ng.id2freq(ng.word2id(w)) for w in ['activities','as','buddhist']][:30]
11 | [ng.id2word(ng.word2id(w)) for w in ['activities','as','buddhist']][:30]
12 | ng.freq(4,22,215)
13 | ng.like(5,6,7)
14 | # convert to ids, search, convert back to words
15 | [(ng.id2word(x), ng.id2word(y), ng.id2word(z), freq)
16 | 	for x,y,z,freq in ng.like(*[ng.word2id(w) for w in ['activities','as','buddhist']])]
17 | ng.freq(1,2)
18 | #ng.like(3,4)
19 | print('idknow')
20 | idknow = ng.word2id('know')
21 | print('word(idknow)')
22 | ng.id2word(idknow)
23 | assert 'know' == ng.id2word(ng.word2id('know'))
24 | assert ng.id2freq(ng.word2id('know')) == ng.wordfreq('know')
25 | 
26 | # "bridge" missing made find a bug
27 | print('id(bridge)=', ng.word2id('bridge'))
28 | print('id2freq(bridge)=', ng.id2freq(ng.word2id('bridge')))
29 | print('wordfreq(bridge)=', ng.wordfreq('bridge'))
30 | 
31 | # "didn" seems to be missing but shouldn't be...
32 | print('wordfreq(didn)=', ng.wordfreq('didn'))
33 | 
34 | [(w,ng.word2id(w)) for w in ['didn','t','know']]
35 | ng.freq(*[ng.word2id(w) for w in ['didn','t','know']])
36 | 
37 | # freq2
38 | (('didn','t'), ng.freq(*[ng.word2id(w) for w in ['didn','t']]))
39 | (('and','that'), ng.freq(*[ng.word2id(w) for w in ['and','that']]))
40 | (('a','mistake'), ng.freq(*[ng.word2id(w) for w in ['a','mistake']]))
41 | 
42 | Test = [
43 | 	'am fond of',
44 | 	'am found of',
45 | 	'i now that',
46 | 	'i know that',
47 | 	'is now that',
48 | 	'future would undoubtedly',
49 | 	'it it did',
50 | 	'if it did',
51 | 	'and then it',
52 | 	'the united states',
53 | 	'cheese burger',
54 | 	'cheeseburger',
55 | 	'don t',
56 | 	"don ' t",
57 | 	'don',
58 | 	'dont',
59 | 	"don't",
60 | 	'i was alluding',
61 | 	'spill chick',
62 | 	'spell check',
63 | 	'spillchick',
64 | 	'spellcheck',
65 | 	'of the art',
66 | 	'the - art',
67 | ]
68 | for s in Test:
69 | 	t = s.lower().split()
70 | 	ids = [ng.word2id(w) for w in t]
71 | 	frfunc = ng.freq if len(ids) > 1 else ng.id2freq
72 | 	print((t, 'freq:', frfunc(*ids), 'ids:', ids))
73 | 	assert all(ng.id2word(ng.word2id(w)) == w or ng.word2id(w) == 0 for w in t)
74 | 
75 | for foo in ['don','dont']:
76 | 	[(foo,ng.id2word(x), y) for x,y in ng.follows(ng.word2id(foo))[:100]]
77 | 
78 | 


--------------------------------------------------------------------------------
/doc/algorithm.txt:
--------------------------------------------------------------------------------
 1 | 
 2 | Goal: Maximize consistency of the language within a document.
 3 | 
 4 | To do so we use an n-gram-based language model.
 5 | 
 6 | We don't want to be too heavy-handed in our language model though;
 7 | we want to incorporate local language use as well.
 8 | 
 9 | We begin with a pre-fabricated sourced from an external "global" corpus,
10 | in this case we use Google Books' 3-ary n-grams.
11 | 
12 | Upon initialization we incorporate the "local" corpus of documents into our
13 | language model, likely by parsing documents in the current and parent folders.
14 | 
15 | It is this local model we should use first against new documents. This
16 | allows our checker to tailor its behavior to its environment, whether the
17 | documents are legal documents, school book reports, bad sci-fi novels, etc.
18 | 
19 | http://en.wikipedia.org/wiki/Text_corpus
20 | http://en.wikipedia.org/wiki/Language_model#N-gram_models
21 | http://en.wikipedia.org/wiki/N-gram
22 | http://ngrams.googlelabs.com/datasets
23 | 
24 | overhere -> overhear (x -> x')
25 | over,here -> over,here (x,y -> x,y)
26 | over,hear -> overhear (x,y -> x')
27 | i,now,the -> i,know,the (x,y,z -> x,y',z)
28 | than,you,very,much -> thank,you,very,much (x,y,z,zz -> x',y,z,zz)
29 | thank,yo -> thank,you (x,y -> x,y')
30 | 
31 | Consider:
32 | 	fingerprinting words by content: hello = e:1,h:1,l:2,o:1
33 | 
34 | Algorithm:
35 | 	AutoRevise(doc):
36 | 		Target the smallest, least-known ngrams first.
37 | 		List alternatives
38 | 			Begin with cheap, straight-forward, common alternatives and progress to more expensive/complex iff necessary
39 | 				Try to solve individual, unknown tokens first
40 | 				Preserve token boundaries (cheap)
41 | 					Edit distance 1, edit distance 2
42 | 					Phonetic similarities
43 | 				Disregard token boundaries (expensive)
44 | 					Parse all possible token sequences
45 | 						
46 | 			For each alternative
47 | 				Score its effectiveness by evaluating the complete repercussions
48 | 		Retain the best alternatives
49 | 		Propose revisions unobtrusively.
50 | 			Never modify without the user's permission. http://en.wikipedia.org/wiki/Cupertino_effect
51 | 		Record revision selection.
52 | 			Incorporate into future decisions.
53 | 		If revision is selected:
54 | 			Update document and all statistics/ngrams to reflect the change
55 | 
56 | parse/load base corpus of target language
57 | parse/load local corpus
58 | 
59 | calculate frequency of all ngrams 1..n
60 | sort ngrams on size:asc, freq:asc
61 | for ng in ngrams below some threshold:
62 | 	calculate feasible permutations for ng
63 | 		note: focus only on one area at a time, as the resulting change will modify the rest of the document
64 | 		for tok in ng:
65 | 			calculate list of permutations: spelling edits, pronunciation
66 | 		account for merging/splitting of tokens, etc.
67 | 
68 | 
69 | conduct re search : conduct research
70 | hitherehowareyou : hi there how are you
71 | 
72 | 
73 | 
74 | 


--------------------------------------------------------------------------------
/doc/things-that-can-go-wrong-language-wise.txt:
--------------------------------------------------------------------------------
 1 | 
 2 | How You Fuck Up			How We Can Detect/Fix It
 3 | ------------------------------- ----------------------------------------
 4 | 
 5 | word mis-spelling		standard spellchecker
 6 | resulting in a non-word		with a dictionary (aspell, ispell, etc.)
 7 | 'hello' -> 'helo'
 8 | 
 9 | word mis-spelling		?
10 | resulting in another word	try: word sequence mapping and levenshtein
11 | 'hello there' -> 'hell there'
12 | 
13 | word transposition
14 | 'foo bar' -> 'bar foo'
15 | 
16 | grammar screw up		grammar checkers(?)
17 | various
18 | 'i am.' -> 'i is.'		try: tense association mapping am/is/are
19 | 
20 | homophone confusion		
21 | '24 caret'			try: map pronunciation
22 | '24 carrot'
23 | 'composed' -> 'come posed' 
24 | 
25 | botched idiom 			?
26 | 'intents and purposes' ->	try: idiom identification and word->pronunciation mapping
27 | 'intensive purposes'		question: is this really any different thhan above?
28 | 
29 | incorrect Proper Noun		?
30 | 'Mr. Johnson' -> 'Mr. Jonson'	try: hmm, contextual proper noun mapping(?)
31 | 
32 | slang/pronunciation
33 | 'hello' -> 'yello'
34 | 
35 | word omission
36 | 'oops, i the word'
37 | 
38 | 


--------------------------------------------------------------------------------
/src/algo.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | """
  5 | Goal: maximize self-consistency of a corpus of documents
  6 | calculate frequency of all ngrams 1..n
  7 | sort ngrams on freq:asc, size:asc
  8 | for ng in ngrams below some threshold:
  9 | 	calculate feasible permutations for ng
 10 | 		note: focus only on one area at a time, as the resulting change will modify the rest of the document
 11 | 		for tok in ng:
 12 | 			calculate list of permutations: spelling edits, pronunciation
 13 | 		account for merging/splitting of tokens, etc.
 14 | """
 15 | 
 16 | import re
 17 | from math import log,sqrt
 18 | from collections import defaultdict
 19 | from itertools import product
 20 | 
 21 | def tokenize(text): return re.findall('[a-z]+', text.lower()) 
 22 | 
 23 | Freq = {
 24 | 	'does':1, 'it':1, 'use':1,
 25 | 	'i':1, 'know':1, 'right':1,
 26 | 	'fuck':1,
 27 | 	'conduct':2, 'research':2, 'search':2, 'con':1, 'duct':1,
 28 | 	'hi':3, 'there':2, 'hit':2, 'here':2, 'how':3, 'are':3, 'you':3,
 29 | 	'ho':1,
 30 | 	'a':3, 'them':2, 'anathema':1,
 31 | }
 32 | 
 33 | """
 34 | given a list of tokens, yield all possible permutations of joining two or more tokens together
 35 | i.e. joins([a,b,c,d]) -> [[a,b,c,d],[a,b,cd],[a,bc,d],[ab,c,d],[a,bcd],[abc,d],[abcd]]
 36 | 
 37 | AHA, i realize now that i'm simply trying to list sum permutations:
 38 | i.e. joins([1,1,1,1]) -> [[1,1,1,1],[1,1,2],[1,2,1],[2,1,1],[1,3],[3,1],[4]]
 39 | complexity: 2**(len(toks)-1)
 40 | """
 41 | def joins(toks):
 42 | 	if len(toks) < 2:
 43 | 		yield toks
 44 | 	else:
 45 | 		for i in range(len(toks)):
 46 | 			for j in range(i+1, len(toks)-i+1):
 47 | 				pref = toks[:i] + [''.join(toks[i:i+j])]
 48 | 				for suf in joins(toks[i+j:]):
 49 | 					yield pref + suf
 50 | 
 51 | """
 52 | find first substring str[x:y] where exists freq[str[x:y]] where y >= l
 53 | return tuple (prefix before substring, the substring, the rest of the string)
 54 | """
 55 | def nextword(str, ng1, l=1):
 56 | 	for i in range(len(str)):
 57 | 		for j in range(i+l, min(i+18, len(str))):
 58 | 			if str[i:j] in ng1:
 59 | 				return (str[:i], str[i:j], str[j:])
 60 | 	return (str,'','')
 61 | 
 62 | """
 63 | given a string of one or more valid substring words, yield a list of permutations
 64 | freq is a dict() of all recognized words in str
 65 | """
 66 | def spl(str, ng1):
 67 | 	if len(str) < 2:
 68 | 		yield [str]
 69 | 	else:
 70 | 		i = 0
 71 | 		while i <= len(str):
 72 | 			pref,word,suf = nextword(str, ng1, i)
 73 | 			#print((i,str,pref,word,suf))
 74 | 			if not word:
 75 | 				#if i == 0 or freq.get(pref):
 76 | 					# on subsequent loops we accumulate garbage non-word-suffixes
 77 | 				yield [pref]
 78 | 				break
 79 | 			else:
 80 | 				w = []
 81 | 				if pref: w.append(pref)
 82 | 				w.append(word)
 83 | 				for sufx in spl(suf, ng1):
 84 | 					if sufx:
 85 | 						yield w + sufx
 86 | 				i += len(word) + 1
 87 | 
 88 | """
 89 | given a list of tokens, yield all possible permutations via splitting
 90 | """
 91 | def splits(toks, freq, g):
 92 | 	score = dict()
 93 | 	# list all possible substrings that are known words
 94 | 	str = ''.join(toks)
 95 | 	for i in range(len(str)+1):
 96 | 		for j in range(i+1, len(str)+1):
 97 | 			w = str[i:j]
 98 | 			sc = freq.get(w, 0)
 99 | 			if sc > 0:
100 | 				score[w] = sc
101 | 	print('  splits score=',score)
102 | 
103 | 	# use ngrams to determine which words are seen next to each other;
104 | 	# use that information to more efficiently parse
105 | 	# find all permutations that contain at least one word
106 | 
107 | 	ngrams = []
108 | 	for x,y in product(score.keys(), score.keys()):
109 | 		# ensure adjacency and order
110 | 		xi = str.index(x) + len(x)
111 | 		if str[xi:xi+len(y)] != y:
112 | 			continue
113 | 		ng = (x,y)
114 | 		sc = g.freq(ng)
115 | 		if sc > 0:
116 | 			ngrams.append(ng)
117 | 	print('  splits ngrams=',ngrams)
118 | 
119 | 	# all of the words that can begin an ngram
120 | 	ng1 = set([x for x,y in ngrams])
121 | 	ng2 = set([y for x,y in ngrams])
122 | 
123 | 	def toks2ngrams(toks, size):
124 | 		size = min(len(toks), size)
125 | 		for ng in zip(*[toks[i:] for i in range(size)]):
126 | 			yield ng
127 | 
128 | 	# sort splits by ngram score
129 | 	pop = []
130 | 	for s in spl(str, ng1):
131 | 		freq = sum([g.freq(x) for x in toks2ngrams(s, 3)])
132 | 		if freq:
133 | 			pop.append((tuple(s), freq))
134 | 	pop = sorted(pop, key=lambda x:x[1], reverse=True)
135 | 	print('  splits() pop=',pop)
136 | 	for p,_ in pop:
137 | 		yield p
138 | 
139 | def weight(tok):
140 | 	factor = 1 + len(tok)
141 | 	return round(Freq.get(tok,0) * factor, 1)
142 | 
143 | def correct(str):
144 | 	toks = tokenize(str)
145 | 	"""
146 | 	j = frozenset(tuple(t) for t in joins(toks))
147 | 	print('j=',j)
148 | 	"""
149 | 	s = list(splits(toks, Freq))
150 | 	print('s=',s[:4])
151 | 	js0 = list(s)# + list(j)
152 | 	js1 = [(k, sum(map(weight, k))) for k in js0]
153 | 	js2 = sorted(js1, key=lambda x:x[1], reverse=True)
154 | 	print('js=',js2[:5])
155 | 	guess = str
156 | 	if js2 != []:
157 | 		guess,gscore = js2[0]
158 | 		oscore = sum(map(weight, toks))
159 | 		print('gscore=',gscore,'oscore=',oscore)
160 | 		if gscore > oscore * 2: # FIXME: there is no good way to do this
161 | 			guess = ' '.join(guess)
162 | 		else:
163 | 			guess = str
164 | 	return guess
165 | 
166 | if __name__ == '__main__':
167 | 	Tests = [
168 | 		'iknowright : i know right',
169 | 		'f u c k y o u : fuck you',
170 | 		'xxxhowareyouxxx : xxx how are you xxx',
171 | 		'con duct re search : conduct research',
172 | 		'hitherehowareyou : hi there how are you',
173 | 		'hithe re : hi there',
174 | 		'anathema : anathema' # unlikely but valid word
175 | 	]
176 | 	passcnt = 0
177 | 	for t in Tests:
178 | 		str,exp = t.strip().split(' : ')
179 | 		print(str)
180 | 		res = correct(str)
181 | 		if res == exp:
182 | 			passcnt += 1
183 | 		else:
184 | 			print('*** FAIL: %s -> %s (%s)' % (str,res,exp))
185 | 	print('Tests %u/%u.' % (passcnt, len(Tests)))
186 | 
187 | 


--------------------------------------------------------------------------------
/src/chick.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | # ex: set ts=8 noet:
  4 | # Copyright 2011 Ryan Flynn <parseerror+spill-chick@gmail.com>
  5 | 
  6 | """
  7 | Word/grammar checking algorithm
  8 | 
  9 | Phon ✕ Word ✕ NGramDiff ✕ Doc
 10 | 
 11 | Facts
 12 | 	* the corpus is not perfect. it contains errors.
 13 | 	* not every valid ngram will exist in the corpus.
 14 | 	* infrequent but valid ngrams are sometimes very similar to very frequent ones
 15 | 
 16 | Mutations
 17 | 	* insertion : additional item
 18 | 		* duplication : correct item incorrectly number of times
 19 | 		* split	(its) -> (it,',s)
 20 | 		* merge (miss,spelling) -> (misspelling)
 21 | 	* deletion : item missing
 22 | 	* transposition : correct items, incorrect order
 23 | 		* letters wap
 24 | 
 25 | TODO:
 26 | 	* figure out how to handle apostrophes
 27 | 	* pre-calculate token joining and merging
 28 | 
 29 | """
 30 | 
 31 | from util import *
 32 | from ngramdiff import TokenDiff,NGramDiff,NGramDiffScore
 33 | 
 34 | import logging
 35 | 
 36 | logger = logging.getLogger('spill-chick')
 37 | hdlr = logging.FileHandler('/var/tmp/spill-chick.log')
 38 | logger.addHandler(hdlr)
 39 | logger.setLevel(logging.DEBUG)
 40 | 
 41 | def handleError(self, record):
 42 |   raise
 43 | logging.Handler.handleError = handleError
 44 | 
 45 | from math import log
 46 | from itertools import takewhile, dropwhile, product, cycle, chain
 47 | from collections import defaultdict
 48 | import bz2, sys, re, os
 49 | import copy
 50 | from word import Words,NGram3BinWordCounter
 51 | from phon import Phon
 52 | from gram import Grams
 53 | from grambin import GramsBin
 54 | from doc import Doc
 55 | 
 56 | logger.debug('sys.version=' + sys.version)
 57 | 
 58 | """
 59 | 
 60 | sentence: "if it did the future would undoubtedly be changed"
 61 | 
 62 | "the future would" and "would undoubtedly be" have high scores,
 63 | but the connector, "future would undoubtedly", has zero.
 64 | we need to be aware that every valid 3-gram will not be in our database,
 65 | but that if the surrounding, overlapping ones are then it's probably ok
 66 | 
 67 | sugg       did the future 156
 68 | sugg            the future would 3162
 69 | sugg                future would undoubtedly 0
 70 | sugg                       would undoubtedly be 3111
 71 | sugg                             undoubtedly be changed 0
 72 | 
 73 | sugg    i did the 12284
 74 | sugg    it did the 4279
 75 | sugg    i did then 1654
 76 | sugg    it did then 690
 77 | sugg    i hid the 646
 78 | sugg       did the future 156
 79 | sugg       hid the future 38
 80 | sugg       aid the future 30
 81 | sugg            the future would 3162
 82 | sugg            the future world 2640
 83 | sugg            the future could 934
 84 | sugg                future wood and 0
 85 | sugg                future wood undoubtedly 0
 86 | sugg                future would and 0
 87 | sugg                future would undoubtedly 0
 88 | sugg                       would undoubtedly be 3111
 89 | sugg                       could undoubtedly be 152
 90 | sugg                             undoubtedly be changed 0
 91 | 
 92 | """
 93 | 
 94 | import inspect
 95 | def lineno():
 96 |     """Returns the current line number in our program."""
 97 |     return inspect.currentframe().f_back.f_lineno
 98 | 
 99 | # TODO: modify levenshtein to weight score based on what has changed;
100 | # - transpositions should count less than insertions/deletions
101 | # - changes near the front of the word should count more than the end
102 | # - for latin alphabets changes to vowels should count less than consonants
103 | def levenshtein(a,b):
104 | 	"Calculates the Levenshtein distance between a and b."
105 | 	n, m = len(a), len(b)
106 | 	if n > m:
107 | 		# Make sure n <= m, to use O(min(n,m)) space
108 | 		a,b = b,a
109 | 		n,m = m,n
110 | 
111 | 	current = range(n+1)
112 | 	for i in range(1,m+1):
113 | 		previous, current = current, [i]+[0]*n
114 | 		for j in range(1,n+1):
115 | 			add, delete = previous[j]+1, current[j-1]+1
116 | 			change = previous[j-1]
117 | 			if a[j-1] != b[i-1]:
118 | 				change = change + 1
119 | 			current[j] = min(add, delete, change)
120 | 
121 | 	return current[n]
122 | 
123 | def list2ngrams(l, size):
124 | 	"""
125 | 	split l into overlapping ngrams of size
126 | 	[x,y,z] -> [(x,y),(y,z)]
127 | 	"""
128 | 	if size >= len(l):
129 | 		return [tuple(l)]
130 | 	return [tuple(l[i:i+size]) for i in range(len(l)-size+1)]
131 | 
132 | class Chick:
133 | 	def __init__(self):
134 | 		# initialize all "global" data
135 | 		logger.debug('loading...')
136 | 		logger.debug('  corpus...')
137 | 		# FIXME: using absolute paths is the easiest way to make us work from cmdline and invoked
138 | 		# in a web app. perhaps we could set up softlinks in /var/ to make this slightly more respectable.
139 | 		self.g = GramsBin(
140 | 			'/home/pizza/proj/spill-chick/data/corpus/google-ngrams/word.bin',
141 | 			'/home/pizza/proj/spill-chick/data/corpus/google-ngrams/ngram3.bin')
142 | 		self.w = Words(NGram3BinWordCounter(self.g.ng))
143 | 		logger.debug('  phon')
144 | 		self.p = Phon(self.w, self.g)
145 | 		logger.debug('done.')
146 | 		# sanity-check junk
147 | 		"""
148 | 		logger.debug('w.correct(naieve)=%s' % self.w.correct(u'naieve'))
149 | 		logger.debug('w.correct(refridgerator)=%s' % self.w.correct(u'refridgerator'))
150 | 		logger.debug('g.freqs(refridgerator)=%s' % self.g.freqs(u'refridgerator'))
151 | 		logger.debug('g.freqs(refrigerator)=%s' % self.g.freqs(u'refrigerator'))
152 | 		logger.debug('g.freq((didn))=%s' % self.g.freq((u'didn',)))
153 | 		logger.debug('g.freq((a,mistake))=%s' % self.g.freq((u'a',u'mistake')))
154 | 		logger.debug('g.freq((undoubtedly,be,changed))=%s' % self.g.freq((u'undoubtedly',u'be',u'changed')))
155 | 		logger.debug('g.freq((undoubtedly,be))=%s' % self.g.freq((u'undoubtedly',u'be')))
156 | 		logger.debug('g.freq((be,changed))=%s' % self.g.freq((u'be',u'changed')))
157 | 		logger.debug('g.freq((it,it,did))=%s' % self.g.freq((u'it',u'it',u'did')))
158 | 		logger.debug('g.freq((it,it))=%s' % self.g.freq((u'it',u'it')))
159 | 		logger.debug('g.freq((it,did))=%s' % self.g.freq((u'it',u'did')))
160 | 		logger.debug('g.freq((hello,there,sir))=%s' % self.g.freq((u'hello',u'there',u'sir')))
161 | 		logger.debug('g.freq((hello,there))=%s' % self.g.freq((u'hello',u'there')))
162 | 		logger.debug('g.freq((hello,there,,))=%s' % self.g.freq((u'hello',u'there',u',')))
163 | 		logger.debug('g.freq((they,\',re))=%s' % self.g.freq((u'they',u"'",u're')))
164 | 		"""
165 | 
166 | 	# FIXME: soundsToWords is expensive and should only be run as a last resort
167 | 	def phonGuess(self, toks, minfreq):
168 | 		"""
169 | 		given a list of tokens search for a list of words with similar pronunciation
170 | 		having g.freq(x) > minfreq
171 | 		"""
172 | 		# create a phonetic signature of the ngram
173 | 		phonsig = self.p.phraseSound(toks)
174 | 		logger.debug('phonsig=%s' % phonsig)
175 | 		phonwords = list(self.p.soundsToWords(phonsig))
176 | 		logger.debug('phonwords=%s' % (phonwords,))
177 | 		if phonwords == [[]]:
178 | 			phonpop = []
179 | 		else:
180 | 			# remove any words that do not meet the minimum frequency;
181 | 			# they cannot possibly be part of the answer
182 | 			phonwords2 = [[[w for w in p if self.g.freq(tuple(w)) > minfreq]
183 | 						for p in pw]
184 | 							for pw in phonwords]
185 | 			logger.debug('phonwords2 lengths=%s product=%u' % \
186 | 				(' '.join([str(len(p)) for p in phonwords2[0]]),
187 | 				 reduce(lambda x,y:x*y, [len(p) for p in phonwords2[0]])))
188 | 			if not all(phonwords2):
189 | 				return []
190 | 			#logger.debug('phonwords2=(%u)%s...' % (len(phonwords2), phonwords2[:10],))
191 | 			# remove any signatures that contain completely empty items after previous
192 | 			phonwords3 = phonwords2
193 | 			#logger.debug('phonwords3=(%u)%s...' % (len(phonwords3), phonwords3))
194 | 			# FIXME: product() function is handy in this case but is potentially hazardous.
195 | 			# we should force a limit to the length of any list passed to it to ensure
196 | 			# the avoidance of any pathological, memory-filling, swap-inducing behavior
197 | 			phonwords4 = list(flatten([list(product(*pw)) for pw in phonwords3]))
198 | 			logger.debug('phonwords4=(%u)%s...' % (len(phonwords4), phonwords4[:20]))
199 | 			# look up ngram popularity, toss anything not more popular than original and sort
200 | 			phonwordsx = [tuple(flatten(p)) for p in phonwords4]
201 | 
202 | 			phonpop = rsort1([(pw, self.g.freq(pw, min)) for pw in phonwordsx])
203 | 			#logger.debug('phonpop=(%u)%s...' % (len(phonpop), phonpop[:10]))
204 | 			phonpop = list(takewhile(lambda x:x[1] > minfreq, phonpop))
205 | 			#logger.debug('phonpop=%s...' % (phonpop[:10],))
206 | 		if phonpop == []:
207 | 			return []
208 | 		best = phonpop[0][0]
209 | 		return [[x] for x in best]
210 | 
211 | 	"""
212 | 	return a list of ngrampos permutations where each token has been replaced by a word with
213 | 	similar pronunciation, and g.freqs(word) > minfreq
214 | 	"""
215 | 	def permphon(self, ngrampos, minfreq):
216 | 		perms = []
217 | 		for i in range(len(ngrampos)):
218 | 			prefix = ngrampos[:i]
219 | 			suffix = ngrampos[i+1:]
220 | 			tokpos = ngrampos[i]
221 | 			tok = tokpos[0]
222 | 			sounds = self.p.word[tok]
223 | 			if not sounds:
224 | 				continue
225 | 			#logger.debug('tok=%s sounds=%s' % (tok, sounds))
226 | 			for sound in sounds:
227 | 				soundslikes = self.p.phon[sound]
228 | 				#logger.debug('tok=%s soundslikes=%s' % (tok, soundslikes))
229 | 				for soundslike in soundslikes:
230 | 					if len(soundslike) > 1:
231 | 						continue
232 | 					soundslike = soundslike[0]
233 | 					if soundslike == tok:
234 | 						continue
235 | 					#logger.debug('soundslike %s -> %s' % (tok, soundslike))
236 | 					if self.g.freqs(soundslike) <= minfreq:
237 | 						continue
238 | 					newtok = (soundslike,) + tokpos[1:]
239 | 					damlev = damerau_levenshtein(tok, soundslike)
240 | 					td = TokenDiff([tokpos], [newtok], damlev)
241 | 					perms.append(NGramDiff(prefix, td, suffix, self.g, soundalike=True))
242 | 		return perms
243 | 
244 | 	@staticmethod
245 | 	def ngrampos_merge(x, y):
246 | 		return (x[0]+y[0], x[1], x[2], x[3])
247 | 
248 | 	def permjoin(self, l, minfreq):
249 | 		"""
250 | 		given a list of strings, produce permutations by joining two tokens together
251 | 		example [a,b,c,d] -> [[ab,c,d],[a,bc,d],[a,b,cd]
252 | 		"""
253 | 		perms = []
254 | 		if len(l) > 1:
255 | 			for i in range(len(l)-1):
256 | 				joined = Chick.ngrampos_merge(l[i],l[i+1])
257 | 				if self.g.freqs(joined[0]) > minfreq:
258 | 					td = TokenDiff(l[i:i+2], [joined], 1)
259 | 					ngd = NGramDiff(l[:i], td, l[i+2:], self.g)
260 | 					perms.append(ngd)
261 | 		return perms
262 | 
263 | 	@staticmethod
264 | 	def ngrampos_split_back(x, y):
265 | 		return (x[0]+y[0][:1], x[1], x[2], x[3]), (y[0][1:], y[1], y[2], y[3])
266 | 
267 | 	@staticmethod
268 | 	def ngrampos_split_forward(x, y):
269 | 		return (x[0][:-1], x[1], x[2], x[3]), (x[0][-1:]+y[0], y[1], y[2], y[3])
270 | 
271 | 	def intertoken_letterswap(self, l, target_freq):
272 | 		# generate permutations of token list with the beginning and ending letter of each
273 | 		# token swapped between adjacent tokens
274 | 		if len(l) < 2:
275 | 			return []
276 | 		perms = []
277 | 		for i in range(len(l)-1):
278 | 			if len(l[i][0]) > 1:
279 | 				x,y = Chick.ngrampos_split_forward(l[i], l[i+1])
280 | 				if self.g.freq((x[0],y[0])) >= target_freq:
281 | 					td = TokenDiff(l[i:i+2], [x,y], 0)
282 | 					ngd = NGramDiff(l[:i], td, l[i+2:], self.g)
283 | 					perms.append(ngd)
284 | 			if len(l[i+1][0]) > 1:
285 | 				x,y = Chick.ngrampos_split_back(l[i], l[i+1])
286 | 				if self.g.freq((x[0],y[0])) >= target_freq:
287 | 					td = TokenDiff(l[i:i+2], [x,y], 0)
288 | 					ngd = NGramDiff(l[:i], td, l[i+2:], self.g)
289 | 					perms.append(ngd)
290 | 		#print 'intertoken_letterswap=',perms
291 | 		return perms
292 | 
293 | 	def do_suggest(self, target_ngram, target_freq, ctx, d, max_suggest=5):
294 | 		"""
295 | 		given an infrequent ngram from a document, attempt to calculate a more frequent one
296 | 		that is similar textually and/or phonetically but is more frequent
297 | 		"""
298 | 
299 | 		target_ngram = list(target_ngram)
300 | 		part = []
301 | 
302 | 		# permutations via token joining
303 | 		# expense: cheap, though rarely useful
304 | 		# TODO: smarter token joining; pre-calculate based on tokens
305 | 		part += self.permjoin(target_ngram, target_freq)
306 | 		#logger.debug('permjoin(%s)=%s' % (target_ngram, part,))
307 | 
308 | 		part += self.intertoken_letterswap(target_ngram, target_freq)
309 | 
310 | 		part += self.permphon(target_ngram, target_freq)
311 | 
312 | 		part += self.g.ngram_like(target_ngram, target_freq)
313 | 
314 | 		logger.debug('part after ngram_like=(%u)%s...' % (len(part), part[:5],))
315 | 
316 | 		# calculate the closest, best ngram in part
317 | 		sim = sorted([NGramDiffScore(ngd, self.p) for ngd in part])
318 | 		for s in sim[:25]:
319 | 			logger.debug('sim %4.1f %2u %u %6u %6u %s' % \
320 | 				(s.score, s.ediff, s.sl, s.ngd.oldfreq, s.ngd.newfreq, ' '.join(s.ngd.newtoks())))
321 | 
322 | 		best = list(takewhile(lambda s:s.score > 0, sim))[:max_suggest]
323 | 		for b in best:
324 | 			logger.debug('best %s' % (b,))
325 | 		return best
326 | 
327 | 	def ngram_suggest(self, target_ngram, target_freq, d, max_suggest=1):
328 | 		"""
329 | 		we calculate ngram context and collect solutions for each context
330 | 		containing the target, then merge them into a cohesive, best suggestion.
331 | 			c d e
332 | 		    a b c d e f g
333 | 		given ngram (c,d,e), calculate context and solve:
334 | 		[S(a,b,c), S(b,c,d), S(c,d,e), S(d,e,f), S(e,f,g)]
335 | 		"""
336 | 
337 | 		logger.debug('target_ngram=%s' % (target_ngram,))
338 | 		tlen = len(target_ngram)
339 | 
340 | 		context = list(d.ngram_context(target_ngram, tlen))
341 | 		logger.debug('context=%s' % (context,))
342 | 		ctoks = [c[0] for c in context]
343 | 		clen = len(context)
344 | 
345 | 		logger.debug('tlen=%d clen=%d' % (tlen, clen))
346 | 		context_ngrams = list2ngrams(context, tlen)
347 | 		logger.debug('context_ngrams=%s' % (context_ngrams,))
348 | 
349 | 		# gather suggestions for each ngram overlapping target_ngram
350 | 		sugg = [(ng, self.do_suggest(ng, self.g.freq([x[0] for x in ng]), context_ngrams, d))
351 | 			for ng in [target_ngram]] #context_ngrams]
352 | 
353 | 		for ng,su in sugg:
354 | 			for s in su:
355 | 				logger.debug('sugg %s' % (s,))
356 | 
357 | 		"""
358 | 		previously we leaned heavily on ngram frequencies and the sums of them for
359 | 		evaluating suggestions in context.
360 | 		instead, we will focus specifically on making the smallest changes which have the
361 | 		largest improvements, and in trying to normalize a document, i.e.
362 | 		"filling in the gaps" of as many 0-freq ngrams as possible.
363 | 		"""
364 | 
365 | 		# merge suggestions based on what they change
366 | 		realdiff = {}
367 | 		for ng,su in sugg:
368 | 			for s in su:
369 | 				rstr = ' '.join(s.ngd.newtoks())
370 | 				if rstr in realdiff:
371 | 					realdiff[rstr] += s
372 | 				else:
373 | 					realdiff[rstr] = s
374 | 				logger.debug('real %s %s' % (rstr, realdiff[rstr]))
375 | 
376 | 		# sort the merged suggestions based on their combined score
377 | 		rdbest = sorted(realdiff.values(), key=lambda x:x.score, reverse=True)
378 | 
379 | 		# finally, allow frequency to overcome small differences in score, but only
380 | 		# for scores that are within 1 to begin with.
381 | 		# if we account for frequency too much the common language idioms always crush
382 | 		# valid but less common phrases; if we don't account for frequency at all we often
383 | 		# recommend very similar but uncommon and weird phrases. this attempts to strike a balance.
384 | 		rdbest.sort(lambda x,y:
385 | 			y.score - x.score if abs(x.score - y.score) > 1	\
386 | 			else	(y.score + int(log(y.ngd.newfreq))) -	\
387 | 				(x.score + int(log(x.ngd.newfreq))))
388 | 
389 | 		for ngds in rdbest:
390 | 			logger.debug('best %s' % (ngds,))
391 | 
392 | 		return rdbest
393 | 
394 | 	def suggest(self, txt, max_suggest=1, skip=[]):
395 | 		"""
396 | 		given a string, run suggest() and apply the first suggestion
397 | 		"""
398 | 		logger.debug('Chick.suggest(txt=%s max_suggest=%s, skip=%s)' % (txt, max_suggest, skip))
399 | 
400 | 		d = Doc(txt, self.w)
401 | 		logger.debug('doc=%s' % d)
402 | 
403 | 		"""
404 | 		locate uncommon n-gram sequences which may indicate grammatical errors
405 | 		see if we can determine better replacements for them given their context
406 | 		"""
407 | 
408 | 		# order n-grams by unpopularity
409 | 		ngsize = min(3, d.totalTokens())
410 | 		logger.debug('ngsize=%s d.totalTokens()=%s' % (ngsize, d.totalTokens()))
411 | 		logger.debug('ngram(1) freq=%s' % list(d.ngramfreqctx(self.g,1)))
412 | 
413 | 		# locate the least-common ngrams
414 | 		# TODO: in some cases an ngram is unpopular, but overlapping ngrams on either side
415 | 		# are relatively popular.
416 | 		# is this useful in differentiating between uncommon but valid phrases from invalid ones?
417 | 		"""
418 | sugg       did the future 156
419 | sugg            the future would 3162
420 | sugg                future would undoubtedly 0
421 | sugg                       would undoubtedly be 3111
422 | sugg                             undoubtedly be changed 0
423 | 		"""
424 | 
425 | 		least_common = sort1(d.ngramfreqctx(self.g, ngsize))
426 | 		logger.debug('least_common=%s' % least_common[:20])
427 | 		# remove any ngrams present in 'skip'
428 | 		least_common = list(dropwhile(lambda x: x[0] in skip, least_common))
429 | 		# filter ngrams containing numeric tokens or periods, they generate too many poor suggestions
430 | 		least_common = list(filter(
431 | 					lambda ng: not any(re.match('^(?:\d+|\.)$', n[0][0], re.U)
432 | 							for n in ng[0]),
433 | 					least_common))
434 | 
435 | 		# FIXME: limit to reduce work
436 | 		least_common = least_common[:max(20, len(least_common)/2)]
437 | 
438 | 		# gather all suggestions for all least_common ngrams
439 | 		suggestions = []
440 | 		for target_ngram,target_freq in least_common:
441 | 			suggs = self.ngram_suggest(target_ngram, target_freq, d, max_suggest)
442 | 			if suggs:
443 | 				suggestions.append(suggs)
444 | 
445 | 		if not suggestions:
446 | 			"""
447 | 			"""
448 | 			ut = list(d.unknownToks())
449 | 			logger.debug('unknownToks=%s' % ut)
450 | 			utChanges = [(u, (self.w.correct(u[0]), u[1], u[2], u[3])) for u in ut]
451 | 			logger.debug('utChanges=%s' % utChanges)
452 | 			utChanges2 = list(filter(lambda x: x not in skip, utChanges))
453 | 			for old,new in utChanges2:
454 | 				td = TokenDiff([old], [new], damerau_levenshtein(old[0], new[0]))
455 | 				ngd = NGramDiff([], td, [], self.g)
456 | 				ngds = NGramDiffScore(ngd, None, 1)
457 | 				suggestions.append([ngds])
458 | 
459 | 		logger.debug('------------')
460 | 		logger.debug('suggestions=%s' % (suggestions,))
461 | 		suggs = filter(lambda x:x and x[0].ngd.newfreq != x[0].ngd.oldfreq, suggestions)
462 | 		logger.debug('suggs=%s' % (suggs,))
463 | 		# sort suggestions by their score, highest first
464 | 		bestsuggs = rsort(suggs, key=lambda x: x[0].score)
465 | 		# by total new frequency...
466 | 		bestsuggs = rsort(bestsuggs, key=lambda x: x[0].ngd.newfreq)
467 | 		# then by improvement pct. for infinite improvements this results in
468 | 		# the most frequent recommendation coming to the top
469 | 		bestsuggs = rsort(bestsuggs, key=lambda x: x[0].improve_pct())
470 | 
471 | 		# finally, allow frequency to overcome small differences in score, but only
472 | 		# for scores that are within 1 to begin with.
473 | 		# if we account for frequency too much the common language idioms always crush
474 | 		# valid but less common phrases; if we don't account for frequency at all we often
475 | 		# recommend very similar but uncommon and weird phrases. this attempts to strike a balance.
476 | 		"""
477 | 		bestsuggs.sort(lambda x,y:
478 | 			x[0].score - y[0].score if abs(x[0].score - y[0].score) > 1 \
479 | 			else \
480 | 				(y[0].score + int(log(y[0].ngd.newfreq))) - \
481 | 				(x[0].score + int(log(x[0].ngd.newfreq))))
482 | 		"""
483 | 
484 | 		for bs in bestsuggs:
485 | 			for bss in bs:
486 | 				logger.debug('bestsugg %6.2f %2u %2u %7u %6.0f%% %s' % \
487 | 					(bss.score, bss.ediff, bss.ngd.diff.damlev,
488 | 					 bss.ngd.newfreq, bss.improve_pct(), ' '.join(bss.ngd.newtoks())))
489 | 
490 | 		for bs in bestsuggs:
491 | 			logger.debug('> bs=%s' % (bs,))
492 | 			yield bs
493 | 
494 | 		# TODO: now the trick is to a) associate these together based on target_ngram
495 | 		# to make them persist along with the document
496 | 		# and to recalculate them as necessary when a change is applied to the document that
497 | 		# affects anything they overlap
498 | 
499 | 	def correct(self, txt):
500 | 		"""
501 | 		given a string, identify the least-common n-gram not present in 'skip'
502 | 		and return a list of suggested replacements
503 | 		"""
504 | 		d = Doc(txt, self.w)
505 | 		changes = list(self.suggest(d, 1))
506 | 		for ch in changes:
507 | 			logger.debug('ch=%s' % (ch,))
508 | 			change = [ch[0].ngd]
509 | 			logger.debug('change=%s' % (change,))
510 | 			d.applyChanges(change)
511 | 			logger.debug('change=%s after applyChanges d=%s' % (change, d))
512 | 			d = Doc(d, self.w)
513 | 			break # FIXME: loops forever
514 | 			changes = list(self.suggest(d, 1))
515 | 		res = str(d).decode('utf8')
516 | 		logger.debug('correct res=%s %s' % (type(res),res))
517 | 		return res
518 | 
519 | 


--------------------------------------------------------------------------------
/src/corpus.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | """
  5 | 
  6 | """
  7 | 
  8 | from collections import defaultdict
  9 | import re
 10 | import sys
 11 | import traceback
 12 | import os
 13 | from gram import Grams
 14 | # NOTE: tried subprocess module but doesn't seem to be able to do per-line output...
 15 | 
 16 | def shell_escape(str):
 17 | 	return str.replace(' ', '\\ ').replace("'", "\\'")
 18 | 
 19 | def cat_cmd(filename):
 20 | 	l = filename.lower()
 21 | 	if l.endswith('.bz2'):
 22 | 		return 'bzcat %s' % (shell_escape(filename),)
 23 | 	elif l.endswith('.tar.gz') or l.endswith('.tgz'):
 24 | 		return 'zcat %s | tar xfO -' % (shell_escape(filename),)
 25 | 	else:
 26 | 		return 'cat %s' % (shell_escape(filename),)
 27 | 
 28 | def corpus(name='gutenberg'):
 29 | 	dir = '../data/corpus/'+name+'/'
 30 | 	for file in os.popen('ls ' + dir + '|head -n 3', 'r'):
 31 | 		file = file.strip()
 32 | 		print('%s...' % (file,))
 33 | 		p = os.popen(cat_cmd(dir+file), 'r')
 34 | 		yield p
 35 | 
 36 | # wikipedia markup filter generator
 37 | class wikipedia_lines:
 38 | 	def __init__(self, p):
 39 | 		self.p = p
 40 | 	def __iter__(self):
 41 | 		for line in self.p:
 42 | 			# find article start
 43 | 			for line in self.p:
 44 | 				if '<text xml:space' in line:
 45 | 					break
 46 | 			if not '</text>' in line:
 47 | 				# go until article end
 48 | 				for line in self.p:
 49 | 					if '</text>' in line:
 50 | 						break
 51 | 					# FIXME: this regex crap is 90% of our processing time
 52 | 					line = re.sub('&lt;/?ref.*?&gt;?', '', line)		# ref crap
 53 | 					line = re.sub('{{.*(?:}})?', '', line)			# citation crap
 54 | 					line = re.sub('!--.*?--', '', line)			# comments
 55 | 					line = re.sub('\[\[.*]]', '', line)			# interior link
 56 | 					line = re.sub('&\S+;?', '', line)			# entity crap
 57 | 					line = re.sub('&\w+;?|!--.*?--|.*}}', '', line) 			# &entity;
 58 | 					line = re.sub("''wikt:(.*?)''", '\\1', line)		# wiktionary link
 59 | 					line = re.sub('\[http.*?]', '', line)			# exterior link
 60 | 					line = re.sub('(?:File|Image|Category):\S+', '', line)	# exterior link
 61 | 					#line = re.sub('.*}}', '', line)				# multi-line citation
 62 | 					if re.match('^[a-z]{2,3}:\S+', line):
 63 | 						continue
 64 | 					line = line.strip()
 65 | 					if line == '' or line[0] == '|' or line[0] == '!' or line[0] == '{' or ']]' in line:
 66 | 						continue
 67 | 					#print(line)
 68 | 					yield line
 69 | 
 70 | def corpus_wikipedia():
 71 | 	p = os.popen('bzcat ../data/corpus/enwiki-latest-pages-articles.xml.bz2 2>/dev/null | head -n 500000', 'r')
 72 | 	yield wikipedia_lines(p)
 73 | 
 74 | class email_lines:
 75 | 	def __init__(self, p):
 76 | 		self.p = p
 77 | 	def __iter__(self):
 78 | 		for line in self.p:
 79 | 			if line.startswith('X-') or \
 80 | 			   line.startswith('=09') or \
 81 | 			   re.match('^(Content-Transfer-Encoding|Message-ID|Date|From|To|Subject|Cc|Mime-Version|Content-Type|Bcc):', line):
 82 | 				continue
 83 | 			yield line
 84 | 
 85 | def corpus_enron():
 86 | 	p = os.popen('zcat ../data/corpus/enron_mail_20110402.tgz | tar xfO - 2>/dev/null', 'r')
 87 | 	yield email_lines(p)
 88 | 
 89 | def parse_corpus(c):
 90 | 	g = Grams()
 91 | 	for p in c:
 92 | 		g.add(p)
 93 | 	return g
 94 | 
 95 | def ngram_match(tok, w2id, ngrams):
 96 | 	if tok not in w2id:
 97 | 		return []
 98 | 	id = w2id[tok]
 99 | 	print('%s -> %s' % (tok, id))
100 | 	return [n for n in ngrams.keys() if id in n]
101 | 
102 | import pickle
103 | 
104 | if __name__ == '__main__':
105 | 	f = ['a b c','d e f']
106 | 	g = Grams(f)
107 | 	print(g)
108 | 
109 | 	w = parse_corpus(corpus_enron())
110 | 	pop = sorted(w.ngrams.items(), key=lambda x:x[1], reverse=True)[:200]
111 | 	popw = [(tuple(w.id2w[id] for id in n),cnt) for n,cnt in pop]
112 | 	print(popw)
113 | 	print('len(pickle(w2id))=%s' % (len(pickle.dumps(w.w2id)),))
114 | 
115 | 	#print([tuple(id2w[id] for id in ng) for ng in ngram_match('the', w2id, ngrams)[:100]])
116 | 
117 | 


--------------------------------------------------------------------------------
/src/doc.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | """
  5 | Doc represents a document being checked against existing Grams
  6 | """
  7 | 
  8 | import collections
  9 | import unittest
 10 | import math
 11 | import gram
 12 | from ngramdiff import TokenDiff,NGramDiff,NGramDiffScore
 13 | import copy
 14 | 
 15 | import logging
 16 | logger = logging.getLogger('spill-chick')
 17 | 
 18 | """
 19 | Tokenized contents of a single file
 20 | Tokens associated with positional data to faciliate changes
 21 | """
 22 | class Doc:
 23 | 
 24 | 	def __init__(self, f, w):
 25 | 		self.words = w				# global tokens
 26 | 		#self.docwords = collections.Counter()	# local {token:freq}
 27 | 		self.tokenize(f)
 28 | 
 29 | 		self.sugg = []	# list of suggestions by [line][ngram]suggestions...
 30 | 				# suggs are aligned with fixed-size ngrams, so for ngrams size 3
 31 | 				# sugg[0][0] refers to ngram of line=0 tokens[0,1,2]
 32 | 				# lines that have fewer than ngram len tokens are ignored at this time
 33 | 
 34 | 	def __str__(self):
 35 | 		return unicode(self).encode('utf-8')
 36 | 
 37 | 	def __unicode__(self):
 38 | 		s = unicode('\n'.join(self.lines))
 39 | 		return s
 40 | 
 41 | 	def __repr__(self):
 42 | 		return 'Doc(%s)' % str(self)
 43 | 
 44 | 	def __iter__(self):
 45 | 		return iter(self.lines)
 46 | 
 47 | 	def tokenize(self, f):
 48 | 		self.lines = []
 49 | 		self.tok = []
 50 | 		for lcnt,line in enumerate(f):
 51 | 			self.lines.append(line)
 52 | 			line = line.lower() # used for index below
 53 | 			toks = gram.tokenize(line)
 54 | 			if toks and toks[-1] == '\n':
 55 | 				toks.pop()
 56 | 			#self.docwords.update(toks) # add words to local dictionary
 57 | 			tpos = 0
 58 | 			ll = []
 59 | 			for t in toks:
 60 | 				tpos = line.index(t, tpos)
 61 | 				ll.append((t, lcnt, len(ll), tpos))
 62 | 				tpos += len(t)
 63 | 			self.tok.append(ll)
 64 | 
 65 | 	def totalTokens(self):
 66 | 		return sum(len(ts) for ts in self.tok)
 67 | 		#return sum(self.docwords.values())
 68 | 
 69 | 	def unknownToks(self):
 70 | 		for tok in self.tok:
 71 | 			for t in tok:
 72 | 				if self.words.freq(t[0]) == 0:
 73 | 					yield t
 74 | 
 75 | 	# given token t supply surrounding token ngram (x, tok, y)
 76 | 	def surroundTok(self, t):
 77 | 		line = self.tok[t[1]]
 78 | 		idx = line.index(t)
 79 | 		if idx > 0 and idx < len(line)-1:
 80 | 			return tuple(line[idx-1:idx+2])
 81 | 		return None
 82 | 
 83 | 	def ngrams(self, size):
 84 | 		for tok in self.tok:
 85 | 			for i in range(0, len(tok)+1-size):
 86 | 				yield tuple(tok[i:i+size])
 87 | 
 88 | 	def ngramfreq(self, g, size):
 89 | 		for ng in self.ngrams(size):
 90 | 			ng2 = tuple(t[0] for t in ng)
 91 | 			yield (ng, g.freq(ng2))
 92 | 
 93 | 	def ngramfreqctx(self, g, size):
 94 | 		"""
 95 | 		return each ngram in document, and the sum of the frequencies
 96 | 		of all overlapping ngrams
 97 | 		"""
 98 | 		for toks in self.tok:
 99 | 			if not toks:
100 | 				continue
101 | 			ngs = [tuple(t[0] for t in toks[i:i+size])
102 | 				for i in range(max(1, len(toks)-size+1))]
103 | 			for i in range(len(ngs)):
104 | 				ctx = ngs[max(0,i-size-1):i+size]
105 | 				freq = sum(map(g.freq,ctx)) / len(ctx)
106 | 				yield (toks[i:i+size], freq)
107 | 				
108 | 	def ngram_prev(self, ngpos):
109 | 		_,line,index,_ = ngpos
110 | 		if index == 0:
111 | 			if line == 0:
112 | 				return None
113 | 			line -= 1
114 | 			while line >= 0 and self.tok[line] == []:
115 | 				line -= 1
116 | 			if line == -1:
117 | 				return None
118 | 			index = len(self.tok[line]) - 1
119 | 		else:
120 | 			index -= 1
121 | 		if index >= len(self.tok[line]):
122 | 			# if the first line is empty we need this
123 | 			return None
124 | 		return self.tok[line][index]
125 | 
126 | 	def ngram_next(self, ngpos):
127 | 		_,line,index,_ = ngpos
128 | 		if line >= len(self.tok):
129 | 			return None
130 | 		if index >= len(self.tok[line]):
131 | 			line += 1
132 | 			while line < len(self.tok) and self.tok[line] == []:
133 | 				line += 1
134 | 			if line >= len(self.tok):
135 | 				return None
136 | 			index = 0
137 | 		else:
138 | 			index += 1
139 | 		if index >= len(self.tok[line]):
140 | 			# if the last line is empty we need this
141 | 			return None
142 | 		return self.tok[line][index]
143 | 
144 | 	def ngram_context(self, ngpos, size):
145 | 		"""
146 | 		given an ngram and a size, return a list of ngrams that contain
147 | 		one or more members of ngram
148 | 		    c d e
149 |                 a b c d e f g
150 | 		"""
151 | 		before, ng = [], ngpos[0]
152 | 		for i in range(size-1):
153 | 			ng = self.ngram_prev(ng)
154 | 			if not ng:
155 | 				break
156 | 			before.insert(0, ng)
157 | 		after, ng = [], ngpos[-1]
158 | 		for i in range(size-1):
159 | 			ng = self.ngram_next(ng)
160 | 			if not ng:
161 | 				break
162 | 			after.append(ng)
163 | 		return before + list(ngpos) + after
164 | 
165 | 	@staticmethod
166 | 	def matchCap(x, y):
167 | 		"""
168 | 		Modify replacement word 'y' to match the capitalization of existing word 'x'
169 | 		(foo,bar) -> bar
170 | 		(Foo,bar) -> Bar
171 | 		(FOO,bar) -> BAR
172 | 		"""
173 | 		if x == x.lower():
174 | 			return y
175 | 		elif x == x.capitalize():
176 | 			return y.capitalize()
177 | 		elif x == x.upper():
178 | 			return y.upper()
179 | 		return y
180 | 
181 | 	def applyChange(self, lines, ngd, off):
182 | 		"""
183 | 		given an ngram containing position data, replace corresponding data
184 | 		in lines with 'mod'
185 | 		"""
186 | 		d = ngd.diff # ngd.diff=TokenDiff(([(u'cheese', 0, 2, 9), (u'burger', 0, 3, 16)],[(u'cheeseburger', 0, 2, 9)]))
187 | 		# FIXME: deal with insertion
188 | 		# FIXME: treat new/old as separate sequences, instead of 1-to-1-ish
189 | 		old = copy.deepcopy(d.old)
190 | 		for mod in d.newtoks():
191 | 			#print 'ngd.diff=%s' % (ngd.diff,)
192 | 			o,l,idx,pos = old.pop(0)
193 | 			pos += off[l]
194 | 			end = pos + len(o)
195 | 			#print 'o=%s l=%s idx=%s pos=%s end=%s old=%s' % (o,l,idx,pos,end,old)
196 | 			ow = lines[l][pos:end]
197 | 			if not mod and pos > 0 and lines[l][pos-1] in (' ','\t','\r','\n'):
198 | 				# if we've removed a token and it was preceded by whitespace,
199 | 				# nuke that whitespace as well
200 | 				pos -= 1
201 | 			cap =  Doc.matchCap(ow, mod)
202 | 			#print 'cap=%s' % (cap,)
203 | 			lines[l] = lines[l][:pos] + cap + lines[l][end:]
204 | 			off[l] += len(cap) - len(o)
205 | 			# FIXME: over-simplified; consider multi-token change
206 | 			#self.docwords[ow] -= 1
207 | 			if mod:
208 | 				pass
209 | 				#self.docwords[mod] += 1
210 | 		return (lines, off)
211 | 
212 | 	def demoChanges(self, changes):
213 | 		"""
214 | 		given a list of positional ngrams and a list of replacements,
215 | 		apply the changes and return a copy of the updated file
216 | 		"""
217 | 		logger.debug('Doc.demoChanges changes=%s' % (changes,))
218 | 		lines = self.lines[:]
219 | 		off = [0] * len(lines)
220 | 		for ngd in changes:
221 | 			lines, off = self.applyChange(lines, ngd, off)
222 | 		return lines
223 | 
224 | 	def applyChanges(self, changes):
225 | 		self.tokenize(self.demoChanges(changes))
226 | 
227 | class DocTest(unittest.TestCase):
228 | 	def test_change(self):
229 | 		pass
230 | 
231 | if __name__ == '__main__':
232 | 	unittest.main()
233 | 
234 | 


--------------------------------------------------------------------------------
/src/gram.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | from collections import defaultdict
  5 | try:
  6 | 	from collections import Counter
  7 | except ImportError: # python 2.6?
  8 | 	pass
  9 | import re, sys, traceback
 10 | import unittest
 11 | from operator import itemgetter
 12 | 
 13 | """
 14 | Tokenizing regular expression
 15 | Group:
 16 | 	letters
 17 | 	numbers and any punctuation
 18 | 		group things like dates, times, ip addresses, etc. into a single token
 19 | """
 20 | TokRgxNL = re.compile('\d+(?:[^\w\s]+\d+)*|\w+|\.|\n', re.UNICODE)
 21 | TokRgx = re.compile('\d+(?:[^\w\s]+\d+)*|\w+|\.', re.UNICODE)
 22 | def tokenize(str):
 23 | 	return re.findall(TokRgxNL, str.lower())
 24 | def tokenize_no_nl(str):
 25 | 	return re.findall(TokRgx, str.lower())
 26 | 
 27 | class TokenizerTest(unittest.TestCase):
 28 | 	def test_tokenize(self):
 29 | 		Expect = [
 30 | 			('', []),
 31 | 			('a', ['a']),
 32 | 			('A', ['a']),
 33 | 			('Aa', ['aa']),
 34 | 			('a b', ['a','b']),
 35 | 		]
 36 | 		for s,xp in Expect:
 37 | 			res = tokenize(s)
 38 | 			self.assertEqual(xp, res)
 39 | 
 40 | """
 41 | store corpus ngrams
 42 | """
 43 | class Grams:
 44 | 	def __init__(self, w, ngmax=3, f=None):
 45 | 		self.words = w
 46 | 		self.ngmax = ngmax
 47 | 		self.ngrams = ( # ngram id -> frequency
 48 | 			None,
 49 | 			None,
 50 | 			Counter(),
 51 | 			Counter(),
 52 | 			Counter(),
 53 | 		)
 54 | 		if f:
 55 | 			self.add(f)
 56 | 	def freq(self, ng):
 57 | 		#assert type(ng) == tuple
 58 | 		if len(ng) == 1:
 59 | 			return self.words.freq(ng[0])
 60 | 		if ng == (): # FIXME: shouldn't need this
 61 | 			return 0
 62 | 		return self.ngrams[len(ng)][ng]
 63 | 	def freqs(self, s):
 64 | 		return self.words.freq(s)
 65 | 	# given an iterable 'f', tokenize and produce a {word:id} mapping and ngram frequency count
 66 | 	def add(self, f):
 67 | 		if type(f) == list:
 68 | 			contents = '\n'.join(f)
 69 | 		else:
 70 | 			try:
 71 | 				contents = f.read(1 * 1024 * 1024) # FIXME
 72 | 				if type(contents) == bytes:
 73 | 					contents = contents.decode('utf8')
 74 | 			except UnicodeDecodeError:
 75 | 				t,v,tb = sys.exc_info()
 76 | 				traceback.print_tb(tb)
 77 | 		toks = tokenize_no_nl(contents)
 78 | 		self.words.addl(toks)
 79 | 		self.ngrams[2].update(zip(toks, toks[1:]))
 80 | 		self.ngrams[3].update(zip(toks, toks[1:], toks[2:]))
 81 | 		self.ngrams[4].update(zip(toks, toks[1:], toks[2:], toks[3:]))
 82 | 		print('      ngrams[2] %8u' % len(self.ngrams[2]))
 83 | 		print('      ngrams[3] %8u' % len(self.ngrams[3]))
 84 | 		print('      ngrams[4] %8u' % len(self.ngrams[4]))
 85 | 
 86 | 	"""
 87 | 	given ngram of arity n, return all known ngrams containing n-1 matches;
 88 | 	that is, where all but one of the tokens match.
 89 | 	this is obviously O(n) and because it is exhaustive it is inefficient.
 90 | 	consider eventually either moving ngrams into an sqlite database or a
 91 | 	custom in-memory structure in C
 92 | 	select x,y,z
 93 | 	from ngram3
 94 | 	where (x = 'x') + (y = 'y') + (z = 'z') = 2
 95 | 	order by freq desc
 96 | 	"""
 97 | 	def ngram_like(self, ng):
 98 | 		if len(ng) <= 1:
 99 | 			return []
100 | 		assert len(ng) in (2,3)
101 | 		def uniq(s0,n):
102 | 			d = dict([(s[0][n],s[1]) for s in s0])
103 | 			s = sorted(d.items(), key=itemgetter(1), reverse=True)
104 | 			return [x for x,y in s]
105 | 		if len(ng) == 2:
106 | 			f = lambda x: x[0] == ng[0] or \
107 | 				      x[1] == ng[1]
108 | 		elif len(ng) == 3:
109 | 			f = lambda x:(x[0] == ng[0]) + \
110 | 				     (x[1] == ng[1]) + \
111 | 				     (x[2] == ng[2]) == 2
112 | 		lng = len(ng)
113 | 		s0 = filter(f, self.ngrams[lng].keys())
114 | 		s1 = [(k,self.ngrams[lng][k]) for k in s0]
115 | 		cnt = tuple(uniq(s1,n) for n in range(lng))
116 | 		return cnt
117 | 
118 | import pickle
119 | 
120 | if __name__ == '__main__':
121 | 	unittest.main()
122 | 
123 | 


--------------------------------------------------------------------------------
/src/grambin.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | # -*- coding: utf-8 -*-
 3 | 
 4 | from operator import itemgetter
 5 | import sys
 6 | from ngram3bin import ngram3bin
 7 | from ngramdiff import TokenDiff,NGramDiff,NGramDiffScore
 8 | from util import *
 9 | 
10 | """
11 | Grams-interface to our binary ngram database
12 | """
13 | class GramsBin:
14 | 
15 | 	def __init__(self, wordpath, ngrampath):
16 | 		self.ng = ngram3bin(wordpath, ngrampath)
17 | 
18 | 	def freq(self, ng, sum_=sum):
19 | 		#print 'freq()=',ng
20 | 		l = len(ng)
21 | 		if l > 1:
22 | 			ids = map(self.ng.word2id, ng)
23 | 			if l > 3:
24 | 				# chop up id list into ngram3-sized chunks
25 | 				smaller = [tuple(ids[i:i+3]) for i in range(len(ids)-3+1)]
26 | 				fr = sum_(self.ng.freq(*s) for s in smaller)
27 | 			else:
28 | 				fr = self.ng.freq(*ids)
29 | 			return fr
30 | 		else:
31 | 			return self.ng.wordfreq(ng[0])
32 | 
33 | 	def freqs(self, s):
34 | 		#print('freq(s)=',s)
35 | 		return self.ng.wordfreq(s)
36 | 
37 | 	def ngram_like(self, ng, ngfreq):
38 | 		"""
39 | 		given an ngram (x,y,z), return a list of ngrams sharing all but one element, i.e.
40 | 		(_,y,z)
41 | 		(x,_,z)
42 | 		(x,y,_)
43 | 		"""
44 | 		if len(ng) != 3:
45 | 			return []
46 | 		#print 'like()=',ng
47 | 		ids = tuple(map(self.ng.word2id, [n[0] for n in ng]))
48 | 		#print('like(ids)=',ids)
49 | 		like = self.ng.like(*ids)
50 | 		#print 'like(',ng,')=',like
51 | 		like2 = []
52 | 		for l in set(like):
53 | 			t,tfreq = tuple(map(self.ng.id2word, l[:3])), l[3]
54 | 			# calculate the single differing token and build an NGramDiff
55 | 			di = 0 if l[0] != ids[0] else 1 if l[1] != ids[1] else 2
56 | 			# do not bother with words that are of grossly different
57 | 			# length than our target
58 | 			if abs(len(t[di]) - len(ng[di][0])) > 3:
59 | 				continue
60 | 			newtok = (t[di],) + ng[di][1:]
61 | 			damlev = damerau_levenshtein(ng[di][0], t[di])
62 | 			ngd = NGramDiff(ng[:di],
63 | 					TokenDiff(ng[di:di+1], [newtok], damlev),
64 | 					ng[di+1:], self, ngfreq, tfreq)
65 | 			like2.append(ngd)
66 | 		like3 = sorted(like2, reverse=True)
67 | 		return like2
68 | 
69 | 


--------------------------------------------------------------------------------
/src/ngramdiff.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | """
  5 | token and ngram comparison
  6 | """
  7 | 
  8 | from math import sqrt,log
  9 | from util import damerau_levenshtein
 10 | 
 11 | class TokenDiff:
 12 | 	"""
 13 | 	represent the modification of zero or more 'old' (original) tokens and their
 14 | 	'new' (proposed) replacement. solves the problem of tracking inter-token changes.
 15 | 	change: TokenDiff([tok], [tok'])
 16 | 	insert: TokenDiff([], [tok'])
 17 | 	delete: TokenDiff([tok], [])
 18 | 	split:  TokenDiff([tok], [tok',tok'])
 19 | 	merge:  TokenDiff([tok,tok], [tok'])
 20 | 	"""
 21 | 	def __init__(self, old, new, damlev):
 22 | 		self.old = list(old)
 23 | 		self.new = list(new)
 24 | 		self.damlev = damlev # Damerau-Levenshtein distance
 25 | 	def oldtoks(self): return [t[0] for t in self.old]
 26 | 	def newtoks(self): return [t[0] for t in self.new]
 27 | 	def __str__(self):
 28 | 		return 'TokenDiff((%s,%s))' % (self.old, self.new)
 29 | 	def __repr__(self):
 30 | 		return str(self)
 31 | 	def __eq__(self, other):
 32 | 		return self.old == other.old and \
 33 | 		       self.new == other.new
 34 | 
 35 | class NGramDiff:
 36 | 	"""
 37 | 	represent a list of tokens that contain a single change, represented by a TokenDiff.
 38 | 	alternative, think of it as an acyclic directed graph with a single branch and merge
 39 | 	conceptually:
 40 | 		  prefix   diff    suffix
 41 | 		O---O---O---O---O---O---O
 42 | 	                 \     /
 43 | 		          `-O-'
 44 | 	"""
 45 | 	def __init__(self, prefix, diff, suffix, g, oldfreq=None, newfreq=None, soundalike=False):
 46 | 		self.prefix = list(prefix)
 47 | 		self.diff = diff
 48 | 		self.suffix = list(suffix)
 49 | 		self.oldfreq = g.freq(self.oldtoks()) if oldfreq is None else oldfreq
 50 | 		self.newfreq = g.freq(self.newtoks()) if newfreq is None else newfreq
 51 | 		self.soundalike = soundalike
 52 | 	def old(self): return self.prefix + self.diff.old + self.suffix
 53 | 	def new(self): return self.prefix + self.diff.new + self.suffix
 54 | 	def oldtoks(self): return [t[0] for t in self.old()]
 55 | 	def newtoks(self): return [t[0] for t in self.new()]
 56 | 	def __repr__(self):
 57 | 		return str(self)
 58 | 	def __str__(self):
 59 | 		return 'NGramDiff(%s,%s,%s)' % (self.prefix, self.diff, self.suffix)
 60 | 	def __eq__(self, other):
 61 | 		return self.diff == other.diff and \
 62 | 		       self.prefix == other.prefix and \
 63 | 		       self.suffix == other.suffix
 64 | 	def __lt__(self, other):
 65 | 		return other.newfreq < self.newfreq
 66 | 
 67 | class NGramDiffScore:
 68 | 	# based on our logarithmic scoring below
 69 | 	DECENT_SCORE = 3.0
 70 | 	GOOD_SCORE = 5.0
 71 | 	"""
 72 | 	decorate an NGramDiff obj with scoring
 73 | 	"""
 74 | 	def __init__(self, ngd, p, score=None):
 75 | 		self.ngd = ngd
 76 | 		self.sl = ngd.diff.new and ngd.diff.old and ngd.diff.new[0][0][0] == ngd.diff.old[0][0][0]
 77 | 		if score:
 78 | 			self.score = score
 79 | 			self.ediff = score
 80 | 		else:
 81 | 			self.score = self.calc_score(ngd, p)
 82 | 	def calc_score(self, ngd, p):
 83 | 		ediff = self.similarity(ngd, p)
 84 | 		self.ediff = ediff
 85 | 		if ngd.newfreq == 0:
 86 | 			score = -float('inf')
 87 | 		else:
 88 | 			# weigh edit distance much more heavily than frequency
 89 | 			score = 10 - (2 + ediff + (not self.sl))
 90 | 		return score
 91 | 	def improve_pct(self):
 92 | 		"""How much of an improvement is the new from the old?"""
 93 | 		if self.ngd.oldfreq == 0:
 94 | 			return float('inf')
 95 | 		return self.ngd.newfreq / self.ngd.oldfreq
 96 | 	def __str__(self):
 97 | 		return 'NGramDiffScore(score=%4.1f ngd=%s)' % (self.score, self.ngd)
 98 | 	def __repr__(self):
 99 | 		return str(self)
100 | 	def __eq__(self, other):
101 | 		return other.score == self.score
102 | 	def __lt__(self, other):
103 | 		return other.score < self.score
104 | 	def __add__(self, other):
105 | 		return NGramDiffScore(self.ngd, None, self.score + other.score)
106 | 	@staticmethod
107 | 	def overlap(s1, s2):
108 | 		"""
109 | 		given a list of sound()s, count the number that do not match
110 | 			 1  2  3  4  5  6
111 | 			'T AH0 M AA1 R OW2'
112 | 			'T UW1 M'
113 | 			 =     =
114 | 			 6 - 2 = 4
115 | 		"""
116 | 		mlen = max(len(s1), len(s2))
117 | 		neq = sum(map(lambda x: x[0] != x[1], zip(s1, s2)))
118 | 		return mlen - neq
119 | 	def similarity(self, ngd, p):
120 | 		"""
121 | 		return tuple (effective difference, absolute distance)
122 | 		given a string x, calculate a similarity distance for y [0, +inf).
123 | 		smaller means more similar. the goal is to identify promising
124 | 		alternatives for a given token within a document; we need to consider
125 | 		the wide range of possible errors that may have been made
126 | 		"""
127 | 		if ngd.soundalike:
128 | 			return 0
129 | 		x = ' '.join(ngd.diff.oldtoks())
130 | 		y = ' '.join(ngd.diff.newtoks())
131 | 		# tokens identical
132 | 		if x == y:
133 | 			return 0
134 | 		damlev = ngd.diff.damlev
135 | 		sx,sy = p.phraseSound([x]),p.phraseSound([y])
136 | 		if sx == sy and sx:
137 | 			# sound the same, e.g. there/their. consider these equal.
138 | 			return damlev
139 | 		# otherwise, calculate phonic/edit difference
140 | 		return max(damlev,
141 | 			   min(NGramDiffScore.overlap(sx, sy),
142 | 			       abs(len(x)-len(y))))
143 | 
144 | if __name__ == '__main__':
145 | 	import sys
146 | 	sys.path.append('..')
147 | 	from grambin import GramsBin
148 | 	from word import Words,NGram3BinWordCounter
149 | 	from phon import Phon
150 | 	import logging
151 | 
152 | 	logging.basicConfig(stream=sys.stderr, level=logging.DEBUG)
153 | 	logging.debug('loading...')
154 | 	g = GramsBin(
155 | 		'/home/pizza/proj/spill-chick/data/corpus/google-ngrams/word.bin',
156 | 		'/home/pizza/proj/spill-chick/data/corpus/google-ngrams/ngram3.bin')
157 | 	w = Words(NGram3BinWordCounter(g.ng))
158 | 	p = Phon(w,g)
159 | 	logging.debug('loaded.')
160 | 
161 | 	pass
162 | 
163 | 


--------------------------------------------------------------------------------
/src/phon.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | """
  5 | Handle phonetics; i.e. the way things sound
  6 | """
  7 | 
  8 | import collections, re, sys, gzip, pickle, os, mmap
  9 | from word import Words
 10 | from gram import tokenize
 11 | 
 12 | class Phon:
 13 | 	def __init__(self, w, g):
 14 | 		self.words = w
 15 | 		self.word = collections.defaultdict(list)
 16 | 		self.phon = collections.defaultdict(list)
 17 | 		self.load(g)
 18 | 	def load(self, g):
 19 | 		dictpath ='/home/pizza/proj/spill-chick/data/cmudict/cmudict.0.7a'
 20 | 		# extract file if necessary
 21 | 		if not os.path.exists(dictpath):
 22 | 			with open(dictpath, 'wb') as dst:
 23 | 				with gzip.open(dictpath + '.gz', 'rb') as src:
 24 | 					dst.write(src.read())
 25 | 		# TODO: loading this ~130,000 line dictionary in python represents the majority
 26 | 		# of the program's initialization time. move it over to C.
 27 | 		with open(dictpath, 'r') as f:
 28 | 			for line in f:
 29 | 				if line.startswith(';;;'):
 30 | 					continue
 31 | 				line = line.decode('utf8')
 32 | 				line = line.strip().lower()
 33 | 				word, phon = line.split('  ')
 34 | 				"""
 35 | 				skip any words that do not appear in our ngrams.
 36 | 				this makes a significant difference when trying to reconstruct phrases
 37 | 				phonetically; small decreases in terms have large decreases in products.
 38 | 				note: you may think that every word in a dictionary would appear
 39 | 				at least once in a large corpus, but we truncate corpus n-grams at a
 40 | 				certain minimum frequency which may exclude very obscure words from ultimately
 41 | 				appearing at all.
 42 | 				"""
 43 | 
 44 | 				# TODO: what i really should do is eliminate all words that appear less
 45 | 				# than some statistically significant time; the vast majority of the
 46 | 				# phonetic phrases I currently try are filled with short obscure words
 47 | 				# and are a complete waste
 48 | 				# FIXME: instead of hard-coding frequency, calculate statistically
 49 | 				if word.count("'") == 0 and g.freqs(word) < 500:
 50 | 					continue
 51 | 				"""
 52 | 				implement a very rough phonic fuzzy-matching
 53 | 				phonic codes consist of a list of sounds such as:
 54 | 					REVIEW  R IY2 V Y UW1
 55 | 				we simplify this to
 56 | 					REVIEW  R I V Y U
 57 | 				this allows words with close but imperfectly sounding matches to
 58 | 				be identified. for example:
 59 | 					REVUE   R IH0 V Y UW1
 60 | 					REVIEW  R IY2 V Y UW1
 61 | 				is close but not a perfect match. after regex:
 62 | 					REVUE  R I V Y U
 63 | 					REVIEW R I V Y U
 64 | 				"""
 65 | 				phon = re.sub('(\S)(\S+)', r'\1', phon)
 66 | 				# now merge leading vowels except 'o' and 'u'
 67 | 				if len(phon) > 1:
 68 | 					phon = re.sub('^[aei]', '*', phon)
 69 | 				self.words.add(word)
 70 | 				self.word[word].append(phon)
 71 | 				toks = tokenize(word)
 72 | 				self.phon[phon].append(toks)
 73 | 
 74 | 	"""
 75 | 	return a list of words that sound like 'word', as long as they appear in ng
 76 | 	"""
 77 | 	def soundsLike(self, word, ng):
 78 | 		l = []
 79 | 		for w in self.word[word]:
 80 | 			for x in self.phon[w]:
 81 | 				fr = ng.freqs(x)
 82 | 				if fr > 0:
 83 | 					l.append((x,fr))
 84 | 		return [w for w,fr in sorted(l, key=lambda x:x[1], reverse=True)]
 85 | 
 86 | 	def phraseSound(self, toks):
 87 | 		"""
 88 | 		given a list of tokens produce a normalize list of their component sound
 89 | 		an unknown token generates None
 90 | 		TODO: ideally we would be able to "guess" the sound of unknown words.
 91 | 		this would be a huge improvement!
 92 | 		given 'waisting' we should be able to break it into 'waist' 'ing'
 93 | 		"""
 94 | 		def head(l):
 95 | 			return l[0] if l else None
 96 | 		s = [head(self.word.get(t,[''])) for t in toks]
 97 | 		#print('phraseSound(',toks,')=',s)
 98 | 		if not all(s):
 99 | 			return []
100 | 		# nuke numbers, join into one string
101 | 		t = ' '.join([re.sub('\d+', '', x) for x in s])
102 | 		# nuke consecutive duplicate sounds
103 | 		u = re.sub('(\S+) \\1 ', '\\1 ', t)
104 | 		v = u.split()
105 | 		#print('phraseSound2=',v)
106 | 		return v
107 | 
108 | 	def soundsToWords(self, snd):
109 | 		if snd == []:
110 | 			yield []
111 | 		for j in range(1, len(snd)+1):
112 | 			t = ' '.join(snd[:j])
113 | 			words = self.phon.get(t)
114 | 			if words:
115 | 				for s in self.soundsToWords(snd[j:]):
116 | 					yield [words] + s
117 | 
118 | if __name__ == '__main__':
119 | 
120 | 	def words(str):
121 | 		return re.findall('[a-z\']+', str.lower()) 
122 | 
123 | 	def pron(wl, wd):
124 | 		print(' '.join([str(wd[w][0]) if w in wd else '<%s>' % (w,) for w in wl]))
125 | 
126 | 	P = Phon(Words())
127 | 	for a in sys.argv[1:]:
128 | 		pron(words(a), P.W)
129 | 
130 | 	print(P.word['there'])
131 | 	print(P.phon[P.word['there'][0]])
132 | 
133 | 	P.phraseSound(['making','mistake'])
134 | 	P.phraseSound(['may','king','mist','ache'])
135 | 	x = P.phraseSound(['making','miss','steak'])
136 | 	from itertools import product
137 | 	for f in P.soundsToWords(x):
138 | 		print(f)
139 | 		#print(list(product(*f)))
140 | 
141 | 


--------------------------------------------------------------------------------
/src/test.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | # ex: set ts=8 noet:
 4 | # Copyright 2011 Ryan Flynn <parseerror+spill-chick@gmail.com>
 5 | 
 6 | """
 7 | chick.py ✕ test.txt
 8 | """
 9 | 
10 | import sys, re, logging
11 | from chick import Chick
12 | 
13 | logger = logging.getLogger('spill-chick')
14 | hdlr = logging.StreamHandler(sys.stderr)
15 | logger.addHandler(hdlr)
16 | logger.setLevel(logging.DEBUG)
17 | 
18 | def load_tests():
19 | 	# load test cases
20 | 	Tests = []
21 | 	with open('../test/test.txt','r') as f:
22 | 		for l in f:
23 | 			l = l.decode('utf8').strip()
24 | 			if l == '#--end--':
25 | 				break
26 | 			if len(l) > 1 and l[0] != '#':
27 | 				before, after = l.split(' : ')
28 | 				after = re.sub('\s*#.*', '', after.rstrip(), re.U) # replace comments
29 | 				Tests.append(([before],after))
30 | 	return Tests
31 | 
32 | # TODO: Word() and Grams() should be merged, they're essentially the same
33 | 
34 | def test():
35 | 	"""
36 | 	run our tests. initialze resources resources and tests, run each test and
37 | 	figure out what works and what doesn't.
38 | 	"""
39 | 	chick = Chick()
40 | 	Tests = load_tests()
41 | 	passcnt = 0
42 | 	for str,exp in Tests:
43 | 		logger.debug('Test str=%s exp=%s' % (str, exp))
44 | 		res = chick.correct(str)
45 | 		logger.debug('exp=%s(%s) res=%s(%s)' % (exp, type(exp), res, type(res)))
46 | 		passcnt += res == exp
47 | 		logger.debug('----------- %s -------------' % ('pass' if res == exp else 'fail',))
48 | 	logger.debug('Tests %u/%u passed.' % (passcnt, len(Tests)))
49 | 
50 | def profile_test():
51 | 	import cProfile, pstats
52 | 	cProfile.run('test()', 'test.prof')
53 | 	st = pstats.Stats('test.prof')
54 | 	st.sort_stats('time')
55 | 	st.print_stats()
56 | 
57 | if __name__ == '__main__':
58 | 
59 | 	from sys import argv
60 | 	if len(argv) > 1 and argv[1] == '--profile':
61 | 		profile_test()
62 | 	else:
63 | 		test()
64 | 
65 | 


--------------------------------------------------------------------------------
/src/util.py:
--------------------------------------------------------------------------------
 1 | 
 2 | """
 3 | classes and utility functions that are used by everyone
 4 | """
 5 | 
 6 | from operator import itemgetter
 7 | from math import sqrt,log
 8 | 
 9 | # convenience functions
10 | def rsort(l, **kw): return sorted(l, reverse=True, **kw)
11 | def rsort1(l): return rsort(l, key=itemgetter(1))
12 | def rsort2(l): return rsort(l, key=itemgetter(2))
13 | def sort1(l): return sorted(l, key=itemgetter(1))
14 | def sort2(l): return sorted(l, key=itemgetter(2))
15 | def flatten(ll): return chain.from_iterable(ll)
16 | def zip_longest(x, y, pad=None):
17 | 	x, y = list(x), list(y)
18 | 	lx, ly = len(x), len(y)
19 | 	if lx < ly:
20 | 		x += [pad] * (ly-lx)
21 | 	elif ly < lx:
22 | 		y += [pad] * (lx-ly)
23 | 	return zip(x, y)
24 | 
25 | def damerau_levenshtein(seq1, seq2):
26 |     """Calculate the Damerau-Levenshtein distance between sequences.
27 |     
28 |     This distance is the number of additions, deletions, substitutions,
29 |     and transpositions needed to transform the first sequence into the
30 |     second. Although generally used with strings, any sequences of
31 |     comparable objects will work.
32 |     
33 |     Transpositions are exchanges of *consecutive* characters; all other
34 |     operations are self-explanatory.
35 |     
36 |     This implementation is O(N*M) time and O(M) space, for N and M the
37 |     lengths of the two sequences.
38 |     
39 |     >>> dameraulevenshtein('ba', 'abc')
40 |     2
41 |     >>> dameraulevenshtein('fee', 'deed')
42 |     2
43 |     
44 |     It works with arbitrary sequences too:
45 |     >>> dameraulevenshtein('abcd', ['b', 'a', 'c', 'd', 'e'])
46 |     2
47 |     """
48 |     # codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F
49 |     # Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1 matrix.
50 |     # However, only the current and two previous rows are needed at once,
51 |     # so we only store those.
52 |     oneago = None
53 |     thisrow = range(1, len(seq2) + 1) + [0]
54 |     for x in xrange(len(seq1)):
55 |         # Python lists wrap around for negative indices, so put the
56 |         # leftmost column at the *end* of the list. This matches with
57 |         # the zero-indexed strings and saves extra calculation.
58 |         twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1]
59 |         for y in xrange(len(seq2)):
60 |             delcost = oneago[y] + 1
61 |             addcost = thisrow[y - 1] + 1
62 |             subcost = oneago[y - 1] + (seq1[x] != seq2[y])
63 |             thisrow[y] = min(delcost, addcost, subcost)
64 |             # This block deals with transpositions
65 |             if (x > 0 and y > 0 and seq1[x] == seq2[y - 1] and seq1[x-1] == seq2[y] and seq1[x] != seq2[y]):
66 |                 thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
67 |     return thisrow[len(seq2) - 1]
68 | 
69 | 


--------------------------------------------------------------------------------
/src/web/.gitignore:
--------------------------------------------------------------------------------
1 | session/*
2 | static/tmp
3 | 


--------------------------------------------------------------------------------
/src/web/code.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | """
  5 | web-based spill-chick front-end
  6 | 
  7 | setup:
  8 | 	* mkdir session/
  9 | 		(I tried adding it to the project but git can't hold empty directories and session/.gitignore kludge got deleted by the webserver, apparently)
 10 | 	* ensure webserver user has
 11 | 		* read access to ngram3.bin and word.bin files
 12 | 		* write access to session/ directory
 13 | """
 14 | 
 15 | def abspath(localpath):
 16 | 	return os.path.join(os.path.dirname(__file__), localpath)
 17 | 
 18 | import os, sys
 19 | from itertools import dropwhile
 20 | from time import time
 21 | import web
 22 | from web import form
 23 | import logging
 24 | 
 25 | logger = logging.getLogger('spill-chick')
 26 | 
 27 | sys.path.append(abspath('..'))
 28 | from chick import Chick
 29 | from doc import Doc
 30 | 
 31 | web.config.debug = True
 32 | 
 33 | urls = ( '/.*', 'check' )
 34 | 
 35 | app = web.application(urls, globals())
 36 | session = web.session.Session(app, web.session.DiskStore(abspath('session')),
 37 | 		initializer={'target':None, 'skip':[], 'replacements':[], 'suggestions':[]})
 38 | render = web.template.render(abspath('templates/'), base='base', globals=globals(), cache=False)
 39 | application = app.wsgifunc()
 40 | chick = Chick()
 41 | 
 42 | class check:
 43 | 
 44 | 	def GET(self):
 45 | 		session.kill()
 46 | 		return render.check('', [], [], 0, [])
 47 | 
 48 | 	def POST(self):
 49 | 		start_time = time()
 50 | 		text = unicode(web.input().get('text', ''))
 51 | 		lines = text.split('\r\n')
 52 | 
 53 | 		act = web.input().get('act', '')
 54 | 		if act == 'Replace':
 55 | 			# FIXME: if replacement takes place, update location/offsets
 56 | 			# of all remaining session['suggestions']
 57 | 			replacement_index = int(web.input().get('replacement_index', '0'))
 58 | 			if replacement_index:
 59 | 				d = Doc(lines, chick.w)
 60 | 				replacements = session.get('replacements')
 61 | 				if replacement_index <= len(replacements):
 62 | 					replacement = replacements[replacement_index-1]
 63 | 					d.applyChanges([replacement])
 64 | 					text = str(d)
 65 | 					lines = d.lines
 66 | 					logger.debug('after replacement lines=%s' % (lines,))
 67 | 				session['suggestions'].pop(0)
 68 | 		elif act == 'Skip to next...':
 69 | 			session['skip'].append(session['target'])
 70 | 			session['suggestions'].pop(0)
 71 | 		elif act == 'Done':
 72 | 			# nuke target, replacements, skip, etc.
 73 | 			session.kill()
 74 | 
 75 | 		sugg2 = []
 76 | 		suggs = []
 77 | 		suggestions = []
 78 | 		replacements = []
 79 | 
 80 | 		if act and act != 'Done':
 81 | 			suggestions = session['suggestions']
 82 | 			if not suggestions:
 83 | 				logger.debug('suggest(lines=%s)' % (lines,))
 84 | 				suggestions = list(chick.suggest(lines, 5, session['skip']))
 85 | 			if not suggestions:
 86 | 				target,suggs,sugg2 = None,[],[]
 87 | 			else:
 88 | 				# calculate offsets based on line length so we can highlight target substring in <texarea>
 89 | 				off = [len(l)+1 for l in lines]
 90 | 				lineoff = [0]+[sum(off[:i]) for i in range(1,len(off)+1)]
 91 | 				changes = suggestions[0]
 92 | 				target = changes[0].ngd.oldtoks()
 93 | 				for ch in changes:
 94 | 					ngd = ch.ngd
 95 | 					replacements.append(ngd)
 96 | 					o = ngd.old()
 97 | 					r = ngd.new()
 98 | 					linestart = o[0][1]
 99 | 					lineend = o[-1][1]
100 | 					start = o[0][3]
101 | 					end = o[-1][3] + len(o[-1][0])
102 | 					sugg2.append((' '.join(ngd.newtoks()),
103 | 						      lineoff[linestart] + start,
104 | 						      lineoff[lineend] + end))
105 | 			session['target'] = target
106 | 			session['replacements'] = replacements
107 | 			session['suggestions'] = suggestions
108 | 
109 | 		elapsed = round(time() - start_time, 2)
110 | 		return render.check(text, sugg2, lines, elapsed, suggestions)
111 | 
112 | if  __name__ == '__main__':
113 | 	app.run()
114 | 
115 | 


--------------------------------------------------------------------------------
/src/web/conf/apache2.conf.add:
--------------------------------------------------------------------------------
 1 | 
 2 | # append something like this to apache2.conf to get our web.py app running
 3 | 
 4 | # spill-chick web.py
 5 | #LoadModule wsgi_module modules/mod_wsgi.so
 6 | WSGIScriptAlias /spill-chick /var/www/spill-chick/code.py
 7 | Alias /spill-chick/static /var/www/spill-chick/static/
 8 | Alias /spill-chick/templates /var/www/spill-chick/templates/
 9 | AddType text/html .py
10 | <Directory /var/www/spill-chick/>
11 | 	Order deny,allow
12 | 	Allow from all
13 | </Directory>
14 | 
15 | 


--------------------------------------------------------------------------------
/src/web/static/img/chick.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rflynn/spill-chick/430257c25369908f243a08d33caa268e8e398aeb/src/web/static/img/chick.png


--------------------------------------------------------------------------------
/src/web/static/img/chick16.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rflynn/spill-chick/430257c25369908f243a08d33caa268e8e398aeb/src/web/static/img/chick16.png


--------------------------------------------------------------------------------
/src/web/static/img/chick16.png.ico:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rflynn/spill-chick/430257c25369908f243a08d33caa268e8e398aeb/src/web/static/img/chick16.png.ico


--------------------------------------------------------------------------------
/src/web/static/img/chick32.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rflynn/spill-chick/430257c25369908f243a08d33caa268e8e398aeb/src/web/static/img/chick32.png


--------------------------------------------------------------------------------
/src/web/static/js/spill-chick.js:
--------------------------------------------------------------------------------
 1 | 
 2 | function textboxSelect(oTextbox, iStart, iEnd)
 3 | {
 4 | 	switch(arguments.length)
 5 | 	{
 6 | 	case 1:
 7 | 		oTextbox.select();
 8 | 		break;
 9 | 	case 2:
10 | 		iEnd = oTextbox.value.length;
11 | 		/* falls through */
12 | 	case 3:
13 | 		if (oTextbox.createTextRange)
14 | 		{
15 | 			var oRange = oTextbox.createTextRange();
16 | 			oRange.moveStart("character", iStart);
17 | 			oRange.moveEnd("character", - oTextbox.value.length + iEnd);
18 | 			oRange.select();
19 | 			oTextbox.scrollTop = oRange.boundingTop
20 | 		}
21 | 		else if (oTextbox.setSelectionRange)
22 | 		{
23 | 			oTextbox.setSelectionRange(iStart, iEnd);
24 | 		}
25 | 	}
26 | }
27 | 


--------------------------------------------------------------------------------
/src/web/templates/base.html:
--------------------------------------------------------------------------------
 1 | $def with (page)
 2 | 
 3 | <html>
 4 | <head>
 5 | $if page.has_key('title'):
 6 | 	<title>$page.title</title>
 7 | $else:
 8 | 	<title>Spill-Chick</title>
 9 | <style type="text/css">
10 | body { margin:0px; background:#fff }/*url(static/splash.png) no-repeat -70px -90px }*/
11 | body, td, select, textarea, input { font-family: Tahoma, Geneva, sans-serif; font-size:125% }
12 | select { background-color: #fff }
13 | h1 { margin-bottom: 1px }
14 | hr { border:0px; height:1px; color:#ccc; background-color:#ccc }
15 | textarea, select { background-color:#ffc }
16 | #titlebar { background-color:#cc0; margin:0px; padding:1px; border-bottom:1px solid #990 }
17 | #checktbl { margin-top:3px }
18 | #replacement input select { display:block }
19 | #replacement_label { display:block }
20 | </style>
21 | <script src="static/js/spill-chick.js"></script>
22 | </head>
23 | <body>
24 | 
25 | <div id="titlebar">
26 | <a href="/spill-chick/"><img src="static/img/chick32.png" style="margin:0px; padding:1px; border:0px solid #999; vertical-align:middle" alt="" title="cheep!" border="0" width="32" height="32"></a><a href="/spill-chick/" style="color:#330; font-size:24px; font-weight:bold; text-decoration:none; vertical-align:middle">Spill-Chick</a>
27 | </div>
28 | 
29 | $:page
30 | 
31 | <hr>
32 | Session: $session
33 | <hr>
34 | web.input: $web.input()
35 | </div>
36 | </body></html>
37 | 


--------------------------------------------------------------------------------
/src/web/templates/check.html:
--------------------------------------------------------------------------------
 1 | $def with (text, replacements, lines, elapsed, suggestions)
 2 | 
 3 | <form id="frm" method="post" action="/spill-chick/">
 4 | <table id="checktbl" style="width:100%">
 5 | 	<tr>
 6 | 	<td valign="top" style="width:12em; text-align:right">
 7 | $if replacements:
 8 | 	<div id="replacement" style="padding-right:0.5em; text-align:right">
 9 | 		<label for="replacement_index" id="replacement_label">Replacements:</label>
10 | 		<select name="replacement_index" multiple="multiple" size="5">
11 | 		$for rep in replacements:
12 | 			<option value="$loop.index" onClick="javascript:textboxSelect(document.getElementById('text'), $(rep[1]), $(rep[2]))">$(rep[0])</option>
13 | 		</select>
14 | 		<br>
15 | 		<input type="submit" name="act" value="Replace"><br>
16 | 		<input type="submit" name="act" value="Skip to next..."><br>
17 | 		<input type="submit" name="act" value="Done">
18 | 	</div>
19 | $else:
20 | 	
21 | <td valign="top" style="padding-left:0.5em; padding-right:1em">
22 | <div style="display:block; width:100%">
23 | <label for="text" id="textlabel">Text:</label>
24 | <textarea id="text" name="text" rows="20" style="font-size:smaller; min-width:10em; width:100%">$text</textarea>
25 | $if not replacements:
26 | 	<input type="submit" name="act" value="Chick Spilling">
27 | </div>
28 | </table>
29 | </form>
30 | 
31 | <div style="padding:1em; font-size:50%; font-family:Verdana,sans-serif">
32 | Elapsed: $elapsed seconds
33 | <hr>
34 | suggs: $suggestions
35 | <hr>
36 | Lines: $lines
37 | <hr>
38 | Replacements: $replacements
39 | 
40 | 


--------------------------------------------------------------------------------
/src/word.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | # -*- coding: utf-8 -*-
 3 | 
 4 | import collections
 5 | 
 6 | Alphabet = 'abcdefghijklmnopqrstuvwxyz'
 7 | 
 8 | def edits1(word):
 9 | 	splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
10 | 	deletes    = [a + b[1:] for a, b in splits if b]
11 | 	transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
12 | 	replaces   = [a + c + b[1:] for a, b in splits for c in Alphabet if b]
13 | 	inserts    = [a + c + b     for a, b in splits for c in Alphabet]
14 | 	return set(deletes + transposes + replaces + inserts)
15 | 
16 | """
17 | imitate the interface of a Counter() that Words is expecting
18 | so we can use ngram3bin without him knowing
19 | """
20 | class NGram3BinWordCounter:
21 | 	def __init__(self, ng):
22 | 		self.ng = ng
23 | 	def __contains__(self, word):
24 | 		# foo in me
25 | 		return self.ng.word2id(word) != 0
26 | 	def get(self, word, default=0):
27 | 		try:
28 | 			return self.ng.wordfreq(word)
29 | 		except (ValueError, TypeError):
30 | 			raise KeyError
31 | 	def __getitem__(self, word):
32 | 		# me[key]
33 | 		if type(word) == int:
34 | 			raise IndexError
35 | 		try:
36 | 			return self.ng.wordfreq(word)
37 | 		except:
38 | 			raise KeyError
39 | 	def __setitem__(self, word, val):
40 | 		# me[key] = val
41 | 		pass
42 | 	def update(self, wordfreqlist):
43 | 		pass
44 | 
45 | """
46 | Word statistics
47 | """
48 | class Words:
49 | 
50 | 	def __init__(self, frq=None):#collections.Counter()):
51 | 		self.frq = frq
52 | 
53 | 	def add(self, word):
54 | 		self.frq[word] += 1
55 | 
56 | 	def addl(self, words):
57 | 		self.frq.update(words)
58 | 
59 | 	def freq(self, word):
60 | 		return self.frq[word]
61 | 
62 | 	def known_edits2(self, word):
63 | 		return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in self.frq)
64 | 
65 | 	def known(self, words): return set(w for w in words if w in self.frq)
66 | 
67 | 	# FIXME: this does not always work
68 | 	# example: 'passified' becomes 'assified' instead of 'pacified'
69 | 	# TODO: lots of mis-spellings are phonetic; we should attempt to "sound out"
70 | 	# unknown words, possibly by breaking them into pieces and trying to assemble the sound
71 | 	# from existing words
72 | 	# FIXME: douce -> douse
73 | 	# FIXME: iv -> ivy
74 | 	def correct(self, word):
75 | 		candidates = self.known([word]) | self.known(edits1(word)) or self.known_edits2(word) or [word]
76 | 		return max(candidates, key=self.frq.get)
77 | 
78 | 	# FIXME: bid -> big
79 | 	# FIXME: hungreh -> hungry
80 | 	def similar(self, word):
81 | 		e = self.known(edits1(word))
82 | 		# FIXME: this is just the trickiest line in the whole thing.
83 | 		# flexibility at an expensive price...
84 | 		if len(word) > 6:
85 | 			e |= self.known_edits2(word)
86 | 		return e
87 | 
88 | 	@staticmethod
89 | 	def signature(word):
90 | 		"sorted list of ('letter',frequency) for all letters in word"
91 | 		return [(c,len(list(l))) for c,l in groupby(sorted(word))]
92 | 
93 | if __name__ == '__main__':
94 | 	pass
95 | 
96 | 


--------------------------------------------------------------------------------
/test/Ode-To-My-Spell-Checker-correct.txt:
--------------------------------------------------------------------------------
 1 | 
 2 | I have a spelling checker
 3 | It came with my PC
 4 | It plainly marks for my review
 5 | Mistakes I cannot see
 6 | 
 7 | I strike a key and type a word
 8 | And wait for it to say
 9 | Whether I am wrong or right
10 | It shows me straight away
11 | 
12 | As soon as a mistake is made
13 | It knows before too long
14 | And I can put the error right
15 | It’s rarely ever wrong
16 | 
17 | I have run this poem through it
18 | I am sure you’re pleased to know
19 | It’s letter perfect in its way
20 | My checker told me so
21 | 
22 | 


--------------------------------------------------------------------------------
/test/Ode-To-My-Spell-Checker-original.txt:
--------------------------------------------------------------------------------
 1 | 
 2 | Eye halve a spelling chequer
 3 | It came with my pea sea
 4 | It plainly marques four my revue
 5 | Miss steaks eye kin knot sea.
 6 | 
 7 | Eye strike a quay and type a word
 8 | And weight four it two say
 9 | Weather eye am wrong oar write
10 | It shows me strait a weigh.
11 | 
12 | As soon as a mist ache is maid
13 | It nose bee fore two long
14 | And eye can put the error rite
15 | Its really ever wrong.
16 | 
17 | Eye have run this poem threw it
18 | I am shore your pleased two no
19 | Its letter perfect in it's weigh
20 | My chequer tolled me sew.
21 | 
22 | 


--------------------------------------------------------------------------------
/test/Spill-Chick-Yore-Dock-You-Mints-correct.txt:
--------------------------------------------------------------------------------
 1 | You can always spellcheck your documents
 2 | 
 3 | Spellcheckers United 
 4 | 1234 Doughnut Street
 5 | Sault Ste Marie, Michigan 49599
 6 | 
 7 | November 7, 2000
 8 | 
 9 | Miss Spellbound
10 | Spelling Checkers, Inc.
11 | 1259 Broadway
12 | New York, NY 11012
13 | 
14 | Dear Mrs. Spellbound:
15 | 
16 | You might have used some spell checker which came with your computer. It's great at putting marks for all to find mistakes I cannot see. I done put your lines right through by typing so carefully. It's going to be perfect and I know it's going to be because our computer told me and you and everybody around here so.
17 | 
18 | Sincerely,
19 | 
20 | Thomas Jackson
21 | 


--------------------------------------------------------------------------------
/test/Spill-Chick-Yore-Dock-You-Mints-original.txt:
--------------------------------------------------------------------------------
 1 | Ewe Kin Awl Weighs Spill Chick Yore Dock You Mints
 2 | 
 3 | Spill Chiggers Ewe Knighted
 4 | 1234 Doe Nuts Treat
 5 | Sue Saint Mary, MI 49599
 6 | 
 7 | No member 7, 2000
 8 | 
 9 | Mist Spill Bond
10 | Spilling Check Hers, Ink.
11 | 1259 Board Weigh
12 | Gnu Yoke, NY 11012
13 | 
14 | Dare Misses Spill Bond:
15 | 
16 | Ewe mite hove use sum spill check her witch came mitt yore come pewter.  Its grate at pudding marts fore awl two fined miss steaks aye kin knot sea.  Aye dun putt you're lions write threw bye tie ping sew care fully.    Its go wing too be purr fit an eye no its gone a bee be cause are calm pewter tolled me an yew an every buddy a round hear sew.
17 | 
18 | Sin sear Lee,
19 | 
20 | Tom Us Jack Sun
21 | 


--------------------------------------------------------------------------------
/test/test.txt:
--------------------------------------------------------------------------------
  1 | # unit tests
  2 | #
  3 | 
  4 | # FIXME: we don't handle apostrophes correctly
  5 | Even so, I'm open minded. : Even so, I'm open minded.
  6 | that's and advantage. : that's an advantage.
  7 | 
  8 | # FAIL -
  9 | # FIXME: can't handle 2-word idioms nor "sound out" sentences
 10 | funky farm : funny farm # wrong first token not handled
 11 | untied stats : united states # tricky -- both tokens wrong, no extra context
 12 | # issue: two-token not handled
 13 | Windows PX : Windows XP
 14 | a miss steak : a mistake
 15 | dry rum : dry run
 16 | # issue: punctuation
 17 | hell there. how are you? : hello there. how are you? # needs punctuation
 18 | bag apple : big apple
 19 | beg apple : big apple
 20 | 
 21 | # FIXME: can't fix 4-token idiom
 22 | state-of-the-are : state-of-the-art
 23 | 
 24 | #--end--
 25 | 
 26 | 
 27 | ######## PASS ############
 28 | bridge the gas : bridge the gap
 29 | you are waisting your time : you are wasting your time
 30 | # test simple change, but...
 31 | # there are several similar variants that are more popular than the correct answer
 32 | i new that! : i knew that!
 33 | # test phonetic match and token removal
 34 | their is no : there is no
 35 | i no you : i know you
 36 | an IV league school : an IVY league school
 37 | #
 38 | win or loose : win or lose
 39 | Wet your appetite : Whet your appetite
 40 | Try and fry again : Try and try again
 41 | I am very tried : I am very tired
 42 | garden of eating : garden of eden
 43 | garden of eatin : garden of eden
 44 | I am found of you : I am fond of you
 45 | their is : there is
 46 | their it is : there it is
 47 | # no-ops
 48 | i think so : i think so
 49 | i now know : i now know
 50 | # phonetic
 51 | Summer is almost hear. : Summer is almost here.
 52 | i am hear : i am here
 53 | no, i was write : no, i was right # metallic
 54 | peace of shit : piece of shit
 55 | #
 56 | we have a bid backyard : we have a big backyard
 57 | a double cheese burger in : a double cheeseburger in
 58 | i have a spelling chequer : i have a spelling checker
 59 | 
 60 | we'll touch bass : we'll touch base
 61 | i didn't no : i didn't know
 62 | # issue: slang
 63 | nope, i was write : nope, i was right # metallic # slang, doesn't know 'nope'
 64 | bridge the gap. bridge the gas. : bridge the gap. bridge the gap. # multi-sentence problem; ignores "gas."
 65 | 
 66 | # avoid making suggestions for numbers
 67 | # perhaps transpositions, but in most cases we don't want to replace whole numbers...
 68 | for over 35 years we bridge the gas : for over 35 years we bridge the gap
 69 | 
 70 | That's not a every impressive claim to make. : That's not a very impressive claim to make.
 71 | Long Island, New York, state-of-the-are facility : Long Island, New York, state-of-the-art facility
 72 | That is pretty much what I was eluding : That is pretty much what I was alluding
 73 | 
 74 | #--end--
 75 | 
 76 | ######### FAIL
 77 | 
 78 | While the post author claims that using a SQL backend doesn't make much sense, according to the fossil web page (http://fossil-scm.org/) that's and advantage. : While the post author claims that using a SQL backend doesn't make much sense, according to the fossil web page (http://fossil-scm.org/) that's an advantage.
 79 | 
 80 | #--end--
 81 | 
 82 | # FIXME: we calculate the diff of eluding -> alluding as 0 because their sounds match
 83 | # but we must differentiate between a phonic change and an actual change that does nothing, i.e.
 84 | # does not change the text at all; we should value the former more highly
 85 | #That is pretty much what I was eluding : That is pretty much what I was alluding
 86 | ##--end--
 87 | 
 88 | # FIXME: this is a shortcoming of the unknown token corrector, a separate but important
 89 | # part of our program that runs before all the other parts.
 90 | # we must do a more thorough job of picking apart unknown tokens, try splitting/merging them with their surroundings
 91 | #spillchick : spellcheck
 92 | ##--end--
 93 | 
 94 | # test whether we're smart enough to prioritize "win or loose" -> "win or lose", which is a good fix
 95 | #In 2005 we win or loose : In 2005 we win or lose
 96 | ##--end--
 97 | #their coming to : they're coming to
 98 | ##--end--
 99 | 
100 | # bestsugg   8.74  3 931301 that it is
101 | # bestsugg   7.95  0  21036 there it is
102 | # the "correct" solution is a close second because of the overwhelming frequency of the first
103 | # we need to more heavily weight the improvement of a diff of 0 (phonic difference) over higher
104 | # frequency.
105 | 
106 | #--end--
107 | 
108 | I would have won if had one! : I would have one if I had won.
109 | I would have one if I had won! : I would have one if I had won!
110 | I would have one if I had one. : I would have one if I had one.
111 | I would have won if I had won. : I would have one if I had won.
112 | 
113 | #I would have won two if had one too!
114 | #I would have one too if I had won one!
115 | #I would have one too if I had one too.
116 | #I would have won too if I had won one.
117 | 
118 | # FIXME: these take ages and always fail
119 | doe sit use machien learning : does it use machine learning
120 | dose it use machien learning : does it use machine learning
121 | doze it use machien learning : does it use machine learning
122 | ##--end--
123 | 
124 | # FIXME: I'm not sure but either I'm picking bad examples or something;
125 | # what i expect is not what the ngrams suggest. strange.
126 | #in the sample place : in the same place
127 | 
128 | could care less : couldn't care less
129 | ##--end--
130 | 
131 | create a passified country : create a pacified country # urbandictionary
132 | someone douce me in chocolate syrup : someone douse me in chocolate syrup
133 | Downloading copywritten movies : Downloading copyrighted movies
134 | # needs to join ('cheese','burger') -> 'cheeseburger'
135 | I still have a double cheese burger in the refridgerator : I still have a double cheeseburger in the refrigerator
136 | ##--end--
137 | 
138 | This is all very tenative. : This is all very tentative.
139 | someone otther than yourself : someone other than yourself
140 | #--end--
141 | 
142 | 
143 | ########## BOTCHED IDIOMS
144 | #
145 | Coming down the pipe : Coming down the pike
146 | Through the ringer : Through the wringer
147 | touch basis : touch bases
148 | #
149 | #800-pond gorilla : 800-pound gorilla
150 | could care less : couldn't care less
151 | #oh de colone : eau de cologne
152 | #two in the hand is worth one in the bush : one in the hand is worth two in the bush
153 | # these two would benefit from trying edit distance 2 if we're unable to find a change the first time
154 | scotch free : scot-free
155 | never cry wool : never cry wolf
156 | # these are too many edits away
157 | # perhaps i could do it by filling in the blanks
158 | pushing up days : pushing up daisies
159 | ##--end--
160 | 
161 | 
162 | ####### SORT OF WORKS ############
163 | # this would benefit if we weighted consonant changes more heavily than vowel changes
164 | spill chick : spell check # actually ok... ['still thick','spell check',...]
165 | # issue: almost works. we get 'pay' instead of 'paid'. 'paid' is second.
166 | get what you payed for : get what you paid for
167 | ##--end--
168 | 
169 | 
170 | # this is a tricky one. "the dog was" is immensely frequent,
171 | #  but "dog was dense" isn't. "fog was dense" is more frequent than "dog was dense",
172 | # but when the ngram frequencies are simply summed "the dog" still wins
173 | The dog was dense : The fog was dense
174 | 
175 | # almost works, but apostrophe still
176 | worth it's salt : worth its salt
177 | 
178 | ##--end--
179 | 
180 | It it did, the future would undoubtedly be changed : If it did, the future would undoubtedly be changed # Foundation, Isaac Asimov p. 33
181 | 
182 | ##--end--
183 | 
184 | ####### BROKEN ##########
185 | overhere : overhear
186 | USB-to-serail driver : USB-to-serial driver # technical term not in ngrams
187 | # big test; requires token expansion (their) -> (they,re)
188 | Their coming too sea if its reel. : They're coming to see if it's real.
189 | # nope, phonic stuff doesn't do fuzzy matching
190 | all intensive purposes : all intents and purposes
191 | say "good riddens" to : say "good riddance" to # fuzzy phonic matching
192 | spill check : spell check
193 | # duplicated word 'does'
194 | the action does does come with : the action does come with # slashdot
195 | 
196 | # this is a tricky one. "the dog was" is immensely frequent,
197 | #  but "dog was dense" isn't. "fog was dense" is more frequent than "dog was dense",
198 | # but when the ngram frequencies are simply summed "the dog" still wins
199 | The dog was dense : The fog was dense
200 | 
201 | ####### UNEXPECTED NON-FIXES ##########
202 | right over their : right over there # hmm "fix" is less than twice as frequent
203 | 
204 | #--end--
205 | 
206 | #soyouneedtomakethatvariable : so you need to make that variable
207 | 
208 | #over hear : overhear # not sure about this one
209 | over here : over here
210 | 
211 | #--end--
212 | 
213 | # misspellings: non-words
214 | naieve : naive
215 | #bazillion : billion
216 | #bajillion : billion
217 | inztrnlazti : international
218 | joyd ivision : joy division
219 | Insturctions: : Instructions:
220 | descently well : decently well
221 | I'm leary of it : I'm leery of it
222 | a pthon library : a python library
223 | #santimoniousness : sanctimoniousness
224 | integeter division : integer division
225 | 
226 | # transpositions resulting in words
227 | The dog was dense : The fog was dense
228 | I am very tried : I am very tired
229 | whatever remains, whoever improbable, must be the truth. : whatever remains, however improbable, must be the truth.
230 | It it did, the future would undoubtedly be changed in some minor respects. : If it did, the future would undoubtedly be changed in some minor respects. # Foundation, Isaac Asimov p. 33
231 | 
232 | # correct non-fixes
233 | I love non-sequiturs. : I love non-sequiturs.
234 | 
235 | # misspellings resulting in words
236 | your right dude. : you're right dude.
237 | 
238 | # transcriptions resulting in non-words
239 | Johsia : Joshua
240 | 
241 | # transpositions resulting in non-words
242 | Gergory : Gregory
243 | 23rd of Auguts : 23rd of August
244 | Johsua : Joshua
245 | 
246 | # misspellings resulting in words
247 | a shallow accent angle. : a shallow ascent angle.
248 | someone otter than yourself : someone other than yourself
249 | now it makes perfect sensor : now it makes perfect sense
250 | I would appreciate and alternative to : I would appreciate an alternative to
251 | "Yes, yes. I now the theorem." : "Yes, yes. I know the theorem." # Second Foundation, Isaac Asmiov, p. 105
252 | Humans many simply be too stupid : Humans may simply be too stupid
253 | At first it was effecting our sex life : At first it was affecting our sex life
254 | #pointers to the UINT type will through away the significant bits : pointers to the UINT type will throw away the significant bits
255 | #I think they call that a sentence now days. : I think they call that a sentence nowadays.
256 | 
257 | 
258 | # phonetic errors
259 | oic : oh i see
260 | f u c k : fuck
261 | hell-o : hello
262 | o i c : oh i see
263 | orly : oh really
264 | faux king hill : fucking hell
265 | in the sample place : in the same place
266 | hungreh. wants soo shee : hungry. want sushi
267 | goan jump off a bridge : go and jump off a bridge
268 | 
269 | # mixed
270 | #did he steel you ice cream? : did he steal your ice cream?
271 | you are backpaddling from a smartass slapdown :-) : you are backpedaling from a smartass slapdown :-)
272 | 
273 | # intentional typos
274 | #concise unlike the verbosity of Java and Erlong. : concise unlike the verbosity of Java and Erlang.
275 | 
276 | 
277 | # grammatical errors
278 | That it. : That's it.
279 | #You have less followers then him : You have fewer followers than him
280 | 
281 | # missing words
282 | #I doubt we'll this any time soon. : I doubt we'll do this any time soon.
283 | #production on hold across the country to allow to watch the match. : put production on hold across the country to allow employees to watch the match.
284 | #Microsoft is obsessed Websockets : Microsoft is obsessed with Websockets
285 | 
286 | ### OTHERS
287 | # splits
288 | # we *can* tease this out, but the cost of doing so is just too high right now. in the future perhps we can fall back to more expensive methods when appropriate
289 | ifit'snotpurethecompilercan'toptimizeitlikeyouwant : if it's not pure the compiler can't optimize it like you want
290 | 
291 | I've been doing this for a very long time and I think I have have encountered each of the bugs listed in this list. : I've been doing this for a very long time and I think I have encountered each of the bugs listed in this list.
292 | 
293 | Software's inherit ability to adapt is part of what drives this differentiating factor. : Software's inherent ability to adapt is part of what drives this differentiating factor.
294 | 
295 | Now is the time for all good people to come to the aid of there country : Now is the time for all good people to come to the aid of their country
296 | 
297 | # real world examples that should be easily fixable
298 | hat are some example of public datasets that have randomized instruments? : what are some example of public datasets that have randomized instruments?
299 | I invite women over so that I have the motivation to stop being such a fucking slob for 12 seconds in the vein attempt at getting laid. : I invite women over so that I have the motivation to stop being such a fucking slob for 12 seconds in the vain attempt at getting laid.
300 | 
301 | # phonetic numbers
302 | Thanks a lot m8. : Thanks a lot mate.
303 | I h8 it! : I hate it!
304 | I 8 it! : I ate it!
305 | 
306 | # transpositions resulting in logical impossiblities
307 | 32rd of August : 23rd of August
308 | 
309 | # unclassified real-world
310 | #A group of 21 volunteers from Tokyo and Saitama brought Sunday 2,000 meals for about over 500 evacuees at the shelter.
311 | #Use Reddit to decide what to tool use. : Use Reddit to decide what tool to use. # token swap x,y -> y,x
312 | # Math is not sexy. Statistics are not sext. : Math is not sexy. Statistics are not sexy.
313 | # best font for coding : best font for coding
314 | It it did, the future would undoubtedly be changed in some minor respects. : If it did, the future would undoubtedly be changed in some minor respects. # Foundation, Isaac Asimov p. 33
315 | not weather you win or loose it's how you ply the gale : not whether you win or lose it's how you play the game
316 | primitives are not implement as a direct call : primitives are not implemented as a direct call # Efficient Parallel Programming in Poly/ML and Isabelle/ML
317 | feel apart of something : feel a part of something
318 | I can't bring myself to an android phone : I can't bring myself to get an android phone # comcor
319 | 
320 | # not so sure about this one...
321 | Im pretty sure T-Rexs weren't that big. : I'm pretty sure T-Rexs weren't that big.
322 | 
323 | # this is a real-world example of an uncommon n-gram transposition.
324 | # the only way we can detect these sorts of errors is to adapt our corpus
325 | # to handle context in a personalized way, by building a corpus out of local documents.
326 | nyc sing company : nyc sign company
327 | 
328 | But it is a lot bulkier, and i teats batteries. : But it is a lot bulkier, and it eats batteries. # ycombinator on calculators
329 | 
330 | # omission: trying to the -> trying to get the
331 | Norvig says that no one is listening to your calls on Google Voice — it is simply their servers trying to the translation right. : Norvig says that no one is listening to your calls on Google Voice — it is simply their servers trying to get the translation right. # slashdot
332 | 
333 | # transcription: affectiveness -> effectiveness
334 | Part of their work is checking that server's affectiveness, too. : Part of their work is checking that server's effectiveness, too. # slashdot
335 | 
336 | 
337 | # transcription: Europe -> Europa
338 | Due to their size, atmospheric drag would slow them down without burning them up, allowing them to study the uppermost atmosphere of wherever they are deployed next: Venus, Titan, Europe, and Jupiter are all possibilities. : Due to their size, atmospheric drag would slow them down without burning them up, allowing them to study the uppermost atmosphere of wherever they are deployed next: Venus, Titan, Europa, and Jupiter are all possibilities. # slashdot
339 | 
340 | # a moderate paragraph with absolutely nothing wrong with it
341 | # source: http://www.propublica.org/article/all-the-magnetar-trade-how-one-hedge-fund-helped-keep-the-housing-bubble
342 | In late 2005, the booming U.S. housing market seemed to be slowing. The Federal Reserve had begun raising interest rates. Subprime mortgage company shares were falling. Investors began to balk at buying complex mortgage securities. The housing bubble, which had propelled a historic growth in home prices, seemed poised to deflate. And if it had, the great financial crisis of 2008, which produced the Great Recession of 2008-09, might have come sooner and been less severe. : In late 2005, the booming U.S. housing market seemed to be slowing. The Federal Reserve had begun raising interest rates. Subprime mortgage company shares were falling. Investors began to balk at buying complex mortgage securities. The housing bubble, which had propelled a historic growth in home prices, seemed poised to deflate. And if it had, the great financial crisis of 2008, which produced the Great Recession of 2008-09, might have come sooner and been less severe.
343 | 


--------------------------------------------------------------------------------