├── .gitignore ├── Makefile ├── README.txt ├── data ├── cmudict │ ├── 00README_FIRST.txt │ ├── cmudict.0.7a.gz │ └── cmudict.0.7a.phones └── corpus │ ├── big.txt.bz2 │ └── google-ngrams │ ├── .gitignore │ ├── Makefile │ ├── extract.py │ ├── fetch.py │ ├── import2bin-ngram.c │ ├── import2bin-word.py │ ├── ngram3bin-compact.c │ ├── ngram3bin.c │ ├── ngram3bin.h │ ├── ngram3binpy.c │ ├── scratch │ ├── benchmark-str-to-id.py │ └── debug-multiprocessing-dict.py │ ├── setup.py │ └── testbin.py ├── doc ├── algorithm.txt └── things-that-can-go-wrong-language-wise.txt ├── src ├── algo.py ├── chick.py ├── corpus.py ├── doc.py ├── gram.py ├── grambin.py ├── ngramdiff.py ├── phon.py ├── test.py ├── util.py ├── web │ ├── .gitignore │ ├── code.py │ ├── conf │ │ └── apache2.conf.add │ ├── static │ │ ├── img │ │ │ ├── chick.png │ │ │ ├── chick16.png │ │ │ ├── chick16.png.ico │ │ │ └── chick32.png │ │ └── js │ │ │ └── spill-chick.js │ └── templates │ │ ├── base.html │ │ └── check.html └── word.py └── test ├── Ode-To-My-Spell-Checker-correct.txt ├── Ode-To-My-Spell-Checker-original.txt ├── Spill-Chick-Yore-Dock-You-Mints-correct.txt ├── Spill-Chick-Yore-Dock-You-Mints-original.txt └── test.txt /.gitignore: -------------------------------------------------------------------------------- 1 | *.swp 2 | *.pyc 3 | src/test.prof 4 | data/corpus/*.gz 5 | data/cmudict/cmudict.0.7a 6 | misc/ 7 | src/scratch 8 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | all: data 2 | 3 | data: ngrams 4 | 5 | ngrams: 6 | $(MAKE) -C data/corpus/google-ngrams 7 | -------------------------------------------------------------------------------- /README.txt: -------------------------------------------------------------------------------- 1 | 2 | Author: Ryan Flynn 3 | 4 | spill-chick is a context-sensitive language checker designed to 5 | correct spelling and grammar errors which pass existing checkers. 6 | 7 | There are all sorts of typing errors one can make. 8 | 9 | transcription error .................... speling is hard 10 | transposition error .................... causal Friday 11 | homophone error ........................ peace of crap 12 | grammatical error ...................... your right! 13 | word merging/splitting ................. always miss spelling stuff 14 | botched idioms ......................... for all intensive purposes 15 | word omission .......................... oops, I the word 16 | word duplication ....................... and it does does also 17 | inconsistency of proper nouns .......... Julius Seizure 18 | 19 | It is inspired by 'Ode To My Spell Checker', which contains no spelling 20 | errors, is perfectly readable and yet is very incorrect. It begins: 21 | 22 | Eye halve a spelling chequer 23 | It came with my pea sea 24 | It plainly marques four my revue 25 | Miss steaks eye kin knot sea. 26 | 27 | Progress: 28 | I have a spelling checker 29 | It came with my pc 30 | It plainly marks for my review 31 | Mistakes i did not see. 32 | 33 | -------------------------------------------------------------------------------- /data/cmudict/00README_FIRST.txt: -------------------------------------------------------------------------------- 1 | 2 | CMUdict 3 | ------- 4 | 5 | CMUdict (the Carnegie Mellon Pronouncing Dictionary) is a free 6 | pronouncing dictionary of English, suitable for uses in speech 7 | technology and is maintained by the Speech Group in the School of 8 | Computer Science at Carnegie Mellon University. 9 | 10 | The Carnegie Mellon Speech Group does not guarantee the accuracy of 11 | this dictionary, nor its suitability for any specific purpose. In 12 | fact, we expect a number of errors, omissions and inconsistencies to 13 | remain in the dictionary. We intend to continually update the 14 | dictionary by correction existing entries and by adding new ones. From 15 | time to time a new major version will be released. 16 | 17 | We welcome input from users: Please send email to Alex Rudnicky 18 | (air+cmudict@cs.cmu.edu). 19 | 20 | The Carnegie Mellon Pronouncing Dictionary, in its current and 21 | previous versions is Copyright (C) 1993-2008 by Carnegie Mellon 22 | University. Use of this dictionary for any research or commercial 23 | purpose is completely unrestricted. If you make use of or 24 | redistribute this material we request that you acknowledge its 25 | origin in your descriptions. 26 | 27 | If you add words to or correct words in your version of this 28 | dictionary, we would appreciate it if you could send these additions 29 | and corrections to us (air+cmudict@cs.cmu.edu) for consideration in a 30 | subsequent version. All submissions will be reviewed and approved by 31 | the current maintainer, Alex Rudnicky at Carnegie Mellon. 32 | 33 | ------------------------------------------------------------------ 34 | The current version of cmudict is cmudict.0.7a 35 | [First released October 29, 2007] 36 | 37 | -------------------------------------------------------------------------------- /data/cmudict/cmudict.0.7a.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rflynn/spill-chick/430257c25369908f243a08d33caa268e8e398aeb/data/cmudict/cmudict.0.7a.gz -------------------------------------------------------------------------------- /data/cmudict/cmudict.0.7a.phones: -------------------------------------------------------------------------------- 1 | AA 2 | AA0 3 | AA1 4 | AA2 5 | AE 6 | AE0 7 | AE1 8 | AE2 9 | AH 10 | AH0 11 | AH1 12 | AH2 13 | AO 14 | AO0 15 | AO1 16 | AO2 17 | AW 18 | AW0 19 | AW1 20 | AW2 21 | AY 22 | AY0 23 | AY1 24 | AY2 25 | B 26 | CH 27 | D 28 | DH 29 | EH 30 | EH0 31 | EH1 32 | EH2 33 | ER 34 | ER0 35 | ER1 36 | ER2 37 | EY 38 | EY0 39 | EY1 40 | EY2 41 | F 42 | G 43 | HH 44 | IH 45 | IH0 46 | IH1 47 | IH2 48 | IY 49 | IY0 50 | IY1 51 | IY2 52 | JH 53 | K 54 | L 55 | M 56 | N 57 | NG 58 | OW 59 | OW0 60 | OW1 61 | OW2 62 | OY 63 | OY0 64 | OY1 65 | OY2 66 | P 67 | R 68 | S 69 | SH 70 | T 71 | TH 72 | UH 73 | UH0 74 | UH1 75 | UH2 76 | UW 77 | UW0 78 | UW1 79 | UW2 80 | V 81 | W 82 | Y 83 | Z 84 | ZH 85 | -------------------------------------------------------------------------------- /data/corpus/big.txt.bz2: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rflynn/spill-chick/430257c25369908f243a08d33caa268e8e398aeb/data/corpus/big.txt.bz2 -------------------------------------------------------------------------------- /data/corpus/google-ngrams/.gitignore: -------------------------------------------------------------------------------- 1 | *.list.gz 2 | *.csv.zip 3 | *.csv.gz 4 | *.csv 5 | *-2008.ids.gz 6 | *.bin 7 | build/* 8 | *.o 9 | *.s 10 | ngram3.bin.* 11 | ngram3bin 12 | ngram3bin-compact 13 | import2bin-ngram 14 | cscope.out 15 | -------------------------------------------------------------------------------- /data/corpus/google-ngrams/Makefile: -------------------------------------------------------------------------------- 1 | CP = cp 2 | 3 | build: ngram3bin.h ngram3bin.c ngram3binpy.c build-py build-py3 4 | 5 | build-py: 6 | python setup.py build 7 | sudo python setup.py install 8 | 9 | build-py3: 10 | python3 setup.py build 11 | sudo python3 setup.py install 12 | 13 | # googlebooks-eng-all-3gram-20090715-#.csv.zip 14 | # -> fetch -> *-2008-list.gz (word,word,word,freq) 15 | # -> extract -> *-2008.ids.gz (id,id,id,freq) 16 | # -> word.csv.gz (wid,word) 17 | # -> import2bin-word.py -> word.bin (id,word utf8 binary padded) 18 | # -> import2bin-ngram -> ngram3.bin (id,id,id,freq binary) 19 | # -> ngram3bin-compact -> ngram3.bin.sort 20 | data: import2bin-ngram ngram3bin-compact 21 | ./fetch.py --run 22 | ./extract.py 23 | ./import2bin-word.py 24 | $(RM) ngram3.bin 25 | gzip -dc *.ids.gz | ./import2bin-ngram > ngram3.bin 26 | ./ngram3bin-compact 27 | $(RM) ngram3.bin 28 | ln -s ngram3.bin.sort ngram3.bin 29 | 30 | all: ngram3bin 31 | ngram3bin: ngram3bin.o 32 | import2bin-ngram: import2bin-ngram.o 33 | ngram3bin-compact: ngram3bin-compact.o ngram3bin.o 34 | -------------------------------------------------------------------------------- /data/corpus/google-ngrams/extract.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | """ 4 | once fetch.py has grabbed a set of ngrams, I parse out a subset and generate CSV. 5 | given our lists of 'x y z\tcnt', extract, parse and dump 6 | """ 7 | 8 | import os, re, sys 9 | from glob import glob 10 | from time import time 11 | import multiprocessing as mp 12 | import Queue 13 | 14 | """ 15 | filter n-grams with a freq < MinFreq. range should be somewhere between 20 and 100. 16 | 17 | we do this to substantially reduce the number of n-grams considered by our program, 18 | and improve the quality of our results. by definition we are trying to reduce the 19 | document entropy, and need only consider n-grams with a certain frequency. 20 | 21 | the population of n-grams is ~inversely proportional to its frequency. 22 | approximately 1/2 have freq <= 2, 1/4 have freq >2 and <=4 etc. 23 | 24 | by filtering we eliminate ~90% 25 | 26 | we must maintain a balance between accepting garbage typos that appear a few times 27 | globally and glossing over legitimate but infrequent phrases. 28 | """ 29 | MinFreq = 20 30 | 31 | # one megabyte, 4*MB more clear than 4*1024*1024 or 4000000 32 | MB = 1024 ** 2 33 | 34 | Ids = {} 35 | Ids['UNKNOWN'] = 0 36 | Ids['$PROPERNOUN'] = 1 37 | Ids['$NUMBER'] = 2 38 | 39 | # translate each unique token into a unique numeric id 40 | # must be thread-safe on write 41 | def tokid(key): 42 | global Ids 43 | if key not in Ids: 44 | # create a new id, must be unique per key and linear 45 | Ids[key] = len(Ids) 46 | return Ids[key] 47 | 48 | # gunzip 'filename', translate string tokens into ids and gzip write to 'dst' 49 | def extractfile(nth, total, filename, dst, ids): 50 | global Ids 51 | start = time() 52 | with os.popen('gunzip -dc ' + filename, 'r') as gunzip: 53 | contents = '\n' + gunzip.read().lower() 54 | with os.popen('gzip -c - > ' + dst, 'wb', 4*MB) as gz: 55 | #for x,y,z,cnt in re.findall('\n([^\d\W]+) ([^\d\W]+) ([^\d\W]+)\t(\d+)', contents): 56 | #for m in re.finditer('\n([^ ]+) ([^ ]+) ([^ ]+)\t(\d+)', contents): 57 | # regexes are more expensive than string splitting but allow us a finer control over 58 | # what we accept which means we can reasonably skip exception setup. 59 | # turns out not setting up an exception for each of 200M lines shaves ~2/3x of our time(!) 60 | # include periods and apostrophes 61 | for m in re.finditer('\n([\w\']+) ([\w\']+) ([\w\']+)\t(\d+)',contents): 62 | x,y,z,cnt = m.groups() 63 | cnt = int(cnt) 64 | if cnt >= MinFreq: 65 | gz.write('%u,%u,%u,%u\n' % \ 66 | (tokid(x), tokid(y), tokid(z), cnt)) 67 | print '%3u/%3u %s (%.1f sec) ids:%u' % (nth, total, dst, time() - start, len(Ids)) 68 | 69 | # pulls filenames out of the queue and hand parameters off 70 | # when we run out of items to process we timeout and return 71 | def worker(q, ids): 72 | while True: 73 | try: 74 | nth,total,filename,dst = q.get(timeout=1) 75 | extractfile(nth,total,filename,dst, ids) 76 | except Queue.Empty: 77 | break 78 | 79 | Q = mp.Queue() 80 | 81 | # build queue of files to process 82 | filenames = sorted(glob('*-2008.list.gz')) 83 | total = len(filenames) 84 | for nth,filename in enumerate(filenames): 85 | dst = str.replace(filename,'list.gz','ids.gz') 86 | if os.path.exists(dst): 87 | continue 88 | Q.put((nth+1,total,filename,dst)) 89 | 90 | if Q.qsize(): 91 | print 'Queued %u files.' % (Q.qsize(),) 92 | 93 | # multiprocessing is great, except with 2 CPUs the overhead from manager 94 | # overcomes the benefit of keeping both CPUs busy, bummer. with 4+ CPUs it 95 | # might be a different story, I don't know. 96 | # for now, single CPU with dict() is the fastest 97 | DoMP = False 98 | if DoMP: 99 | manager = mp.Manager() 100 | Ids = mp.dict() 101 | Ids['UNKNOWN'] = 0 102 | Ids['$PROPERNOUN'] = 1 103 | Ids['$NUMBER'] = 2 104 | 105 | # create workers, run, wait for completion 106 | W = [ mp.Process(target=worker, args=(Q, Ids)) 107 | for _ in range(mp.cpu_count()) ] 108 | for w in W: w.start() 109 | for w in W: w.join() 110 | else: 111 | worker(Q, None) 112 | 113 | print 'len(Ids)=', len(Ids) 114 | assert len(Ids) > 3 115 | 116 | with os.popen('gzip -c - > word.csv.gz', 'wb') as gz: 117 | Ids = sorted(Ids.items(), key=lambda x:x[1]) 118 | for word,wid in Ids: 119 | gz.write('%s,%s\n' % (wid, word)) 120 | 121 | -------------------------------------------------------------------------------- /data/corpus/google-ngrams/fetch.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | """ 4 | fetch Google Books' 3-ary ngrams. 5 | run me, then extract.py 6 | enumerate, download, extract, filter and delete files 7 | """ 8 | 9 | """ 10 | Traceback (most recent call last): 11 | File "./fetch.py", line 70, in 12 | download(url, dst) 13 | File "./fetch.py", line 22, in download 14 | chunk = req.read(CHUNK) 15 | File "/usr/lib/python2.6/socket.py", line 353, in read 16 | data = self._sock.recv(left) 17 | File "/usr/lib/python2.6/httplib.py", line 538, in read 18 | s = self.fp.read(amt) 19 | File "/usr/lib/python2.6/socket.py", line 353, in read 20 | data = self._sock.recv(left) 21 | socket.error: [Errno 104] Connection reset by peer 22 | make: *** [data] Error 1 23 | """ 24 | 25 | import datetime 26 | 27 | def log(what, msg): 28 | print('%s %s %s' % (datetime.datetime.now(), what, msg)) 29 | 30 | import urllib2 31 | 32 | def download(url, dst): 33 | log(dst, 'download') 34 | CHUNK = 2 * 1024 * 1024 35 | while True: 36 | try: 37 | req = urllib2.urlopen(url) 38 | with open(dst, 'wb') as fp: 39 | while 1: 40 | chunk = req.read(CHUNK) 41 | if not chunk: break 42 | fp.write(chunk) 43 | break 44 | except socket.error: 45 | log(dst, 'error, continuing...') 46 | continue 47 | return dst 48 | 49 | import re, collections 50 | import os 51 | 52 | def extract(filename, gzfile): 53 | log(filename, 'extract') 54 | CHUNK = 8 * 1024 * 1024 55 | with os.popen('unzip -p ' + filename) as fd: 56 | d = {} 57 | while 1: 58 | txt = fd.read(CHUNK) 59 | if not txt: break 60 | # ! ! Along 2008 4 61 | d.update(re.findall('\n([^\t]+)\t2008\t(\d+)', txt)) 62 | with os.popen('gzip -c - > ' + gzfile, 'wb') as out: 63 | for k in sorted(d.keys()): 64 | out.write('%s\t%s\n' % (k, d[k])) 65 | 66 | def delete(filename): 67 | try: 68 | if os.path.exists(filename): 69 | log(filename, 'delete') 70 | os.remove(filename) 71 | except: 72 | pass 73 | 74 | def urls(): 75 | for n in range(0, 200): 76 | yield 'http://commondatastorage.googleapis.com/books/ngrams/books/googlebooks-eng-all-3gram-20090715-' + str(n) + '.csv.zip' 77 | 78 | def url2filename(url): 79 | return url[url.rfind('/')+1:] 80 | 81 | def filename2gz(filename): 82 | return filename + '-2008.list.gz' 83 | 84 | if __name__ == '__main__': 85 | import sys 86 | if len(sys.argv) > 1 and sys.argv[1] == '--run': 87 | for url in urls(): 88 | try: 89 | dst = url2filename(url) 90 | dstgz = filename2gz(dst) 91 | if not os.path.exists(dstgz): 92 | if not os.path.exists(dst): 93 | download(url, dst) 94 | extract(dst, dstgz) 95 | except urllib2.HTTPError, e: 96 | print(e.reason) 97 | print('continuing...') 98 | finally: 99 | delete(dst) # delete either partial and or complete 100 | 101 | -------------------------------------------------------------------------------- /data/corpus/google-ngrams/import2bin-ngram.c: -------------------------------------------------------------------------------- 1 | // ex: set ts=8 noet: 2 | 3 | // Convert text-based CSV format "x,y,z,freq" to packed little-endian binary format 4 | // 5 | // Usage: gzip -dc *.ids.gz | ./import2bin-ngram > ngram3.bin.orig.c 6 | // 7 | // Port from import2bin.py; it was just too slow. We're >10x faster. 8 | 9 | #include 10 | #include 11 | #include 12 | #include 13 | #include 14 | #include 15 | #include 16 | #include 17 | 18 | typedef struct { 19 | #pragma pack(push, 1) 20 | uint32_t id[3], 21 | freq; 22 | #pragma pack(pop) 23 | } ngram3; 24 | 25 | // "id0,id1,id2,freq" -> ngram3 26 | int line2ng(const wchar_t *line, ngram3 *ng) 27 | { 28 | return swscanf(line, 29 | L"%" SCNu32 ",%" SCNu32 ",%" SCNu32 ",%" SCNu32 "\n", 30 | ng->id+0, ng->id+1, ng->id+2, &ng->freq) == 4; 31 | } 32 | 33 | // stdin -> [ngram3(...),...] 34 | int main(void) 35 | { 36 | // we're going to be writing out 100s of MB in a batch; use a large buffer 37 | # define BUFLEN 32 * 1024 * 1024L 38 | static wchar_t line[1024]; 39 | char *buf = malloc(BUFLEN); 40 | ngram3 ng; 41 | 42 | assert(sizeof ng == 16 && "ensure packing"); 43 | 44 | if (!setlocale(LC_CTYPE, "")) 45 | { 46 | fprintf(stderr, "Can't set the specified locale! Check LANG, LC_CTYPE, LC_ALL.\n"); 47 | return 1; 48 | } 49 | 50 | // fully buffer stdout 51 | setvbuf(stdout, buf, _IOFBF, BUFLEN); 52 | 53 | // parse lines from stdin, write packed binary ngram to stdout, errors to stderr 54 | while (fgetws(line, sizeof line / sizeof line[0], stdin)) 55 | { 56 | if (line2ng(line, &ng)) 57 | { 58 | fwrite(&ng, sizeof ng, 1, stdout); 59 | } 60 | else 61 | { 62 | fprintf(stderr, "invalid line '%ls'\n", line); 63 | } 64 | } 65 | 66 | free(buf); 67 | 68 | return 0; 69 | } 70 | 71 | -------------------------------------------------------------------------------- /data/corpus/google-ngrams/import2bin-word.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import os 4 | 5 | """ 6 | extract.py generated: 7 | word.csv.gz: master word file (word id,word) 8 | *-2008.ids.gz: files of 3-ary ngrams (id0,id1,id2,freq) 9 | 10 | we take word.csv.gz, which is already in sorted order by id ascending, 11 | and compact the words into binary format and write to word.bin 12 | """ 13 | 14 | from struct import pack,unpack 15 | with os.popen('gzip -dc word.csv.gz', 'r') as gz: 16 | with open('word.bin', 'wb') as bin: 17 | for line in gz: 18 | wid,word = line.rstrip().split(',', 1) 19 | wid = int(wid) 20 | bword = bytes(word, 'utf-8') 21 | wlen = len(bword) 22 | # pad bword with enough \0 to make next string start with alignment=4 23 | bword += b'\0' * (1 + ((len(bword)+1) % 4)) 24 | """ 25 | write [uint32_t len][word ... \0\0?\0?\0?] 26 | we use fields that are multiples of 4 bytes to keep the &word[0] 32-bit aligned 27 | which improves read performance 28 | """ 29 | bin.write(pack(' 4 | * 5 | * google's data has duplicate ngrams(!) 6 | * sort our ngram.bin file's entries, then merge/sum 7 | * 8 | */ 9 | 10 | #include 11 | #include 12 | #include 13 | #include 14 | #include 15 | #include 16 | #include 17 | #include 18 | #include 19 | #include "ngram3bin.h" 20 | 21 | static void sortfile(const struct ngram3map *m) 22 | { 23 | size_t nmemb = m->size / sizeof(ngram3); 24 | printf("%s:%u qsort(%p, %zu, %zu, %p);\n", 25 | __func__, __LINE__, (void*)m->m, nmemb, sizeof(ngram3), (void*)ngram3cmp); 26 | qsort(m->m, nmemb, sizeof(ngram3), ngram3cmp); 27 | } 28 | 29 | /* 30 | * ngram3map.m is a big mmap array of ngram3 31 | * it's been sorted, we want to merge consecutive identical ids into a single one, summing the freq field 32 | */ 33 | static void mergefile(const struct ngram3map *m) 34 | { 35 | char *buf = malloc(1024 * 1024); 36 | ngram3 *rd = ngram3map_start(m); 37 | const ngram3 *end = ngram3map_end(m); 38 | unsigned long uniqcnt = 1; 39 | FILE *f = fopen("ngram3.bin.sort", "w"); 40 | ngram3 wr = *rd; 41 | perror("fopen"); 42 | rd++; 43 | setvbuf(f, buf, _IOFBF, 1024 * 1024); 44 | perror("setvbuf"); 45 | while (rd < end) 46 | { 47 | if (rd->id[0] == wr.id[0] && 48 | rd->id[1] == wr.id[1] && 49 | rd->id[2] == wr.id[2]) 50 | { 51 | wr.freq += rd->freq; 52 | } 53 | else 54 | { 55 | fwrite(&wr, sizeof wr, 1, f); 56 | wr = *rd; 57 | uniqcnt++; 58 | } 59 | rd++; 60 | } 61 | printf("%s:%u\n", __func__, __LINE__); 62 | 63 | printf("merged into %lu ngram3s...\n", uniqcnt); 64 | printf("saving...\n"); 65 | 66 | fclose(f); 67 | perror("fclose"); 68 | free(buf); 69 | } 70 | 71 | int main(void) 72 | { 73 | const char *path = "ngram3.bin"; 74 | struct ngram3map m = ngram3bin_init(path, 1); 75 | printf("map %llu bytes (%llu ngram3s)\n", m.size, m.size / sizeof(ngram3)); 76 | printf("sorting...\n"); 77 | sortfile(&m); 78 | printf("merging...\n"); 79 | mergefile(&m); 80 | printf("done.\n"); 81 | ngram3bin_fini(m); 82 | return 0; 83 | } 84 | 85 | -------------------------------------------------------------------------------- /data/corpus/google-ngrams/ngram3bin.c: -------------------------------------------------------------------------------- 1 | /* ex: set ts=8 noet: */ 2 | /* 3 | * Copyright 2011 Ryan Flynn 4 | * 5 | * our 3-ary ngrams are in binary format in ngram3.bin 6 | */ 7 | 8 | #include 9 | #include 10 | #include 11 | #include 12 | #include 13 | #include 14 | #include 15 | #include 16 | #include "ngram3bin.h" 17 | 18 | void ngram3bin_str(const struct ngram3map m, FILE *f) 19 | { 20 | fprintf(f, "ngram3map(size=%llu)", m.size); 21 | } 22 | 23 | struct ngram3map ngram3bin_init(const char *path, int write) 24 | { 25 | struct stat st; 26 | struct ngram3map m = { NULL, -1, 0 }; 27 | if (!stat(path, &st)) 28 | { 29 | //printf("stat(\"%s\") size=%llu\n", path, (unsigned long long)st.st_size); 30 | if (-1 != (m.fd = open(path, write ? O_RDWR : O_RDONLY))) 31 | { 32 | m.size = st.st_size; 33 | m.m = mmap(NULL, m.size, PROT_READ | (write ? PROT_WRITE : 0), MAP_SHARED, m.fd, 0); 34 | if (MAP_FAILED == m.m) 35 | { 36 | perror("mmap"); 37 | m.m = NULL; 38 | } 39 | } 40 | } 41 | return m; 42 | } 43 | 44 | // ng is 'cnt' items long; we need to ensure at least 1 more ngram3 in it 45 | ngram3 * ngram3_find_spacefor1more(ngram3 *ng, unsigned long cnt) 46 | { 47 | // allocate space for results on every power of 2 48 | // 0->1, 1->2, 2->4, 4->8, etc. 49 | if ((cnt & (cnt - 1)) == 0) 50 | { 51 | unsigned long alloc = cnt ? cnt * 2 : 1; 52 | ngram3 *tmp = realloc(ng, alloc * sizeof *ng); 53 | if (tmp) 54 | { 55 | ng = tmp; 56 | } 57 | else 58 | { 59 | free(ng); 60 | ng = NULL; 61 | } 62 | } 63 | return ng; 64 | } 65 | 66 | /* 67 | * map contains the mmap'ed contents of a dictionary file 68 | * the dictionary file is a list of variable-length entries in the form 69 | * [uint32_t id][uint32_t len][utf-8 encoded string of bytes length 'len'] 70 | */ 71 | struct ngramword ngramword_load(const struct ngram3map m) 72 | { 73 | ngramwordcursor *cursor = m.m; 74 | ngramwordcursor *end = (void *)((char *)m.m + m.size); 75 | unsigned long maxpossible = m.size / 6 + 1; 76 | struct ngramword w; 77 | w.word = calloc(maxpossible, sizeof *w.word); 78 | w.cnt = 0; 79 | while (cursor < end) 80 | { 81 | const char *str = ngramwordcursor_str(cursor); 82 | w.word[w.cnt].len = cursor->len; 83 | w.word[w.cnt].str = str; 84 | w.cnt++; 85 | cursor = ngramwordcursor_next(cursor); 86 | } 87 | w.word = realloc(w.word, w.cnt * sizeof *w.word); 88 | return w; 89 | } 90 | 91 | /* 92 | * FIXME: O(n) 93 | * note: this is mitigated by the python module by using a dict 94 | * if i added a stage at the beginning of this whole process and sorted words then we could 95 | * reduce this to O(log n) 96 | * we can also reduce the impact of this by converting all tokens in a document to their ids 97 | * once for the duration of the process; currently we're being lazy and repeatedly translating 98 | */ 99 | const unsigned long ngramword_word2id(const char *word, unsigned len, const struct ngramword w) 100 | { 101 | unsigned long id = 0; 102 | printf("ngramword_word2id(word=\"%s\", w={%lu,%p})\n", word, w.cnt, w.word); 103 | while (id < w.cnt) 104 | { 105 | if (w.word[id].len == len && 0 == memcmp(word, w.word[id].str, len)) 106 | break; 107 | id++; 108 | } 109 | if (id == w.cnt) 110 | id = 0; 111 | return id; 112 | } 113 | 114 | const char * ngramword_id2word(unsigned long id, const struct ngramword w) 115 | { 116 | if (id < w.cnt) 117 | return w.word[id].str; 118 | return NULL; 119 | } 120 | 121 | void ngramword_fini(struct ngramword w) 122 | { 123 | free(w.word); 124 | } 125 | 126 | /* 127 | * ngram3 comparison callback 128 | * ascending order 129 | */ 130 | int ngram3cmp(const void *va, const void *vb) 131 | { 132 | const ngram3 *a = va, 133 | *b = vb; 134 | if (a->id[0] != b->id[0]) return (int)(a->id[0] - b->id[0]); 135 | if (a->id[1] != b->id[1]) return (int)(a->id[1] - b->id[1]); 136 | if (a->id[2] != b->id[2]) return (int)(a->id[2] - b->id[2]); 137 | return 0; 138 | } 139 | 140 | /* 141 | * 142 | */ 143 | unsigned long ngram3bin_freq(ngram3 find, const struct ngram3map *m) 144 | { 145 | ngram3 *base = m->m; 146 | size_t nmemb = m->size / sizeof *base; 147 | const ngram3 *res = bsearch(&find, base, nmemb, sizeof *base, ngram3cmp); 148 | return res ? res->freq : 0; 149 | } 150 | 151 | /* 152 | * given find (x,y) sum the occurences of (x,y,_) and (_,x,y) 153 | */ 154 | unsigned long ngram3bin_freq2(ngram3 find, const struct ngram3map *m) 155 | { 156 | unsigned long freq = 0; 157 | ngram3 *cur = m->m; 158 | const ngram3 *end = (ngram3 *)((char *)cur + m->size); 159 | while (cur < end) 160 | { 161 | if (cur->id[0] == find.id[0] && 162 | cur->id[1] == find.id[1]) 163 | { 164 | freq += cur->freq; 165 | } 166 | else 167 | if (cur->id[1] == find.id[0] && 168 | cur->id[2] == find.id[1]) 169 | { 170 | freq += cur->freq; 171 | } 172 | cur++; 173 | } 174 | return freq; 175 | } 176 | 177 | 178 | /* 179 | * given an id 3-gram (x,y,z) and a list of ngram frequencies 180 | * return matches (_,y,z) or (x,_,z) or (x,y,_) 181 | */ 182 | ngram3 * ngram3bin_like(ngram3 find, const struct ngram3map *m) 183 | { 184 | unsigned long ngcnt = 0; 185 | ngram3 *cur = m->m; 186 | const ngram3 *end = (ngram3*)((char*)cur + m->size); 187 | ngram3 *res = NULL; 188 | while (cur < end) 189 | { 190 | if (((cur->id[0] == find.id[0]) + 191 | (cur->id[1] == find.id[1]) + 192 | (cur->id[2] == find.id[2])) == 2) 193 | { 194 | res = ngram3_find_spacefor1more(res, ngcnt); 195 | if (!res) 196 | break; 197 | res[ngcnt] = *cur; /* copy result */ 198 | ngcnt++; 199 | } 200 | cur++; 201 | } 202 | if (res) 203 | { 204 | if ((res = ngram3_find_spacefor1more(res, ngcnt))) 205 | res[ngcnt].freq = 0; // sentinel 206 | } 207 | return res; 208 | } 209 | 210 | static unsigned long ngram3bin_like_xy_(ngram3 find, const struct ngram3map *m, ngram3 **res, unsigned long rescnt); 211 | static unsigned long ngram3bin_like_x_z(ngram3 find, const struct ngram3map *m, ngram3 **res, unsigned long rescnt); 212 | static unsigned long ngram3bin_like__yz(ngram3 find, const struct ngram3map *m, ngram3 **res, unsigned long rescnt, 213 | ngram3bin_index *idx); 214 | 215 | /* 216 | * given an id 3-gram (x,y,z) and a list of ngram frequencies 217 | * return matches (_,y,z) or (x,_,z) or (x,y,_) 218 | * 219 | * note: this is really the crux of the application: finding ngram-based context. 220 | * this function will be run thousands of times for every page of text our application checks. 221 | * 'm' represents 10s of millions of records totalling 100s of MBs. 222 | * efficiency is critical. 223 | * 224 | * note: upgrade of ngram3bin_like(), which performed a sequential scan of the entire 'm' every time. 225 | * this was simple and effective but just too inefficient. 226 | * so, we broke up the 3 types of matches performed into separate functions which incorporate binary 227 | * searches, which should reduce CPU-memory traffic considerably. 228 | * update: preliminary profiling suggests this is ~40x faster. 229 | */ 230 | ngram3 * ngram3bin_like_better(ngram3 find, const struct ngram3map *m, ngram3bin_index *idx) 231 | { 232 | ngram3 *res = NULL; 233 | unsigned long rescnt = 0; 234 | rescnt = ngram3bin_like_xy_(find, m, &res, rescnt); 235 | rescnt = ngram3bin_like_x_z(find, m, &res, rescnt); 236 | rescnt = ngram3bin_like__yz(find, m, &res, rescnt, idx); 237 | if (res) 238 | { 239 | if ((res = ngram3_find_spacefor1more(res, rescnt))) 240 | res[rescnt].freq = 0; // sentinel 241 | } 242 | return res; 243 | } 244 | 245 | static int ngram3cmp_xy_(const void *va, const void *vb) 246 | { 247 | const ngram3 *a = va, 248 | *b = vb; 249 | if (a->id[0] != b->id[0]) return (int)(a->id[0] - b->id[0]); 250 | if (a->id[1] != b->id[1]) return (int)(a->id[1] - b->id[1]); 251 | return 0; 252 | } 253 | 254 | /* 255 | * find entries in m matching (x,y,_) from find 256 | * because m's contents are sorted we can use bsearch 257 | */ 258 | static unsigned long ngram3bin_like_xy_(ngram3 find, const struct ngram3map *m, ngram3 **res, unsigned long rescnt) 259 | { 260 | const ngram3 *base = m->m; 261 | const size_t nmemb = m->size / sizeof *base; 262 | const ngram3 *bs = bsearch(&find, base, nmemb, sizeof *base, ngram3cmp_xy_); 263 | if (bs) 264 | { 265 | const ngram3 *end = (ngram3*)((char*)m->m + m->size); 266 | // at least one x,y_ exists, but many may exist and we can't be certain 267 | // where in that range we have landed 268 | // rewind to the beginning of the range... 269 | while (bs > base && (bs-1)->id[0] == find.id[0] && (bs-1)->id[1] == find.id[1]) 270 | bs--; 271 | // ...and then seek forward, capturing all (contiguous) matches 272 | while (bs < end && bs->id[0] == find.id[0] && bs->id[1] == find.id[1]) 273 | { 274 | *res = ngram3_find_spacefor1more(*res, rescnt); 275 | if (!*res) 276 | break; 277 | (*res)[rescnt] = *bs; 278 | rescnt++; 279 | bs++; 280 | } 281 | } 282 | return rescnt; 283 | } 284 | 285 | static int ngram3cmp_x__(const void *va, const void *vb) 286 | { 287 | const ngram3 *a = va, 288 | *b = vb; 289 | if (a->id[0] != b->id[0]) return (int)(a->id[0] - b->id[0]); 290 | return 0; 291 | } 292 | 293 | /* 294 | * find entries in m matching (x,_,z) from find 295 | * because m's contents are sorted we can use bsearch 296 | */ 297 | static unsigned long ngram3bin_like_x_z(ngram3 find, const struct ngram3map *m, ngram3 **res, unsigned long rescnt) 298 | { 299 | const ngram3 *base = m->m; 300 | const size_t nmemb = m->size / sizeof *base; 301 | const ngram3 *bs = bsearch(&find, base, nmemb, sizeof *base, ngram3cmp_x__); 302 | if (bs) 303 | { 304 | const ngram3 *end = (ngram3*)((char*)m->m + m->size); 305 | // rewind to the beginning of (x,_,_) range... 306 | while (bs > base && (bs-1)->id[0] == find.id[0]) 307 | bs--; 308 | // and then seek forward through all (x,_,_), 309 | // recording any (x,_,z) matches 310 | while (bs < end && bs->id[0] == find.id[0]) 311 | { 312 | if (bs->id[2] == find.id[2]) 313 | { 314 | *res = ngram3_find_spacefor1more(*res, rescnt); 315 | if (!*res) 316 | break; 317 | (*res)[rescnt] = *bs; 318 | rescnt++; 319 | } 320 | bs++; 321 | } 322 | } 323 | return rescnt; 324 | } 325 | 326 | /* 327 | * given find (x,y,z), search m for all matches of (_,y,z) with help of the index 328 | * m entries are sorted by (x,y,z) 329 | * idx is a length of the spans of entries with the same (x,_,_) 330 | * search through m by idx[] records at a time. 331 | * search sequential for small spans, bsearch large ones 332 | */ 333 | static unsigned long ngram3bin_like__yz(ngram3 find, const struct ngram3map *m, 334 | ngram3 **res, unsigned long rescnt, 335 | ngram3bin_index *idx) 336 | { 337 | # define SPAN_LARGE 16 // arbitrary, somewhat-reasonable number 338 | uint32_t *span = idx->span; 339 | const ngram3 *mcur = m->m; 340 | while (*span) 341 | { 342 | if (*span < SPAN_LARGE) 343 | { 344 | // small span, search sequentially 345 | const ngram3 *mend = mcur + *span; 346 | while (mcur < mend) 347 | { 348 | if (mcur->id[1] == find.id[1] && 349 | mcur->id[2] == find.id[2]) 350 | { 351 | if ((*res = ngram3_find_spacefor1more(*res, rescnt))) 352 | (*res)[rescnt++] = *mcur; 353 | mcur = mend; 354 | break; 355 | } 356 | mcur++; 357 | } 358 | } 359 | else 360 | { 361 | // large span, bsearch 362 | const ngram3 *bs; 363 | find.id[0] = mcur->id[0]; // first id must match(!) 364 | if ((bs = bsearch(&find, mcur, *span, sizeof *mcur, ngram3cmp))) 365 | { 366 | if ((*res = ngram3_find_spacefor1more(*res, rescnt))) 367 | (*res)[rescnt++] = *bs; 368 | } 369 | mcur += *span; 370 | } 371 | // mcur set to previous mcur + *span by this point 372 | span++; 373 | } 374 | return rescnt; 375 | } 376 | 377 | /* 378 | * sum ngram3 word frequencies in w.word[n].freq 379 | */ 380 | void ngramword_totalfreqs(struct ngramword w, const struct ngram3map *m) 381 | { 382 | ngram3 *cur = m->m; 383 | const ngram3 *end = (ngram3*)((char*)cur + m->size); 384 | while (cur < end) 385 | { 386 | if (cur->id[0] < w.cnt) w.word[cur->id[0]].freq += cur->freq; 387 | if (cur->id[1] < w.cnt) w.word[cur->id[1]].freq += cur->freq; 388 | if (cur->id[2] < w.cnt) w.word[cur->id[2]].freq += cur->freq; 389 | cur++; 390 | } 391 | { 392 | unsigned long i, cnt = w.cnt; 393 | for (i = 0; i < cnt; i++) 394 | w.word[i].freq /= 2; 395 | } 396 | } 397 | 398 | /* 399 | * build an index that speeds out searches of (_,y,z) searches 400 | * count the spans of consecutive id[0]s in m 401 | * e.g. [(x,_,_),(x,_,_),(y,_,_),(z,_,_),(z,_,_),(z,_,_)] 402 | * |_______| | |_______________| 403 | * 2 1 3 404 | */ 405 | int ngram3bin_index_init(ngram3bin_index *idx, const struct ngram3map *m, const struct ngramword *w) 406 | { 407 | /* 408 | * allocate enough space to hold a counter for every existing unique word, 409 | * even though not every word may necessarily be present in id[0] 410 | */ 411 | idx->span = malloc((w->cnt + 1) * sizeof *idx->span); 412 | if (idx->span) 413 | { 414 | unsigned long spanidx = 0, 415 | spancnt = 1; 416 | const ngram3 *cur = m->m; 417 | const ngram3 *end = (ngram3*)((char*)cur + m->size); 418 | const ngram3 *nxt = cur+1; 419 | while (nxt < end) 420 | { 421 | if (cur->id[0] == nxt->id[0]) 422 | { 423 | spancnt++; 424 | } 425 | else 426 | { 427 | idx->span[spanidx] = spancnt; 428 | spanidx++; 429 | spancnt = 1; 430 | } 431 | cur = nxt; 432 | nxt++; 433 | } 434 | idx->span[spanidx] = spancnt; 435 | idx->span[spanidx+1] = 0; // sentinel 436 | } 437 | return !!idx->span; 438 | } 439 | 440 | void ngram3bin_index_fini(ngram3bin_index *idx) 441 | { 442 | free(idx->span); 443 | } 444 | 445 | void ngram3bin_fini(struct ngram3map m) 446 | { 447 | munmap(m.m, m.size); 448 | close(m.fd); 449 | } 450 | 451 | /* 452 | * sort descending by frequency 453 | */ 454 | static int follows_cmp(const void *va, const void *vb) 455 | { 456 | const ngram3 *a = va, 457 | *b = vb; 458 | return (int)(b->freq - a->freq); 459 | } 460 | 461 | /* 462 | * given a single word, return a list of words follow and their frequency 463 | */ 464 | ngram3 * ngram3bin_follows(const ngram3 *find, const struct ngram3map *m) 465 | { 466 | uint32_t fid = find->id[0]; 467 | unsigned long ngcnt = 0; 468 | ngram3 *cur = m->m; 469 | const ngram3 *end = (ngram3*)((char*)cur + m->size); 470 | ngram3 *res = NULL; 471 | while (cur < end) 472 | { 473 | int foundindex; 474 | if (cur->id[0] == fid) 475 | foundindex = 1; 476 | else if (cur->id[1] == fid) 477 | foundindex = 2; 478 | else 479 | foundindex = 0; 480 | 481 | if (foundindex) 482 | { 483 | int i; 484 | // linear scan for already found... 485 | for (i = 0; i < ngcnt; i++) 486 | { 487 | if (res[i].id[0] == cur->id[foundindex]) 488 | { 489 | res[i].freq++; 490 | if (i > 0 && res[i].freq > res[i-1].freq * 2) 491 | { 492 | // bring most common entries to front of list 493 | ngram3 tmp = res[i]; 494 | res[i] = res[i-1]; 495 | res[i-1] = tmp; 496 | } 497 | break; 498 | } 499 | } 500 | // didn't find, add another entry to list 501 | if (i == ngcnt) 502 | { 503 | res = ngram3_find_spacefor1more(res, ngcnt); 504 | if (!res) 505 | break; 506 | res[ngcnt].id[0] = cur->id[foundindex]; 507 | res[ngcnt].freq = 1; 508 | ngcnt++; 509 | } 510 | } 511 | cur++; 512 | } 513 | if (res) 514 | { 515 | res = ngram3_find_spacefor1more(res, ngcnt); 516 | if (res) 517 | { 518 | res[ngcnt].freq = 0; // sentinel 519 | // sort results 520 | qsort(res, ngcnt, sizeof *res, follows_cmp); 521 | } 522 | } 523 | return res; 524 | } 525 | 526 | #ifdef TEST 527 | 528 | /* 529 | * dump binary entries for sanity checking 530 | */ 531 | static void ngram3bin_dump(const struct ngram3map *m, const struct ngramword w) 532 | { 533 | const ngram3 *cur = m->m; 534 | const ngram3 *end = (ngram3*)((char*)cur + m->size); 535 | while (cur < end) 536 | { 537 | printf("%6lu:%-16s %6lu:%-16s %6lu:%-16s %8lu\n", 538 | (unsigned long)cur->id[0], ngramword_id2word(cur->id[0], w), 539 | (unsigned long)cur->id[1], ngramword_id2word(cur->id[1], w), 540 | (unsigned long)cur->id[2], ngramword_id2word(cur->id[2], w), 541 | (unsigned long)cur->freq); 542 | cur++; 543 | } 544 | } 545 | 546 | int main(void) 547 | { 548 | struct ngram3map mb = ngram3bin_init("ngram3.bin", 0); 549 | struct ngram3map mw = ngram3bin_init("word.bin", 0); 550 | struct ngramword w = ngramword_load(mw); 551 | const ngram3 find = { 5, 29835, 22, 0 }; // am fond of 552 | // googlebooks-eng-all-3gram-20090715-24.csv.zip-2008.list.gz 553 | // 552621:am fond of 3170 554 | // $ zcat googlebooks-eng-all-3gram-20090715-24.csv.zip-2008.ids.gz | grep -In '^5,29835,22,3170$' 555 | // 427250:5,29835,22,3170 556 | printf("map %llu bytes (%llu ngram3s)\n", mb.size, mb.size / sizeof find); 557 | printf("freq of %lu.%lu.%lu: %lu\n", 558 | (unsigned long)find.id[0], 559 | (unsigned long)find.id[1], 560 | (unsigned long)find.id[2], 561 | ngram3bin_freq(find, &mb)); 562 | ngram3bin_dump(&mb, w); 563 | ngram3bin_fini(mb); 564 | ngramword_fini(w); 565 | ngram3bin_fini(mw); 566 | return 0; 567 | } 568 | 569 | #endif 570 | 571 | -------------------------------------------------------------------------------- /data/corpus/google-ngrams/ngram3bin.h: -------------------------------------------------------------------------------- 1 | /* ex: set ts=8 noet: */ 2 | /* 3 | * Copyright 2011 Ryan Flynn 4 | */ 5 | 6 | #ifndef NGRAM3BIN_H 7 | #define NGRAM3BIN_H 8 | 9 | #include 10 | #include 11 | 12 | #define UNKNOWN_ID (0) 13 | #define IMPOSSIBLE_ID (~0) 14 | 15 | struct ngram3map 16 | { 17 | void *m; 18 | int fd; 19 | unsigned long long size; 20 | }; 21 | 22 | #define ngram3map_start(map) ((ngram3*)((map)->m)) 23 | #define ngram3map_end(map) ((ngram3*)(((char *)((map)->m)) + (map)->size)) 24 | 25 | struct ngramword 26 | { 27 | unsigned long cnt; 28 | struct wordlen { 29 | unsigned len; 30 | unsigned freq; 31 | const char *str; 32 | } *word; 33 | }; 34 | 35 | #pragma pack(push, 1) 36 | struct ngramwordcursor { 37 | uint32_t len; 38 | }; 39 | #pragma pack(pop) 40 | typedef struct ngramwordcursor ngramwordcursor; 41 | 42 | #define ngramwordcursor_str(cur) ((char *)(cur) + sizeof *(cur)) 43 | #define ngramwordcursor_next(cur) (void *)((char *)(ngramwordcursor_str(cur) + ((cur)->len + (1 + ((cur)->len+1) % 4)))) 44 | 45 | #pragma pack(push, 1) 46 | typedef struct 47 | { 48 | uint32_t id[3], 49 | freq; 50 | } ngram3; 51 | #pragma pack(pop) 52 | 53 | /* 54 | * ngram3 is a sorted array of 3-grams (x,y,z) 55 | * for each unique x, count the number of sequential records (x,_,_) 56 | * this allows us to more efficiently search for (_,y,z) 57 | * 58 | * note: we don't need to track which id each span represents, we 59 | * can retrieve it when necessary; we just need the number of records 60 | */ 61 | typedef struct 62 | { 63 | uint32_t *span; 64 | } ngram3bin_index; 65 | 66 | struct ngramword ngramword_load(const struct ngram3map); 67 | const unsigned long ngramword_word2id(const char *word, unsigned len, const struct ngramword); 68 | const char * ngramword_id2word(unsigned long id, const struct ngramword); 69 | void ngramword_totalfreqs(struct ngramword, const struct ngram3map *); 70 | void ngramword_fini(struct ngramword); 71 | 72 | struct ngram3map ngram3bin_init(const char *path, int write); 73 | unsigned long ngram3bin_freq(ngram3 find, const struct ngram3map *); 74 | unsigned long ngram3bin_freq2(ngram3 find, const struct ngram3map *); 75 | ngram3 * ngram3bin_like(ngram3 find, const struct ngram3map *); 76 | ngram3 * ngram3bin_like_better(ngram3 find, const struct ngram3map *, ngram3bin_index *); 77 | void ngram3bin_str (const struct ngram3map, FILE *); 78 | void ngram3bin_fini(struct ngram3map); 79 | ngram3 * ngram3bin_follows(const ngram3 *, const struct ngram3map *); 80 | 81 | int ngram3bin_index_init(ngram3bin_index *, const struct ngram3map *, const struct ngramword *); 82 | void ngram3bin_index_fini(ngram3bin_index *); 83 | 84 | int ngram3cmp(const void *, const void *); 85 | 86 | #endif /* NGRAM3BIN_H */ 87 | 88 | -------------------------------------------------------------------------------- /data/corpus/google-ngrams/ngram3binpy.c: -------------------------------------------------------------------------------- 1 | /* ex: set ts=8 noet: */ 2 | /* 3 | * Copyright 2011 Ryan Flynn 4 | * 5 | * ngram3bin python bindings 6 | * 7 | * Reference: http://starship.python.net/crew/arcege/extwriting/pyext.html 8 | * http://docs.python.org/release/2.5.2/ext/callingPython.html 9 | * http://www.fnal.gov/docs/products/python/v1_5_2/ext/buildValue.html 10 | */ 11 | 12 | #include 13 | #include 14 | #include "ngram3bin.h" 15 | 16 | #if PY_MAJOR_VERSION >= 3 17 | #define PY3K 18 | #endif 19 | 20 | /* 21 | * obj PyObject wrapper 22 | */ 23 | typedef struct { 24 | PyObject_HEAD 25 | struct ngram3map wordmap; 26 | struct ngram3map ngramap; 27 | struct ngramword word; 28 | ngram3bin_index ngramap_index; 29 | PyObject *worddict; 30 | } ngram3bin; 31 | 32 | static void ngram3bin_dealloc(PyObject *self); 33 | static int ngram3bin_print (PyObject *self, FILE *fp, int flags); 34 | #ifndef PY3K 35 | static PyObject *ngram3bin_getattr(PyObject *self, char *attr); 36 | #endif 37 | 38 | static PyObject *ngram3bin_new (PyObject *self, PyObject *args); 39 | static PyObject *ngram3binpy_word2id(PyObject *self, PyObject *args); 40 | static PyObject *ngram3binpy_id2word(PyObject *self, PyObject *args); 41 | static PyObject *ngram3binpy_id2freq(PyObject *self, PyObject *args); 42 | static PyObject *ngram3binpy_wordfreq(PyObject *self, PyObject *args); 43 | static PyObject *ngram3binpy_freq (PyObject *self, PyObject *args); 44 | static PyObject *ngram3binpy_like (PyObject *self, PyObject *args); 45 | static PyObject *ngram3binpy_follows(PyObject *self, PyObject *args); 46 | 47 | static struct PyMethodDef ngram3bin_Methods[] = { 48 | { "word2id", (PyCFunction) ngram3binpy_word2id, METH_VARARGS, NULL }, 49 | { "id2word", (PyCFunction) ngram3binpy_id2word, METH_VARARGS, NULL }, 50 | { "id2freq", (PyCFunction) ngram3binpy_id2freq, METH_VARARGS, NULL }, 51 | { "wordfreq", (PyCFunction) ngram3binpy_wordfreq, METH_VARARGS, NULL }, 52 | { "freq", (PyCFunction) ngram3binpy_freq, METH_VARARGS, NULL }, 53 | { "like", (PyCFunction) ngram3binpy_like, METH_VARARGS, NULL }, 54 | { "ngram3bin", (PyCFunction) ngram3bin_new, METH_VARARGS, NULL }, 55 | { "follows", (PyCFunction) ngram3binpy_follows, METH_VARARGS, NULL }, 56 | { NULL, NULL, 0, NULL } 57 | }; 58 | 59 | /* 60 | * ngram3bin type-builtin methods 61 | */ 62 | PyTypeObject ngram3bin_Type = { 63 | #ifdef PY3K 64 | PyVarObject_HEAD_INIT(NULL, 0) 65 | #else 66 | PyObject_HEAD_INIT(NULL) 67 | 0, /* ob_size */ 68 | #endif 69 | "ngram3bin", /* char *tp_name; */ 70 | sizeof(ngram3bin), /* int tp_basicsize; */ 71 | 0, /* int tp_itemsize; not used much */ 72 | ngram3bin_dealloc, /* destructor tp_dealloc; */ 73 | ngram3bin_print, /* printfunc tp_print; */ 74 | #ifdef PY3K 75 | 0, /* getattrfunc tp_getattr; __getattr__ */ 76 | #else 77 | ngram3bin_getattr, /* getattrfunc tp_getattr; __getattr__ */ 78 | #endif 79 | 0, /* setattrfunc tp_setattr; __setattr__ */ 80 | 0, /* cmpfunc tp_compare; __cmp__ */ 81 | 0, /* reprfunc tp_repr; __repr__ */ 82 | 0, /* PyNumberMethods *tp_as_number; */ 83 | 0, /* PySequenceMethods *tp_as_sequence; */ 84 | 0, /* PyMappingMethods *tp_as_mapping; */ 85 | 0, /* hashfunc tp_hash; __hash__ */ 86 | 0, /* ternaryfunc tp_call; __call__ */ 87 | 0, /* reprfunc tp_str; __str__ */ 88 | #ifdef PY3K 89 | PyObject_GenericGetAttr,/* tp_getattro */ 90 | 0, /* tp_setattro */ 91 | 0, /* tp_as_buffer */ 92 | Py_TPFLAGS_DEFAULT, /* tp_flags */ 93 | 0, /* tp_doc */ 94 | 0, /* tp_traverse */ 95 | 0, /* tp_clear */ 96 | 0, /* tp_richcompare */ 97 | 0, /* tp_weaklistoffset */ 98 | 0, /* tp_iter */ 99 | 0, /* tp_iternext */ 100 | ngram3bin_Methods, /* tp_methods */ 101 | 0, /* tp_members */ 102 | 0, /* tp_getset */ 103 | 0, /* tp_base */ 104 | 0, /* tp_dict */ 105 | 0, /* tp_descr_get */ 106 | 0, /* tp_descr_set */ 107 | 0, /* tp_dictoffset */ 108 | 0, /* tp_init */ 109 | 0, /* tp_alloc */ 110 | 0, /* tp_new */ 111 | #endif 112 | }; 113 | 114 | struct module_state { 115 | PyObject *error; 116 | }; 117 | 118 | #if PY_MAJOR_VERSION >= 3 119 | #define GETSTATE(m) ((struct module_state*)PyModule_GetState(m)) 120 | #else 121 | #define GETSTATE(m) (&_state) 122 | static struct module_state _state; 123 | #endif 124 | 125 | #if 0 126 | static PyObject * error_out(PyObject *m) 127 | { 128 | struct module_state *st = GETSTATE(m); 129 | PyErr_SetString(st->error, "something bad happened"); 130 | return NULL; 131 | } 132 | #endif 133 | 134 | #if PY_MAJOR_VERSION >= 3 135 | 136 | static int ngram3bin_traverse(PyObject *m, visitproc visit, void *arg) 137 | { 138 | Py_VISIT(GETSTATE(m)->error); 139 | return 0; 140 | } 141 | 142 | static int ngram3bin_clear(PyObject *m) 143 | { 144 | Py_CLEAR(GETSTATE(m)->error); 145 | return 0; 146 | } 147 | 148 | static struct PyModuleDef moduledef = 149 | { 150 | PyModuleDef_HEAD_INIT, 151 | "ngram3bin", 152 | NULL, 153 | sizeof(struct module_state), 154 | ngram3bin_Methods, 155 | NULL, 156 | ngram3bin_traverse, 157 | ngram3bin_clear, 158 | NULL 159 | }; 160 | 161 | #define INITERROR return NULL 162 | 163 | PyObject * 164 | PyInit_ngram3bin(void) 165 | 166 | #else 167 | #define INITERROR return 168 | 169 | void 170 | initngram3bin(void) 171 | #endif 172 | { 173 | #ifdef PY3K 174 | PyObject *module = PyModule_Create(&moduledef); 175 | #else 176 | PyObject *module = Py_InitModule("ngram3bin", ngram3bin_Methods); 177 | #endif 178 | 179 | if (module == NULL) 180 | INITERROR; 181 | struct module_state *st = GETSTATE(module); 182 | 183 | st->error = PyErr_NewException("ngram3bin.Error", NULL, NULL); 184 | if (st->error == NULL) 185 | { 186 | Py_DECREF(module); 187 | INITERROR; 188 | } 189 | 190 | #ifdef PY3K 191 | return module; 192 | #endif 193 | } 194 | 195 | #ifndef PY3K 196 | PyObject *ngram3bin_getattr(PyObject *self, char *attr) 197 | { 198 | PyObject *res = Py_FindMethod(ngram3bin_Methods, self, attr); 199 | return res; 200 | } 201 | #endif 202 | 203 | static PyObject * ngram3bin_NEW(void) 204 | { 205 | ngram3bin *obj = PyObject_NEW(ngram3bin, &ngram3bin_Type); 206 | obj->wordmap.m = NULL; 207 | obj->ngramap.m = NULL; 208 | obj->wordmap.fd = -1; 209 | obj->ngramap.fd = -1; 210 | obj->wordmap.size = 0; 211 | obj->ngramap.size = 0; 212 | return (PyObject *)obj; 213 | } 214 | 215 | static PyObject * worddict_new(struct ngramword w) 216 | { 217 | PyObject *d = PyDict_New(); 218 | struct wordlen *wl = w.word; 219 | unsigned long id; 220 | for (id = 0; id < w.cnt; id++, wl++) 221 | { 222 | PyObject *v = PyLong_FromUnsignedLong(id); 223 | PyObject *k = PyBytes_FromStringAndSize(wl->str, wl->len); 224 | (void)PyDict_SetItem(d, k, v); 225 | } 226 | return d; 227 | } 228 | 229 | static PyObject * ngram3bin_new(PyObject *self, PyObject *args) 230 | { 231 | ngram3bin *obj = (ngram3bin *)ngram3bin_NEW(); 232 | char *wordpath = NULL; 233 | char *ngrampath = NULL; 234 | if (PyArg_ParseTuple(args, "ss", &wordpath, &ngrampath)) 235 | { 236 | obj->wordmap = ngram3bin_init(wordpath, 0); 237 | obj->word = ngramword_load(obj->wordmap); 238 | obj->ngramap = ngram3bin_init(ngrampath, 0); 239 | obj->worddict = worddict_new(obj->word); 240 | ngramword_totalfreqs(obj->word, &obj->ngramap); 241 | ngram3bin_index_init(&obj->ngramap_index, &obj->ngramap, &obj->word); 242 | Py_INCREF(obj->worddict); 243 | } 244 | Py_INCREF(obj); 245 | return (PyObject *)obj; 246 | } 247 | 248 | static void ngram3bin_dealloc(PyObject *self) 249 | { 250 | ngram3bin *obj = (ngram3bin *)self; 251 | ngram3bin_fini(obj->wordmap); 252 | ngramword_fini(obj->word); 253 | ngram3bin_fini(obj->ngramap); 254 | PyMem_FREE(self); 255 | } 256 | 257 | static int ngram3bin_print(PyObject *self, FILE *fp, int flags) 258 | { 259 | ngram3bin *obj = (ngram3bin *)self; 260 | ngram3bin_str(obj->wordmap, fp); 261 | ngram3bin_str(obj->ngramap, fp); 262 | return 0; 263 | } 264 | 265 | static PyObject *ngram3binpy_word2id(PyObject *self, PyObject *args) 266 | { 267 | PyObject *res = NULL; 268 | Py_UNICODE *u = NULL; 269 | int l = 0; 270 | if (PyArg_ParseTuple(args, "u#", &u, &l)) 271 | { 272 | PyObject *key = PyUnicode_EncodeUTF8(u, l, NULL); 273 | if (key) 274 | { 275 | ngram3bin *obj = (ngram3bin *)self; 276 | res = PyDict_GetItem(obj->worddict, key); 277 | } 278 | } 279 | if (!res) 280 | res = PyLong_FromLong(UNKNOWN_ID); 281 | Py_INCREF(res); 282 | return res; 283 | } 284 | 285 | static PyObject *ngram3binpy_id2word(PyObject *self, PyObject *args) 286 | { 287 | PyObject *res = NULL; 288 | ngram3bin *obj = (ngram3bin *)self; 289 | unsigned long id = 0; 290 | if (PyArg_ParseTuple(args, "i", &id)) 291 | { 292 | const char *word = ngramword_id2word(id, obj->word); 293 | if (word) 294 | res = PyUnicode_FromStringAndSize(word, strlen(word)); 295 | else 296 | res = PyErr_NewException("ngram3bin.Error", NULL, NULL); 297 | Py_INCREF(res); 298 | } 299 | return res; 300 | } 301 | 302 | static PyObject *ngram3binpy_id2freq(PyObject *self, PyObject *args) 303 | { 304 | PyObject *res = NULL; 305 | ngram3bin *obj = (ngram3bin *)self; 306 | unsigned long id = 0; 307 | if (PyArg_ParseTuple(args, "i", &id)) 308 | { 309 | if (id < obj->word.cnt) 310 | res = PyLong_FromUnsignedLong(obj->word.word[id].freq); 311 | else 312 | res = PyLong_FromLong(0); 313 | Py_INCREF(res); 314 | } 315 | return res; 316 | } 317 | 318 | /* 319 | * equivalent of id2freq(word2id(word)) 320 | */ 321 | static PyObject *ngram3binpy_wordfreq(PyObject *self, PyObject *args) 322 | { 323 | PyObject *res = NULL; 324 | ngram3bin *obj = (ngram3bin *)self; 325 | unsigned long id = 0; 326 | Py_UNICODE *u = NULL; 327 | int l = 0; 328 | if (PyArg_ParseTuple(args, "u#", &u, &l)) 329 | { 330 | PyObject *key = PyUnicode_EncodeUTF8(u, l, NULL); 331 | if (key) 332 | { 333 | res = PyDict_GetItem(obj->worddict, key); 334 | if (res) 335 | id = PyLong_AsLong(res); 336 | } 337 | } 338 | if (id < obj->word.cnt) 339 | res = PyLong_FromUnsignedLong(obj->word.word[id].freq); 340 | else 341 | res = PyLong_FromLong(0); 342 | Py_INCREF(res); 343 | return res; 344 | } 345 | 346 | /* 347 | * find frequency of (x,y,z) 348 | */ 349 | static PyObject *ngram3binpy_freq(PyObject *self, PyObject *args) 350 | { 351 | PyObject *res = NULL; 352 | ngram3bin *obj = (ngram3bin *)self; 353 | ngram3 find; 354 | find.id[2] = IMPOSSIBLE_ID; 355 | unsigned long freq = 0; 356 | if (PyArg_ParseTuple(args, "ii|i", find.id+0, find.id+1, find.id+2)) 357 | { 358 | if (find.id[2] == IMPOSSIBLE_ID) 359 | freq = ngram3bin_freq2(find, &obj->ngramap); 360 | else 361 | freq = ngram3bin_freq(find, &obj->ngramap); 362 | } 363 | res = PyLong_FromUnsignedLong(freq); 364 | Py_INCREF(res); 365 | return res; 366 | } 367 | 368 | /* 369 | * given the results of an ngram3_find() call, 370 | * import them into a python list of 4-tuples [(x,y,z,freq),...] 371 | */ 372 | static PyObject * ngram3_find_res2py(const ngram3 *f) 373 | { 374 | PyObject *res = PyList_New(0); 375 | Py_INCREF(res); 376 | if (f) 377 | { 378 | const ngram3 *c = f; 379 | while (c->freq) 380 | { 381 | PyObject *o, *t = PyTuple_New(4); 382 | int i; 383 | for (i = 0; i < 3; i++) 384 | { 385 | o = PyLong_FromUnsignedLong(c->id[i]); 386 | PyTuple_SetItem(t, i, o); 387 | Py_INCREF(o); 388 | } 389 | o = PyLong_FromUnsignedLong(c->freq); 390 | PyTuple_SetItem(t, 3, o); 391 | Py_INCREF(o); 392 | PyList_Append(res, t); 393 | Py_INCREF(t); 394 | c++; 395 | } 396 | } 397 | return res; 398 | } 399 | 400 | static PyObject *ngram3binpy_like(PyObject *self, PyObject *args) 401 | { 402 | PyObject *res = NULL; 403 | ngram3bin *obj = (ngram3bin *)self; 404 | ngram3 find; 405 | if (PyArg_ParseTuple(args, "iii", find.id+0, find.id+1, find.id+2)) 406 | { 407 | if (obj->ngramap.m) 408 | { 409 | ngram3 *f = ngram3bin_like_better(find, &obj->ngramap, &obj->ngramap_index); 410 | res = ngram3_find_res2py(f); 411 | free(f); 412 | } 413 | } 414 | else 415 | { 416 | res = PyList_New(0); 417 | } 418 | return res; 419 | } 420 | 421 | /* 422 | * given the results of an ngram3_follows() call, 423 | * import them into a python list of 2-tuples [(word_id,freq),...] 424 | */ 425 | static PyObject * ngram3_follows_res2py(const ngram3 *f) 426 | { 427 | PyObject *res = PyList_New(0); 428 | Py_INCREF(res); 429 | if (f) 430 | { 431 | const ngram3 *c = f; 432 | while (c->freq) 433 | { 434 | PyObject *o, *t; 435 | t = PyTuple_New(2); 436 | o = PyLong_FromUnsignedLong(c->id[0]); 437 | PyTuple_SetItem(t, 0, o); 438 | Py_INCREF(o); 439 | o = PyLong_FromUnsignedLong(c->freq); 440 | PyTuple_SetItem(t, 1, o); 441 | Py_INCREF(o); 442 | PyList_Append(res, t); 443 | Py_INCREF(t); 444 | c++; 445 | } 446 | } 447 | return res; 448 | } 449 | 450 | static PyObject *ngram3binpy_follows(PyObject *self, PyObject *args) 451 | { 452 | PyObject *res = NULL; 453 | ngram3bin *obj = (ngram3bin *)self; 454 | ngram3 find; 455 | if (PyArg_ParseTuple(args, "i", find.id+0)) 456 | { 457 | if (obj->ngramap.m) 458 | { 459 | ngram3 *f = ngram3bin_follows(&find, &obj->ngramap); 460 | res = ngram3_follows_res2py(f); 461 | free(f); 462 | } 463 | } 464 | else 465 | { 466 | res = PyList_New(0); 467 | } 468 | return res; 469 | } 470 | 471 | -------------------------------------------------------------------------------- /data/corpus/google-ngrams/scratch/benchmark-str-to-id.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | """ 4 | benchmark fastest implementation of key generating algorithm for tokens. 5 | gets called ~600 million times in CPU-bound task. 6 | 7 | >>> def doit1(): 8 | ... import string 9 | ... string.lower('Python') 10 | ... 11 | >>> import string 12 | >>> def doit2(): 13 | ... string.lower('Python') 14 | ... 15 | >>> import timeit 16 | >>> t = timeit.Timer(setup='from __main__ import doit1', stmt='doit1()') 17 | >>> t.timeit() 18 | 11.479144930839539 19 | >>> t = timeit.Timer(setup='from __main__ import doit2', stmt='doit2()') 20 | >>> t.timeit() 21 | 4.6661689281463623 22 | """ 23 | 24 | def id1(d, key): 25 | try: 26 | return d[key] 27 | except KeyError: 28 | cnt = len(d) 29 | d[key] = cnt 30 | return cnt 31 | 32 | def id2(d, key): 33 | ld = len(d) 34 | val = d.get(key, ld) 35 | if val == ld: 36 | d[key] = val 37 | return val 38 | 39 | def id3(d, key): 40 | if key in d: 41 | return d[key] 42 | else: 43 | cnt = len(d) 44 | d[key] = cnt 45 | return cnt 46 | 47 | Id = {} 48 | def id4(_, key): 49 | global Id 50 | if key in Id: 51 | return Id[key] 52 | else: 53 | cnt = len(Id) 54 | Id[key] = cnt 55 | return cnt 56 | 57 | from random import randint 58 | 59 | def foo(f): 60 | d = {} 61 | for _ in range(1,1000): 62 | f(d, randint(0, 20)) 63 | 64 | import timeit 65 | for n in range(1,5): 66 | print '%d:%s' % (n, timeit.Timer(setup='from __main__ import foo,id%d' % n, stmt='foo(id%d)' % n).timeit(number=1000)) 67 | 68 | -------------------------------------------------------------------------------- /data/corpus/google-ngrams/scratch/debug-multiprocessing-dict.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | 4 | """ 5 | a dict() must be shared between worker processes and whose contents, 6 | written by the workers, must be accessible after they are done. 7 | 8 | this is a contrived example exploring this. 9 | """ 10 | 11 | import multiprocessing as mp 12 | import Queue 13 | from time import sleep 14 | manager = mp.Manager() 15 | Ids = manager.dict() 16 | Q = mp.Queue() 17 | 18 | # build queue 19 | for n in range(10): 20 | Q.put(n) 21 | 22 | # mirror my getid() function 23 | def setid(d, key, val): d[key] = val 24 | 25 | def worker(my_id, q, ids): 26 | while True: 27 | try: 28 | k = q.get(timeout=1) 29 | setid(ids, k, my_id) 30 | sleep(0.01) # let other worker have a chance 31 | except Queue.Empty: 32 | return 33 | 34 | W = [ mp.Process(target=worker, args=(i, Q, Ids)) 35 | for i in range(2) ] 36 | for w in W: w.start() 37 | for w in W: w.join() 38 | 39 | # figure out who did what 40 | from operator import itemgetter as ig 41 | from itertools import groupby 42 | for k,g in groupby(sorted(Ids.items(), key=ig(1)), key=ig(1)): 43 | print 'worker', k, 'wrote keys', [x[0] for x in g] 44 | 45 | -------------------------------------------------------------------------------- /data/corpus/google-ngrams/setup.py: -------------------------------------------------------------------------------- 1 | """ 2 | How To Use 3 | 4 | $ python setup.py build 5 | $ sudo python setup.py install 6 | $ time python3 -i < testbin.py 7 | 8 | """ 9 | 10 | from distutils.core import setup, Extension 11 | 12 | setup(name = 'ngram3bin', 13 | version = '1.0', 14 | ext_modules = [Extension('ngram3bin', ['ngram3bin.c','ngram3binpy.c'])]) 15 | -------------------------------------------------------------------------------- /data/corpus/google-ngrams/testbin.py: -------------------------------------------------------------------------------- 1 | 2 | # Usage: python3 -i testbin.py 3 | 4 | from ngram3bin import ngram3bin 5 | #ng = ngram3bin('xxx') # too few parameters 6 | ng = ngram3bin('word.bin','ngram3.bin') 7 | ng.word2id('freq') 8 | ng.word2id('FDWD#$#$@#@') 9 | list(map(ng.word2id, ['activities','as','buddhist'])) 10 | [ng.id2freq(ng.word2id(w)) for w in ['activities','as','buddhist']][:30] 11 | [ng.id2word(ng.word2id(w)) for w in ['activities','as','buddhist']][:30] 12 | ng.freq(4,22,215) 13 | ng.like(5,6,7) 14 | # convert to ids, search, convert back to words 15 | [(ng.id2word(x), ng.id2word(y), ng.id2word(z), freq) 16 | for x,y,z,freq in ng.like(*[ng.word2id(w) for w in ['activities','as','buddhist']])] 17 | ng.freq(1,2) 18 | #ng.like(3,4) 19 | print('idknow') 20 | idknow = ng.word2id('know') 21 | print('word(idknow)') 22 | ng.id2word(idknow) 23 | assert 'know' == ng.id2word(ng.word2id('know')) 24 | assert ng.id2freq(ng.word2id('know')) == ng.wordfreq('know') 25 | 26 | # "bridge" missing made find a bug 27 | print('id(bridge)=', ng.word2id('bridge')) 28 | print('id2freq(bridge)=', ng.id2freq(ng.word2id('bridge'))) 29 | print('wordfreq(bridge)=', ng.wordfreq('bridge')) 30 | 31 | # "didn" seems to be missing but shouldn't be... 32 | print('wordfreq(didn)=', ng.wordfreq('didn')) 33 | 34 | [(w,ng.word2id(w)) for w in ['didn','t','know']] 35 | ng.freq(*[ng.word2id(w) for w in ['didn','t','know']]) 36 | 37 | # freq2 38 | (('didn','t'), ng.freq(*[ng.word2id(w) for w in ['didn','t']])) 39 | (('and','that'), ng.freq(*[ng.word2id(w) for w in ['and','that']])) 40 | (('a','mistake'), ng.freq(*[ng.word2id(w) for w in ['a','mistake']])) 41 | 42 | Test = [ 43 | 'am fond of', 44 | 'am found of', 45 | 'i now that', 46 | 'i know that', 47 | 'is now that', 48 | 'future would undoubtedly', 49 | 'it it did', 50 | 'if it did', 51 | 'and then it', 52 | 'the united states', 53 | 'cheese burger', 54 | 'cheeseburger', 55 | 'don t', 56 | "don ' t", 57 | 'don', 58 | 'dont', 59 | "don't", 60 | 'i was alluding', 61 | 'spill chick', 62 | 'spell check', 63 | 'spillchick', 64 | 'spellcheck', 65 | 'of the art', 66 | 'the - art', 67 | ] 68 | for s in Test: 69 | t = s.lower().split() 70 | ids = [ng.word2id(w) for w in t] 71 | frfunc = ng.freq if len(ids) > 1 else ng.id2freq 72 | print((t, 'freq:', frfunc(*ids), 'ids:', ids)) 73 | assert all(ng.id2word(ng.word2id(w)) == w or ng.word2id(w) == 0 for w in t) 74 | 75 | for foo in ['don','dont']: 76 | [(foo,ng.id2word(x), y) for x,y in ng.follows(ng.word2id(foo))[:100]] 77 | 78 | -------------------------------------------------------------------------------- /doc/algorithm.txt: -------------------------------------------------------------------------------- 1 | 2 | Goal: Maximize consistency of the language within a document. 3 | 4 | To do so we use an n-gram-based language model. 5 | 6 | We don't want to be too heavy-handed in our language model though; 7 | we want to incorporate local language use as well. 8 | 9 | We begin with a pre-fabricated sourced from an external "global" corpus, 10 | in this case we use Google Books' 3-ary n-grams. 11 | 12 | Upon initialization we incorporate the "local" corpus of documents into our 13 | language model, likely by parsing documents in the current and parent folders. 14 | 15 | It is this local model we should use first against new documents. This 16 | allows our checker to tailor its behavior to its environment, whether the 17 | documents are legal documents, school book reports, bad sci-fi novels, etc. 18 | 19 | http://en.wikipedia.org/wiki/Text_corpus 20 | http://en.wikipedia.org/wiki/Language_model#N-gram_models 21 | http://en.wikipedia.org/wiki/N-gram 22 | http://ngrams.googlelabs.com/datasets 23 | 24 | overhere -> overhear (x -> x') 25 | over,here -> over,here (x,y -> x,y) 26 | over,hear -> overhear (x,y -> x') 27 | i,now,the -> i,know,the (x,y,z -> x,y',z) 28 | than,you,very,much -> thank,you,very,much (x,y,z,zz -> x',y,z,zz) 29 | thank,yo -> thank,you (x,y -> x,y') 30 | 31 | Consider: 32 | fingerprinting words by content: hello = e:1,h:1,l:2,o:1 33 | 34 | Algorithm: 35 | AutoRevise(doc): 36 | Target the smallest, least-known ngrams first. 37 | List alternatives 38 | Begin with cheap, straight-forward, common alternatives and progress to more expensive/complex iff necessary 39 | Try to solve individual, unknown tokens first 40 | Preserve token boundaries (cheap) 41 | Edit distance 1, edit distance 2 42 | Phonetic similarities 43 | Disregard token boundaries (expensive) 44 | Parse all possible token sequences 45 | 46 | For each alternative 47 | Score its effectiveness by evaluating the complete repercussions 48 | Retain the best alternatives 49 | Propose revisions unobtrusively. 50 | Never modify without the user's permission. http://en.wikipedia.org/wiki/Cupertino_effect 51 | Record revision selection. 52 | Incorporate into future decisions. 53 | If revision is selected: 54 | Update document and all statistics/ngrams to reflect the change 55 | 56 | parse/load base corpus of target language 57 | parse/load local corpus 58 | 59 | calculate frequency of all ngrams 1..n 60 | sort ngrams on size:asc, freq:asc 61 | for ng in ngrams below some threshold: 62 | calculate feasible permutations for ng 63 | note: focus only on one area at a time, as the resulting change will modify the rest of the document 64 | for tok in ng: 65 | calculate list of permutations: spelling edits, pronunciation 66 | account for merging/splitting of tokens, etc. 67 | 68 | 69 | conduct re search : conduct research 70 | hitherehowareyou : hi there how are you 71 | 72 | 73 | 74 | -------------------------------------------------------------------------------- /doc/things-that-can-go-wrong-language-wise.txt: -------------------------------------------------------------------------------- 1 | 2 | How You Fuck Up How We Can Detect/Fix It 3 | ------------------------------- ---------------------------------------- 4 | 5 | word mis-spelling standard spellchecker 6 | resulting in a non-word with a dictionary (aspell, ispell, etc.) 7 | 'hello' -> 'helo' 8 | 9 | word mis-spelling ? 10 | resulting in another word try: word sequence mapping and levenshtein 11 | 'hello there' -> 'hell there' 12 | 13 | word transposition 14 | 'foo bar' -> 'bar foo' 15 | 16 | grammar screw up grammar checkers(?) 17 | various 18 | 'i am.' -> 'i is.' try: tense association mapping am/is/are 19 | 20 | homophone confusion 21 | '24 caret' try: map pronunciation 22 | '24 carrot' 23 | 'composed' -> 'come posed' 24 | 25 | botched idiom ? 26 | 'intents and purposes' -> try: idiom identification and word->pronunciation mapping 27 | 'intensive purposes' question: is this really any different thhan above? 28 | 29 | incorrect Proper Noun ? 30 | 'Mr. Johnson' -> 'Mr. Jonson' try: hmm, contextual proper noun mapping(?) 31 | 32 | slang/pronunciation 33 | 'hello' -> 'yello' 34 | 35 | word omission 36 | 'oops, i the word' 37 | 38 | -------------------------------------------------------------------------------- /src/algo.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | 4 | """ 5 | Goal: maximize self-consistency of a corpus of documents 6 | calculate frequency of all ngrams 1..n 7 | sort ngrams on freq:asc, size:asc 8 | for ng in ngrams below some threshold: 9 | calculate feasible permutations for ng 10 | note: focus only on one area at a time, as the resulting change will modify the rest of the document 11 | for tok in ng: 12 | calculate list of permutations: spelling edits, pronunciation 13 | account for merging/splitting of tokens, etc. 14 | """ 15 | 16 | import re 17 | from math import log,sqrt 18 | from collections import defaultdict 19 | from itertools import product 20 | 21 | def tokenize(text): return re.findall('[a-z]+', text.lower()) 22 | 23 | Freq = { 24 | 'does':1, 'it':1, 'use':1, 25 | 'i':1, 'know':1, 'right':1, 26 | 'fuck':1, 27 | 'conduct':2, 'research':2, 'search':2, 'con':1, 'duct':1, 28 | 'hi':3, 'there':2, 'hit':2, 'here':2, 'how':3, 'are':3, 'you':3, 29 | 'ho':1, 30 | 'a':3, 'them':2, 'anathema':1, 31 | } 32 | 33 | """ 34 | given a list of tokens, yield all possible permutations of joining two or more tokens together 35 | i.e. joins([a,b,c,d]) -> [[a,b,c,d],[a,b,cd],[a,bc,d],[ab,c,d],[a,bcd],[abc,d],[abcd]] 36 | 37 | AHA, i realize now that i'm simply trying to list sum permutations: 38 | i.e. joins([1,1,1,1]) -> [[1,1,1,1],[1,1,2],[1,2,1],[2,1,1],[1,3],[3,1],[4]] 39 | complexity: 2**(len(toks)-1) 40 | """ 41 | def joins(toks): 42 | if len(toks) < 2: 43 | yield toks 44 | else: 45 | for i in range(len(toks)): 46 | for j in range(i+1, len(toks)-i+1): 47 | pref = toks[:i] + [''.join(toks[i:i+j])] 48 | for suf in joins(toks[i+j:]): 49 | yield pref + suf 50 | 51 | """ 52 | find first substring str[x:y] where exists freq[str[x:y]] where y >= l 53 | return tuple (prefix before substring, the substring, the rest of the string) 54 | """ 55 | def nextword(str, ng1, l=1): 56 | for i in range(len(str)): 57 | for j in range(i+l, min(i+18, len(str))): 58 | if str[i:j] in ng1: 59 | return (str[:i], str[i:j], str[j:]) 60 | return (str,'','') 61 | 62 | """ 63 | given a string of one or more valid substring words, yield a list of permutations 64 | freq is a dict() of all recognized words in str 65 | """ 66 | def spl(str, ng1): 67 | if len(str) < 2: 68 | yield [str] 69 | else: 70 | i = 0 71 | while i <= len(str): 72 | pref,word,suf = nextword(str, ng1, i) 73 | #print((i,str,pref,word,suf)) 74 | if not word: 75 | #if i == 0 or freq.get(pref): 76 | # on subsequent loops we accumulate garbage non-word-suffixes 77 | yield [pref] 78 | break 79 | else: 80 | w = [] 81 | if pref: w.append(pref) 82 | w.append(word) 83 | for sufx in spl(suf, ng1): 84 | if sufx: 85 | yield w + sufx 86 | i += len(word) + 1 87 | 88 | """ 89 | given a list of tokens, yield all possible permutations via splitting 90 | """ 91 | def splits(toks, freq, g): 92 | score = dict() 93 | # list all possible substrings that are known words 94 | str = ''.join(toks) 95 | for i in range(len(str)+1): 96 | for j in range(i+1, len(str)+1): 97 | w = str[i:j] 98 | sc = freq.get(w, 0) 99 | if sc > 0: 100 | score[w] = sc 101 | print(' splits score=',score) 102 | 103 | # use ngrams to determine which words are seen next to each other; 104 | # use that information to more efficiently parse 105 | # find all permutations that contain at least one word 106 | 107 | ngrams = [] 108 | for x,y in product(score.keys(), score.keys()): 109 | # ensure adjacency and order 110 | xi = str.index(x) + len(x) 111 | if str[xi:xi+len(y)] != y: 112 | continue 113 | ng = (x,y) 114 | sc = g.freq(ng) 115 | if sc > 0: 116 | ngrams.append(ng) 117 | print(' splits ngrams=',ngrams) 118 | 119 | # all of the words that can begin an ngram 120 | ng1 = set([x for x,y in ngrams]) 121 | ng2 = set([y for x,y in ngrams]) 122 | 123 | def toks2ngrams(toks, size): 124 | size = min(len(toks), size) 125 | for ng in zip(*[toks[i:] for i in range(size)]): 126 | yield ng 127 | 128 | # sort splits by ngram score 129 | pop = [] 130 | for s in spl(str, ng1): 131 | freq = sum([g.freq(x) for x in toks2ngrams(s, 3)]) 132 | if freq: 133 | pop.append((tuple(s), freq)) 134 | pop = sorted(pop, key=lambda x:x[1], reverse=True) 135 | print(' splits() pop=',pop) 136 | for p,_ in pop: 137 | yield p 138 | 139 | def weight(tok): 140 | factor = 1 + len(tok) 141 | return round(Freq.get(tok,0) * factor, 1) 142 | 143 | def correct(str): 144 | toks = tokenize(str) 145 | """ 146 | j = frozenset(tuple(t) for t in joins(toks)) 147 | print('j=',j) 148 | """ 149 | s = list(splits(toks, Freq)) 150 | print('s=',s[:4]) 151 | js0 = list(s)# + list(j) 152 | js1 = [(k, sum(map(weight, k))) for k in js0] 153 | js2 = sorted(js1, key=lambda x:x[1], reverse=True) 154 | print('js=',js2[:5]) 155 | guess = str 156 | if js2 != []: 157 | guess,gscore = js2[0] 158 | oscore = sum(map(weight, toks)) 159 | print('gscore=',gscore,'oscore=',oscore) 160 | if gscore > oscore * 2: # FIXME: there is no good way to do this 161 | guess = ' '.join(guess) 162 | else: 163 | guess = str 164 | return guess 165 | 166 | if __name__ == '__main__': 167 | Tests = [ 168 | 'iknowright : i know right', 169 | 'f u c k y o u : fuck you', 170 | 'xxxhowareyouxxx : xxx how are you xxx', 171 | 'con duct re search : conduct research', 172 | 'hitherehowareyou : hi there how are you', 173 | 'hithe re : hi there', 174 | 'anathema : anathema' # unlikely but valid word 175 | ] 176 | passcnt = 0 177 | for t in Tests: 178 | str,exp = t.strip().split(' : ') 179 | print(str) 180 | res = correct(str) 181 | if res == exp: 182 | passcnt += 1 183 | else: 184 | print('*** FAIL: %s -> %s (%s)' % (str,res,exp)) 185 | print('Tests %u/%u.' % (passcnt, len(Tests))) 186 | 187 | -------------------------------------------------------------------------------- /src/chick.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # ex: set ts=8 noet: 4 | # Copyright 2011 Ryan Flynn 5 | 6 | """ 7 | Word/grammar checking algorithm 8 | 9 | Phon ✕ Word ✕ NGramDiff ✕ Doc 10 | 11 | Facts 12 | * the corpus is not perfect. it contains errors. 13 | * not every valid ngram will exist in the corpus. 14 | * infrequent but valid ngrams are sometimes very similar to very frequent ones 15 | 16 | Mutations 17 | * insertion : additional item 18 | * duplication : correct item incorrectly number of times 19 | * split (its) -> (it,',s) 20 | * merge (miss,spelling) -> (misspelling) 21 | * deletion : item missing 22 | * transposition : correct items, incorrect order 23 | * letters wap 24 | 25 | TODO: 26 | * figure out how to handle apostrophes 27 | * pre-calculate token joining and merging 28 | 29 | """ 30 | 31 | from util import * 32 | from ngramdiff import TokenDiff,NGramDiff,NGramDiffScore 33 | 34 | import logging 35 | 36 | logger = logging.getLogger('spill-chick') 37 | hdlr = logging.FileHandler('/var/tmp/spill-chick.log') 38 | logger.addHandler(hdlr) 39 | logger.setLevel(logging.DEBUG) 40 | 41 | def handleError(self, record): 42 | raise 43 | logging.Handler.handleError = handleError 44 | 45 | from math import log 46 | from itertools import takewhile, dropwhile, product, cycle, chain 47 | from collections import defaultdict 48 | import bz2, sys, re, os 49 | import copy 50 | from word import Words,NGram3BinWordCounter 51 | from phon import Phon 52 | from gram import Grams 53 | from grambin import GramsBin 54 | from doc import Doc 55 | 56 | logger.debug('sys.version=' + sys.version) 57 | 58 | """ 59 | 60 | sentence: "if it did the future would undoubtedly be changed" 61 | 62 | "the future would" and "would undoubtedly be" have high scores, 63 | but the connector, "future would undoubtedly", has zero. 64 | we need to be aware that every valid 3-gram will not be in our database, 65 | but that if the surrounding, overlapping ones are then it's probably ok 66 | 67 | sugg did the future 156 68 | sugg the future would 3162 69 | sugg future would undoubtedly 0 70 | sugg would undoubtedly be 3111 71 | sugg undoubtedly be changed 0 72 | 73 | sugg i did the 12284 74 | sugg it did the 4279 75 | sugg i did then 1654 76 | sugg it did then 690 77 | sugg i hid the 646 78 | sugg did the future 156 79 | sugg hid the future 38 80 | sugg aid the future 30 81 | sugg the future would 3162 82 | sugg the future world 2640 83 | sugg the future could 934 84 | sugg future wood and 0 85 | sugg future wood undoubtedly 0 86 | sugg future would and 0 87 | sugg future would undoubtedly 0 88 | sugg would undoubtedly be 3111 89 | sugg could undoubtedly be 152 90 | sugg undoubtedly be changed 0 91 | 92 | """ 93 | 94 | import inspect 95 | def lineno(): 96 | """Returns the current line number in our program.""" 97 | return inspect.currentframe().f_back.f_lineno 98 | 99 | # TODO: modify levenshtein to weight score based on what has changed; 100 | # - transpositions should count less than insertions/deletions 101 | # - changes near the front of the word should count more than the end 102 | # - for latin alphabets changes to vowels should count less than consonants 103 | def levenshtein(a,b): 104 | "Calculates the Levenshtein distance between a and b." 105 | n, m = len(a), len(b) 106 | if n > m: 107 | # Make sure n <= m, to use O(min(n,m)) space 108 | a,b = b,a 109 | n,m = m,n 110 | 111 | current = range(n+1) 112 | for i in range(1,m+1): 113 | previous, current = current, [i]+[0]*n 114 | for j in range(1,n+1): 115 | add, delete = previous[j]+1, current[j-1]+1 116 | change = previous[j-1] 117 | if a[j-1] != b[i-1]: 118 | change = change + 1 119 | current[j] = min(add, delete, change) 120 | 121 | return current[n] 122 | 123 | def list2ngrams(l, size): 124 | """ 125 | split l into overlapping ngrams of size 126 | [x,y,z] -> [(x,y),(y,z)] 127 | """ 128 | if size >= len(l): 129 | return [tuple(l)] 130 | return [tuple(l[i:i+size]) for i in range(len(l)-size+1)] 131 | 132 | class Chick: 133 | def __init__(self): 134 | # initialize all "global" data 135 | logger.debug('loading...') 136 | logger.debug(' corpus...') 137 | # FIXME: using absolute paths is the easiest way to make us work from cmdline and invoked 138 | # in a web app. perhaps we could set up softlinks in /var/ to make this slightly more respectable. 139 | self.g = GramsBin( 140 | '/home/pizza/proj/spill-chick/data/corpus/google-ngrams/word.bin', 141 | '/home/pizza/proj/spill-chick/data/corpus/google-ngrams/ngram3.bin') 142 | self.w = Words(NGram3BinWordCounter(self.g.ng)) 143 | logger.debug(' phon') 144 | self.p = Phon(self.w, self.g) 145 | logger.debug('done.') 146 | # sanity-check junk 147 | """ 148 | logger.debug('w.correct(naieve)=%s' % self.w.correct(u'naieve')) 149 | logger.debug('w.correct(refridgerator)=%s' % self.w.correct(u'refridgerator')) 150 | logger.debug('g.freqs(refridgerator)=%s' % self.g.freqs(u'refridgerator')) 151 | logger.debug('g.freqs(refrigerator)=%s' % self.g.freqs(u'refrigerator')) 152 | logger.debug('g.freq((didn))=%s' % self.g.freq((u'didn',))) 153 | logger.debug('g.freq((a,mistake))=%s' % self.g.freq((u'a',u'mistake'))) 154 | logger.debug('g.freq((undoubtedly,be,changed))=%s' % self.g.freq((u'undoubtedly',u'be',u'changed'))) 155 | logger.debug('g.freq((undoubtedly,be))=%s' % self.g.freq((u'undoubtedly',u'be'))) 156 | logger.debug('g.freq((be,changed))=%s' % self.g.freq((u'be',u'changed'))) 157 | logger.debug('g.freq((it,it,did))=%s' % self.g.freq((u'it',u'it',u'did'))) 158 | logger.debug('g.freq((it,it))=%s' % self.g.freq((u'it',u'it'))) 159 | logger.debug('g.freq((it,did))=%s' % self.g.freq((u'it',u'did'))) 160 | logger.debug('g.freq((hello,there,sir))=%s' % self.g.freq((u'hello',u'there',u'sir'))) 161 | logger.debug('g.freq((hello,there))=%s' % self.g.freq((u'hello',u'there'))) 162 | logger.debug('g.freq((hello,there,,))=%s' % self.g.freq((u'hello',u'there',u','))) 163 | logger.debug('g.freq((they,\',re))=%s' % self.g.freq((u'they',u"'",u're'))) 164 | """ 165 | 166 | # FIXME: soundsToWords is expensive and should only be run as a last resort 167 | def phonGuess(self, toks, minfreq): 168 | """ 169 | given a list of tokens search for a list of words with similar pronunciation 170 | having g.freq(x) > minfreq 171 | """ 172 | # create a phonetic signature of the ngram 173 | phonsig = self.p.phraseSound(toks) 174 | logger.debug('phonsig=%s' % phonsig) 175 | phonwords = list(self.p.soundsToWords(phonsig)) 176 | logger.debug('phonwords=%s' % (phonwords,)) 177 | if phonwords == [[]]: 178 | phonpop = [] 179 | else: 180 | # remove any words that do not meet the minimum frequency; 181 | # they cannot possibly be part of the answer 182 | phonwords2 = [[[w for w in p if self.g.freq(tuple(w)) > minfreq] 183 | for p in pw] 184 | for pw in phonwords] 185 | logger.debug('phonwords2 lengths=%s product=%u' % \ 186 | (' '.join([str(len(p)) for p in phonwords2[0]]), 187 | reduce(lambda x,y:x*y, [len(p) for p in phonwords2[0]]))) 188 | if not all(phonwords2): 189 | return [] 190 | #logger.debug('phonwords2=(%u)%s...' % (len(phonwords2), phonwords2[:10],)) 191 | # remove any signatures that contain completely empty items after previous 192 | phonwords3 = phonwords2 193 | #logger.debug('phonwords3=(%u)%s...' % (len(phonwords3), phonwords3)) 194 | # FIXME: product() function is handy in this case but is potentially hazardous. 195 | # we should force a limit to the length of any list passed to it to ensure 196 | # the avoidance of any pathological, memory-filling, swap-inducing behavior 197 | phonwords4 = list(flatten([list(product(*pw)) for pw in phonwords3])) 198 | logger.debug('phonwords4=(%u)%s...' % (len(phonwords4), phonwords4[:20])) 199 | # look up ngram popularity, toss anything not more popular than original and sort 200 | phonwordsx = [tuple(flatten(p)) for p in phonwords4] 201 | 202 | phonpop = rsort1([(pw, self.g.freq(pw, min)) for pw in phonwordsx]) 203 | #logger.debug('phonpop=(%u)%s...' % (len(phonpop), phonpop[:10])) 204 | phonpop = list(takewhile(lambda x:x[1] > minfreq, phonpop)) 205 | #logger.debug('phonpop=%s...' % (phonpop[:10],)) 206 | if phonpop == []: 207 | return [] 208 | best = phonpop[0][0] 209 | return [[x] for x in best] 210 | 211 | """ 212 | return a list of ngrampos permutations where each token has been replaced by a word with 213 | similar pronunciation, and g.freqs(word) > minfreq 214 | """ 215 | def permphon(self, ngrampos, minfreq): 216 | perms = [] 217 | for i in range(len(ngrampos)): 218 | prefix = ngrampos[:i] 219 | suffix = ngrampos[i+1:] 220 | tokpos = ngrampos[i] 221 | tok = tokpos[0] 222 | sounds = self.p.word[tok] 223 | if not sounds: 224 | continue 225 | #logger.debug('tok=%s sounds=%s' % (tok, sounds)) 226 | for sound in sounds: 227 | soundslikes = self.p.phon[sound] 228 | #logger.debug('tok=%s soundslikes=%s' % (tok, soundslikes)) 229 | for soundslike in soundslikes: 230 | if len(soundslike) > 1: 231 | continue 232 | soundslike = soundslike[0] 233 | if soundslike == tok: 234 | continue 235 | #logger.debug('soundslike %s -> %s' % (tok, soundslike)) 236 | if self.g.freqs(soundslike) <= minfreq: 237 | continue 238 | newtok = (soundslike,) + tokpos[1:] 239 | damlev = damerau_levenshtein(tok, soundslike) 240 | td = TokenDiff([tokpos], [newtok], damlev) 241 | perms.append(NGramDiff(prefix, td, suffix, self.g, soundalike=True)) 242 | return perms 243 | 244 | @staticmethod 245 | def ngrampos_merge(x, y): 246 | return (x[0]+y[0], x[1], x[2], x[3]) 247 | 248 | def permjoin(self, l, minfreq): 249 | """ 250 | given a list of strings, produce permutations by joining two tokens together 251 | example [a,b,c,d] -> [[ab,c,d],[a,bc,d],[a,b,cd] 252 | """ 253 | perms = [] 254 | if len(l) > 1: 255 | for i in range(len(l)-1): 256 | joined = Chick.ngrampos_merge(l[i],l[i+1]) 257 | if self.g.freqs(joined[0]) > minfreq: 258 | td = TokenDiff(l[i:i+2], [joined], 1) 259 | ngd = NGramDiff(l[:i], td, l[i+2:], self.g) 260 | perms.append(ngd) 261 | return perms 262 | 263 | @staticmethod 264 | def ngrampos_split_back(x, y): 265 | return (x[0]+y[0][:1], x[1], x[2], x[3]), (y[0][1:], y[1], y[2], y[3]) 266 | 267 | @staticmethod 268 | def ngrampos_split_forward(x, y): 269 | return (x[0][:-1], x[1], x[2], x[3]), (x[0][-1:]+y[0], y[1], y[2], y[3]) 270 | 271 | def intertoken_letterswap(self, l, target_freq): 272 | # generate permutations of token list with the beginning and ending letter of each 273 | # token swapped between adjacent tokens 274 | if len(l) < 2: 275 | return [] 276 | perms = [] 277 | for i in range(len(l)-1): 278 | if len(l[i][0]) > 1: 279 | x,y = Chick.ngrampos_split_forward(l[i], l[i+1]) 280 | if self.g.freq((x[0],y[0])) >= target_freq: 281 | td = TokenDiff(l[i:i+2], [x,y], 0) 282 | ngd = NGramDiff(l[:i], td, l[i+2:], self.g) 283 | perms.append(ngd) 284 | if len(l[i+1][0]) > 1: 285 | x,y = Chick.ngrampos_split_back(l[i], l[i+1]) 286 | if self.g.freq((x[0],y[0])) >= target_freq: 287 | td = TokenDiff(l[i:i+2], [x,y], 0) 288 | ngd = NGramDiff(l[:i], td, l[i+2:], self.g) 289 | perms.append(ngd) 290 | #print 'intertoken_letterswap=',perms 291 | return perms 292 | 293 | def do_suggest(self, target_ngram, target_freq, ctx, d, max_suggest=5): 294 | """ 295 | given an infrequent ngram from a document, attempt to calculate a more frequent one 296 | that is similar textually and/or phonetically but is more frequent 297 | """ 298 | 299 | target_ngram = list(target_ngram) 300 | part = [] 301 | 302 | # permutations via token joining 303 | # expense: cheap, though rarely useful 304 | # TODO: smarter token joining; pre-calculate based on tokens 305 | part += self.permjoin(target_ngram, target_freq) 306 | #logger.debug('permjoin(%s)=%s' % (target_ngram, part,)) 307 | 308 | part += self.intertoken_letterswap(target_ngram, target_freq) 309 | 310 | part += self.permphon(target_ngram, target_freq) 311 | 312 | part += self.g.ngram_like(target_ngram, target_freq) 313 | 314 | logger.debug('part after ngram_like=(%u)%s...' % (len(part), part[:5],)) 315 | 316 | # calculate the closest, best ngram in part 317 | sim = sorted([NGramDiffScore(ngd, self.p) for ngd in part]) 318 | for s in sim[:25]: 319 | logger.debug('sim %4.1f %2u %u %6u %6u %s' % \ 320 | (s.score, s.ediff, s.sl, s.ngd.oldfreq, s.ngd.newfreq, ' '.join(s.ngd.newtoks()))) 321 | 322 | best = list(takewhile(lambda s:s.score > 0, sim))[:max_suggest] 323 | for b in best: 324 | logger.debug('best %s' % (b,)) 325 | return best 326 | 327 | def ngram_suggest(self, target_ngram, target_freq, d, max_suggest=1): 328 | """ 329 | we calculate ngram context and collect solutions for each context 330 | containing the target, then merge them into a cohesive, best suggestion. 331 | c d e 332 | a b c d e f g 333 | given ngram (c,d,e), calculate context and solve: 334 | [S(a,b,c), S(b,c,d), S(c,d,e), S(d,e,f), S(e,f,g)] 335 | """ 336 | 337 | logger.debug('target_ngram=%s' % (target_ngram,)) 338 | tlen = len(target_ngram) 339 | 340 | context = list(d.ngram_context(target_ngram, tlen)) 341 | logger.debug('context=%s' % (context,)) 342 | ctoks = [c[0] for c in context] 343 | clen = len(context) 344 | 345 | logger.debug('tlen=%d clen=%d' % (tlen, clen)) 346 | context_ngrams = list2ngrams(context, tlen) 347 | logger.debug('context_ngrams=%s' % (context_ngrams,)) 348 | 349 | # gather suggestions for each ngram overlapping target_ngram 350 | sugg = [(ng, self.do_suggest(ng, self.g.freq([x[0] for x in ng]), context_ngrams, d)) 351 | for ng in [target_ngram]] #context_ngrams] 352 | 353 | for ng,su in sugg: 354 | for s in su: 355 | logger.debug('sugg %s' % (s,)) 356 | 357 | """ 358 | previously we leaned heavily on ngram frequencies and the sums of them for 359 | evaluating suggestions in context. 360 | instead, we will focus specifically on making the smallest changes which have the 361 | largest improvements, and in trying to normalize a document, i.e. 362 | "filling in the gaps" of as many 0-freq ngrams as possible. 363 | """ 364 | 365 | # merge suggestions based on what they change 366 | realdiff = {} 367 | for ng,su in sugg: 368 | for s in su: 369 | rstr = ' '.join(s.ngd.newtoks()) 370 | if rstr in realdiff: 371 | realdiff[rstr] += s 372 | else: 373 | realdiff[rstr] = s 374 | logger.debug('real %s %s' % (rstr, realdiff[rstr])) 375 | 376 | # sort the merged suggestions based on their combined score 377 | rdbest = sorted(realdiff.values(), key=lambda x:x.score, reverse=True) 378 | 379 | # finally, allow frequency to overcome small differences in score, but only 380 | # for scores that are within 1 to begin with. 381 | # if we account for frequency too much the common language idioms always crush 382 | # valid but less common phrases; if we don't account for frequency at all we often 383 | # recommend very similar but uncommon and weird phrases. this attempts to strike a balance. 384 | rdbest.sort(lambda x,y: 385 | y.score - x.score if abs(x.score - y.score) > 1 \ 386 | else (y.score + int(log(y.ngd.newfreq))) - \ 387 | (x.score + int(log(x.ngd.newfreq)))) 388 | 389 | for ngds in rdbest: 390 | logger.debug('best %s' % (ngds,)) 391 | 392 | return rdbest 393 | 394 | def suggest(self, txt, max_suggest=1, skip=[]): 395 | """ 396 | given a string, run suggest() and apply the first suggestion 397 | """ 398 | logger.debug('Chick.suggest(txt=%s max_suggest=%s, skip=%s)' % (txt, max_suggest, skip)) 399 | 400 | d = Doc(txt, self.w) 401 | logger.debug('doc=%s' % d) 402 | 403 | """ 404 | locate uncommon n-gram sequences which may indicate grammatical errors 405 | see if we can determine better replacements for them given their context 406 | """ 407 | 408 | # order n-grams by unpopularity 409 | ngsize = min(3, d.totalTokens()) 410 | logger.debug('ngsize=%s d.totalTokens()=%s' % (ngsize, d.totalTokens())) 411 | logger.debug('ngram(1) freq=%s' % list(d.ngramfreqctx(self.g,1))) 412 | 413 | # locate the least-common ngrams 414 | # TODO: in some cases an ngram is unpopular, but overlapping ngrams on either side 415 | # are relatively popular. 416 | # is this useful in differentiating between uncommon but valid phrases from invalid ones? 417 | """ 418 | sugg did the future 156 419 | sugg the future would 3162 420 | sugg future would undoubtedly 0 421 | sugg would undoubtedly be 3111 422 | sugg undoubtedly be changed 0 423 | """ 424 | 425 | least_common = sort1(d.ngramfreqctx(self.g, ngsize)) 426 | logger.debug('least_common=%s' % least_common[:20]) 427 | # remove any ngrams present in 'skip' 428 | least_common = list(dropwhile(lambda x: x[0] in skip, least_common)) 429 | # filter ngrams containing numeric tokens or periods, they generate too many poor suggestions 430 | least_common = list(filter( 431 | lambda ng: not any(re.match('^(?:\d+|\.)$', n[0][0], re.U) 432 | for n in ng[0]), 433 | least_common)) 434 | 435 | # FIXME: limit to reduce work 436 | least_common = least_common[:max(20, len(least_common)/2)] 437 | 438 | # gather all suggestions for all least_common ngrams 439 | suggestions = [] 440 | for target_ngram,target_freq in least_common: 441 | suggs = self.ngram_suggest(target_ngram, target_freq, d, max_suggest) 442 | if suggs: 443 | suggestions.append(suggs) 444 | 445 | if not suggestions: 446 | """ 447 | """ 448 | ut = list(d.unknownToks()) 449 | logger.debug('unknownToks=%s' % ut) 450 | utChanges = [(u, (self.w.correct(u[0]), u[1], u[2], u[3])) for u in ut] 451 | logger.debug('utChanges=%s' % utChanges) 452 | utChanges2 = list(filter(lambda x: x not in skip, utChanges)) 453 | for old,new in utChanges2: 454 | td = TokenDiff([old], [new], damerau_levenshtein(old[0], new[0])) 455 | ngd = NGramDiff([], td, [], self.g) 456 | ngds = NGramDiffScore(ngd, None, 1) 457 | suggestions.append([ngds]) 458 | 459 | logger.debug('------------') 460 | logger.debug('suggestions=%s' % (suggestions,)) 461 | suggs = filter(lambda x:x and x[0].ngd.newfreq != x[0].ngd.oldfreq, suggestions) 462 | logger.debug('suggs=%s' % (suggs,)) 463 | # sort suggestions by their score, highest first 464 | bestsuggs = rsort(suggs, key=lambda x: x[0].score) 465 | # by total new frequency... 466 | bestsuggs = rsort(bestsuggs, key=lambda x: x[0].ngd.newfreq) 467 | # then by improvement pct. for infinite improvements this results in 468 | # the most frequent recommendation coming to the top 469 | bestsuggs = rsort(bestsuggs, key=lambda x: x[0].improve_pct()) 470 | 471 | # finally, allow frequency to overcome small differences in score, but only 472 | # for scores that are within 1 to begin with. 473 | # if we account for frequency too much the common language idioms always crush 474 | # valid but less common phrases; if we don't account for frequency at all we often 475 | # recommend very similar but uncommon and weird phrases. this attempts to strike a balance. 476 | """ 477 | bestsuggs.sort(lambda x,y: 478 | x[0].score - y[0].score if abs(x[0].score - y[0].score) > 1 \ 479 | else \ 480 | (y[0].score + int(log(y[0].ngd.newfreq))) - \ 481 | (x[0].score + int(log(x[0].ngd.newfreq)))) 482 | """ 483 | 484 | for bs in bestsuggs: 485 | for bss in bs: 486 | logger.debug('bestsugg %6.2f %2u %2u %7u %6.0f%% %s' % \ 487 | (bss.score, bss.ediff, bss.ngd.diff.damlev, 488 | bss.ngd.newfreq, bss.improve_pct(), ' '.join(bss.ngd.newtoks()))) 489 | 490 | for bs in bestsuggs: 491 | logger.debug('> bs=%s' % (bs,)) 492 | yield bs 493 | 494 | # TODO: now the trick is to a) associate these together based on target_ngram 495 | # to make them persist along with the document 496 | # and to recalculate them as necessary when a change is applied to the document that 497 | # affects anything they overlap 498 | 499 | def correct(self, txt): 500 | """ 501 | given a string, identify the least-common n-gram not present in 'skip' 502 | and return a list of suggested replacements 503 | """ 504 | d = Doc(txt, self.w) 505 | changes = list(self.suggest(d, 1)) 506 | for ch in changes: 507 | logger.debug('ch=%s' % (ch,)) 508 | change = [ch[0].ngd] 509 | logger.debug('change=%s' % (change,)) 510 | d.applyChanges(change) 511 | logger.debug('change=%s after applyChanges d=%s' % (change, d)) 512 | d = Doc(d, self.w) 513 | break # FIXME: loops forever 514 | changes = list(self.suggest(d, 1)) 515 | res = str(d).decode('utf8') 516 | logger.debug('correct res=%s %s' % (type(res),res)) 517 | return res 518 | 519 | -------------------------------------------------------------------------------- /src/corpus.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | 4 | """ 5 | 6 | """ 7 | 8 | from collections import defaultdict 9 | import re 10 | import sys 11 | import traceback 12 | import os 13 | from gram import Grams 14 | # NOTE: tried subprocess module but doesn't seem to be able to do per-line output... 15 | 16 | def shell_escape(str): 17 | return str.replace(' ', '\\ ').replace("'", "\\'") 18 | 19 | def cat_cmd(filename): 20 | l = filename.lower() 21 | if l.endswith('.bz2'): 22 | return 'bzcat %s' % (shell_escape(filename),) 23 | elif l.endswith('.tar.gz') or l.endswith('.tgz'): 24 | return 'zcat %s | tar xfO -' % (shell_escape(filename),) 25 | else: 26 | return 'cat %s' % (shell_escape(filename),) 27 | 28 | def corpus(name='gutenberg'): 29 | dir = '../data/corpus/'+name+'/' 30 | for file in os.popen('ls ' + dir + '|head -n 3', 'r'): 31 | file = file.strip() 32 | print('%s...' % (file,)) 33 | p = os.popen(cat_cmd(dir+file), 'r') 34 | yield p 35 | 36 | # wikipedia markup filter generator 37 | class wikipedia_lines: 38 | def __init__(self, p): 39 | self.p = p 40 | def __iter__(self): 41 | for line in self.p: 42 | # find article start 43 | for line in self.p: 44 | if '' in line: 47 | # go until article end 48 | for line in self.p: 49 | if '' in line: 50 | break 51 | # FIXME: this regex crap is 90% of our processing time 52 | line = re.sub('</?ref.*?>?', '', line) # ref crap 53 | line = re.sub('{{.*(?:}})?', '', line) # citation crap 54 | line = re.sub('!--.*?--', '', line) # comments 55 | line = re.sub('\[\[.*]]', '', line) # interior link 56 | line = re.sub('&\S+;?', '', line) # entity crap 57 | line = re.sub('&\w+;?|!--.*?--|.*}}', '', line) # &entity; 58 | line = re.sub("''wikt:(.*?)''", '\\1', line) # wiktionary link 59 | line = re.sub('\[http.*?]', '', line) # exterior link 60 | line = re.sub('(?:File|Image|Category):\S+', '', line) # exterior link 61 | #line = re.sub('.*}}', '', line) # multi-line citation 62 | if re.match('^[a-z]{2,3}:\S+', line): 63 | continue 64 | line = line.strip() 65 | if line == '' or line[0] == '|' or line[0] == '!' or line[0] == '{' or ']]' in line: 66 | continue 67 | #print(line) 68 | yield line 69 | 70 | def corpus_wikipedia(): 71 | p = os.popen('bzcat ../data/corpus/enwiki-latest-pages-articles.xml.bz2 2>/dev/null | head -n 500000', 'r') 72 | yield wikipedia_lines(p) 73 | 74 | class email_lines: 75 | def __init__(self, p): 76 | self.p = p 77 | def __iter__(self): 78 | for line in self.p: 79 | if line.startswith('X-') or \ 80 | line.startswith('=09') or \ 81 | re.match('^(Content-Transfer-Encoding|Message-ID|Date|From|To|Subject|Cc|Mime-Version|Content-Type|Bcc):', line): 82 | continue 83 | yield line 84 | 85 | def corpus_enron(): 86 | p = os.popen('zcat ../data/corpus/enron_mail_20110402.tgz | tar xfO - 2>/dev/null', 'r') 87 | yield email_lines(p) 88 | 89 | def parse_corpus(c): 90 | g = Grams() 91 | for p in c: 92 | g.add(p) 93 | return g 94 | 95 | def ngram_match(tok, w2id, ngrams): 96 | if tok not in w2id: 97 | return [] 98 | id = w2id[tok] 99 | print('%s -> %s' % (tok, id)) 100 | return [n for n in ngrams.keys() if id in n] 101 | 102 | import pickle 103 | 104 | if __name__ == '__main__': 105 | f = ['a b c','d e f'] 106 | g = Grams(f) 107 | print(g) 108 | 109 | w = parse_corpus(corpus_enron()) 110 | pop = sorted(w.ngrams.items(), key=lambda x:x[1], reverse=True)[:200] 111 | popw = [(tuple(w.id2w[id] for id in n),cnt) for n,cnt in pop] 112 | print(popw) 113 | print('len(pickle(w2id))=%s' % (len(pickle.dumps(w.w2id)),)) 114 | 115 | #print([tuple(id2w[id] for id in ng) for ng in ngram_match('the', w2id, ngrams)[:100]]) 116 | 117 | -------------------------------------------------------------------------------- /src/doc.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | 4 | """ 5 | Doc represents a document being checked against existing Grams 6 | """ 7 | 8 | import collections 9 | import unittest 10 | import math 11 | import gram 12 | from ngramdiff import TokenDiff,NGramDiff,NGramDiffScore 13 | import copy 14 | 15 | import logging 16 | logger = logging.getLogger('spill-chick') 17 | 18 | """ 19 | Tokenized contents of a single file 20 | Tokens associated with positional data to faciliate changes 21 | """ 22 | class Doc: 23 | 24 | def __init__(self, f, w): 25 | self.words = w # global tokens 26 | #self.docwords = collections.Counter() # local {token:freq} 27 | self.tokenize(f) 28 | 29 | self.sugg = [] # list of suggestions by [line][ngram]suggestions... 30 | # suggs are aligned with fixed-size ngrams, so for ngrams size 3 31 | # sugg[0][0] refers to ngram of line=0 tokens[0,1,2] 32 | # lines that have fewer than ngram len tokens are ignored at this time 33 | 34 | def __str__(self): 35 | return unicode(self).encode('utf-8') 36 | 37 | def __unicode__(self): 38 | s = unicode('\n'.join(self.lines)) 39 | return s 40 | 41 | def __repr__(self): 42 | return 'Doc(%s)' % str(self) 43 | 44 | def __iter__(self): 45 | return iter(self.lines) 46 | 47 | def tokenize(self, f): 48 | self.lines = [] 49 | self.tok = [] 50 | for lcnt,line in enumerate(f): 51 | self.lines.append(line) 52 | line = line.lower() # used for index below 53 | toks = gram.tokenize(line) 54 | if toks and toks[-1] == '\n': 55 | toks.pop() 56 | #self.docwords.update(toks) # add words to local dictionary 57 | tpos = 0 58 | ll = [] 59 | for t in toks: 60 | tpos = line.index(t, tpos) 61 | ll.append((t, lcnt, len(ll), tpos)) 62 | tpos += len(t) 63 | self.tok.append(ll) 64 | 65 | def totalTokens(self): 66 | return sum(len(ts) for ts in self.tok) 67 | #return sum(self.docwords.values()) 68 | 69 | def unknownToks(self): 70 | for tok in self.tok: 71 | for t in tok: 72 | if self.words.freq(t[0]) == 0: 73 | yield t 74 | 75 | # given token t supply surrounding token ngram (x, tok, y) 76 | def surroundTok(self, t): 77 | line = self.tok[t[1]] 78 | idx = line.index(t) 79 | if idx > 0 and idx < len(line)-1: 80 | return tuple(line[idx-1:idx+2]) 81 | return None 82 | 83 | def ngrams(self, size): 84 | for tok in self.tok: 85 | for i in range(0, len(tok)+1-size): 86 | yield tuple(tok[i:i+size]) 87 | 88 | def ngramfreq(self, g, size): 89 | for ng in self.ngrams(size): 90 | ng2 = tuple(t[0] for t in ng) 91 | yield (ng, g.freq(ng2)) 92 | 93 | def ngramfreqctx(self, g, size): 94 | """ 95 | return each ngram in document, and the sum of the frequencies 96 | of all overlapping ngrams 97 | """ 98 | for toks in self.tok: 99 | if not toks: 100 | continue 101 | ngs = [tuple(t[0] for t in toks[i:i+size]) 102 | for i in range(max(1, len(toks)-size+1))] 103 | for i in range(len(ngs)): 104 | ctx = ngs[max(0,i-size-1):i+size] 105 | freq = sum(map(g.freq,ctx)) / len(ctx) 106 | yield (toks[i:i+size], freq) 107 | 108 | def ngram_prev(self, ngpos): 109 | _,line,index,_ = ngpos 110 | if index == 0: 111 | if line == 0: 112 | return None 113 | line -= 1 114 | while line >= 0 and self.tok[line] == []: 115 | line -= 1 116 | if line == -1: 117 | return None 118 | index = len(self.tok[line]) - 1 119 | else: 120 | index -= 1 121 | if index >= len(self.tok[line]): 122 | # if the first line is empty we need this 123 | return None 124 | return self.tok[line][index] 125 | 126 | def ngram_next(self, ngpos): 127 | _,line,index,_ = ngpos 128 | if line >= len(self.tok): 129 | return None 130 | if index >= len(self.tok[line]): 131 | line += 1 132 | while line < len(self.tok) and self.tok[line] == []: 133 | line += 1 134 | if line >= len(self.tok): 135 | return None 136 | index = 0 137 | else: 138 | index += 1 139 | if index >= len(self.tok[line]): 140 | # if the last line is empty we need this 141 | return None 142 | return self.tok[line][index] 143 | 144 | def ngram_context(self, ngpos, size): 145 | """ 146 | given an ngram and a size, return a list of ngrams that contain 147 | one or more members of ngram 148 | c d e 149 | a b c d e f g 150 | """ 151 | before, ng = [], ngpos[0] 152 | for i in range(size-1): 153 | ng = self.ngram_prev(ng) 154 | if not ng: 155 | break 156 | before.insert(0, ng) 157 | after, ng = [], ngpos[-1] 158 | for i in range(size-1): 159 | ng = self.ngram_next(ng) 160 | if not ng: 161 | break 162 | after.append(ng) 163 | return before + list(ngpos) + after 164 | 165 | @staticmethod 166 | def matchCap(x, y): 167 | """ 168 | Modify replacement word 'y' to match the capitalization of existing word 'x' 169 | (foo,bar) -> bar 170 | (Foo,bar) -> Bar 171 | (FOO,bar) -> BAR 172 | """ 173 | if x == x.lower(): 174 | return y 175 | elif x == x.capitalize(): 176 | return y.capitalize() 177 | elif x == x.upper(): 178 | return y.upper() 179 | return y 180 | 181 | def applyChange(self, lines, ngd, off): 182 | """ 183 | given an ngram containing position data, replace corresponding data 184 | in lines with 'mod' 185 | """ 186 | d = ngd.diff # ngd.diff=TokenDiff(([(u'cheese', 0, 2, 9), (u'burger', 0, 3, 16)],[(u'cheeseburger', 0, 2, 9)])) 187 | # FIXME: deal with insertion 188 | # FIXME: treat new/old as separate sequences, instead of 1-to-1-ish 189 | old = copy.deepcopy(d.old) 190 | for mod in d.newtoks(): 191 | #print 'ngd.diff=%s' % (ngd.diff,) 192 | o,l,idx,pos = old.pop(0) 193 | pos += off[l] 194 | end = pos + len(o) 195 | #print 'o=%s l=%s idx=%s pos=%s end=%s old=%s' % (o,l,idx,pos,end,old) 196 | ow = lines[l][pos:end] 197 | if not mod and pos > 0 and lines[l][pos-1] in (' ','\t','\r','\n'): 198 | # if we've removed a token and it was preceded by whitespace, 199 | # nuke that whitespace as well 200 | pos -= 1 201 | cap = Doc.matchCap(ow, mod) 202 | #print 'cap=%s' % (cap,) 203 | lines[l] = lines[l][:pos] + cap + lines[l][end:] 204 | off[l] += len(cap) - len(o) 205 | # FIXME: over-simplified; consider multi-token change 206 | #self.docwords[ow] -= 1 207 | if mod: 208 | pass 209 | #self.docwords[mod] += 1 210 | return (lines, off) 211 | 212 | def demoChanges(self, changes): 213 | """ 214 | given a list of positional ngrams and a list of replacements, 215 | apply the changes and return a copy of the updated file 216 | """ 217 | logger.debug('Doc.demoChanges changes=%s' % (changes,)) 218 | lines = self.lines[:] 219 | off = [0] * len(lines) 220 | for ngd in changes: 221 | lines, off = self.applyChange(lines, ngd, off) 222 | return lines 223 | 224 | def applyChanges(self, changes): 225 | self.tokenize(self.demoChanges(changes)) 226 | 227 | class DocTest(unittest.TestCase): 228 | def test_change(self): 229 | pass 230 | 231 | if __name__ == '__main__': 232 | unittest.main() 233 | 234 | -------------------------------------------------------------------------------- /src/gram.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | 4 | from collections import defaultdict 5 | try: 6 | from collections import Counter 7 | except ImportError: # python 2.6? 8 | pass 9 | import re, sys, traceback 10 | import unittest 11 | from operator import itemgetter 12 | 13 | """ 14 | Tokenizing regular expression 15 | Group: 16 | letters 17 | numbers and any punctuation 18 | group things like dates, times, ip addresses, etc. into a single token 19 | """ 20 | TokRgxNL = re.compile('\d+(?:[^\w\s]+\d+)*|\w+|\.|\n', re.UNICODE) 21 | TokRgx = re.compile('\d+(?:[^\w\s]+\d+)*|\w+|\.', re.UNICODE) 22 | def tokenize(str): 23 | return re.findall(TokRgxNL, str.lower()) 24 | def tokenize_no_nl(str): 25 | return re.findall(TokRgx, str.lower()) 26 | 27 | class TokenizerTest(unittest.TestCase): 28 | def test_tokenize(self): 29 | Expect = [ 30 | ('', []), 31 | ('a', ['a']), 32 | ('A', ['a']), 33 | ('Aa', ['aa']), 34 | ('a b', ['a','b']), 35 | ] 36 | for s,xp in Expect: 37 | res = tokenize(s) 38 | self.assertEqual(xp, res) 39 | 40 | """ 41 | store corpus ngrams 42 | """ 43 | class Grams: 44 | def __init__(self, w, ngmax=3, f=None): 45 | self.words = w 46 | self.ngmax = ngmax 47 | self.ngrams = ( # ngram id -> frequency 48 | None, 49 | None, 50 | Counter(), 51 | Counter(), 52 | Counter(), 53 | ) 54 | if f: 55 | self.add(f) 56 | def freq(self, ng): 57 | #assert type(ng) == tuple 58 | if len(ng) == 1: 59 | return self.words.freq(ng[0]) 60 | if ng == (): # FIXME: shouldn't need this 61 | return 0 62 | return self.ngrams[len(ng)][ng] 63 | def freqs(self, s): 64 | return self.words.freq(s) 65 | # given an iterable 'f', tokenize and produce a {word:id} mapping and ngram frequency count 66 | def add(self, f): 67 | if type(f) == list: 68 | contents = '\n'.join(f) 69 | else: 70 | try: 71 | contents = f.read(1 * 1024 * 1024) # FIXME 72 | if type(contents) == bytes: 73 | contents = contents.decode('utf8') 74 | except UnicodeDecodeError: 75 | t,v,tb = sys.exc_info() 76 | traceback.print_tb(tb) 77 | toks = tokenize_no_nl(contents) 78 | self.words.addl(toks) 79 | self.ngrams[2].update(zip(toks, toks[1:])) 80 | self.ngrams[3].update(zip(toks, toks[1:], toks[2:])) 81 | self.ngrams[4].update(zip(toks, toks[1:], toks[2:], toks[3:])) 82 | print(' ngrams[2] %8u' % len(self.ngrams[2])) 83 | print(' ngrams[3] %8u' % len(self.ngrams[3])) 84 | print(' ngrams[4] %8u' % len(self.ngrams[4])) 85 | 86 | """ 87 | given ngram of arity n, return all known ngrams containing n-1 matches; 88 | that is, where all but one of the tokens match. 89 | this is obviously O(n) and because it is exhaustive it is inefficient. 90 | consider eventually either moving ngrams into an sqlite database or a 91 | custom in-memory structure in C 92 | select x,y,z 93 | from ngram3 94 | where (x = 'x') + (y = 'y') + (z = 'z') = 2 95 | order by freq desc 96 | """ 97 | def ngram_like(self, ng): 98 | if len(ng) <= 1: 99 | return [] 100 | assert len(ng) in (2,3) 101 | def uniq(s0,n): 102 | d = dict([(s[0][n],s[1]) for s in s0]) 103 | s = sorted(d.items(), key=itemgetter(1), reverse=True) 104 | return [x for x,y in s] 105 | if len(ng) == 2: 106 | f = lambda x: x[0] == ng[0] or \ 107 | x[1] == ng[1] 108 | elif len(ng) == 3: 109 | f = lambda x:(x[0] == ng[0]) + \ 110 | (x[1] == ng[1]) + \ 111 | (x[2] == ng[2]) == 2 112 | lng = len(ng) 113 | s0 = filter(f, self.ngrams[lng].keys()) 114 | s1 = [(k,self.ngrams[lng][k]) for k in s0] 115 | cnt = tuple(uniq(s1,n) for n in range(lng)) 116 | return cnt 117 | 118 | import pickle 119 | 120 | if __name__ == '__main__': 121 | unittest.main() 122 | 123 | -------------------------------------------------------------------------------- /src/grambin.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | 4 | from operator import itemgetter 5 | import sys 6 | from ngram3bin import ngram3bin 7 | from ngramdiff import TokenDiff,NGramDiff,NGramDiffScore 8 | from util import * 9 | 10 | """ 11 | Grams-interface to our binary ngram database 12 | """ 13 | class GramsBin: 14 | 15 | def __init__(self, wordpath, ngrampath): 16 | self.ng = ngram3bin(wordpath, ngrampath) 17 | 18 | def freq(self, ng, sum_=sum): 19 | #print 'freq()=',ng 20 | l = len(ng) 21 | if l > 1: 22 | ids = map(self.ng.word2id, ng) 23 | if l > 3: 24 | # chop up id list into ngram3-sized chunks 25 | smaller = [tuple(ids[i:i+3]) for i in range(len(ids)-3+1)] 26 | fr = sum_(self.ng.freq(*s) for s in smaller) 27 | else: 28 | fr = self.ng.freq(*ids) 29 | return fr 30 | else: 31 | return self.ng.wordfreq(ng[0]) 32 | 33 | def freqs(self, s): 34 | #print('freq(s)=',s) 35 | return self.ng.wordfreq(s) 36 | 37 | def ngram_like(self, ng, ngfreq): 38 | """ 39 | given an ngram (x,y,z), return a list of ngrams sharing all but one element, i.e. 40 | (_,y,z) 41 | (x,_,z) 42 | (x,y,_) 43 | """ 44 | if len(ng) != 3: 45 | return [] 46 | #print 'like()=',ng 47 | ids = tuple(map(self.ng.word2id, [n[0] for n in ng])) 48 | #print('like(ids)=',ids) 49 | like = self.ng.like(*ids) 50 | #print 'like(',ng,')=',like 51 | like2 = [] 52 | for l in set(like): 53 | t,tfreq = tuple(map(self.ng.id2word, l[:3])), l[3] 54 | # calculate the single differing token and build an NGramDiff 55 | di = 0 if l[0] != ids[0] else 1 if l[1] != ids[1] else 2 56 | # do not bother with words that are of grossly different 57 | # length than our target 58 | if abs(len(t[di]) - len(ng[di][0])) > 3: 59 | continue 60 | newtok = (t[di],) + ng[di][1:] 61 | damlev = damerau_levenshtein(ng[di][0], t[di]) 62 | ngd = NGramDiff(ng[:di], 63 | TokenDiff(ng[di:di+1], [newtok], damlev), 64 | ng[di+1:], self, ngfreq, tfreq) 65 | like2.append(ngd) 66 | like3 = sorted(like2, reverse=True) 67 | return like2 68 | 69 | -------------------------------------------------------------------------------- /src/ngramdiff.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | """ 5 | token and ngram comparison 6 | """ 7 | 8 | from math import sqrt,log 9 | from util import damerau_levenshtein 10 | 11 | class TokenDiff: 12 | """ 13 | represent the modification of zero or more 'old' (original) tokens and their 14 | 'new' (proposed) replacement. solves the problem of tracking inter-token changes. 15 | change: TokenDiff([tok], [tok']) 16 | insert: TokenDiff([], [tok']) 17 | delete: TokenDiff([tok], []) 18 | split: TokenDiff([tok], [tok',tok']) 19 | merge: TokenDiff([tok,tok], [tok']) 20 | """ 21 | def __init__(self, old, new, damlev): 22 | self.old = list(old) 23 | self.new = list(new) 24 | self.damlev = damlev # Damerau-Levenshtein distance 25 | def oldtoks(self): return [t[0] for t in self.old] 26 | def newtoks(self): return [t[0] for t in self.new] 27 | def __str__(self): 28 | return 'TokenDiff((%s,%s))' % (self.old, self.new) 29 | def __repr__(self): 30 | return str(self) 31 | def __eq__(self, other): 32 | return self.old == other.old and \ 33 | self.new == other.new 34 | 35 | class NGramDiff: 36 | """ 37 | represent a list of tokens that contain a single change, represented by a TokenDiff. 38 | alternative, think of it as an acyclic directed graph with a single branch and merge 39 | conceptually: 40 | prefix diff suffix 41 | O---O---O---O---O---O---O 42 | \ / 43 | `-O-' 44 | """ 45 | def __init__(self, prefix, diff, suffix, g, oldfreq=None, newfreq=None, soundalike=False): 46 | self.prefix = list(prefix) 47 | self.diff = diff 48 | self.suffix = list(suffix) 49 | self.oldfreq = g.freq(self.oldtoks()) if oldfreq is None else oldfreq 50 | self.newfreq = g.freq(self.newtoks()) if newfreq is None else newfreq 51 | self.soundalike = soundalike 52 | def old(self): return self.prefix + self.diff.old + self.suffix 53 | def new(self): return self.prefix + self.diff.new + self.suffix 54 | def oldtoks(self): return [t[0] for t in self.old()] 55 | def newtoks(self): return [t[0] for t in self.new()] 56 | def __repr__(self): 57 | return str(self) 58 | def __str__(self): 59 | return 'NGramDiff(%s,%s,%s)' % (self.prefix, self.diff, self.suffix) 60 | def __eq__(self, other): 61 | return self.diff == other.diff and \ 62 | self.prefix == other.prefix and \ 63 | self.suffix == other.suffix 64 | def __lt__(self, other): 65 | return other.newfreq < self.newfreq 66 | 67 | class NGramDiffScore: 68 | # based on our logarithmic scoring below 69 | DECENT_SCORE = 3.0 70 | GOOD_SCORE = 5.0 71 | """ 72 | decorate an NGramDiff obj with scoring 73 | """ 74 | def __init__(self, ngd, p, score=None): 75 | self.ngd = ngd 76 | self.sl = ngd.diff.new and ngd.diff.old and ngd.diff.new[0][0][0] == ngd.diff.old[0][0][0] 77 | if score: 78 | self.score = score 79 | self.ediff = score 80 | else: 81 | self.score = self.calc_score(ngd, p) 82 | def calc_score(self, ngd, p): 83 | ediff = self.similarity(ngd, p) 84 | self.ediff = ediff 85 | if ngd.newfreq == 0: 86 | score = -float('inf') 87 | else: 88 | # weigh edit distance much more heavily than frequency 89 | score = 10 - (2 + ediff + (not self.sl)) 90 | return score 91 | def improve_pct(self): 92 | """How much of an improvement is the new from the old?""" 93 | if self.ngd.oldfreq == 0: 94 | return float('inf') 95 | return self.ngd.newfreq / self.ngd.oldfreq 96 | def __str__(self): 97 | return 'NGramDiffScore(score=%4.1f ngd=%s)' % (self.score, self.ngd) 98 | def __repr__(self): 99 | return str(self) 100 | def __eq__(self, other): 101 | return other.score == self.score 102 | def __lt__(self, other): 103 | return other.score < self.score 104 | def __add__(self, other): 105 | return NGramDiffScore(self.ngd, None, self.score + other.score) 106 | @staticmethod 107 | def overlap(s1, s2): 108 | """ 109 | given a list of sound()s, count the number that do not match 110 | 1 2 3 4 5 6 111 | 'T AH0 M AA1 R OW2' 112 | 'T UW1 M' 113 | = = 114 | 6 - 2 = 4 115 | """ 116 | mlen = max(len(s1), len(s2)) 117 | neq = sum(map(lambda x: x[0] != x[1], zip(s1, s2))) 118 | return mlen - neq 119 | def similarity(self, ngd, p): 120 | """ 121 | return tuple (effective difference, absolute distance) 122 | given a string x, calculate a similarity distance for y [0, +inf). 123 | smaller means more similar. the goal is to identify promising 124 | alternatives for a given token within a document; we need to consider 125 | the wide range of possible errors that may have been made 126 | """ 127 | if ngd.soundalike: 128 | return 0 129 | x = ' '.join(ngd.diff.oldtoks()) 130 | y = ' '.join(ngd.diff.newtoks()) 131 | # tokens identical 132 | if x == y: 133 | return 0 134 | damlev = ngd.diff.damlev 135 | sx,sy = p.phraseSound([x]),p.phraseSound([y]) 136 | if sx == sy and sx: 137 | # sound the same, e.g. there/their. consider these equal. 138 | return damlev 139 | # otherwise, calculate phonic/edit difference 140 | return max(damlev, 141 | min(NGramDiffScore.overlap(sx, sy), 142 | abs(len(x)-len(y)))) 143 | 144 | if __name__ == '__main__': 145 | import sys 146 | sys.path.append('..') 147 | from grambin import GramsBin 148 | from word import Words,NGram3BinWordCounter 149 | from phon import Phon 150 | import logging 151 | 152 | logging.basicConfig(stream=sys.stderr, level=logging.DEBUG) 153 | logging.debug('loading...') 154 | g = GramsBin( 155 | '/home/pizza/proj/spill-chick/data/corpus/google-ngrams/word.bin', 156 | '/home/pizza/proj/spill-chick/data/corpus/google-ngrams/ngram3.bin') 157 | w = Words(NGram3BinWordCounter(g.ng)) 158 | p = Phon(w,g) 159 | logging.debug('loaded.') 160 | 161 | pass 162 | 163 | -------------------------------------------------------------------------------- /src/phon.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | """ 5 | Handle phonetics; i.e. the way things sound 6 | """ 7 | 8 | import collections, re, sys, gzip, pickle, os, mmap 9 | from word import Words 10 | from gram import tokenize 11 | 12 | class Phon: 13 | def __init__(self, w, g): 14 | self.words = w 15 | self.word = collections.defaultdict(list) 16 | self.phon = collections.defaultdict(list) 17 | self.load(g) 18 | def load(self, g): 19 | dictpath ='/home/pizza/proj/spill-chick/data/cmudict/cmudict.0.7a' 20 | # extract file if necessary 21 | if not os.path.exists(dictpath): 22 | with open(dictpath, 'wb') as dst: 23 | with gzip.open(dictpath + '.gz', 'rb') as src: 24 | dst.write(src.read()) 25 | # TODO: loading this ~130,000 line dictionary in python represents the majority 26 | # of the program's initialization time. move it over to C. 27 | with open(dictpath, 'r') as f: 28 | for line in f: 29 | if line.startswith(';;;'): 30 | continue 31 | line = line.decode('utf8') 32 | line = line.strip().lower() 33 | word, phon = line.split(' ') 34 | """ 35 | skip any words that do not appear in our ngrams. 36 | this makes a significant difference when trying to reconstruct phrases 37 | phonetically; small decreases in terms have large decreases in products. 38 | note: you may think that every word in a dictionary would appear 39 | at least once in a large corpus, but we truncate corpus n-grams at a 40 | certain minimum frequency which may exclude very obscure words from ultimately 41 | appearing at all. 42 | """ 43 | 44 | # TODO: what i really should do is eliminate all words that appear less 45 | # than some statistically significant time; the vast majority of the 46 | # phonetic phrases I currently try are filled with short obscure words 47 | # and are a complete waste 48 | # FIXME: instead of hard-coding frequency, calculate statistically 49 | if word.count("'") == 0 and g.freqs(word) < 500: 50 | continue 51 | """ 52 | implement a very rough phonic fuzzy-matching 53 | phonic codes consist of a list of sounds such as: 54 | REVIEW R IY2 V Y UW1 55 | we simplify this to 56 | REVIEW R I V Y U 57 | this allows words with close but imperfectly sounding matches to 58 | be identified. for example: 59 | REVUE R IH0 V Y UW1 60 | REVIEW R IY2 V Y UW1 61 | is close but not a perfect match. after regex: 62 | REVUE R I V Y U 63 | REVIEW R I V Y U 64 | """ 65 | phon = re.sub('(\S)(\S+)', r'\1', phon) 66 | # now merge leading vowels except 'o' and 'u' 67 | if len(phon) > 1: 68 | phon = re.sub('^[aei]', '*', phon) 69 | self.words.add(word) 70 | self.word[word].append(phon) 71 | toks = tokenize(word) 72 | self.phon[phon].append(toks) 73 | 74 | """ 75 | return a list of words that sound like 'word', as long as they appear in ng 76 | """ 77 | def soundsLike(self, word, ng): 78 | l = [] 79 | for w in self.word[word]: 80 | for x in self.phon[w]: 81 | fr = ng.freqs(x) 82 | if fr > 0: 83 | l.append((x,fr)) 84 | return [w for w,fr in sorted(l, key=lambda x:x[1], reverse=True)] 85 | 86 | def phraseSound(self, toks): 87 | """ 88 | given a list of tokens produce a normalize list of their component sound 89 | an unknown token generates None 90 | TODO: ideally we would be able to "guess" the sound of unknown words. 91 | this would be a huge improvement! 92 | given 'waisting' we should be able to break it into 'waist' 'ing' 93 | """ 94 | def head(l): 95 | return l[0] if l else None 96 | s = [head(self.word.get(t,[''])) for t in toks] 97 | #print('phraseSound(',toks,')=',s) 98 | if not all(s): 99 | return [] 100 | # nuke numbers, join into one string 101 | t = ' '.join([re.sub('\d+', '', x) for x in s]) 102 | # nuke consecutive duplicate sounds 103 | u = re.sub('(\S+) \\1 ', '\\1 ', t) 104 | v = u.split() 105 | #print('phraseSound2=',v) 106 | return v 107 | 108 | def soundsToWords(self, snd): 109 | if snd == []: 110 | yield [] 111 | for j in range(1, len(snd)+1): 112 | t = ' '.join(snd[:j]) 113 | words = self.phon.get(t) 114 | if words: 115 | for s in self.soundsToWords(snd[j:]): 116 | yield [words] + s 117 | 118 | if __name__ == '__main__': 119 | 120 | def words(str): 121 | return re.findall('[a-z\']+', str.lower()) 122 | 123 | def pron(wl, wd): 124 | print(' '.join([str(wd[w][0]) if w in wd else '<%s>' % (w,) for w in wl])) 125 | 126 | P = Phon(Words()) 127 | for a in sys.argv[1:]: 128 | pron(words(a), P.W) 129 | 130 | print(P.word['there']) 131 | print(P.phon[P.word['there'][0]]) 132 | 133 | P.phraseSound(['making','mistake']) 134 | P.phraseSound(['may','king','mist','ache']) 135 | x = P.phraseSound(['making','miss','steak']) 136 | from itertools import product 137 | for f in P.soundsToWords(x): 138 | print(f) 139 | #print(list(product(*f))) 140 | 141 | -------------------------------------------------------------------------------- /src/test.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # ex: set ts=8 noet: 4 | # Copyright 2011 Ryan Flynn 5 | 6 | """ 7 | chick.py ✕ test.txt 8 | """ 9 | 10 | import sys, re, logging 11 | from chick import Chick 12 | 13 | logger = logging.getLogger('spill-chick') 14 | hdlr = logging.StreamHandler(sys.stderr) 15 | logger.addHandler(hdlr) 16 | logger.setLevel(logging.DEBUG) 17 | 18 | def load_tests(): 19 | # load test cases 20 | Tests = [] 21 | with open('../test/test.txt','r') as f: 22 | for l in f: 23 | l = l.decode('utf8').strip() 24 | if l == '#--end--': 25 | break 26 | if len(l) > 1 and l[0] != '#': 27 | before, after = l.split(' : ') 28 | after = re.sub('\s*#.*', '', after.rstrip(), re.U) # replace comments 29 | Tests.append(([before],after)) 30 | return Tests 31 | 32 | # TODO: Word() and Grams() should be merged, they're essentially the same 33 | 34 | def test(): 35 | """ 36 | run our tests. initialze resources resources and tests, run each test and 37 | figure out what works and what doesn't. 38 | """ 39 | chick = Chick() 40 | Tests = load_tests() 41 | passcnt = 0 42 | for str,exp in Tests: 43 | logger.debug('Test str=%s exp=%s' % (str, exp)) 44 | res = chick.correct(str) 45 | logger.debug('exp=%s(%s) res=%s(%s)' % (exp, type(exp), res, type(res))) 46 | passcnt += res == exp 47 | logger.debug('----------- %s -------------' % ('pass' if res == exp else 'fail',)) 48 | logger.debug('Tests %u/%u passed.' % (passcnt, len(Tests))) 49 | 50 | def profile_test(): 51 | import cProfile, pstats 52 | cProfile.run('test()', 'test.prof') 53 | st = pstats.Stats('test.prof') 54 | st.sort_stats('time') 55 | st.print_stats() 56 | 57 | if __name__ == '__main__': 58 | 59 | from sys import argv 60 | if len(argv) > 1 and argv[1] == '--profile': 61 | profile_test() 62 | else: 63 | test() 64 | 65 | -------------------------------------------------------------------------------- /src/util.py: -------------------------------------------------------------------------------- 1 | 2 | """ 3 | classes and utility functions that are used by everyone 4 | """ 5 | 6 | from operator import itemgetter 7 | from math import sqrt,log 8 | 9 | # convenience functions 10 | def rsort(l, **kw): return sorted(l, reverse=True, **kw) 11 | def rsort1(l): return rsort(l, key=itemgetter(1)) 12 | def rsort2(l): return rsort(l, key=itemgetter(2)) 13 | def sort1(l): return sorted(l, key=itemgetter(1)) 14 | def sort2(l): return sorted(l, key=itemgetter(2)) 15 | def flatten(ll): return chain.from_iterable(ll) 16 | def zip_longest(x, y, pad=None): 17 | x, y = list(x), list(y) 18 | lx, ly = len(x), len(y) 19 | if lx < ly: 20 | x += [pad] * (ly-lx) 21 | elif ly < lx: 22 | y += [pad] * (lx-ly) 23 | return zip(x, y) 24 | 25 | def damerau_levenshtein(seq1, seq2): 26 | """Calculate the Damerau-Levenshtein distance between sequences. 27 | 28 | This distance is the number of additions, deletions, substitutions, 29 | and transpositions needed to transform the first sequence into the 30 | second. Although generally used with strings, any sequences of 31 | comparable objects will work. 32 | 33 | Transpositions are exchanges of *consecutive* characters; all other 34 | operations are self-explanatory. 35 | 36 | This implementation is O(N*M) time and O(M) space, for N and M the 37 | lengths of the two sequences. 38 | 39 | >>> dameraulevenshtein('ba', 'abc') 40 | 2 41 | >>> dameraulevenshtein('fee', 'deed') 42 | 2 43 | 44 | It works with arbitrary sequences too: 45 | >>> dameraulevenshtein('abcd', ['b', 'a', 'c', 'd', 'e']) 46 | 2 47 | """ 48 | # codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F 49 | # Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1 matrix. 50 | # However, only the current and two previous rows are needed at once, 51 | # so we only store those. 52 | oneago = None 53 | thisrow = range(1, len(seq2) + 1) + [0] 54 | for x in xrange(len(seq1)): 55 | # Python lists wrap around for negative indices, so put the 56 | # leftmost column at the *end* of the list. This matches with 57 | # the zero-indexed strings and saves extra calculation. 58 | twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1] 59 | for y in xrange(len(seq2)): 60 | delcost = oneago[y] + 1 61 | addcost = thisrow[y - 1] + 1 62 | subcost = oneago[y - 1] + (seq1[x] != seq2[y]) 63 | thisrow[y] = min(delcost, addcost, subcost) 64 | # This block deals with transpositions 65 | if (x > 0 and y > 0 and seq1[x] == seq2[y - 1] and seq1[x-1] == seq2[y] and seq1[x] != seq2[y]): 66 | thisrow[y] = min(thisrow[y], twoago[y - 2] + 1) 67 | return thisrow[len(seq2) - 1] 68 | 69 | -------------------------------------------------------------------------------- /src/web/.gitignore: -------------------------------------------------------------------------------- 1 | session/* 2 | static/tmp 3 | -------------------------------------------------------------------------------- /src/web/code.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | """ 5 | web-based spill-chick front-end 6 | 7 | setup: 8 | * mkdir session/ 9 | (I tried adding it to the project but git can't hold empty directories and session/.gitignore kludge got deleted by the webserver, apparently) 10 | * ensure webserver user has 11 | * read access to ngram3.bin and word.bin files 12 | * write access to session/ directory 13 | """ 14 | 15 | def abspath(localpath): 16 | return os.path.join(os.path.dirname(__file__), localpath) 17 | 18 | import os, sys 19 | from itertools import dropwhile 20 | from time import time 21 | import web 22 | from web import form 23 | import logging 24 | 25 | logger = logging.getLogger('spill-chick') 26 | 27 | sys.path.append(abspath('..')) 28 | from chick import Chick 29 | from doc import Doc 30 | 31 | web.config.debug = True 32 | 33 | urls = ( '/.*', 'check' ) 34 | 35 | app = web.application(urls, globals()) 36 | session = web.session.Session(app, web.session.DiskStore(abspath('session')), 37 | initializer={'target':None, 'skip':[], 'replacements':[], 'suggestions':[]}) 38 | render = web.template.render(abspath('templates/'), base='base', globals=globals(), cache=False) 39 | application = app.wsgifunc() 40 | chick = Chick() 41 | 42 | class check: 43 | 44 | def GET(self): 45 | session.kill() 46 | return render.check('', [], [], 0, []) 47 | 48 | def POST(self): 49 | start_time = time() 50 | text = unicode(web.input().get('text', '')) 51 | lines = text.split('\r\n') 52 | 53 | act = web.input().get('act', '') 54 | if act == 'Replace': 55 | # FIXME: if replacement takes place, update location/offsets 56 | # of all remaining session['suggestions'] 57 | replacement_index = int(web.input().get('replacement_index', '0')) 58 | if replacement_index: 59 | d = Doc(lines, chick.w) 60 | replacements = session.get('replacements') 61 | if replacement_index <= len(replacements): 62 | replacement = replacements[replacement_index-1] 63 | d.applyChanges([replacement]) 64 | text = str(d) 65 | lines = d.lines 66 | logger.debug('after replacement lines=%s' % (lines,)) 67 | session['suggestions'].pop(0) 68 | elif act == 'Skip to next...': 69 | session['skip'].append(session['target']) 70 | session['suggestions'].pop(0) 71 | elif act == 'Done': 72 | # nuke target, replacements, skip, etc. 73 | session.kill() 74 | 75 | sugg2 = [] 76 | suggs = [] 77 | suggestions = [] 78 | replacements = [] 79 | 80 | if act and act != 'Done': 81 | suggestions = session['suggestions'] 82 | if not suggestions: 83 | logger.debug('suggest(lines=%s)' % (lines,)) 84 | suggestions = list(chick.suggest(lines, 5, session['skip'])) 85 | if not suggestions: 86 | target,suggs,sugg2 = None,[],[] 87 | else: 88 | # calculate offsets based on line length so we can highlight target substring in 89 | off = [len(l)+1 for l in lines] 90 | lineoff = [0]+[sum(off[:i]) for i in range(1,len(off)+1)] 91 | changes = suggestions[0] 92 | target = changes[0].ngd.oldtoks() 93 | for ch in changes: 94 | ngd = ch.ngd 95 | replacements.append(ngd) 96 | o = ngd.old() 97 | r = ngd.new() 98 | linestart = o[0][1] 99 | lineend = o[-1][1] 100 | start = o[0][3] 101 | end = o[-1][3] + len(o[-1][0]) 102 | sugg2.append((' '.join(ngd.newtoks()), 103 | lineoff[linestart] + start, 104 | lineoff[lineend] + end)) 105 | session['target'] = target 106 | session['replacements'] = replacements 107 | session['suggestions'] = suggestions 108 | 109 | elapsed = round(time() - start_time, 2) 110 | return render.check(text, sugg2, lines, elapsed, suggestions) 111 | 112 | if __name__ == '__main__': 113 | app.run() 114 | 115 | -------------------------------------------------------------------------------- /src/web/conf/apache2.conf.add: -------------------------------------------------------------------------------- 1 | 2 | # append something like this to apache2.conf to get our web.py app running 3 | 4 | # spill-chick web.py 5 | #LoadModule wsgi_module modules/mod_wsgi.so 6 | WSGIScriptAlias /spill-chick /var/www/spill-chick/code.py 7 | Alias /spill-chick/static /var/www/spill-chick/static/ 8 | Alias /spill-chick/templates /var/www/spill-chick/templates/ 9 | AddType text/html .py 10 | 11 | Order deny,allow 12 | Allow from all 13 | 14 | 15 | -------------------------------------------------------------------------------- /src/web/static/img/chick.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rflynn/spill-chick/430257c25369908f243a08d33caa268e8e398aeb/src/web/static/img/chick.png -------------------------------------------------------------------------------- /src/web/static/img/chick16.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rflynn/spill-chick/430257c25369908f243a08d33caa268e8e398aeb/src/web/static/img/chick16.png -------------------------------------------------------------------------------- /src/web/static/img/chick16.png.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rflynn/spill-chick/430257c25369908f243a08d33caa268e8e398aeb/src/web/static/img/chick16.png.ico -------------------------------------------------------------------------------- /src/web/static/img/chick32.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rflynn/spill-chick/430257c25369908f243a08d33caa268e8e398aeb/src/web/static/img/chick32.png -------------------------------------------------------------------------------- /src/web/static/js/spill-chick.js: -------------------------------------------------------------------------------- 1 | 2 | function textboxSelect(oTextbox, iStart, iEnd) 3 | { 4 | switch(arguments.length) 5 | { 6 | case 1: 7 | oTextbox.select(); 8 | break; 9 | case 2: 10 | iEnd = oTextbox.value.length; 11 | /* falls through */ 12 | case 3: 13 | if (oTextbox.createTextRange) 14 | { 15 | var oRange = oTextbox.createTextRange(); 16 | oRange.moveStart("character", iStart); 17 | oRange.moveEnd("character", - oTextbox.value.length + iEnd); 18 | oRange.select(); 19 | oTextbox.scrollTop = oRange.boundingTop 20 | } 21 | else if (oTextbox.setSelectionRange) 22 | { 23 | oTextbox.setSelectionRange(iStart, iEnd); 24 | } 25 | } 26 | } 27 | -------------------------------------------------------------------------------- /src/web/templates/base.html: -------------------------------------------------------------------------------- 1 | $def with (page) 2 | 3 | 4 | 5 | $if page.has_key('title'): 6 | $page.title 7 | $else: 8 | Spill-Chick 9 | 21 | 22 | 23 | 24 | 25 |
26 | Spill-Chick 27 |
28 | 29 | $:page 30 | 31 |
32 | Session: $session 33 |
34 | web.input: $web.input() 35 | 36 | 37 | -------------------------------------------------------------------------------- /src/web/templates/check.html: -------------------------------------------------------------------------------- 1 | $def with (text, replacements, lines, elapsed, suggestions) 2 | 3 |
4 | 5 | 6 |
7 | $if replacements: 8 |
9 | 10 | 14 |
15 |
16 |
17 | 18 |
19 | $else: 20 | 21 |
22 |
23 | 24 | 25 | $if not replacements: 26 | 27 |
28 |
29 |
30 | 31 |
32 | Elapsed: $elapsed seconds 33 |
34 | suggs: $suggestions 35 |
36 | Lines: $lines 37 |
38 | Replacements: $replacements 39 | 40 | -------------------------------------------------------------------------------- /src/word.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | 4 | import collections 5 | 6 | Alphabet = 'abcdefghijklmnopqrstuvwxyz' 7 | 8 | def edits1(word): 9 | splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] 10 | deletes = [a + b[1:] for a, b in splits if b] 11 | transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1] 12 | replaces = [a + c + b[1:] for a, b in splits for c in Alphabet if b] 13 | inserts = [a + c + b for a, b in splits for c in Alphabet] 14 | return set(deletes + transposes + replaces + inserts) 15 | 16 | """ 17 | imitate the interface of a Counter() that Words is expecting 18 | so we can use ngram3bin without him knowing 19 | """ 20 | class NGram3BinWordCounter: 21 | def __init__(self, ng): 22 | self.ng = ng 23 | def __contains__(self, word): 24 | # foo in me 25 | return self.ng.word2id(word) != 0 26 | def get(self, word, default=0): 27 | try: 28 | return self.ng.wordfreq(word) 29 | except (ValueError, TypeError): 30 | raise KeyError 31 | def __getitem__(self, word): 32 | # me[key] 33 | if type(word) == int: 34 | raise IndexError 35 | try: 36 | return self.ng.wordfreq(word) 37 | except: 38 | raise KeyError 39 | def __setitem__(self, word, val): 40 | # me[key] = val 41 | pass 42 | def update(self, wordfreqlist): 43 | pass 44 | 45 | """ 46 | Word statistics 47 | """ 48 | class Words: 49 | 50 | def __init__(self, frq=None):#collections.Counter()): 51 | self.frq = frq 52 | 53 | def add(self, word): 54 | self.frq[word] += 1 55 | 56 | def addl(self, words): 57 | self.frq.update(words) 58 | 59 | def freq(self, word): 60 | return self.frq[word] 61 | 62 | def known_edits2(self, word): 63 | return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in self.frq) 64 | 65 | def known(self, words): return set(w for w in words if w in self.frq) 66 | 67 | # FIXME: this does not always work 68 | # example: 'passified' becomes 'assified' instead of 'pacified' 69 | # TODO: lots of mis-spellings are phonetic; we should attempt to "sound out" 70 | # unknown words, possibly by breaking them into pieces and trying to assemble the sound 71 | # from existing words 72 | # FIXME: douce -> douse 73 | # FIXME: iv -> ivy 74 | def correct(self, word): 75 | candidates = self.known([word]) | self.known(edits1(word)) or self.known_edits2(word) or [word] 76 | return max(candidates, key=self.frq.get) 77 | 78 | # FIXME: bid -> big 79 | # FIXME: hungreh -> hungry 80 | def similar(self, word): 81 | e = self.known(edits1(word)) 82 | # FIXME: this is just the trickiest line in the whole thing. 83 | # flexibility at an expensive price... 84 | if len(word) > 6: 85 | e |= self.known_edits2(word) 86 | return e 87 | 88 | @staticmethod 89 | def signature(word): 90 | "sorted list of ('letter',frequency) for all letters in word" 91 | return [(c,len(list(l))) for c,l in groupby(sorted(word))] 92 | 93 | if __name__ == '__main__': 94 | pass 95 | 96 | -------------------------------------------------------------------------------- /test/Ode-To-My-Spell-Checker-correct.txt: -------------------------------------------------------------------------------- 1 | 2 | I have a spelling checker 3 | It came with my PC 4 | It plainly marks for my review 5 | Mistakes I cannot see 6 | 7 | I strike a key and type a word 8 | And wait for it to say 9 | Whether I am wrong or right 10 | It shows me straight away 11 | 12 | As soon as a mistake is made 13 | It knows before too long 14 | And I can put the error right 15 | It’s rarely ever wrong 16 | 17 | I have run this poem through it 18 | I am sure you’re pleased to know 19 | It’s letter perfect in its way 20 | My checker told me so 21 | 22 | -------------------------------------------------------------------------------- /test/Ode-To-My-Spell-Checker-original.txt: -------------------------------------------------------------------------------- 1 | 2 | Eye halve a spelling chequer 3 | It came with my pea sea 4 | It plainly marques four my revue 5 | Miss steaks eye kin knot sea. 6 | 7 | Eye strike a quay and type a word 8 | And weight four it two say 9 | Weather eye am wrong oar write 10 | It shows me strait a weigh. 11 | 12 | As soon as a mist ache is maid 13 | It nose bee fore two long 14 | And eye can put the error rite 15 | Its really ever wrong. 16 | 17 | Eye have run this poem threw it 18 | I am shore your pleased two no 19 | Its letter perfect in it's weigh 20 | My chequer tolled me sew. 21 | 22 | -------------------------------------------------------------------------------- /test/Spill-Chick-Yore-Dock-You-Mints-correct.txt: -------------------------------------------------------------------------------- 1 | You can always spellcheck your documents 2 | 3 | Spellcheckers United 4 | 1234 Doughnut Street 5 | Sault Ste Marie, Michigan 49599 6 | 7 | November 7, 2000 8 | 9 | Miss Spellbound 10 | Spelling Checkers, Inc. 11 | 1259 Broadway 12 | New York, NY 11012 13 | 14 | Dear Mrs. Spellbound: 15 | 16 | You might have used some spell checker which came with your computer. It's great at putting marks for all to find mistakes I cannot see. I done put your lines right through by typing so carefully. It's going to be perfect and I know it's going to be because our computer told me and you and everybody around here so. 17 | 18 | Sincerely, 19 | 20 | Thomas Jackson 21 | -------------------------------------------------------------------------------- /test/Spill-Chick-Yore-Dock-You-Mints-original.txt: -------------------------------------------------------------------------------- 1 | Ewe Kin Awl Weighs Spill Chick Yore Dock You Mints 2 | 3 | Spill Chiggers Ewe Knighted 4 | 1234 Doe Nuts Treat 5 | Sue Saint Mary, MI 49599 6 | 7 | No member 7, 2000 8 | 9 | Mist Spill Bond 10 | Spilling Check Hers, Ink. 11 | 1259 Board Weigh 12 | Gnu Yoke, NY 11012 13 | 14 | Dare Misses Spill Bond: 15 | 16 | Ewe mite hove use sum spill check her witch came mitt yore come pewter. Its grate at pudding marts fore awl two fined miss steaks aye kin knot sea. Aye dun putt you're lions write threw bye tie ping sew care fully. Its go wing too be purr fit an eye no its gone a bee be cause are calm pewter tolled me an yew an every buddy a round hear sew. 17 | 18 | Sin sear Lee, 19 | 20 | Tom Us Jack Sun 21 | -------------------------------------------------------------------------------- /test/test.txt: -------------------------------------------------------------------------------- 1 | # unit tests 2 | # 3 | 4 | # FIXME: we don't handle apostrophes correctly 5 | Even so, I'm open minded. : Even so, I'm open minded. 6 | that's and advantage. : that's an advantage. 7 | 8 | # FAIL - 9 | # FIXME: can't handle 2-word idioms nor "sound out" sentences 10 | funky farm : funny farm # wrong first token not handled 11 | untied stats : united states # tricky -- both tokens wrong, no extra context 12 | # issue: two-token not handled 13 | Windows PX : Windows XP 14 | a miss steak : a mistake 15 | dry rum : dry run 16 | # issue: punctuation 17 | hell there. how are you? : hello there. how are you? # needs punctuation 18 | bag apple : big apple 19 | beg apple : big apple 20 | 21 | # FIXME: can't fix 4-token idiom 22 | state-of-the-are : state-of-the-art 23 | 24 | #--end-- 25 | 26 | 27 | ######## PASS ############ 28 | bridge the gas : bridge the gap 29 | you are waisting your time : you are wasting your time 30 | # test simple change, but... 31 | # there are several similar variants that are more popular than the correct answer 32 | i new that! : i knew that! 33 | # test phonetic match and token removal 34 | their is no : there is no 35 | i no you : i know you 36 | an IV league school : an IVY league school 37 | # 38 | win or loose : win or lose 39 | Wet your appetite : Whet your appetite 40 | Try and fry again : Try and try again 41 | I am very tried : I am very tired 42 | garden of eating : garden of eden 43 | garden of eatin : garden of eden 44 | I am found of you : I am fond of you 45 | their is : there is 46 | their it is : there it is 47 | # no-ops 48 | i think so : i think so 49 | i now know : i now know 50 | # phonetic 51 | Summer is almost hear. : Summer is almost here. 52 | i am hear : i am here 53 | no, i was write : no, i was right # metallic 54 | peace of shit : piece of shit 55 | # 56 | we have a bid backyard : we have a big backyard 57 | a double cheese burger in : a double cheeseburger in 58 | i have a spelling chequer : i have a spelling checker 59 | 60 | we'll touch bass : we'll touch base 61 | i didn't no : i didn't know 62 | # issue: slang 63 | nope, i was write : nope, i was right # metallic # slang, doesn't know 'nope' 64 | bridge the gap. bridge the gas. : bridge the gap. bridge the gap. # multi-sentence problem; ignores "gas." 65 | 66 | # avoid making suggestions for numbers 67 | # perhaps transpositions, but in most cases we don't want to replace whole numbers... 68 | for over 35 years we bridge the gas : for over 35 years we bridge the gap 69 | 70 | That's not a every impressive claim to make. : That's not a very impressive claim to make. 71 | Long Island, New York, state-of-the-are facility : Long Island, New York, state-of-the-art facility 72 | That is pretty much what I was eluding : That is pretty much what I was alluding 73 | 74 | #--end-- 75 | 76 | ######### FAIL 77 | 78 | While the post author claims that using a SQL backend doesn't make much sense, according to the fossil web page (http://fossil-scm.org/) that's and advantage. : While the post author claims that using a SQL backend doesn't make much sense, according to the fossil web page (http://fossil-scm.org/) that's an advantage. 79 | 80 | #--end-- 81 | 82 | # FIXME: we calculate the diff of eluding -> alluding as 0 because their sounds match 83 | # but we must differentiate between a phonic change and an actual change that does nothing, i.e. 84 | # does not change the text at all; we should value the former more highly 85 | #That is pretty much what I was eluding : That is pretty much what I was alluding 86 | ##--end-- 87 | 88 | # FIXME: this is a shortcoming of the unknown token corrector, a separate but important 89 | # part of our program that runs before all the other parts. 90 | # we must do a more thorough job of picking apart unknown tokens, try splitting/merging them with their surroundings 91 | #spillchick : spellcheck 92 | ##--end-- 93 | 94 | # test whether we're smart enough to prioritize "win or loose" -> "win or lose", which is a good fix 95 | #In 2005 we win or loose : In 2005 we win or lose 96 | ##--end-- 97 | #their coming to : they're coming to 98 | ##--end-- 99 | 100 | # bestsugg 8.74 3 931301 that it is 101 | # bestsugg 7.95 0 21036 there it is 102 | # the "correct" solution is a close second because of the overwhelming frequency of the first 103 | # we need to more heavily weight the improvement of a diff of 0 (phonic difference) over higher 104 | # frequency. 105 | 106 | #--end-- 107 | 108 | I would have won if had one! : I would have one if I had won. 109 | I would have one if I had won! : I would have one if I had won! 110 | I would have one if I had one. : I would have one if I had one. 111 | I would have won if I had won. : I would have one if I had won. 112 | 113 | #I would have won two if had one too! 114 | #I would have one too if I had won one! 115 | #I would have one too if I had one too. 116 | #I would have won too if I had won one. 117 | 118 | # FIXME: these take ages and always fail 119 | doe sit use machien learning : does it use machine learning 120 | dose it use machien learning : does it use machine learning 121 | doze it use machien learning : does it use machine learning 122 | ##--end-- 123 | 124 | # FIXME: I'm not sure but either I'm picking bad examples or something; 125 | # what i expect is not what the ngrams suggest. strange. 126 | #in the sample place : in the same place 127 | 128 | could care less : couldn't care less 129 | ##--end-- 130 | 131 | create a passified country : create a pacified country # urbandictionary 132 | someone douce me in chocolate syrup : someone douse me in chocolate syrup 133 | Downloading copywritten movies : Downloading copyrighted movies 134 | # needs to join ('cheese','burger') -> 'cheeseburger' 135 | I still have a double cheese burger in the refridgerator : I still have a double cheeseburger in the refrigerator 136 | ##--end-- 137 | 138 | This is all very tenative. : This is all very tentative. 139 | someone otther than yourself : someone other than yourself 140 | #--end-- 141 | 142 | 143 | ########## BOTCHED IDIOMS 144 | # 145 | Coming down the pipe : Coming down the pike 146 | Through the ringer : Through the wringer 147 | touch basis : touch bases 148 | # 149 | #800-pond gorilla : 800-pound gorilla 150 | could care less : couldn't care less 151 | #oh de colone : eau de cologne 152 | #two in the hand is worth one in the bush : one in the hand is worth two in the bush 153 | # these two would benefit from trying edit distance 2 if we're unable to find a change the first time 154 | scotch free : scot-free 155 | never cry wool : never cry wolf 156 | # these are too many edits away 157 | # perhaps i could do it by filling in the blanks 158 | pushing up days : pushing up daisies 159 | ##--end-- 160 | 161 | 162 | ####### SORT OF WORKS ############ 163 | # this would benefit if we weighted consonant changes more heavily than vowel changes 164 | spill chick : spell check # actually ok... ['still thick','spell check',...] 165 | # issue: almost works. we get 'pay' instead of 'paid'. 'paid' is second. 166 | get what you payed for : get what you paid for 167 | ##--end-- 168 | 169 | 170 | # this is a tricky one. "the dog was" is immensely frequent, 171 | # but "dog was dense" isn't. "fog was dense" is more frequent than "dog was dense", 172 | # but when the ngram frequencies are simply summed "the dog" still wins 173 | The dog was dense : The fog was dense 174 | 175 | # almost works, but apostrophe still 176 | worth it's salt : worth its salt 177 | 178 | ##--end-- 179 | 180 | It it did, the future would undoubtedly be changed : If it did, the future would undoubtedly be changed # Foundation, Isaac Asimov p. 33 181 | 182 | ##--end-- 183 | 184 | ####### BROKEN ########## 185 | overhere : overhear 186 | USB-to-serail driver : USB-to-serial driver # technical term not in ngrams 187 | # big test; requires token expansion (their) -> (they,re) 188 | Their coming too sea if its reel. : They're coming to see if it's real. 189 | # nope, phonic stuff doesn't do fuzzy matching 190 | all intensive purposes : all intents and purposes 191 | say "good riddens" to : say "good riddance" to # fuzzy phonic matching 192 | spill check : spell check 193 | # duplicated word 'does' 194 | the action does does come with : the action does come with # slashdot 195 | 196 | # this is a tricky one. "the dog was" is immensely frequent, 197 | # but "dog was dense" isn't. "fog was dense" is more frequent than "dog was dense", 198 | # but when the ngram frequencies are simply summed "the dog" still wins 199 | The dog was dense : The fog was dense 200 | 201 | ####### UNEXPECTED NON-FIXES ########## 202 | right over their : right over there # hmm "fix" is less than twice as frequent 203 | 204 | #--end-- 205 | 206 | #soyouneedtomakethatvariable : so you need to make that variable 207 | 208 | #over hear : overhear # not sure about this one 209 | over here : over here 210 | 211 | #--end-- 212 | 213 | # misspellings: non-words 214 | naieve : naive 215 | #bazillion : billion 216 | #bajillion : billion 217 | inztrnlazti : international 218 | joyd ivision : joy division 219 | Insturctions: : Instructions: 220 | descently well : decently well 221 | I'm leary of it : I'm leery of it 222 | a pthon library : a python library 223 | #santimoniousness : sanctimoniousness 224 | integeter division : integer division 225 | 226 | # transpositions resulting in words 227 | The dog was dense : The fog was dense 228 | I am very tried : I am very tired 229 | whatever remains, whoever improbable, must be the truth. : whatever remains, however improbable, must be the truth. 230 | It it did, the future would undoubtedly be changed in some minor respects. : If it did, the future would undoubtedly be changed in some minor respects. # Foundation, Isaac Asimov p. 33 231 | 232 | # correct non-fixes 233 | I love non-sequiturs. : I love non-sequiturs. 234 | 235 | # misspellings resulting in words 236 | your right dude. : you're right dude. 237 | 238 | # transcriptions resulting in non-words 239 | Johsia : Joshua 240 | 241 | # transpositions resulting in non-words 242 | Gergory : Gregory 243 | 23rd of Auguts : 23rd of August 244 | Johsua : Joshua 245 | 246 | # misspellings resulting in words 247 | a shallow accent angle. : a shallow ascent angle. 248 | someone otter than yourself : someone other than yourself 249 | now it makes perfect sensor : now it makes perfect sense 250 | I would appreciate and alternative to : I would appreciate an alternative to 251 | "Yes, yes. I now the theorem." : "Yes, yes. I know the theorem." # Second Foundation, Isaac Asmiov, p. 105 252 | Humans many simply be too stupid : Humans may simply be too stupid 253 | At first it was effecting our sex life : At first it was affecting our sex life 254 | #pointers to the UINT type will through away the significant bits : pointers to the UINT type will throw away the significant bits 255 | #I think they call that a sentence now days. : I think they call that a sentence nowadays. 256 | 257 | 258 | # phonetic errors 259 | oic : oh i see 260 | f u c k : fuck 261 | hell-o : hello 262 | o i c : oh i see 263 | orly : oh really 264 | faux king hill : fucking hell 265 | in the sample place : in the same place 266 | hungreh. wants soo shee : hungry. want sushi 267 | goan jump off a bridge : go and jump off a bridge 268 | 269 | # mixed 270 | #did he steel you ice cream? : did he steal your ice cream? 271 | you are backpaddling from a smartass slapdown :-) : you are backpedaling from a smartass slapdown :-) 272 | 273 | # intentional typos 274 | #concise unlike the verbosity of Java and Erlong. : concise unlike the verbosity of Java and Erlang. 275 | 276 | 277 | # grammatical errors 278 | That it. : That's it. 279 | #You have less followers then him : You have fewer followers than him 280 | 281 | # missing words 282 | #I doubt we'll this any time soon. : I doubt we'll do this any time soon. 283 | #production on hold across the country to allow to watch the match. : put production on hold across the country to allow employees to watch the match. 284 | #Microsoft is obsessed Websockets : Microsoft is obsessed with Websockets 285 | 286 | ### OTHERS 287 | # splits 288 | # we *can* tease this out, but the cost of doing so is just too high right now. in the future perhps we can fall back to more expensive methods when appropriate 289 | ifit'snotpurethecompilercan'toptimizeitlikeyouwant : if it's not pure the compiler can't optimize it like you want 290 | 291 | I've been doing this for a very long time and I think I have have encountered each of the bugs listed in this list. : I've been doing this for a very long time and I think I have encountered each of the bugs listed in this list. 292 | 293 | Software's inherit ability to adapt is part of what drives this differentiating factor. : Software's inherent ability to adapt is part of what drives this differentiating factor. 294 | 295 | Now is the time for all good people to come to the aid of there country : Now is the time for all good people to come to the aid of their country 296 | 297 | # real world examples that should be easily fixable 298 | hat are some example of public datasets that have randomized instruments? : what are some example of public datasets that have randomized instruments? 299 | I invite women over so that I have the motivation to stop being such a fucking slob for 12 seconds in the vein attempt at getting laid. : I invite women over so that I have the motivation to stop being such a fucking slob for 12 seconds in the vain attempt at getting laid. 300 | 301 | # phonetic numbers 302 | Thanks a lot m8. : Thanks a lot mate. 303 | I h8 it! : I hate it! 304 | I 8 it! : I ate it! 305 | 306 | # transpositions resulting in logical impossiblities 307 | 32rd of August : 23rd of August 308 | 309 | # unclassified real-world 310 | #A group of 21 volunteers from Tokyo and Saitama brought Sunday 2,000 meals for about over 500 evacuees at the shelter. 311 | #Use Reddit to decide what to tool use. : Use Reddit to decide what tool to use. # token swap x,y -> y,x 312 | # Math is not sexy. Statistics are not sext. : Math is not sexy. Statistics are not sexy. 313 | # best font for coding : best font for coding 314 | It it did, the future would undoubtedly be changed in some minor respects. : If it did, the future would undoubtedly be changed in some minor respects. # Foundation, Isaac Asimov p. 33 315 | not weather you win or loose it's how you ply the gale : not whether you win or lose it's how you play the game 316 | primitives are not implement as a direct call : primitives are not implemented as a direct call # Efficient Parallel Programming in Poly/ML and Isabelle/ML 317 | feel apart of something : feel a part of something 318 | I can't bring myself to an android phone : I can't bring myself to get an android phone # comcor 319 | 320 | # not so sure about this one... 321 | Im pretty sure T-Rexs weren't that big. : I'm pretty sure T-Rexs weren't that big. 322 | 323 | # this is a real-world example of an uncommon n-gram transposition. 324 | # the only way we can detect these sorts of errors is to adapt our corpus 325 | # to handle context in a personalized way, by building a corpus out of local documents. 326 | nyc sing company : nyc sign company 327 | 328 | But it is a lot bulkier, and i teats batteries. : But it is a lot bulkier, and it eats batteries. # ycombinator on calculators 329 | 330 | # omission: trying to the -> trying to get the 331 | Norvig says that no one is listening to your calls on Google Voice — it is simply their servers trying to the translation right. : Norvig says that no one is listening to your calls on Google Voice — it is simply their servers trying to get the translation right. # slashdot 332 | 333 | # transcription: affectiveness -> effectiveness 334 | Part of their work is checking that server's affectiveness, too. : Part of their work is checking that server's effectiveness, too. # slashdot 335 | 336 | 337 | # transcription: Europe -> Europa 338 | Due to their size, atmospheric drag would slow them down without burning them up, allowing them to study the uppermost atmosphere of wherever they are deployed next: Venus, Titan, Europe, and Jupiter are all possibilities. : Due to their size, atmospheric drag would slow them down without burning them up, allowing them to study the uppermost atmosphere of wherever they are deployed next: Venus, Titan, Europa, and Jupiter are all possibilities. # slashdot 339 | 340 | # a moderate paragraph with absolutely nothing wrong with it 341 | # source: http://www.propublica.org/article/all-the-magnetar-trade-how-one-hedge-fund-helped-keep-the-housing-bubble 342 | In late 2005, the booming U.S. housing market seemed to be slowing. The Federal Reserve had begun raising interest rates. Subprime mortgage company shares were falling. Investors began to balk at buying complex mortgage securities. The housing bubble, which had propelled a historic growth in home prices, seemed poised to deflate. And if it had, the great financial crisis of 2008, which produced the Great Recession of 2008-09, might have come sooner and been less severe. : In late 2005, the booming U.S. housing market seemed to be slowing. The Federal Reserve had begun raising interest rates. Subprime mortgage company shares were falling. Investors began to balk at buying complex mortgage securities. The housing bubble, which had propelled a historic growth in home prices, seemed poised to deflate. And if it had, the great financial crisis of 2008, which produced the Great Recession of 2008-09, might have come sooner and been less severe. 343 | --------------------------------------------------------------------------------