├── .gitignore
├── LICENSE.md
├── Makefile
├── README.md
├── bookworm
    ├── #APIimplementation.py#
    ├── .gitignore
    ├── APIimplementation.py
    ├── MetaWorm.py
    ├── SQLAPI.py
    ├── __init__.py
    ├── general_API.py
    ├── knownHosts.py
    └── logParser.py
├── dbbindings.py
└── testAPI.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | old/*
 2 | *~
 3 | APIkeys
 4 | #*
 5 | .#*
 6 | .DS_Store
 7 | *.cgi
 8 | migration.py
 9 | shipping.py
10 | genderizer*
11 | *.pyc
12 | 


--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2014 Benjamin Schmidt
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 6 | this software and associated documentation files (the "Software"), to deal in
 7 | the Software without restriction, including without limitation the rights to
 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 9 | the Software, and to permit persons to whom the Software is furnished to do so,
10 | subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
17 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
18 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
19 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
20 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
21 | 


--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
 1 | ubuntu-install:
 2 | 	apt-get install python-numpy python-mysqldb
 3 | 	mkdir -p /var/log/presidio
 4 | 	touch /var/log/presidio/log.txt
 5 | 	chown -R www-data:www-data /var/log/presidio
 6 | 	mv ./*.py /usr/lib/cgi-bin/
 7 | 	chmod -R 755 /usr/lib/cgi-bin
 8 | 
 9 | os-x-install:
10 | 	brew install python-numpy python-mysqldb
11 | 	mkdir -p /var/log/presidio
12 | 	touch /var/log/presidio/log.txt
13 | 	chown -R www /var/log/presidio
14 | 	chmod -R 755 /usr/lib/cgi-bin	
15 | 	mkdir -p /etc/mysql
16 | 	ln -s /etc/my.cnf /etc/mysql/my.cnf 
17 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ## Bookworm API
 2 | 
 3 | **This entire repo is deprecated: the API is now bundled inside the [BookwormDB](http://github.com/bookworm-project/bookwormDB) repo**
 4 | 
 5 | 
 6 | This is an implementation of the API for Bookworm, written in Python. It primarily implements the API on a MySQL database now, but includes classes for more easily implementing it on top of other platforms (such as Solr).
 7 | 
 8 | It is used with the [Bookworm GUI](https://github.com/Bookworm-project/BookwormGUI) and can also be used as a standalone tool to query data from your database created by [the BookwormDB repo](https://github.com/Bookworm-project/BookwormDB).
 9 | For a more interactive explanation of how the GUI works, see the [D3 bookworm browser](http://benschmidt.org/beta/APISandbox)
10 | 
11 | ### General Description
12 | 
13 | A file, currently at `dbbindings.py`, calls the script `bookworm/general_API.py`; that implements a general purpose API, and then further modules may implement the API on specific backends. Currently, the only backend is the one for the MySQL databases create by [the database repo](http://github.com/bookworm-project/BookwormDB).
14 | 
15 | 
16 | ### Installation
17 | 
18 | Currently, you should just clone this repo into your cgi-bin directory, and make sure that `dbbindings.py` is executable.
19 | 
20 | #### OS X caveat.
21 | 
22 | If using homebrew, the shebang at the beginning of `dbbindings.py` is incorrect. (It will not load your installed python modules). Change it from `#!/usr/bin/env python` to `#!/usr/local/bin/python`, and it should work.
23 | 
24 | ### Usage
25 | 
26 | If the bookworm is located on your server, there is no need to do anything--it should be drag-and-drop. (Although on anything but debian, settings might require a small amount of tweaking.
27 | 
28 | If you want to have the webserver and database server on different machines, that needs to be specified in the configuration file for mysql that this reads: if you want to have multiple mysql servers, you may need to get fancy.
29 | 
30 | This tells the API where to look for the data for a particular bookworm. The benefit of this setup is that you can have your webserver on one server and the database on another server.
31 | 
32 | 


--------------------------------------------------------------------------------
/bookworm/#APIimplementation.py#:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python
  2 | 
  3 | import sys
  4 | import json
  5 | import cgi
  6 | import re
  7 | import numpy #used for smoothing.
  8 | import copy
  9 | 
 10 | #These are here so we can support multiple databases with different naming schemes from a single API. A bit ugly to have here; could be part of configuration file somewhere else, I guess. there are 'fast' and 'full' tables for books and words;
 11 | #that's so memory tables can be used in certain cases for fast, hashed matching, but longer form data (like book titles)
 12 | #can be stored on disk. Different queries use different types of calls.
 13 | #Also, certain metadata fields are stored separately from the main catalog table; I list them manually here to avoid a database call to find out what they are,
 14 | #although the latter would be more elegant. The way to do that would be a database call
 15 | #of tables with two columns one of which is 'bookid', maybe, or something like that.
 16 | #(Or to add it as error handling when a query failed; only then check for missing files.
 17 | 
 18 | general_prefs = {"presidio":{"HOST":"melville.seas.harvard.edu","database":"presidio","fastcat":"fastcat","fullcat":"open_editions","fastword":"wordsheap","read_default_file":"/etc/mysql/my.cnf","fullword":"words","separateDataTables":["LCSH","gender"],"read_url_head":"http://www.archive.org/stream/"},"arxiv":{"HOST":"10.102.15.45","database":"arxiv","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":["genre","fastgenre","archive","subclass"],"read_url_head":"http://www.arxiv.org/abs/"},"jstor":{"HOST":"10.102.15.45","database":"jstor","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":["discipline"],"read_url_head":"http://www.arxiv.org/abs/"}, "politweets":{"HOST":"chaucer.fas.harvard.edu","database":"politweets","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":[],"read_url_head":"http://www.arxiv.org/abs/"},"LOC":{"HOST":"10.102.15.45","database":"LOC","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":[],"read_url_head":"http://www.arxiv.org/abs/"},"ChronAm":{"HOST":"10.102.15.45","database":"LOC","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":[],"read_url_head":"http://www.arxiv.org/abs/"}}
 19 | #We define prefs to default to the Open Library set at first; later, it can do other things.
 20 | 
 21 | class dbConnect():
 22 |     #This is a read-only account
 23 |     def __init__(self,prefs = general_prefs['presidio']):
 24 |         import MySQLdb
 25 |         self.db = MySQLdb.connect(host=prefs['HOST'],read_default_file = prefs['read_default_file'],use_unicode='True',charset='utf8',db=prefs['database'])
 26 |         self.cursor = self.db.cursor()
 27 | 
 28 | 
 29 | # The basic object here is a userquery: it takes dictionary as input, as defined in the API, and returns a value 
 30 | # via the 'execute' function whose behavior 
 31 | # depends on the mode that is passed to it.
 32 | # Given the dictionary, it can return a number of objects.
 33 | # The "Search_limits" array in the passed dictionary determines how many elements it returns; this lets multiple queries be bundled together.
 34 | # Most functions describe a subquery that might be combined into one big query in various ways.
 35 | 
 36 | class userqueries():
 37 |     #This is a set of queries that are bound together; each element in search limits is iterated over, and we're done.
 38 |     def __init__(self,outside_dictionary = {"counttype":"Percentage_of_Books","search_limits":[{"word":["polka dot"],"LCSH":["Fiction"]}]},db = None):
 39 |         #coerce one-element dictionaries to an array.
 40 |         self.database = outside_dictionary.setdefault('database','presidio')
 41 |         prefs = general_prefs[self.database]
 42 |         self.prefs = prefs
 43 |         self.wordsheap = prefs['fastword']
 44 |         self.words = prefs['fullword']
 45 |         if 'search_limits' not in outside_dictionary.keys():
 46 |             outside_dictionary['search_limits'] = [{}]
 47 |         if isinstance(outside_dictionary['search_limits'],dict):
 48 |             #(allowing passing of just single dictionaries instead of arrays)
 49 |             outside_dictionary['search_limits'] = [outside_dictionary['search_limits']]
 50 |         self.returnval = []
 51 |         self.queryInstances = []
 52 |         for limits in outside_dictionary['search_limits']:
 53 |             mylimits = outside_dictionary
 54 |             mylimits['search_limits'] = limits
 55 |             localQuery = userquery(mylimits)
 56 |             self.queryInstances.append(localQuery)
 57 |             self.returnval.append(localQuery.execute())
 58 | 
 59 |     def execute(self):
 60 |         return self.returnval
 61 | 
 62 | class userquery():
 63 |     def __init__(self,outside_dictionary = {"counttype":"Percentage_of_Books","search_limits":{"word":["polka dot"],"LCSH":["Fiction"]}}):
 64 |         #Certain constructions require a DB connection already available, so we just start it here, or use the one passed to it.
 65 |         self.outside_dictionary = outside_dictionary
 66 |         self.prefs = general_prefs[outside_dictionary.setdefault('database','presidio')]
 67 |         self.db = dbConnect(self.prefs)
 68 |         self.cursor = self.db.cursor
 69 |         self.wordsheap = self.prefs['fastword']
 70 |         self.words = self.prefs['fullword']
 71 | 
 72 |         #I'm now allowing 'search_limits' to either be a dictionary or an array of dictionaries: 
 73 |         #this makes the syntax cleaner on most queries,
 74 |         #while still allowing some more complicated ones.
 75 |         if isinstance(outside_dictionary['search_limits'],list):
 76 |             outside_dictionary['search_limits'] = outside_dictionary['search_limits'][0]
 77 |         self.defaults(outside_dictionary) #Take some defaults
 78 |         self.derive_variables() #Derive some useful variables that the query will use.
 79 |         
 80 |     def defaults(self,outside_dictionary):
 81 |         #these are default values;these are the only values that can be set in the query
 82 |             #search_limits is an array of dictionaries;
 83 |             #each one contains a set of limits that are mutually independent
 84 |             #The other limitations are universal for all the search limits being set.
 85 | 
 86 |         #Set up a dictionary for the denominator of any fraction if it doesn't already exist:
 87 |         self.search_limits = outside_dictionary.setdefault('search_limits',[{"word":["polka dot"]}])
 88 | 
 89 |         self.words_collation = outside_dictionary.setdefault('words_collation',"Case_Insensitive")
 90 |         lookups = {"Case_Insensitive":'word',"case_insensitive":"word","Case_Sensitive":"casesens","Correct_Medial_s":'ffix',"All_Words_with_Same_Stem":"stem","Flagged":'wflag'}
 91 |         self.word_field = lookups[self.words_collation]
 92 | 
 93 |         self.groups = []
 94 |         try:
 95 |             groups = outside_dictionary['groups']
 96 |         except:
 97 |             groups = [outside_dictionary['time_measure']]
 98 | 
 99 |         if groups == []:
100 |             groups = ["bookid is not null as In_Library"]
101 |         if (len (groups) > 1):
102 |             pass
103 |             #self.groups = credentialCheckandClean(self.groups)
104 |             #Define some sort of limitations here.
105 |         for group in groups:
106 |             group = group
107 |             if group=="unigram" or group=="word":
108 |                 group = "words1." + self.word_field + " as unigram"
109 |             if group=="bigram":
110 |                 group = "CONCAT (words1." + self.word_field + " ,' ' , words2." + self.word_field + ") as bigram"
111 |             self.groups.append(group)
112 | 
113 |         self.selections = ",".join(self.groups)
114 |         self.groupings  = ",".join([re.sub(".* as","",group) for group in self.groups])
115 | 
116 | 
117 |         self.compare_dictionary = copy.deepcopy(self.outside_dictionary)
118 |         if 'compare_limits' in self.outside_dictionary.keys():
119 |             self.compare_dictionary['search_limits'] = outside_dictionary['compare_limits']
120 |             del outside_dictionary['compare_limits']
121 |         else: #if nothing specified, we compare the word to the corpus.
122 |             for key in ['word','word1','word2','word3','word4','word5','unigram','bigram']:
123 |                 try:
124 |                     del self.compare_dictionary['search_limits'][key]
125 |                 except:
126 |                     pass
127 |             for key in self.outside_dictionary['search_limits'].keys():
128 |                 if re.search('words?\d',key):
129 |                     try:
130 |                         del self.compare_dictionary['search_limits'][key]
131 |                     except:
132 |                         pass
133 | 
134 |         comparegroups = []
135 |         #This is a little tricky behavior here--hopefully it works in all cases. It drops out word groupings.
136 |         try:
137 |             compareGroups = self.compare_dictionary['groups']
138 |         except:
139 |             compareGroups = [self.compare_dictionary['time_measure']]
140 |         for group in compareGroups:
141 |             if not re.match("words",group) and not re.match("[u]?[bn]igram",group):
142 |                 comparegroups.append(group)
143 |         self.compare_dictionary['groups'] = comparegroups
144 |         self.time_limits = outside_dictionary.setdefault('time_limits',[0,10000000])
145 |         self.time_measure = outside_dictionary.setdefault('time_measure','year')
146 |         self.counttype = outside_dictionary.setdefault('counttype',"Occurrences_per_Million_Words")
147 | 
148 |         self.index  = outside_dictionary.setdefault('index',0)
149 |         #Ordinarily, the input should be an an array of groups that will both select and group by.
150 |         #The joins may be screwed up by certain names that exist in multiple tables, so there's an option to do something like 
151 |         #SELECT catalog.bookid as myid, because WHERE clauses on myid will work but GROUP BY clauses on catalog.bookid may not 
152 |         #after a sufficiently large number of subqueries.
153 |         #This smoothing code really ought to go somewhere else, since it doesn't quite fit into the whole API mentality and is 
154 |         #more about the webpage.
155 |         self.smoothingType = outside_dictionary.setdefault('smoothingType',"triangle")
156 |         self.smoothingSpan = outside_dictionary.setdefault('smoothingSpan',3)
157 |         self.method = outside_dictionary.setdefault('method',"Nothing")
158 |         self.tablename = outside_dictionary.setdefault('tablename','master'+"_bookcounts as bookcounts")
159 | 
160 |     def derive_variables(self):
161 |         #These are locally useful, and depend on the variables
162 |         self.limits = self.search_limits
163 |         #Treat empty constraints as nothing at all, not as full restrictions.
164 |         for key in self.limits.keys():
165 |             if self.limits[key] == []:
166 |                 del self.limits[key]
167 |         self.create_catalog_table()
168 |         self.make_catwhere()
169 |         self.make_wordwheres()
170 | 
171 |     def create_catalog_table(self):
172 |         self.catalog = self.prefs['fastcat'] #'catalog' #Can be replaced with a more complicated query.
173 | 
174 |         #Rather than just search for "LCSH", this should check query constraints against a list of tables, and join to them.
175 |         #So if you query with a limit on LCSH, it joins the table "LCSH" to catalog; and then that table has one column, ALSO
176 |         #called "LCSH", which is matched against. This allows a bookid to be a member of multiple catalogs.
177 | 
178 |         for limitation in self.prefs['separateDataTables']:
179 |             #That re.sub thing is in here because sometimes I do queries that involve renaming.
180 |             if limitation in [re.sub(" .*","",key) for key in self.limits.keys()] or limitation in [re.sub(" .*","",group) for group in self.groups]:
181 |                 self.catalog = self.catalog + """ JOIN """ + limitation + """ USING (bookid)"""
182 | 
183 |         #Here's a feature that's not yet fully implemented: it doesn't work quickly enough, probably because the joins involve a lot of jumping back and forth
184 |         if 'hasword' in self.limits.keys():
185 |             #This is the sort of code I should have written more of: 
186 |             #it just generates a new API call to fill a small part of the code here:
187 |             #(in this case, it merges the 'catalog' entry with a select query on 
188 |             #the word in the 'haswords' field. Enough of this could really
189 |             #shrink the codebase, I suspect. But for some reason, these joins end up being too slow to run.
190 |             #I think that has to do with the temporary table being created; we need to figure out how
191 |             #to allow direct access to wordsheap here without having the table aliases for the different versions of wordsheap
192 |             #being used overlapping.
193 |             if self.limits['hasword'] == []:
194 |                 del self.limits['hasword']
195 |                 return
196 |             import copy
197 |             #deepcopy lets us get a real copy of the dictionary 
198 |             #that can be changed without affecting the old one.
199 |             mydict = copy.deepcopy(self.outside_dictionary)
200 |             mydict['search_limits'] = copy.deepcopy(self.limits)
201 |             mydict['search_limits']['word'] = copy.deepcopy(mydict['search_limits']['hasword'])
202 |             del mydict['search_limits']['hasword']
203 |             tempquery = userquery(mydict)
204 |             bookids = ''
205 |             bookids = tempquery.counts_query()
206 | 
207 |             #If this is ever going to work, 'catalog' here should be some call to self.prefs['fastcat']
208 |             bookids = re.sub("(?s).*catalog[^\.]?[^\.\n]*\n","\n",bookids)
209 |             bookids = re.sub("(?s)WHERE.*","\n",bookids)
210 |             bookids = re.sub("(words|lookup)([0-9])","has\\1\\2",bookids)
211 |             bookids = re.sub("main","hasTable",bookids)
212 |             self.catalog = self.catalog + bookids
213 |             #del self.limits['hasword']
214 | 
215 |     def make_catwhere(self):
216 |         #Where terms that don't include the words table join. Kept separate so that we can have subqueries only working on one half of the stack.
217 |         catlimits = dict()
218 |         for key in self.limits.keys():
219 |             if key not in ('word','word1','word2','hasword') and not re.search("words\d",key):
220 |                 catlimits[key] = self.limits[key]
221 |         if len(catlimits.keys()) > 0:
222 |             self.catwhere = where_from_hash(catlimits)
223 |         else:
224 |             self.catwhere = "TRUE"
225 | 
226 |     def make_wordwheres(self):
227 |         self.wordswhere = " TRUE "
228 |         self.max_word_length = 0
229 |         limits = []
230 |         
231 |         if 'word' in self.limits.keys():
232 |             """
233 |             This doesn't currently allow mixing of one and two word searches together in a logical way.
234 |             It might be possible to just join on both the tables in MySQL--I'm not completely sure what would happen.
235 |             But the philosophy has been to keep users from doing those searches as far as possible in any case.
236 |             """
237 |             for phrase in self.limits['word']:
238 |                 locallimits = dict()
239 |                 array = phrase.split(" ")
240 |                 n=1
241 |                 for word in array:
242 |                     locallimits['words'+str(n) + "." + self.word_field] = word
243 |                     self.max_word_length = max(self.max_word_length,n)
244 |                     n = n+1
245 |                 limits.append(where_from_hash(locallimits))
246 |                 #XXX for backward compatability
247 |                 self.words_searched = phrase
248 |             #del self.limits['word']
249 |             self.wordswhere = '(' + ' OR '.join(limits) + ')'
250 | 
251 |         wordlimits = dict()
252 | 
253 |         limitlist = copy.deepcopy(self.limits.keys())
254 | 
255 |         for key in limitlist:
256 |             if re.search("words\d",key):
257 |                 wordlimits[key] = self.limits[key]
258 |                 self.max_word_length = max(self.max_word_length,2)
259 |                 del self.limits[key]
260 | 
261 |         if len(wordlimits.keys()) > 0:
262 |             self.wordswhere = where_from_hash(wordlimits)
263 | 
264 | 
265 | #    def return_wordstableOld(self, words = ['polka dot'], pos=1):
266 | #        #This returns an SQL sequence suitable for querying or, probably, joining, that gives a words table only as long as the words that are
267 | #        #listed in the query; it works with different word fields
268 | #        #The pos value specifies a number to go after the table names, so that we can have more than one table in the join. But those numbers
269 | #        #have to be assigned elsewhere, so overlap is a danger if programmed poorly.
270 | #        self.lookupname = "lookup" + str(pos)
271 | #        self.wordsname  = "words" + str(pos)
272 | #        if len(words) > 0:
273 | #            self.wordwhere = where_from_hash({self.lookupname + ".casesens":words})
274 | #            self.wordstable = """
275 | #            %(wordsheap)s as %(wordsname)s JOIN 
276 | #            %(wordsheap)s AS %(lookupname)s 
277 | #            ON ( %(wordsname)s.%(word_field)s=%(lookupname)s.%(word_field)s 
278 | #            AND  %(wordwhere)s   )  """ % self.__dict__ 
279 | #        else:
280 | #            #We want to have some words returned even if _none_ are the query so that they can be selected. Having all the joins doesn't allow that,
281 | #            #because in certain cases (merging by stems, eg) it would have multiple rows returned for a single word.
282 | #            self.wordstable = """
283 | #            %(wordsheap)s as %(wordsname)s """ % self.__dict__
284 | #        return self.wordstable
285 | 
286 |     def build_wordstables(self):
287 |         #Deduce the words tables we're joining against. The iterating on this can be made more general to get 3 or four grams in pretty easily.
288 |         #This relies on a determination already having been made about whether this is a unigram or bigram search; that's reflected in the keys passed.
289 |         if (self.max_word_length == 2 or re.search("words2",self.selections)):
290 |             self.maintable = 'master_bigrams'
291 |             self.main = '''
292 |                  JOIN
293 |                  master_bigrams as main
294 |                  ON ('''+ self.prefs['fastcat'] +'''.bookid=main.bookid)
295 |                  '''
296 |             self.wordstables =  """
297 |             JOIN %(wordsheap)s as words1 ON (main.word1 = words1.wordid) 
298 |             JOIN %(wordsheap)s as words2 ON (main.word2 = words2.wordid) """ % self.__dict__
299 | 
300 |         #I use a regex here to do a blanket search for any sort of word limitations. That has some messy sideffects (make sure the 'hasword'
301 |         #key has already been eliminated, for example!) but generally works.
302 |         elif self.max_word_length == 1 or re.search("word",self.selections):
303 |             self.maintable = 'master_bookcounts'
304 |             self.main = '''
305 |                 JOIN
306 |                  master_bookcounts as main
307 |                  ON (''' + self.prefs['fastcat'] + '''.bookid=main.bookid)'''
308 |             self.tablename = 'master_bookcounts'
309 |             self.wordstables = """
310 |               JOIN ( %(wordsheap)s as words1)  ON (main.wordid = words1.wordid)
311 |              """ % self.__dict__
312 |         #Have _no_ words table if no words searched for or grouped by; instead just use nwords. This 
313 |         #isn't strictly necessary, but means the API can be used for the slug-filling queries, and some others.
314 |         else:
315 |             self.main = " "
316 |             self.operation = self.catoperation[self.counttype] #Why did I do this?    
317 |             self.wordstables = " "
318 |             self.wordswhere  = " TRUE " #Just a dummy thing. Shouldn't take any time, right?
319 | 
320 |     def counts_query(self,countname='count'):
321 |         self.countname=countname
322 |         self.bookoperation = {"Occurrences_per_Million_Words":"sum(main.count)","Raw_Counts":"sum(main.count)","Percentage_of_Books":"count(DISTINCT " + self.prefs['fastcat'] + ".bookid)","Number_of_Books":"count(DISTINCT "+ self.prefs['fastcat'] + ".bookid)"}
323 |         self.catoperation = {"Occurrences_per_Million_Words":"sum(nwords)","Raw_Counts":"sum(nwords)","Percentage_of_Books":"count(nwords)","Number_of_Books":"count(nwords)"}        
324 |         self.operation = self.bookoperation[self.counttype]
325 |         self.build_wordstables()
326 |         countsQuery = """
327 |             SELECT
328 |                 %(selections)s,
329 |                 %(operation)s as %(countname)s
330 |             FROM 
331 |                 %(catalog)s
332 |                 %(main)s
333 |                 %(wordstables)s 
334 |             WHERE
335 |                  %(catwhere)s AND %(wordswhere)s
336 |             GROUP BY 
337 |                 %(groupings)s
338 |         """ % self.__dict__
339 |         return countsQuery
340 |     
341 |     def ratio_query(self):
342 |         finalcountcommands = {"Occurrences_per_Million_Words":"IFNULL(count,0)*1000000/total","Raw_Counts":"IFNULL(count,0)","Percentage_of_Books":"IFNULL(count,0)*100/total","Number_of_Books":"IFNULL(count,0)"}
343 |         self.countcommand = finalcountcommands[self.counttype]
344 |         #if True: #In the case that we're not using a superset of words; this can be changed later
345 |         #    supersetGroups = [group for group in self.groups if not re.match('word',group)]
346 |         #    self.finalgroupings = self.groupings
347 |         #    for key in self.limits.keys():
348 |         #        if re.match('word',key):
349 |         #            del self.limits[key]
350 |         
351 |         self.denominator =  userquery(outside_dictionary = self.compare_dictionary)
352 |         self.supersetquery = self.denominator.counts_query(countname='total')
353 | 
354 |         if re.search("In_Library",self.denominator.selections):
355 |             self.selections = self.selections + ", fastcat.bookid is not null as In_Library"
356 | 
357 |         #See above: In_Library is a dummy variable so that there's always something to join on.            
358 |         self.mainquery    = self.counts_query()
359 |         
360 | 
361 |         """
362 |         We launch a whole new userquery instance here to build the denominator, based on the 'compare_dictionary' option (which in most 
363 |         cases is the search_limits without the keys, see above.
364 |         We then get the counts_query results out of that result.
365 |         """
366 |         
367 | 
368 |         self.totalMergeTerms = "USING (" + self.denominator.groupings + " ) "
369 | 
370 | 
371 |         self.totalselections  = ",".join([re.sub(".* as","",group) for group in self.groups])
372 | 
373 |         query = """
374 |         SELECT
375 |             %(totalselections)s,
376 |             %(countcommand)s as value
377 |         FROM 
378 |             ( %(mainquery)s 
379 |             ) as tmp 
380 |             RIGHT JOIN 
381 |              ( %(supersetquery)s ) as totaller
382 |              %(totalMergeTerms)s
383 |         GROUP BY %(groupings)s;""" % self.__dict__
384 |         return query        
385 | 
386 |     def return_slug_data(self,force=False):
387 |         #Rather than understand this error, I'm just returning 0 if it fails.
388 |         #Probably that's the right thing to do, though it may cause trouble later.
389 |         #It's just a punishment for not later using a decent API call with "Raw_Counts" to extract these counts out, and relying on this ugly method.
390 |         try:
391 |             temp_words = self.return_n_words(force = True)
392 |             temp_counts = self.return_n_books(force = True)
393 |         except:
394 |             temp_words = 0
395 |             temp_counts = 0
396 |         return [temp_counts,temp_words]    
397 | 
398 |     def return_n_books(self,force=False):
399 |         if (not hasattr(self,'nbooks')) or force:
400 |             query = "SELECT count(*) from " + self.catalog + " WHERE " + self.catwhere
401 |             silent = self.cursor.execute(query)
402 |             self.counts = int(self.cursor.fetchall()[0][0])
403 |         return self.counts
404 | 
405 |     def return_n_words(self,force=False):
406 |         if (not hasattr(self,'nwords')) or force:
407 |             query = "SELECT sum(nwords) from " + self.catalog + " WHERE " + self.catwhere
408 |             silent = self.cursor.execute(query)
409 |             self.nwords = int(self.cursor.fetchall()[0][0])
410 |         return self.nwords   
411 | 
412 |     def ranked_query(self,percentile_to_return = 99,addwhere = ""):
413 |         #NOT CURRENTLY IN USE ANYWHERE--DELETE???
414 |         ##This returns a list of bookids in order by how well they match the sort terms.
415 |         ## Using an IDF term will give better search results for case-sensitive searches, but is currently disabled
416 |         ##
417 |         self.LIMIT = int((100-percentile_to_return) * self.return_n_books()/100)
418 |         countQuery = """
419 |          SELECT
420 |          bookid,
421 |          sum(main.count*1000/nwords%(idfterm)s) as score
422 |          FROM %(catalog)s LEFT JOIN %(tablename)s
423 |          USING (bookid)
424 |          WHERE %(catwhere)s AND %(wordswhere)s
425 |          GROUP BY bookid
426 |  	 ORDER BY score DESC
427 |          LIMIT %(LIMIT)s
428 |          """ % self.__dict__
429 |         return countQuery
430 |     
431 |     def bibliography_query(self,limit = "100"):
432 |         #I'd like to redo this at some point so it could work as an API call.
433 |         self.limit = limit
434 |         self.ordertype = "sum(main.count*10000/nwords)"
435 |         try:
436 |             if self.outside_dictionary['ordertype'] == "random":
437 |                 if self.counttype=="Raw_Counts" or self.counttype=="Number_of_Books":
438 |                     self.ordertype = "RAND()"
439 |                 else:
440 |                     self.ordertype = "LOG(1-RAND())/sum(main.count)"
441 |         except KeyError:
442 |             pass
443 | 
444 |         #If IDF searching is enabled, we could add a term like '*IDF' here to overweight better selecting words
445 |         #in the event of a multiple search.
446 |         self.idfterm = ""
447 |         prep = self.counts_query()
448 | 
449 |         bibQuery = """
450 |         SELECT searchstring 
451 |         FROM """ % self.__dict__ + self.prefs['fullcat'] + """ RIGHT JOIN (
452 |         SELECT                                                                                                       
453 |         """+ self.prefs['fastcat'] + """.bookid, %(ordertype)s as ordering
454 |             FROM                                                                                                     
455 |                 %(catalog)s                                                                                          
456 |                 %(main)s                                                                                             
457 |                 %(wordstables)s                                                                                      
458 |             WHERE                                                                                                    
459 |                  %(catwhere)s AND %(wordswhere)s                                                                                        
460 |         GROUP BY bookid ORDER BY %(ordertype)s DESC LIMIT %(limit)s                              
461 |         ) as tmp USING(bookid) ORDER BY ordering DESC;
462 |         """ % self.__dict__
463 |         return bibQuery
464 | 
465 |     def disk_query(self,limit="100"):
466 |         pass
467 | 
468 |     def return_books(self):
469 |         #This preps up the display elements for a search.
470 |         #All this needs to be rewritten.
471 |         silent = self.cursor.execute(self.bibliography_query())
472 |         returnarray = []
473 |         for line in self.cursor.fetchall():
474 |             returnarray.append(line[0])
475 |         if not returnarray:
476 |             returnarray.append("No results for this particular point: try again without smoothing")
477 |         newerarray = self.custom_SearchString_additions(returnarray)
478 |         return json.dumps(newerarray)
479 | 
480 |     def getActualSearchedWords(self):
481 |         if len(self.wordswhere) > 7:
482 |             words = self.outside_dictionary['search_limits']['word']
483 |             #Break bigrams into single words.
484 |             words = ' '.join(words).split(' ')
485 |             self.cursor.execute("""SELECT word FROM wordsheap WHERE """ + where_from_hash({self.word_field:words}))
486 |             self.actualWords =[item[0] for item in self.cursor.fetchall()]
487 |         else:
488 |             self.actualWords = ["tasty","mistake","happened","here"]
489 | 
490 |     def custom_SearchString_additions(self,returnarray):
491 |         db = self.outside_dictionary['database']
492 |         if db in ('jstor','presidio','ChronAm','LOC'):
493 |             self.getActualSearchedWords()
494 |             if db=='jstor':
495 |                 joiner = "&searchText="
496 |                 preface = "?Search=yes&searchText="
497 |                 urlRegEx = "http://www.jstor.org/stable/\d+"
498 |             if db=='presidio':
499 |                 joiner = "+"
500 |                 preface =  "#page/1/mode/2up/search/"
501 |                 urlRegEx = 'http://archive.org/stream/[^"# ><]*'
502 |             if db in ('ChronAm','LOC'):
503 |                 preface = "/;words="
504 |                 joiner = "+"
505 |                 urlRegEx = 'http://chroniclingamerica.loc.gov[^\"><]*/seq-\d'
506 |             newarray = []
507 |             for string in returnarray:
508 |                 base = re.findall(urlRegEx,string)[0]
509 |                 newcore = ' <a href = "' +  base  + preface + joiner.join(self.actualWords) + '"> search inside </a>'
510 |                 string = re.sub("^<td>","",string)
511 |                 string = re.sub("</td>$","",string)
512 |                 string = string+newcore
513 |                 newarray.append(string)
514 |         #Arxiv is messier, requiring a whole different URL interface: http://search.arxiv.org:8081/paper.jsp?r=1204.3352&qs=network
515 |         return newarray
516 | 
517 |     def return_query_values(self,query = "ratio_query"):
518 |         #The API returns a dictionary with years pointing to values.
519 |         values = []
520 |         querytext = getattr(self,query)()
521 |         silent = self.cursor.execute(querytext)
522 |             #Gets the results
523 |         mydict = dict(self.cursor.fetchall())
524 |         try:
525 |             for key in mydict.keys():
526 |                 #Only return results inside the time limits
527 |                 if key >= self.time_limits[0] and key <= self.time_limits[1]:
528 |                     mydict[key] = str(mydict[key])
529 |                 else:
530 |                     del mydict[key]
531 |             mydict = smooth_function(mydict,smooth_method = self.smoothingType,span = self.smoothingSpan)
532 | 
533 |         except:
534 |             mydict = {0:"0"}
535 | 
536 |         #This is a good place to change some values.
537 |         try:
538 |             return {'index':self.index, 'Name':self.words_searched,"values":mydict,'words_searched':""}
539 |         except:
540 |             return{'values':mydict}
541 | 
542 |     def arrayNest(self,array,returnt):
543 |         #A recursive function to transform a list into a nested array
544 |         if len(array)==2:
545 |             try:
546 |                 returnt[array[0]] = float(array[1])
547 |             except:
548 |                 returnt[array[0]] = array[1]
549 |         else:
550 |             try:
551 |                 returnt[array[0]] = self.arrayNest(array[1:len(array)],returnt[array[0]])
552 |             except KeyError:
553 |                 returnt[array[0]] = self.arrayNest(array[1:len(array)],dict())
554 |         return returnt
555 | 
556 |     def return_json(self,query='ratio_query'):
557 |         if self.counttype=="Raw_Counts" or self.counttype=="Number_of_Books":
558 |             query="counts_query"
559 |         querytext = getattr(self,query)()
560 |         silent = self.cursor.execute(querytext)
561 |         names = [to_unicode(item[0]) for item in self.cursor.description]
562 |         returnt = dict()
563 |         lines = self.cursor.fetchall()
564 |         for line in lines:
565 |             returnt = self.arrayNest(line,returnt)
566 |         return returnt
567 | 
568 |     def return_tsv(self,query = "ratio_query"):
569 |         if self.counttype=="Raw_Counts" or self.counttype=="Number_of_Books":
570 |             query="counts_query"
571 |         querytext = getattr(self,query)()
572 |         silent = self.cursor.execute(querytext)
573 |         results = ["\t".join([to_unicode(item[0]) for item in self.cursor.description])]
574 |         lines = self.cursor.fetchall()
575 |         for line in lines:
576 |             items = []
577 |             for item in line:
578 |                 item = to_unicode(item)
579 |                 item = re.sub("\t","<tab>",item)
580 |                 items.append(item)
581 |             results.append("\t".join(items))
582 |         return "\n".join(results)
583 | 
584 |     def export_data(self,query1="ratio_query"):
585 |         self.smoothing=0
586 |         return self.return_query_values(query=query1)
587 | 
588 |     def execute(self):
589 |         #This performs the query using the method specified in the passed parameters.
590 |         if self.method=="Nothing":
591 |             pass
592 |         else:
593 |             return getattr(self,self.method)()
594 | 
595 | 
596 |     #############
597 |     ##GENERAL#### #These are general purpose functional types of things not implemented in the class.
598 |     #############
599 |     
600 | def to_unicode(obj, encoding='utf-8'):
601 |     if isinstance(obj, basestring):
602 |         if not isinstance(obj, unicode):
603 |             obj = unicode(obj, encoding)
604 |     elif isinstance(obj,int):
605 |         obj=unicode(str(obj),encoding)
606 |     else:
607 |         obj = unicode(str(obj),encoding)
608 |     return obj
609 | 
610 | def where_from_hash(myhash,joiner=" AND ",comp = " = "):
611 |     whereterm = []
612 |     #The general idea here is that we try to break everything in search_limits down to a list, and then create a whereterm on that joined by whatever the 'joiner' is ("AND" or "OR"), with the comparison as whatever comp is ("=",">=",etc.).
613 |     #For more complicated bits, it gets all recursive until the bits are in terms of list.
614 |     for key in myhash.keys():
615 |         values = myhash[key]
616 |         if isinstance(values,basestring) or isinstance(values,int) or isinstance(values,float):
617 |             #This is just error handling. You can pass a single value instead of a list if you like, and it will just convert it 
618 |             #to a list for you.
619 |             values = [values]
620 |         #Or queries are special, since the default is "AND". This toggles that around for a subportion.
621 |         if key=='$or' or key=="$OR":
622 |             for comparison in values:
623 |                 whereterm.append(where_from_hash(comparison,joiner=" OR ",comp=comp))
624 |                 #The or doesn't get populated any farther down.
625 |         elif isinstance(values,dict):
626 |             #Certain function operators can use MySQL terms. These are the only cases that a dict can be passed as a limitations
627 |             operations = {"$gt":">","$lt":"<","$grep":" REGEXP ","$gte":">=","$lte":"<=","$eq":"="}
628 |             for operation in values.keys():
629 |                 whereterm.append(where_from_hash({key:values[operation]},comp=operations[operation],joiner=joiner))
630 |         elif isinstance(values,list):
631 |             #and this is where the magic actually happens
632 |             if isinstance(values[0],dict):
633 |                 for entry in values:
634 |                     whereterm.append(where_from_hash(entry))
635 |             else:
636 |                 if isinstance(values[0],basestring):
637 |                     quotesep="'"
638 |                 else:
639 |                     quotesep = ""
640 |                 #Note the "OR" here. There's no way to pass in a query like "year=1876 AND year=1898" as currently set up.
641 |                 #Obviously that's no great loss, but there might be something I'm missing that would be.
642 |                 whereterm.append(" (" + " OR ".join([" (" + key+comp+quotesep+str(value)+quotesep+") " for value in values])+ ") ")
643 |     return "(" + joiner.join(whereterm) + ")"
644 |     #This works pretty well, except that it requires very specific sorts of terms going in, I think.
645 | 
646 | 
647 | 
648 | #I'd rather have all this smoothing stuff done at the client side, but currently it happens here.
649 | def smooth_function(zinput,smooth_method = 'lowess',span = .05):
650 |     if smooth_method not in ['lowess','triangle','rectangle']:
651 |         return zinput
652 |     xarray = []
653 |     yarray = []
654 |     years = zinput.keys()
655 |     years.sort()
656 |     for key in years:
657 |         if zinput[key]!='None':
658 |             xarray.append(float(key))
659 |             yarray.append(float(zinput[key]))
660 |     from numpy import array
661 |     x = array(xarray)
662 |     y = array(yarray)
663 |     if smooth_method == 'lowess':
664 |         #print "starting lowess smoothing<br>"
665 |         from Bio.Statistics.lowess import lowess
666 |         smoothed = lowess(x,y,float(span)/100,3)
667 |         x = [int(p) for p in x]
668 |         returnval = dict(zip(x,smoothed))
669 |         return returnval
670 |     if smooth_method == 'rectangle':
671 |         from math import log
672 |         #print "starting triangle smoothing<br>"
673 |         span = int(span) #Takes the floor--so no smoothing on a span < 1.
674 |         returnval = zinput
675 |         windowsize = span*2 + 1
676 |         from numpy import average
677 |         for i in range(len(xarray)):
678 |             surrounding = array(range(windowsize),dtype=float)
679 |             weights = array(range(windowsize),dtype=float)
680 |             for j in range(windowsize):
681 |                 key_dist = j - span #if span is 2, the zeroeth element is -2, the second element is 0 off, etc.
682 |                 workingon = i + key_dist
683 |                 if workingon >= 0 and workingon < len(xarray):
684 |                     surrounding[j] = float(yarray[workingon])
685 |                     weights[j] = 1
686 |                 else:
687 |                     surrounding[j] = 0
688 |                     weights[j] = 0
689 |             returnval[xarray[i]] = round(average(surrounding,weights=weights),3)
690 |         return returnval
691 |     if smooth_method == 'triangle':
692 |         from math import log
693 |         #print "starting triangle smoothing<br>"
694 |         span = int(span) #Takes the floor--so no smoothing on a span < 1.
695 |         returnval = zinput
696 |         windowsize = span*2 + 1
697 |         from numpy import average
698 |         for i in range(len(xarray)):
699 |             surrounding = array(range(windowsize),dtype=float)
700 |             weights = array(range(windowsize),dtype=float)
701 |             for j in range(windowsize):
702 |                 key_dist = j - span #if span is 2, the zeroeth element is -2, the second element is 0 off, etc.
703 |                 workingon = i + key_dist
704 |                 if workingon >= 0 and workingon < len(xarray):
705 |                     surrounding[j] = float(yarray[workingon])
706 |                     #This isn't actually triangular smoothing: I dampen it by the logs, to keep the peaks from being too too big.
707 |                     #The minimum is '2', since log(1) == 0, which is a nonesense weight.
708 |                     weights[j] = log(span + 2 - abs(key_dist))
709 |                 else:
710 |                     surrounding[j] = 0
711 |                     weights[j] = 0
712 |             
713 |             returnval[xarray[i]] = round(average(surrounding,weights=weights),3)
714 |         return returnval
715 |     
716 | #The idea is: this works by default by slurping up from the command line, but you could also load the functions in and run results on your own queries.
717 | try:
718 |     command = str(sys.argv[1])
719 |     command = json.loads(command)
720 | #Got to go before we let anything else happen.
721 |     print command
722 |     p = userqueries(command)
723 |     result = p.execute()
724 |     print json.dumps(result)
725 | except:
726 |     pass
727 | 
728 | 


--------------------------------------------------------------------------------
/bookworm/.gitignore:
--------------------------------------------------------------------------------
1 | old/*
2 | *~
3 | APIkeys
4 | #*
5 | .#*


--------------------------------------------------------------------------------
/bookworm/APIimplementation.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python
  2 | 
  3 | import sys
  4 | import json
  5 | import cgi
  6 | import re
  7 | import numpy #used for smoothing.
  8 | import copy
  9 | import decimal
 10 | """
 11 | #These are here so we can support multiple databases with different naming schemes from a single API.
 12 | #A bit ugly to have here; could be part of configuration file somewhere else, I guess. there are 'fast' and 'full' tables for books and words;
 13 | #that's so memory tables can be used in certain cases for fast, hashed matching, but longer form data (like book titles)
 14 | #can be stored on disk. Different queries use different types of calls.
 15 | #Also, certain metadata fields are stored separately from the main catalog table;
 16 | #I list them manually here to avoid a database call to find out what they are,
 17 | #although the latter would be more elegant. The way to do that would be a database call
 18 | #of tables with two columns one of which is 'bookid', maybe, or something like that.
 19 | #(Or to add it as error handling when a query failed; only then check for missing files.
 20 | """
 21 | 
 22 | general_prefs = {"presidio":{"HOST":"melville.seas.harvard.edu","database":"presidio","fastcat":"fastcat","fullcat":"open_editions","fastword":"wordsheap","read_default_file":"/etc/mysql/my.cnf","fullword":"words","separateDataTables":["LCSH","gender"],"read_url_head":"http://www.archive.org/stream/"},"arxiv":{"HOST":"10.102.15.45","database":"arxiv","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":["genre","fastgenre","archive","subclass"],"read_url_head":"http://www.arxiv.org/abs/"},"jstor":{"HOST":"10.102.15.45","database":"jstor","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":["discipline"],"read_url_head":"http://www.arxiv.org/abs/"}, "politweets":{"HOST":"chaucer.fas.harvard.edu","database":"politweets","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":[],"read_url_head":"http://www.arxiv.org/abs/"},"LOC":{"HOST":"10.102.15.45","database":"LOC","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":[],"read_url_head":"http://www.arxiv.org/abs/"},"ChronAm":{"HOST":"10.102.15.45","database":"ChronAm","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":["subjects"],"read_url_head":"http://www.arxiv.org/abs/"},"ngrams":{"fastcat": "fastcat", "HOST": "10.102.15.45", "separateDataTables": [], "fastword": "wordsheap", "database": "ngrams", "read_url_head": "arxiv.culturomics.org", "fullcat": "catalog", "fullword": "words", "read_default_file": "/etc/mysql/my.cnf"},"OL":{"HOST":"10.102.15.45","database":"OL","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":["subjects"],"read_url_head":"http://www.arxiv.org/abs/"}}
 23 | 
 24 | general_prefs['OL'] = {"fastcat": "fastcat", "HOST": "10.102.15.45", "separateDataTables": ["authors", "publishers", "authors", "subjects"], "fastword": "wordsheap", "database": "OL", "read_url_head": "arxiv.culturomics.org", "fullcat":"catalog", "fullword": "words", "read_default_file": "/etc/mysql/my.cnf"}
 25 | 
 26 | #We define prefs to default to the Open Library set at first; later, it can do other things.
 27 | 
 28 | class dbConnect():
 29 |     #This is a read-only account
 30 |     def __init__(self,prefs = general_prefs['presidio']):
 31 |         import MySQLdb
 32 |         self.db = MySQLdb.connect(host=prefs['HOST'],read_default_file = prefs['read_default_file'],use_unicode='True',charset='utf8',db=prefs['database'])
 33 |         self.cursor = self.db.cursor()
 34 | 
 35 | 
 36 | # The basic object here is a userquery: it takes dictionary as input, as defined in the API, and returns a value 
 37 | # via the 'execute' function whose behavior 
 38 | # depends on the mode that is passed to it.
 39 | # Given the dictionary, it can return a number of objects.
 40 | # The "Search_limits" array in the passed dictionary determines how many elements it returns; this lets multiple queries be bundled together.
 41 | # Most functions describe a subquery that might be combined into one big query in various ways.
 42 | 
 43 | class userqueries():
 44 |     #This is a set of queries that are bound together; each element in search limits is iterated over, and we're done.
 45 |     def __init__(self,outside_dictionary = {"counttype":["Percentage_of_Books"],"search_limits":[{"word":["polka dot"],"LCSH":["Fiction"]}]},db = None):
 46 |         self.database = outside_dictionary.setdefault('database','presidio')
 47 |         prefs = general_prefs[self.database]
 48 |         self.prefs = prefs
 49 |         self.wordsheap = prefs['fastword']
 50 |         self.words = prefs['fullword']
 51 |         if 'search_limits' not in outside_dictionary.keys():
 52 |             outside_dictionary['search_limits'] = [{}]
 53 |         #coerce one-element dictionaries to an array.
 54 |         if isinstance(outside_dictionary['search_limits'],dict):
 55 |             #(allowing passing of just single dictionaries instead of arrays)
 56 |             outside_dictionary['search_limits'] = [outside_dictionary['search_limits']]
 57 |         self.returnval = []
 58 |         self.queryInstances = []
 59 |         for limits in outside_dictionary['search_limits']:
 60 |             mylimits = outside_dictionary
 61 |             mylimits['search_limits'] = limits
 62 |             localQuery = userquery(mylimits)
 63 |             self.queryInstances.append(localQuery)
 64 |             self.returnval.append(localQuery.execute())
 65 | 
 66 |     def execute(self):
 67 |         return self.returnval
 68 | 
 69 | class userquery():
 70 |     def __init__(self,outside_dictionary = {"counttype":["Percentage_of_Books"],"search_limits":{"word":["polka dot"],"LCSH":["Fiction"]}}):
 71 |         #Certain constructions require a DB connection already available, so we just start it here, or use the one passed to it.
 72 |         self.outside_dictionary = outside_dictionary
 73 |         self.prefs = general_prefs[outside_dictionary.setdefault('database','presidio')]
 74 |         self.db = dbConnect(self.prefs)
 75 |         self.cursor = self.db.cursor
 76 |         self.wordsheap = self.prefs['fastword']
 77 |         self.words = self.prefs['fullword']
 78 | 
 79 |         #I'm now allowing 'search_limits' to either be a dictionary or an array of dictionaries: 
 80 |         #this makes the syntax cleaner on most queries,
 81 |         #while still allowing some long ones from the Bookworm website.
 82 |         if isinstance(outside_dictionary['search_limits'],list):
 83 |             outside_dictionary['search_limits'] = outside_dictionary['search_limits'][0]
 84 |         self.defaults(outside_dictionary) #Take some defaults
 85 |         self.derive_variables() #Derive some useful variables that the query will use.
 86 |         
 87 |     def defaults(self,outside_dictionary):
 88 |         #these are default values;these are the only values that can be set in the query
 89 |             #search_limits is an array of dictionaries;
 90 |             #each one contains a set of limits that are mutually independent
 91 |             #The other limitations are universal for all the search limits being set.
 92 | 
 93 |         #Set up a dictionary for the denominator of any fraction if it doesn't already exist:
 94 |         self.search_limits = outside_dictionary.setdefault('search_limits',[{"word":["polka dot"]}])
 95 | 
 96 |         self.words_collation = outside_dictionary.setdefault('words_collation',"Case_Insensitive")
 97 | 
 98 |         lookups = {"Case_Insensitive":'word',"case_insensitive":"word","Case_Sensitive":"casesens","Correct_Medial_s":'ffix',"All_Words_with_Same_Stem":"stem","Flagged":'wflag'}
 99 | 
100 |         self.word_field = lookups[self.words_collation]
101 | 
102 |         self.groups = []
103 |         try:
104 |             groups = outside_dictionary['groups']
105 |         except:
106 |             groups = [outside_dictionary['time_measure']]
107 | 
108 |         if groups == []:
109 |             #Set an arbitrary column name if nothing else is set.
110 |             groups = ["bookid is not null as In_Library"]
111 | 
112 |         if (len (groups) > 1):
113 |             pass
114 |             #self.groups = credentialCheckandClean(self.groups)
115 |             #Define some sort of limitations here, if not done in dbbindings.py
116 | 
117 |         for group in groups:
118 |             group = group
119 |             if group=="unigram" or group=="word":
120 |                 group = "words1." + self.word_field + " as unigram"
121 |             if group=="bigram":
122 |                 group = "CONCAT (words1." + self.word_field + " ,' ' , words2." + self.word_field + ") as bigram"
123 |             self.groups.append(group)
124 | 
125 |         self.selections = ",".join(self.groups)
126 |         self.groupings  = ",".join([re.sub(".* as","",group) for group in self.groups])
127 | 
128 | 
129 |         self.compare_dictionary = copy.deepcopy(self.outside_dictionary)
130 |         if 'compare_limits' in self.outside_dictionary.keys():
131 |             self.compare_dictionary['search_limits'] = outside_dictionary['compare_limits']
132 |             del outside_dictionary['compare_limits']
133 |         else: #if nothing specified, we compare the word to the corpus.
134 |             for key in ['word','word1','word2','word3','word4','word5','unigram','bigram']:
135 |                 try:
136 |                     del self.compare_dictionary['search_limits'][key]
137 |                 except:
138 |                     pass
139 |             for key in self.outside_dictionary['search_limits'].keys():
140 |                 if re.search('words?\d',key):
141 |                     try:
142 |                         del self.compare_dictionary['search_limits'][key]
143 |                     except:
144 |                         pass
145 | 
146 |         comparegroups = []
147 |         #This is a little tricky behavior here--hopefully it works in all cases. It drops out word groupings.
148 |         try:
149 |             compareGroups = self.compare_dictionary['groups']
150 |         except:
151 |             compareGroups = [self.compare_dictionary['time_measure']]
152 |         for group in compareGroups:
153 |             if not re.match("words",group) and not re.match("[u]?[bn]igram",group):
154 |                 comparegroups.append(group)
155 |         self.compare_dictionary['groups'] = comparegroups
156 |         self.time_limits = outside_dictionary.setdefault('time_limits',[0,10000000])
157 |         self.time_measure = outside_dictionary.setdefault('time_measure','year')
158 |         self.counttype = outside_dictionary.setdefault('counttype',["Occurrences_per_Million_Words"])
159 |         if isinstance(self.counttype,basestring):
160 |             self.counttype = [self.counttype]
161 |         self.index  = outside_dictionary.setdefault('index',0)
162 |         #Ordinarily, the input should be an an array of groups that will both select and group by.
163 |         #The joins may be screwed up by certain names that exist in multiple tables, so there's an option to do something like 
164 |         #SELECT catalog.bookid as myid, because WHERE clauses on myid will work but GROUP BY clauses on catalog.bookid may not 
165 |         #after a sufficiently large number of subqueries.
166 |         #This smoothing code really ought to go somewhere else, since it doesn't quite fit into the whole API mentality and is 
167 |         #more about the webpage.
168 |         self.smoothingType = outside_dictionary.setdefault('smoothingType',"triangle")
169 |         self.smoothingSpan = outside_dictionary.setdefault('smoothingSpan',3)
170 |         self.method = outside_dictionary.setdefault('method',"Nothing")
171 |         self.tablename = outside_dictionary.setdefault('tablename','master'+"_bookcounts as bookcounts")
172 | 
173 |     def derive_variables(self):
174 |         #These are locally useful, and depend on the variables
175 |         self.limits = self.search_limits
176 |         #Treat empty constraints as nothing at all, not as full restrictions.
177 |         for key in self.limits.keys():
178 |             if self.limits[key] == []:
179 |                 del self.limits[key]
180 |         self.set_operations()
181 |         self.create_catalog_table()
182 |         self.make_catwhere()
183 |         self.make_wordwheres()
184 | 
185 |     def create_catalog_table(self):
186 |         self.catalog = self.prefs['fastcat'] #'catalog' #Can be replaced with a more complicated query in the event of longer joins.
187 |         """
188 |         This should check query constraints against a list of tables, and join to them.
189 |         So if you query with a limit on LCSH, and LCSH is listed as being in a separate table,
190 |         it joins the table "LCSH" to catalog; and then that table has one column, ALSO
191 |         called "LCSH", which is matched against. This allows a bookid to be a member of multiple catalogs.
192 |         """
193 |         
194 | 
195 | 
196 |         for limitation in self.prefs['separateDataTables']:
197 |             #That re.sub thing is in here because sometimes I do queries that involve renaming.
198 |             if limitation in [re.sub(" .*","",key) for key in self.limits.keys()] or limitation in [re.sub(" .*","",group) for group in self.groups]:
199 |                 self.catalog = self.catalog + """ JOIN """ + limitation + """ USING (bookid)"""
200 |                 
201 |         """
202 |         Here it just pulls every variable and where to look for it. 
203 |         """
204 |         
205 |         tableToLookIn = {}
206 |         #This is sorted by engine DESC so that memory table locations will overwrite disk table in the hash.
207 |         self.cursor.execute("SELECT ENGINE,TABLE_NAME,COLUMN_NAME,COLUMN_KEY FROM information_schema.COLUMNS JOIN INFORMATION_SCHEMA.TABLES USING (TABLE_NAME,TABLE_SCHEMA) WHERE TABLE_SCHEMA='" + self.outside_dictionary['database']+ "' ORDER BY ENGINE DESC,TABLE_NAME;");
208 |         columnNames = self.cursor.fetchall()
209 | 
210 |         for databaseColumn in columnNames:
211 |             tableToLookIn[databaseColumn[2]] = databaseColumn[1]
212 | 
213 |         self.relevantTables = set()
214 | 
215 |         for columnInQuery in [re.sub(" .*","",key) for key in self.limits.keys()] + [re.sub(" .*","",group) for group in self.groups]:
216 |             if not re.search('\.',columnInQuery): #Lets me keep a little bit of SQL sauce for my own queries
217 |                 try:
218 |                     self.relevantTables.add(tableToLookIn[columnInQuery])
219 |                 except KeyError:
220 |                     pass
221 |                     #Could warn as well, but this helps back-compatability.
222 | 
223 |         self.catalog = "fastcat"
224 |         for table in self.relevantTables:
225 |             if table!="fastcat" and table!="words" and table!="wordsheap":
226 |                 self.catalog = self.catalog + """ NATURAL JOIN """ + table + " "
227 | 
228 |         #Here's a feature that's not yet fully implemented: it doesn't work quickly enough, probably because the joins involve a lot of jumping back and forth. 
229 |         if 'hasword' in self.limits.keys():
230 |             """
231 |             This is the sort of code I'm trying to move towards
232 |             it just generates a new API call to fill a small part of the code here:
233 |             (in this case, it merges the 'catalog' entry with a select query on 
234 |             the word in the 'haswords' field. Enough of this could really
235 |             shrink the codebase, I suspect. It should be possible in MySQL 6.0, from what I've read, where subqueried tables will have indexes written for them by the query optimizer.
236 |             """
237 | 
238 |             if self.limits['hasword'] == []:
239 |                 del self.limits['hasword']
240 |                 return
241 | 
242 |             #deepcopy lets us get a real copy of the dictionary 
243 |             #that can be changed without affecting the old one.
244 |             mydict = copy.deepcopy(self.outside_dictionary)
245 |             mydict['search_limits'] = copy.deepcopy(self.limits)
246 |             mydict['search_limits']['word'] = copy.deepcopy(mydict['search_limits']['hasword'])
247 |             del mydict['search_limits']['hasword']
248 |             tempquery = userquery(mydict)
249 |             bookids = ''
250 |             bookids = tempquery.counts_query()
251 | 
252 |             #If this is ever going to work, 'catalog' here should be some call to self.prefs['fastcat']
253 |             bookids = re.sub("(?s).*catalog[^\.]?[^\.\n]*\n","\n",bookids)
254 |             bookids = re.sub("(?s)WHERE.*","\n",bookids)
255 |             bookids = re.sub("(words|lookup)([0-9])","has\\1\\2",bookids)
256 |             bookids = re.sub("main","hasTable",bookids)
257 |             self.catalog = self.catalog + bookids
258 |             #del self.limits['hasword']
259 | 
260 |     def make_catwhere(self):
261 |         #Where terms that don't include the words table join. Kept separate so that we can have subqueries only working on one half of the stack.
262 |         catlimits = dict()
263 |         for key in self.limits.keys():
264 |             if key not in ('word','word1','word2','hasword') and not re.search("words\d",key):
265 |                 catlimits[key] = self.limits[key]
266 |         if len(catlimits.keys()) > 0:
267 |             self.catwhere = where_from_hash(catlimits)
268 |         else:
269 |             self.catwhere = "TRUE"
270 | 
271 |     def make_wordwheres(self):
272 |         self.wordswhere = " TRUE "
273 |         self.max_word_length = 0
274 |         limits = []
275 |         
276 |         if 'word' in self.limits.keys():
277 |             """
278 |             This doesn't currently allow mixing of one and two word searches together in a logical way.
279 |             It might be possible to just join on both the tables in MySQL--I'm not completely sure what would happen.
280 |             But the philosophy has been to keep users from doing those searches as far as possible in any case.
281 |             """
282 |             for phrase in self.limits['word']:
283 |                 locallimits = dict()
284 |                 array = phrase.split(" ")
285 |                 n=1
286 |                 for word in array:
287 |                     selectString =  "(SELECT " + self.word_field + " FROM wordsheap WHERE casesens='" + word + "')"
288 |                     locallimits['words'+str(n) + "." + self.word_field] = selectString
289 |                     self.max_word_length = max(self.max_word_length,n)
290 |                     n = n+1
291 |                 limits.append(where_from_hash(locallimits,quotesep=""))
292 |                 #XXX for backward compatability
293 |                 self.words_searched = phrase
294 |             self.wordswhere = '(' + ' OR '.join(limits) + ')'
295 | 
296 |         wordlimits = dict()
297 | 
298 |         limitlist = copy.deepcopy(self.limits.keys())
299 | 
300 |         for key in limitlist:
301 |             if re.search("words\d",key):
302 |                 wordlimits[key] = self.limits[key]
303 |                 self.max_word_length = max(self.max_word_length,2)
304 |                 del self.limits[key]
305 | 
306 |         if len(wordlimits.keys()) > 0:
307 |             self.wordswhere = where_from_hash(wordlimits)
308 | 
309 | 
310 |     def build_wordstables(self):
311 |         #Deduce the words tables we're joining against. The iterating on this can be made more general to get 3 or four grams in pretty easily.
312 |         #This relies on a determination already having been made about whether this is a unigram or bigram search; that's reflected in the keys passed.
313 |         if (self.max_word_length == 2 or re.search("words2",self.selections)):
314 | 
315 |             self.maintable = 'master_bigrams'
316 | 
317 |             self.main = '''
318 |                  JOIN
319 |                  master_bigrams as main
320 |                  ON ('''+ self.prefs['fastcat'] +'''.bookid=main.bookid)
321 |                  '''
322 | 
323 |             self.wordstables =  """
324 |             JOIN %(wordsheap)s as words1 ON (main.word1 = words1.wordid) 
325 |             JOIN %(wordsheap)s as words2 ON (main.word2 = words2.wordid) """ % self.__dict__
326 | 
327 |         #I use a regex here to do a blanket search for any sort of word limitations. That has some messy sideffects (make sure the 'hasword'
328 |         #key has already been eliminated, for example!) but generally works.
329 | 
330 |         elif self.max_word_length == 1 or re.search("[^h][^a][^s]word",self.selections):
331 |             self.maintable = 'master_bookcounts'
332 |             self.main = '''
333 |                 JOIN
334 |                  master_bookcounts as main
335 |                  ON (''' + self.prefs['fastcat'] + '''.bookid=main.bookid)'''
336 |             self.tablename = 'master_bookcounts'
337 |             self.wordstables = """
338 |               JOIN ( %(wordsheap)s as words1)  ON (main.wordid = words1.wordid)
339 |              """ % self.__dict__
340 | 
341 |         else:
342 |             """
343 |             Have _no_ words table if no words searched for or grouped by; instead just use nwords. This 
344 |             means that we can use the same basic functions both to build the counts for word searches and 
345 |             for metadata searches, which is valuable because there is a metadata-only search built in to every single ratio
346 |             query. (To get the denominator values).
347 |             """
348 |             self.main = " "
349 |             self.operation = ','.join(self.catoperations)
350 |             """
351 |             This, above is super important: the operation used is relative to the counttype, and changes to use 'catoperation' instead of 'bookoperation'
352 |             That's the place that the denominator queries avoid having to do a table scan on full bookcounts that would take hours, and instead takes
353 |             milliseconds.
354 |             """
355 |             self.wordstables = " "
356 |             self.wordswhere  = " TRUE " #Just a dummy thing to make the SQL writing easier. Shouldn't take any time.
357 | 
358 |     def set_operations(self):
359 | 
360 |         """
361 |         This is the code that allows multiple values to be selected.
362 |         """
363 | 
364 |         backCompatability = {"Occurrences_per_Million_Words":"WordsPerMillion","Raw_Counts":"WordCount","Percentage_of_Books":"TextPercent","Number_of_Books":"TextCount"}
365 |             
366 |         for oldKey in backCompatability.keys():
367 |             self.counttype = [re.sub(oldKey,backCompatability[oldKey],entry) for entry in self.counttype]
368 |             
369 |         self.bookoperation = {}
370 |         self.catoperation = {}
371 |         self.finaloperation = {}
372 | 
373 |         #Text statistics
374 |         self.bookoperation['TextPercent'] = "count(DISTINCT " + self.prefs['fastcat'] + ".bookid) as TextCount"
375 |         self.bookoperation['TextRatio'] = "count(DISTINCT " + self.prefs['fastcat'] + ".bookid) as TextCount"
376 |         self.bookoperation['TextCount'] = "count(DISTINCT " + self.prefs['fastcat'] + ".bookid) as TextCount"
377 |         #Word Statistics
378 |         self.bookoperation['WordCount'] = "sum(main.count) as WordCount"
379 |         self.bookoperation['WordsPerMillion'] = "sum(main.count) as WordCount"
380 |         self.bookoperation['WordsRatio'] = "sum(main.count) as WordCount"
381 |         """
382 |         +Total Numbers for comparisons/significance assessments
383 |         This is a little tricky. The total words is EITHER the denominator (as in a query against words per Million) or the numerator+denominator (if you're comparing 
384 |         Pittsburg and Pittsburgh, say, and want to know the total number of uses of the lemma. For now, "TotalWords" means the former and "SumWords" the latter,
385 |         On the theory that 'TotalWords' is more intuitive and only I (Ben) will be using SumWords all that much.
386 |         """
387 |         self.bookoperation['TotalWords'] = self.bookoperation['WordsPerMillion']
388 |         self.bookoperation['SumWords'] = self.bookoperation['WordsPerMillion']
389 |         self.bookoperation['TotalTexts'] = self.bookoperation['TextCount']
390 |         self.bookoperation['SumTexts'] = self.bookoperation['TextCount']
391 | 
392 |         for stattype in self.bookoperation.keys():
393 |             if re.search("Word",stattype):
394 |                 self.catoperation[stattype] = "sum(nwords) as WordCount"
395 |             if re.search("Text",stattype):
396 |                 self.catoperation[stattype] = "count(nwords) as TextCount"
397 | 
398 |         self.finaloperation['TextPercent'] = "IFNULL(numerator.TextCount,0)/IFNULL(denominator.TextCount,0)*100 as TextPercent"
399 |         self.finaloperation['TextRatio'] = "IFNULL(numerator.TextRatio,0)/IFNULL(denominator.TextCount,0) as TextRatio"
400 |         self.finaloperation['TextCount'] = "IFNULL(numerator.TextCount,0) as TextCount"
401 | 
402 |         self.finaloperation['WordsPerMillion'] = "IFNULL(numerator.WordCount,0)*100000000/IFNULL(denominator.WordCount,0)/100 as WordsPerMillion"
403 |         self.finaloperation['WordsRatio'] = "IFNULL(numerator.WordCount,0)/IFNULL(denominator.WordCount,0) as WordsRatio"
404 |         self.finaloperation['WordCount'] = "IFNULL(numerator.WordCount,0) as WordCount"
405 |         
406 |         self.finaloperation['TotalWords'] = "IFNULL(denominator.WordCount,0) as TotalWords"
407 |         self.finaloperation['SumWords']   = "IFNULL(denominator.WordCount,0) + IFNULL(numerator.WordCount,0) as SumWords"
408 |         self.finaloperation['TotalTexts'] = "IFNULL(denominator.TextCount,0) as TotalTexts"
409 |         self.finaloperation['SumTexts'] = "IFNULL(denominator.TextCount,0) + IFNULL(numerator.TextCount,0) as SumTexts"
410 | 
411 |         """
412 |         The values here will be chosen in build_wordstables; that's what decides if it uses the 'bookoperation' or 'catoperation' dictionary to build out.
413 |         """
414 | 
415 |         self.finaloperations = list()
416 |         self.bookoperations = set()
417 |         self.catoperations = set()
418 | 
419 |         for summaryStat in self.counttype:
420 |             self.catoperations.add(self.catoperation[summaryStat])
421 |             self.bookoperations.add(self.bookoperation[summaryStat])
422 |             self.finaloperations.append(self.finaloperation[summaryStat])
423 | 
424 |         #self.catoperation
425 | 
426 |     def counts_query(self):
427 |         #self.bookoperation = {"Occurrences_per_Million_Words":"sum(main.count)","Raw_Counts":"sum(main.count)","Percentage_of_Books":"count(DISTINCT " + self.prefs['fastcat'] + ".bookid)","Number_of_Books":"count(DISTINCT "+ self.prefs['fastcat'] + ".bookid)"}
428 |         #self.catoperation = {"Occurrences_per_Million_Words":"sum(nwords)","Raw_Counts":"sum(nwords)","Percentage_of_Books":"count(nwords)","Number_of_Books":"count(nwords)"}        
429 | 
430 |         self.operation = ','.join(self.bookoperations)
431 | 
432 |         self.build_wordstables()
433 |         countsQuery = """
434 |             SELECT
435 |                 %(selections)s,
436 |                 %(operation)s
437 |             FROM 
438 |                 %(catalog)s
439 |                 %(main)s
440 |                 %(wordstables)s 
441 |             WHERE
442 |                  %(catwhere)s AND %(wordswhere)s
443 |             GROUP BY 
444 |                 %(groupings)s
445 |         """ % self.__dict__
446 |         return countsQuery
447 |     
448 |     def ratio_query(self):
449 |         #if True: #In the case that we're not using a superset of words; this can be changed later
450 |         #    supersetGroups = [group for group in self.groups if not re.match('word',group)]
451 |         #    self.finalgroupings = self.groupings
452 |         #    for key in self.limits.keys():
453 |         #        if re.match('word',key):
454 |         #            del self.limits[key]
455 |         
456 |         self.denominator =  userquery(outside_dictionary = self.compare_dictionary)
457 |         self.supersetquery = self.denominator.counts_query()
458 | 
459 |         if re.search("In_Library",self.denominator.selections):
460 |             self.selections = self.selections + ", fastcat.bookid is not null as In_Library"
461 | 
462 |         #See above: In_Library is a dummy variable so that there's always something to join on.            
463 |         self.mainquery    = self.counts_query()
464 |         
465 |         self.countcommand = ','.join(self.finaloperations)
466 | 
467 |         """
468 |         We launch a whole new userquery instance here to build the denominator, based on the 'compare_dictionary' option (which in most 
469 |         cases is the search_limits without the keys, see above.
470 |         We then get the counts_query results out of that result.
471 |         """
472 | 
473 |         self.totalMergeTerms = "USING (" + self.denominator.groupings + " ) "
474 |         self.totalselections  = ",".join([re.sub(".* as","",group) for group in self.groups])
475 | 
476 |         query = """
477 |         SELECT
478 |             %(totalselections)s,
479 |             %(countcommand)s
480 |         FROM 
481 |             ( %(mainquery)s 
482 |             ) as numerator
483 |             RIGHT OUTER JOIN 
484 |              ( %(supersetquery)s ) as denominator
485 |              %(totalMergeTerms)s
486 |         GROUP BY %(groupings)s;""" % self.__dict__
487 |         return query        
488 | 
489 | 
490 |     def return_slug_data(self,force=False):
491 |         #Rather than understand this error, I'm just returning 0 if it fails.
492 |         #Probably that's the right thing to do, though it may cause trouble later.
493 |         #It's just a punishment for not later using a decent API call with "Raw_Counts" to extract these counts out, and relying on this ugly method.
494 |         #Please, citizens of the future, NEVER USE THIS METHOD.
495 |         try:
496 |             temp_words = self.return_n_words(force = True)
497 |             temp_counts = self.return_n_books(force = True)
498 |         except:
499 |             temp_words = 0
500 |             temp_counts = 0
501 |         return [temp_counts,temp_words]    
502 | 
503 |     def return_n_books(self,force=False): #deprecated
504 |         if (not hasattr(self,'nbooks')) or force:
505 |             query = "SELECT count(*) from " + self.catalog + " WHERE " + self.catwhere
506 |             silent = self.cursor.execute(query)
507 |             self.counts = int(self.cursor.fetchall()[0][0])
508 |         return self.counts
509 | 
510 |     def return_n_words(self,force=False): #deprecated
511 |         if (not hasattr(self,'nwords')) or force:
512 |             query = "SELECT sum(nwords) from " + self.catalog + " WHERE " + self.catwhere
513 |             silent = self.cursor.execute(query)
514 |             self.nwords = int(self.cursor.fetchall()[0][0])
515 |         return self.nwords   
516 | 
517 |     def ranked_query(self,percentile_to_return = 99,addwhere = ""):
518 |         #NOT CURRENTLY IN USE ANYWHERE--DELETE???
519 |         ##This returns a list of bookids in order by how well they match the sort terms.
520 |         ## Using an IDF term will give better search results for case-sensitive searches, but is currently disabled
521 |         ##
522 |         self.LIMIT = int((100-percentile_to_return) * self.return_n_books()/100)
523 |         countQuery = """
524 |          SELECT
525 |          bookid,
526 |          sum(main.count*1000/nwords%(idfterm)s) as score
527 |          FROM %(catalog)s LEFT JOIN %(tablename)s
528 |          USING (bookid)
529 |          WHERE %(catwhere)s AND %(wordswhere)s
530 |          GROUP BY bookid
531 |  	 ORDER BY score DESC
532 |          LIMIT %(LIMIT)s
533 |          """ % self.__dict__
534 |         return countQuery
535 |     
536 |     def bibliography_query(self,limit = "100"):
537 |         #I'd like to redo this at some point so it could work as an API call.
538 |         self.limit = limit
539 |         self.ordertype = "sum(main.count*10000/nwords)"
540 |         try:
541 |             if self.outside_dictionary['ordertype'] == "random":
542 |                 if self.counttype==["Raw_Counts"] or self.counttype==["Number_of_Books"] or self.counttype==['WordCount'] or self.counttype==['BookCount']:
543 |                     self.ordertype = "RAND()"
544 |                 else:
545 |                     self.ordertype = "LOG(1-RAND())/sum(main.count)"
546 |         except KeyError:
547 |             pass
548 | 
549 |         #If IDF searching is enabled, we could add a term like '*IDF' here to overweight better selecting words
550 |         #in the event of a multiple search.
551 |         self.idfterm = ""
552 |         prep = self.counts_query()
553 | 
554 |         bibQuery = """
555 |         SELECT searchstring 
556 |         FROM """ % self.__dict__ + self.prefs['fullcat'] + """ RIGHT JOIN (
557 |         SELECT                                                                                                       
558 |         """+ self.prefs['fastcat'] + """.bookid, %(ordertype)s as ordering
559 |             FROM                                                                                                     
560 |                 %(catalog)s                                                                                          
561 |                 %(main)s                                                                                             
562 |                 %(wordstables)s                                                                                      
563 |             WHERE                                                                                                    
564 |                  %(catwhere)s AND %(wordswhere)s                                                                                        
565 |         GROUP BY bookid ORDER BY %(ordertype)s DESC LIMIT %(limit)s                              
566 |         ) as tmp USING(bookid) ORDER BY ordering DESC;
567 |         """ % self.__dict__
568 |         return bibQuery
569 | 
570 |     def disk_query(self,limit="100"):
571 |         pass
572 | 
573 |     def return_books(self):
574 |         #This preps up the display elements for a search: it returns an array with a single string for each book, sorted in the best possible way
575 |         silent = self.cursor.execute(self.bibliography_query())
576 |         returnarray = []
577 |         for line in self.cursor.fetchall():
578 |             returnarray.append(line[0])
579 |         if not returnarray:
580 |             #why would someone request a search with no locations? Turns out (usually) because the smoothing tricked them.
581 |             returnarray.append("No results for this particular point: try again without smoothing")
582 |         newerarray = self.custom_SearchString_additions(returnarray)
583 |         return json.dumps(newerarray)
584 | 
585 |     def search_results(self):
586 |         #This is an alias that is handled slightly differently in APIimplementation (no "RESULTS" bit in front). Once
587 |         #that legacy code is cleared out, they can be one and the same.
588 |         return json.loads(self.return_books())
589 | 
590 |     def getActualSearchedWords(self):
591 |         if len(self.wordswhere) > 7:
592 |             words = self.outside_dictionary['search_limits']['word']
593 |             #Break bigrams into single words.
594 |             words = ' '.join(words).split(' ')
595 |             self.cursor.execute("""SELECT word FROM wordsheap WHERE """ + where_from_hash({self.word_field:words}))
596 |             self.actualWords =[item[0] for item in self.cursor.fetchall()]
597 |         else:
598 |             self.actualWords = ["tasty","mistake","happened","here"]
599 | 
600 |     def custom_SearchString_additions(self,returnarray):
601 |         db = self.outside_dictionary['database']
602 |         if db in ('jstor','presidio','ChronAm','LOC'):
603 |             self.getActualSearchedWords()
604 |             if db=='jstor':
605 |                 joiner = "&searchText="
606 |                 preface = "?Search=yes&searchText="
607 |                 urlRegEx = "http://www.jstor.org/stable/\d+"
608 |             if db=='presidio':
609 |                 joiner = "+"
610 |                 preface =  "#page/1/mode/2up/search/"
611 |                 urlRegEx = 'http://archive.org/stream/[^"# ><]*'
612 |             if db in ('ChronAm','LOC'):
613 |                 preface = "/;words="
614 |                 joiner = "+"
615 |                 urlRegEx = 'http://chroniclingamerica.loc.gov[^\"><]*/seq-\d+'
616 |             newarray = []
617 |             for string in returnarray:
618 |                 base = re.findall(urlRegEx,string)[0]
619 |                 newcore = ' <a href = "' +  base  + preface + joiner.join(self.actualWords) + '"> search inside </a>'
620 |                 string = re.sub("^<td>","",string)
621 |                 string = re.sub("</td>$","",string)
622 |                 string = string+newcore
623 |                 newarray.append(string)
624 |         #Arxiv is messier, requiring a whole different URL interface: http://search.arxiv.org:8081/paper.jsp?r=1204.3352&qs=network
625 |         else:
626 |             newarray = returnarray
627 |         return newarray
628 | 
629 |     def return_query_values(self,query = "ratio_query"):
630 |         #The API returns a dictionary with years pointing to values.
631 |         values = []
632 |         querytext = getattr(self,query)()
633 |         silent = self.cursor.execute(querytext)
634 |             #Gets the results
635 |         mydict = dict(self.cursor.fetchall())
636 |         try:
637 |             for key in mydict.keys():
638 |                 #Only return results inside the time limits
639 |                 if key >= self.time_limits[0] and key <= self.time_limits[1]:
640 |                     mydict[key] = str(mydict[key])
641 |                 else:
642 |                     del mydict[key]
643 |             mydict = smooth_function(mydict,smooth_method = self.smoothingType,span = self.smoothingSpan)
644 | 
645 |         except:
646 |             mydict = {0:"0"}
647 | 
648 |         #This is a good place to change some values.
649 |         try:
650 |             return {'index':self.index, 'Name':self.words_searched,"values":mydict,'words_searched':""}
651 |         except:
652 |             return{'values':mydict}
653 | 
654 |     def arrayNest(self,array,returnt,endLength=1):
655 |         #A recursive function to transform a list into a nested array
656 |         key = array[0]
657 |         key = to_unicode(key)
658 |         if len(array)==endLength+1:
659 |             #This is the condition where we have the last two, which is where we no longer need to nest anymore:
660 |             #it's just the last value[key] = value
661 |             value = list(array[1:])
662 |             for i in range(len(value)):
663 |                 try:
664 |                     value[i] = float(value[i])
665 |                 except:
666 |                     pass
667 |             returnt[key] = value
668 |         else:
669 |             try:
670 |                 returnt[key] = self.arrayNest(array[1:len(array)],returnt[key],endLength=endLength)
671 |             except KeyError:
672 |                 returnt[key] = self.arrayNest(array[1:len(array)],dict(),endLength=endLength)
673 |         return returnt
674 | 
675 |     def return_json(self,query='ratio_query'):
676 |         querytext = getattr(self,query)()
677 |         silent = self.cursor.execute(querytext)
678 |         names = [to_unicode(item[0]) for item in self.cursor.description]
679 |         returnt = dict()
680 |         lines = self.cursor.fetchall()
681 |         for line in lines:
682 |             returnt = self.arrayNest(line,returnt,endLength = len(self.counttype))
683 |         return returnt
684 | 
685 |     def return_tsv(self,query = "ratio_query"):
686 |         if self.outside_dictionary['counttype']=="Raw_Counts" or self.outside_dictionary['counttype']==["Raw_Counts"]:
687 |             query="counts_query"
688 |             #This allows much speedier access to counts data if you're willing not to know about all the zeroes.
689 |         querytext = getattr(self,query)()
690 |         silent = self.cursor.execute(querytext)
691 |         results = ["\t".join([to_unicode(item[0]) for item in self.cursor.description])]
692 |         lines = self.cursor.fetchall()
693 |         for line in lines:
694 |             items = []
695 |             for item in line:
696 |                 item = to_unicode(item)
697 |                 item = re.sub("\t","<tab>",item)
698 |                 items.append(item)
699 |             results.append("\t".join(items))
700 |         return "\n".join(results)
701 | 
702 |     def export_data(self,query1="ratio_query"):
703 |         self.smoothing=0
704 |         return self.return_query_values(query=query1)
705 | 
706 |     def execute(self):
707 |         #This performs the query using the method specified in the passed parameters.
708 |         if self.method=="Nothing":
709 |             pass
710 |         else:
711 |             return getattr(self,self.method)()
712 | 
713 | 
714 |     #############
715 |     ##GENERAL#### #These are general purpose functional types of things not implemented in the class.
716 |     #############
717 |     
718 | def to_unicode(obj, encoding='utf-8'):
719 |     if isinstance(obj, basestring):
720 |         if not isinstance(obj, unicode):
721 |             obj = unicode(obj, encoding)
722 |     elif isinstance(obj,int):
723 |         obj=unicode(str(obj),encoding)
724 |     else:
725 |         obj = unicode(str(obj),encoding)
726 |     return obj
727 | 
728 | def where_from_hash(myhash,joiner=" AND ",comp = " = ",quotesep=None):
729 |     whereterm = []
730 |     #The general idea here is that we try to break everything in search_limits down to a list, and then create a whereterm on that joined by whatever the 'joiner' is ("AND" or "OR"), with the comparison as whatever comp is ("=",">=",etc.).
731 |     #For more complicated bits, it gets all recursive until the bits are in terms of list.
732 |     for key in myhash.keys():
733 |         values = myhash[key]
734 |         if isinstance(values,basestring) or isinstance(values,int) or isinstance(values,float):
735 |             #This is just error handling. You can pass a single value instead of a list if you like, and it will just convert it 
736 |             #to a list for you.
737 |             values = [values]
738 |         #Or queries are special, since the default is "AND". This toggles that around for a subportion.
739 |         if key=='$or' or key=="$OR":
740 |             for comparison in values:
741 |                 whereterm.append(where_from_hash(comparison,joiner=" OR ",comp=comp))
742 |                 #The or doesn't get populated any farther down.
743 |         elif isinstance(values,dict):
744 |             #Certain function operators can use MySQL terms. These are the only cases that a dict can be passed as a limitations
745 |             operations = {"$gt":">","$ne":"!=","$lt":"<","$grep":" REGEXP ","$gte":">=","$lte":"<=","$eq":"="}
746 |             for operation in values.keys():
747 |                 whereterm.append(where_from_hash({key:values[operation]},comp=operations[operation],joiner=joiner))
748 |         elif isinstance(values,list):
749 |             #and this is where the magic actually happens
750 |             if isinstance(values[0],dict):
751 |                 for entry in values:
752 |                     whereterm.append(where_from_hash(entry))
753 |             else:
754 |                 if quotesep is None:
755 |                     if isinstance(values[0],basestring):
756 |                         quotesep="'"
757 |                     else:
758 |                         quotesep = ""
759 |                 #Note the "OR" here. There's no way to pass in a query like "year=1876 AND year=1898" as currently set up.
760 |                 #Obviously that's no great loss, but there might be something I'm missing that would be.
761 |                 whereterm.append(" (" + " OR ".join([" (" + key+comp+quotesep+to_unicode(value)+quotesep+") " for value in values])+ ") ")
762 |     return "(" + joiner.join(whereterm) + ")"
763 |     #This works pretty well, except that it requires very specific sorts of terms going in, I think.
764 | 
765 | 
766 | 
767 | #I'd rather have all this smoothing stuff done at the client side, but currently it happens here.
768 | def smooth_function(zinput,smooth_method = 'lowess',span = .05):
769 |     if smooth_method not in ['lowess','triangle','rectangle']:
770 |         return zinput
771 |     xarray = []
772 |     yarray = []
773 |     years = zinput.keys()
774 |     years.sort()
775 |     for key in years:
776 |         if zinput[key]!='None':
777 |             xarray.append(float(key))
778 |             yarray.append(float(zinput[key]))
779 |     from numpy import array
780 |     x = array(xarray)
781 |     y = array(yarray)
782 |     if smooth_method == 'lowess':
783 |         #print "starting lowess smoothing<br>"
784 |         from Bio.Statistics.lowess import lowess
785 |         smoothed = lowess(x,y,float(span)/100,3)
786 |         x = [int(p) for p in x]
787 |         returnval = dict(zip(x,smoothed))
788 |         return returnval
789 |     if smooth_method == 'rectangle':
790 |         from math import log
791 |         #print "starting triangle smoothing<br>"
792 |         span = int(span) #Takes the floor--so no smoothing on a span < 1.
793 |         returnval = zinput
794 |         windowsize = span*2 + 1
795 |         from numpy import average
796 |         for i in range(len(xarray)):
797 |             surrounding = array(range(windowsize),dtype=float)
798 |             weights = array(range(windowsize),dtype=float)
799 |             for j in range(windowsize):
800 |                 key_dist = j - span #if span is 2, the zeroeth element is -2, the second element is 0 off, etc.
801 |                 workingon = i + key_dist
802 |                 if workingon >= 0 and workingon < len(xarray):
803 |                     surrounding[j] = float(yarray[workingon])
804 |                     weights[j] = 1
805 |                 else:
806 |                     surrounding[j] = 0
807 |                     weights[j] = 0
808 |             returnval[xarray[i]] = round(average(surrounding,weights=weights),3)
809 |         return returnval
810 |     if smooth_method == 'triangle':
811 |         from math import log
812 |         #print "starting triangle smoothing<br>"
813 |         span = int(span) #Takes the floor--so no smoothing on a span < 1.
814 |         returnval = zinput
815 |         windowsize = span*2 + 1
816 |         from numpy import average
817 |         for i in range(len(xarray)):
818 |             surrounding = array(range(windowsize),dtype=float)
819 |             weights = array(range(windowsize),dtype=float)
820 |             for j in range(windowsize):
821 |                 key_dist = j - span #if span is 2, the zeroeth element is -2, the second element is 0 off, etc.
822 |                 workingon = i + key_dist
823 |                 if workingon >= 0 and workingon < len(xarray):
824 |                     surrounding[j] = float(yarray[workingon])
825 |                     #This isn't actually triangular smoothing: I dampen it by the logs, to keep the peaks from being too too big.
826 |                     #The minimum is '2', since log(1) == 0, which is a nonesense weight.
827 |                     weights[j] = log(span + 2 - abs(key_dist))
828 |                 else:
829 |                     surrounding[j] = 0
830 |                     weights[j] = 0
831 |             
832 |             returnval[xarray[i]] = round(average(surrounding,weights=weights),3)
833 |         return returnval
834 |     
835 | #The idea is: this works by default by slurping up from the command line, but you could also load the functions in and run results on your own queries.
836 | try:
837 |     command = str(sys.argv[1])
838 |     command = json.loads(command)
839 | #Got to go before we let anything else happen.
840 |     print command
841 |     p = userqueries(command)
842 |     result = p.execute()
843 |     print json.dumps(result)
844 | except:
845 |     pass
846 | 
847 | 


--------------------------------------------------------------------------------
/bookworm/MetaWorm.py:
--------------------------------------------------------------------------------
  1 | import pandas
  2 | import json
  3 | import copy
  4 | import threading
  5 | import time
  6 | from collections import defaultdict
  7 | 
  8 | def hostlist(dblist):
  9 |     #This could do something fancier, but for now we look by default only on localhost.
 10 |     return ["localhost"]*len(dblist)
 11 | 
 12 | class childQuery(threading.Thread):
 13 |     def __init__(self,dictJSON,host):
 14 |         super(SummingThread, self).__init__()
 15 |         self.dict = json.dumps(dict)
 16 |         self.host = host
 17 |         
 18 |     def runQuery(self):
 19 |         #make a webquery, assign it to self.data
 20 |         url = self.host + "/cgi-bin/bookwormAPI?query=" + self.dict
 21 | 
 22 |     def parseResults(self):
 23 |         pass
 24 |         #return json.loads(self.data)
 25 | 
 26 |     def run(self):
 27 |         self.runQuery()
 28 | 
 29 | def flatten(dictOfdicts):
 30 |     """
 31 |     Recursive function: transforms a dict with nested entries like
 32 |     foo["a"]["b"]["c"] = 3
 33 |     to one with tuple entries like
 34 |     fooPrime[("a","b","c")] = 3
 35 |     """
 36 |     output = []
 37 |     for (key,value) in dictOfdicts.iteritems():
 38 |         if isinstance(value,dict):
 39 |             output.append([(key),value])
 40 |         else:
 41 |             children = flatten(value)
 42 |             for child in children:
 43 |                 output.append([(key,) + child[0],child[1]])
 44 |     return output
 45 | 
 46 | def animate(dictOfTuples):
 47 |     """
 48 |     opposite of flatten
 49 |     """
 50 | 
 51 |     def tree():
 52 |         return defaultdict(tree)
 53 | 
 54 |     output = defaultdict(tree)
 55 | 
 56 |     
 57 | 
 58 | def combineDicts(master,new):
 59 |     """
 60 |     instead of a dict of dicts of arbitrary depth, use a dict of tuples to store.
 61 |     """
 62 | 
 63 |     for (keysequence, valuesequence) in flatten(new):
 64 |         try:
 65 |             master[keysequence] = map(sum,zip(master[keysequence],valuesequence))
 66 |         except KeyError:
 67 |             master[keysequence] = valuesequence
 68 |     return dict1
 69 |         
 70 | class MetaQuery(object):
 71 |     def __init__(self,dictJSON):
 72 |         self.outside_outdictionary = json.dumps(dictJSON)
 73 |         
 74 |     def setDefaults(self):
 75 |         for specialKey in ["database","host"]:
 76 |             try:
 77 |                 if isinstance(self.outside_dictionary[specialKey],basestring):
 78 |                     #coerce strings to list:
 79 |                     self.outside_dictionary[specialKey] = [self.outside_dictionary[specialKey]]
 80 |             except KeyError:
 81 |                 #It's OK not to define host.
 82 |                 if specialKey=="host":
 83 |                     pass
 84 |             
 85 |         if 'host' not in self.outside_dictionary:
 86 |             #Build a hostlist: usually just localhost a bunch of times.
 87 |             self.outside_dictionary['host']  = hostlist(self.outside_dictionary['database'])
 88 | 
 89 |         for (target, dest) in [("database","host"),("host","database")]:
 90 |             #Expand out so you can search for the same database on multiple databases, or multiple databases on the same host.
 91 |             if len(self.outside_dictionary[target])==1 and len(self.outside_dictionary[dest]) != 1:
 92 |                 self.outside_dictionary[target] = self.outside_dictionary[target] * len(self.outside_dictionary[dest])
 93 |             
 94 | 
 95 |     def buildChildren(self):
 96 |         desiredCounts = []
 97 |         for (host,dbname) in zip(self.outside_dictionary["host"],self.outside_dictionary["database"]):
 98 |             query = copy.deepcopy(self.outside_dictionary)
 99 |             del(query['host'])
100 |             query['database'] = dbname
101 |             
102 |             desiredCounts.append(childQuery(query,host))
103 |         self.children = desiredCounts
104 | 
105 |     def runChildren(self):
106 |         for child in self.children:
107 |             child.start()
108 | 
109 |     def combineChildren(self):
110 |         complete = dict()
111 |         while (threading.enumerate()):
112 |             for child in self.children:
113 |                 if not child.is_alive():
114 |                     complete=combineDicts(complete,child.parseResult())
115 |             time.sleep(.05)
116 | 
117 |     def return_json(self):
118 |         pass
119 | 
120 | 
121 |     
122 | 


--------------------------------------------------------------------------------
/bookworm/SQLAPI.py:
--------------------------------------------------------------------------------
   1 | #!/usr/local/bin/python
   2 | 
   3 | import sys
   4 | import json
   5 | import cgi
   6 | import re
   7 | import numpy #used for smoothing.
   8 | import copy
   9 | import decimal
  10 | import MySQLdb
  11 | import warnings
  12 | import hashlib
  13 | 
  14 | """
  15 | #There are 'fast' and 'full' tables for books and words;
  16 | #that's so memory tables can be used in certain cases for fast, hashed matching, but longer form data (like book titles)
  17 | can be stored on disk. Different queries use different types of calls.
  18 | #Also, certain metadata fields are stored separately from the main catalog table;
  19 | """
  20 | 
  21 | from knownHosts import *
  22 | 
  23 | class dbConnect(object):
  24 |     #This is a read-only account
  25 |     def __init__(self,prefs):
  26 |         self.dbname = prefs['database']
  27 |         self.db = MySQLdb.connect(host=prefs['HOST'],read_default_file = prefs['read_default_file'],use_unicode='True',charset='utf8',db=prefs['database'])
  28 |         self.cursor = self.db.cursor()
  29 | 
  30 | # The basic object here is a 'userquery:' it takes dictionary as input, as defined in the API, and returns a value
  31 | # via the 'execute' function whose behavior
  32 | # depends on the mode that is passed to it.
  33 | # Given the dictionary, it can return a number of objects.
  34 | # The "Search_limits" array in the passed dictionary determines how many elements it returns; this lets multiple queries be bundled together.
  35 | # Most functions describe a subquery that might be combined into one big query in various ways.
  36 | 
  37 | class userqueries:
  38 |     #This is a set of userqueries that are bound together; each element in search limits is iterated over, and we're done.
  39 |     #currently used for various different groups sent in a bundle (multiple lines on a Bookworm chart).
  40 |     #A sufficiently sophisticated 'group by' search might make this unnecessary.
  41 |     #But until that day, it's useful to be able to return lists of elements, which happens in here.
  42 | 
  43 |     def __init__(self,outside_dictionary = {"counttype":["Percentage_of_Books"],"search_limits":[{"word":["polka dot"],"LCSH":["Fiction"]}]},db = None):
  44 |         try:
  45 |             self.database = outside_dictionary.setdefault('database', 'default')
  46 |             prefs = general_prefs[self.database]
  47 |         except KeyError: #If it's not in the option, use some default preferences and search on localhost. This will work in most cases here on out.
  48 |             prefs = general_prefs['default']
  49 |             prefs['database'] = self.database
  50 |         self.prefs = prefs
  51 | 
  52 |         self.wordsheap = prefs['fastword']
  53 |         self.words = prefs['fullword']
  54 |         if 'search_limits' not in outside_dictionary.keys():
  55 |             outside_dictionary['search_limits'] = [{}]
  56 |         #coerce one-element dictionaries to an array.
  57 |         if isinstance(outside_dictionary['search_limits'],dict):
  58 |             #(allowing passing of just single dictionaries instead of arrays)
  59 |             outside_dictionary['search_limits'] = [outside_dictionary['search_limits']]
  60 |         self.returnval = []
  61 |         self.queryInstances = []
  62 |         db = dbConnect(prefs)
  63 |         databaseScheme = databaseSchema(db)
  64 |         for limits in outside_dictionary['search_limits']:
  65 |             mylimits = copy.deepcopy(outside_dictionary)
  66 |             mylimits['search_limits'] = limits
  67 |             localQuery = userquery(mylimits,db=db,databaseScheme=databaseScheme)
  68 |             self.queryInstances.append(localQuery)
  69 |             self.returnval.append(localQuery.execute())
  70 | 
  71 |     def execute(self):
  72 |         
  73 |         return self.returnval
  74 | 
  75 | 
  76 | class userquery:
  77 |     def __init__(self,outside_dictionary = {"counttype":["Percentage_of_Books"],"search_limits":{"word":["polka dot"],"LCSH":["Fiction"]}},db=None,databaseScheme=None):
  78 |         #Certain constructions require a DB connection already available, so we just start it here, or use the one passed to it.
  79 |         try:
  80 |             self.prefs = general_prefs[outside_dictionary['database']]
  81 |         except KeyError:
  82 |             #If it's not in the option, use some default preferences and search on localhost. This will work in most cases here on out.
  83 |             self.prefs = general_prefs['default']
  84 |             self.prefs['database'] = outside_dictionary['database']
  85 |         self.outside_dictionary = outside_dictionary
  86 |         #self.prefs = general_prefs[outside_dictionary.setdefault('database','presidio')]
  87 |         self.db = db
  88 |         if db is None:
  89 |             self.db = dbConnect(self.prefs)
  90 |         self.databaseScheme = databaseScheme
  91 |         if databaseScheme is None:
  92 |             self.databaseScheme = databaseSchema(self.db)
  93 | 
  94 |         self.cursor = self.db.cursor
  95 |         self.wordsheap = self.prefs['fastword']
  96 |         self.words = self.prefs['fullword']
  97 |         """
  98 |         I'm now allowing 'search_limits' to either be a dictionary or an array of dictionaries:
  99 |         this makes the syntax cleaner on most queries,
 100 |         while still allowing some long ones from the Bookworm website.
 101 |         """
 102 |         try:
 103 |             if isinstance(outside_dictionary['search_limits'],list):
 104 |                 outside_dictionary['search_limits'] = outside_dictionary['search_limits'][0]
 105 |         except:
 106 |             outside_dictionary['search_limits'] = dict()
 107 |         #outside_dictionary = self.limitCategoricalQueries(outside_dictionary)
 108 |         self.defaults(outside_dictionary) #Take some defaults
 109 |         self.derive_variables() #Derive some useful variables that the query will use.
 110 | 
 111 |     def defaults(self,outside_dictionary):
 112 |         #these are default values;these are the only values that can be set in the query
 113 |         #search_limits is an array of dictionaries;
 114 |         #each one contains a set of limits that are mutually independent
 115 |         #The other limitations are universal for all the search limits being set.
 116 | 
 117 |         #Set up a dictionary for the denominator of any fraction if it doesn't already exist:
 118 |         self.search_limits = outside_dictionary.setdefault('search_limits',[{"word":["polka dot"]}])
 119 |         self.words_collation = outside_dictionary.setdefault('words_collation',"Case_Insensitive")
 120 | 
 121 |         lookups = {"Case_Insensitive":'word','lowercase':'lowercase','casesens':'casesens',"case_insensitive":"word","Case_Sensitive":"casesens","All_Words_with_Same_Stem":"stem",'stem':'stem'}
 122 |         self.word_field = str(MySQLdb.escape_string(lookups[self.words_collation]))
 123 | 
 124 |         self.time_limits = outside_dictionary.setdefault('time_limits',[0,10000000])
 125 |         self.time_measure = outside_dictionary.setdefault('time_measure','year')
 126 | 
 127 |         self.groups = set()
 128 |         self.outerGroups = [] #[] #Only used on the final join; directionality matters, unlike for the other ones.
 129 |         self.finalMergeTables=set()
 130 |         try:
 131 |             groups = outside_dictionary['groups']
 132 |         except:
 133 |             groups = [outside_dictionary['time_measure']]
 134 | 
 135 |         if groups == [] or groups == ["unigram"]:
 136 |             #Set an arbitrary column name that will always be true if nothing else is set.
 137 |             groups.insert(0,"1 as In_Library")
 138 | 
 139 |         if (len (groups) > 1):
 140 |             pass
 141 |             #self.groups = credentialCheckandClean(self.groups)
 142 |             #Define some sort of limitations here, if not done in dbbindings.py
 143 | 
 144 |         for group in groups:
 145 | 
 146 |             #There's a special set of rules for how to handle unigram and bigrams
 147 |             multigramSearch = re.match("(unigram|bigram|trigram)(\d)?",group)
 148 | 
 149 |             if multigramSearch:
 150 |                 if group=="unigram":
 151 |                     gramPos = "1"
 152 |                     gramType = "unigram"
 153 | 
 154 |                 else:
 155 |                     gramType = multigramSearch.groups()[0]
 156 |                     try:
 157 |                         gramPos = multigramSearch.groups()[1]
 158 |                     except:
 159 |                         print "currently you must specify which bigram element you want (eg, 'bigram1')"
 160 |                         raise
 161 | 
 162 |                 lookupTableName = "%sLookup%s" %(gramType,gramPos)
 163 |                 self.outerGroups.append("%s.%s as %s" %(lookupTableName,self.word_field,group))
 164 |                 self.finalMergeTables.add(" JOIN wordsheap as %s ON %s.wordid=w%s" %(lookupTableName,lookupTableName,gramPos))
 165 |                 self.groups.add("words%s.wordid as w%s" %(gramPos,gramPos))
 166 | 
 167 |             else:
 168 |                 self.outerGroups.append(group)
 169 |                 try:
 170 |                     if self.databaseScheme.aliases[group] != group:
 171 |                         #Search on the ID field, not the basic field.
 172 |                         #debug(self.databaseScheme.aliases.keys())
 173 |                         self.groups.add(self.databaseScheme.aliases[group])
 174 |                         table = self.databaseScheme.tableToLookIn[group]
 175 | 
 176 |                         joinfield = self.databaseScheme.aliases[group]
 177 |                         self.finalMergeTables.add(" JOIN " + table + " USING (" + joinfield + ") ")
 178 |                     else:
 179 |                         self.groups.add(group)
 180 |                 except KeyError:
 181 |                     self.groups.add(group)                
 182 | 
 183 |         """
 184 |         There are the selections which can include table refs, and the groupings, which may not:
 185 |         and the final suffix to enable fast lookup
 186 |         """
 187 | 
 188 |         self.selections = ",".join(self.groups)
 189 |         self.groupings  = ",".join([re.sub(".* as","",group) for group in self.groups])
 190 | 
 191 |         self.joinSuffix = "" + " ".join(self.finalMergeTables)
 192 | 
 193 |         """
 194 |         Define the comparison set if a comparison is being done.
 195 |         """
 196 |         #Deprecated--tagged for deletion
 197 |         #self.determineOutsideDictionary()
 198 | 
 199 |         #This is a little tricky behavior here--hopefully it works in all cases. It drops out word groupings.
 200 | 
 201 |         self.counttype = outside_dictionary.setdefault('counttype',["WordCount"])
 202 | 
 203 |         if isinstance(self.counttype,basestring):
 204 |             self.counttype = [self.counttype]
 205 | 
 206 |         #index is deprecated,but the old version uses it.
 207 |         self.index  = outside_dictionary.setdefault('index',0)
 208 |         """
 209 |         #Ordinarily, the input should be an an array of groups that will both select and group by.
 210 |         #The joins may be screwed up by certain names that exist in multiple tables, so there's an option to do something like
 211 |         #SELECT catalog.bookid as myid, because WHERE clauses on myid will work but GROUP BY clauses on catalog.bookid may not
 212 |         #after a sufficiently large number of subqueries.
 213 |         #This smoothing code really ought to go somewhere else, since it doesn't quite fit into the whole API mentality and is
 214 |         #more about the webpage. It is only included here as a stopgap: NO FURTHER APPLICATIONS USING IT SHOULD BE BUILT.
 215 |         """
 216 | 
 217 |         self.smoothingType = outside_dictionary.setdefault('smoothingType',"triangle")
 218 |         self.smoothingSpan = outside_dictionary.setdefault('smoothingSpan',3)
 219 |         self.method = outside_dictionary.setdefault('method',"Nothing")
 220 | 
 221 |     def determineOutsideDictionary(self):
 222 |         """
 223 |         deprecated--tagged for deletion.
 224 |         """
 225 |         self.compare_dictionary = copy.deepcopy(self.outside_dictionary)
 226 |         if 'compare_limits' in self.outside_dictionary.keys():
 227 |             self.compare_dictionary['search_limits'] = self.outside_dictionary['compare_limits']
 228 |             del self.outside_dictionary['compare_limits']
 229 |         elif sum([bool(re.search(r'\*',string)) for string in self.outside_dictionary['search_limits'].keys()]) > 0:
 230 |             #If any keys have stars at the end, drop them from the compare set
 231 |             #This is often a _very_ helpful definition for succinct comparison queries of many types.
 232 |             #The cost is that an asterisk doesn't allow you
 233 | 
 234 |             for key in self.outside_dictionary['search_limits'].keys():
 235 |                 if re.search(r'\*',key):
 236 |                     #rename the main one to not have a star
 237 |                     self.outside_dictionary['search_limits'][re.sub(r'\*','',key)] = self.outside_dictionary['search_limits'][key]
 238 |                     #drop it from the compare_limits and delete the version in the search_limits with a star
 239 |                     del self.outside_dictionary['search_limits'][key]
 240 |                     del self.compare_dictionary['search_limits'][key]
 241 |         else: #if nothing specified, we compare the word to the corpus.
 242 |             deleted = False
 243 |             for key in self.outside_dictionary['search_limits'].keys():
 244 |                 if re.search('words?\d',key) or re.search('gram$',key) or re.match(r'word',key):
 245 |                     del self.compare_dictionary['search_limits'][key]
 246 |                     deleted = True
 247 |             if not deleted:
 248 |                 #If there are no words keys, just delete the first key of any type.
 249 |                 #Sort order can't be assumed, but this is a useful failure mechanism of last resort. Maybe.
 250 |                 try:
 251 |                     del self.compare_dictionary['search_limits'][self.outside_dictionary['search_limits'].keys()[0]]
 252 |                 except:
 253 |                     pass
 254 |         """
 255 |         The grouping behavior here is not desirable, but I'm not quite sure how yet.
 256 |         Aha--one way is that it accidentally drops out a bunch of options. I'm just disabling it: let's see what goes wrong now.
 257 |         """
 258 |         try:
 259 |             pass#self.compare_dictionary['groups'] = [group for group in self.compare_dictionary['groups'] if not re.match('word',group) and not re.match("[u]?[bn]igram",group)]# topicfix? and not re.match("topic",group)]
 260 |         except:
 261 |             self.compare_dictionary['groups'] = [self.compare_dictionary['time_measure']]
 262 | 
 263 | 
 264 |     def derive_variables(self):
 265 |         #These are locally useful, and depend on the search limits put in.
 266 |         self.limits = self.search_limits
 267 |         #Treat empty constraints as nothing at all, not as full restrictions.
 268 |         for key in self.limits.keys():
 269 |             if self.limits[key] == []:
 270 |                 del self.limits[key]
 271 |         self.set_operations()
 272 |         self.create_catalog_table()
 273 |         self.make_catwhere()
 274 |         self.make_wordwheres()
 275 | 
 276 |     def tablesNeededForQuery(self,fieldNames=[]):
 277 |         db = self.db
 278 |         neededTables = set()
 279 |         tablenames = dict()
 280 |         tableDepends = dict()
 281 |         db.cursor.execute("SELECT dbname,alias,tablename,dependsOn FROM masterVariableTable JOIN masterTableTable USING (tablename);")
 282 |         for row in db.cursor.fetchall():
 283 |             tablenames[row[0]] = row[2]
 284 |             tableDepends[row[2]] = row[3]
 285 | 
 286 |         for fieldname in fieldNames:
 287 |             parent = ""
 288 |             try:
 289 |                 current = tablenames[fieldname]
 290 |                 neededTables.add(current)
 291 |                 n = 1
 292 |                 while parent not in ['fastcat','wordsheap']:
 293 |                     parent = tableDepends[current]
 294 |                     neededTables.add(parent)
 295 |                     current = parent;
 296 |                     n+=1
 297 |                     if n > 100:
 298 |                         raise TypeError("Unable to handle this; seems like a recursion loop in the table definitions.")
 299 |                     #This will add 'fastcat' or 'wordsheap' exactly once per entry
 300 |             except KeyError:
 301 |                 pass
 302 |             
 303 |         return neededTables
 304 | 
 305 |     def create_catalog_table(self):
 306 |         self.catalog = self.prefs['fastcat'] #'catalog' #Can be replaced with a more complicated query in the event of longer joins.
 307 | 
 308 |         """
 309 |         This should check query constraints against a list of tables, and join to them.
 310 |         So if you query with a limit on LCSH, and LCSH is listed as being in a separate table,
 311 |         it joins the table "LCSH" to catalog; and then that table has one column, ALSO
 312 |         called "LCSH", which is matched against. This allows a bookid to be a member of multiple catalogs.
 313 |         """
 314 | 
 315 |         #for limitation in self.prefs['separateDataTables']:
 316 |         #    #That re.sub thing is in here because sometimes I do queries that involve renaming.
 317 |         #    if limitation in [re.sub(" .*","",key) for key in self.limits.keys()] or limitation in [re.sub(" .*","",group) for group in self.groups]:
 318 |         #        self.catalog = self.catalog + """ JOIN """ + limitation + """ USING (bookid)"""
 319 | 
 320 |         """
 321 |         Here it just pulls every variable and where to look for it.
 322 |         """
 323 | 
 324 | 
 325 |         self.relevantTables = set()
 326 | 
 327 |         databaseScheme = self.databaseScheme
 328 |         columns = []
 329 |         for columnInQuery in [re.sub(" .*","",key) for key in self.limits.keys()] + [re.sub(" .*","",group) for group in self.groups]:
 330 |             columns.append(columnInQuery)
 331 |             try:
 332 |                 self.relevantTables.add(databaseScheme.tableToLookIn[columnInQuery])
 333 |                 try:
 334 |                     self.relevantTables.add(databaseScheme.tableToLookIn[databaseScheme.anchorFields[columnInQuery]])
 335 |                     try:
 336 |                         self.relevantTables.add(databaseScheme.tableToLookIn[databaseScheme.anchorFields[databaseScheme.anchorFields[columnInQuery]]])
 337 |                     except KeyError:
 338 |                         pass
 339 |                 except KeyError:
 340 |                     pass
 341 |             except KeyError:
 342 |                 pass
 343 |                 #Could raise as well--shouldn't be errors--but this helps back-compatability.
 344 | 
 345 | #        if "catalog" in self.relevantTables and self.method != "bibliography_query":
 346 | #            self.relevantTables.remove('catalog')
 347 |         try:
 348 |             moreTables = self.tablesNeededForQuery(columns)
 349 |         except MySQLdb.ProgrammingError:
 350 |             #What happens on old-style Bookworm constructions.
 351 |             moreTables = set()
 352 |         self.relevantTables = self.relevantTables.union(moreTables)
 353 |         self.catalog = "fastcat"
 354 |         for table in self.relevantTables:
 355 |             if table!="fastcat" and table!="words" and table!="wordsheap" and table!="master_bookcounts" and table!="master_bigrams":
 356 |                 self.catalog = self.catalog + """ NATURAL JOIN """ + table + " "
 357 | 
 358 |     def make_catwhere(self):
 359 |         #Where terms that don't include the words table join. Kept separate so that we can have subqueries only working on one half of the stack.
 360 |         catlimits = dict()
 361 |         for key in self.limits.keys():
 362 |             ###Warning--none of these phrases can be used ina  bookworm as a custom table names.
 363 |             if key not in ('word','word1','word2','hasword') and not re.search("words\d",key):
 364 |                 catlimits[key] = self.limits[key]
 365 |         if len(catlimits.keys()) > 0:
 366 |             self.catwhere = where_from_hash(catlimits)
 367 |         else:
 368 |             self.catwhere = "TRUE"
 369 |         if 'hasword' in self.limits.keys():
 370 |             """
 371 |             Because derived tables don't carry indexes, we're just making the new tables
 372 |             with indexes on the fly to be stored in a temporary database, "bookworm_scratch"
 373 |             Each time a hasword query is performed, the results of that query are permanently cached;
 374 |             they're stored as a table that can be used in the future.
 375 | 
 376 |             This will create problems if database contents are changed; there needs to be some mechanism for 
 377 |             clearing out the cache periodically.
 378 |             """
 379 | 
 380 |             if self.limits['hasword'] == []:
 381 |                 del self.limits['hasword']
 382 |                 return
 383 | 
 384 |             #deepcopy lets us get a real copy of the dictionary
 385 |             #that can be changed without affecting the old one.
 386 |             mydict = copy.deepcopy(self.outside_dictionary)
 387 |             # This may make it take longer than it should; we might want the list to
 388 |             # just be every bookid with the given word rather than
 389 |             # filtering by the limits as well.
 390 |             # It's not obvious to me which will be faster.
 391 |             mydict['search_limits'] = copy.deepcopy(self.limits)
 392 |             if isinstance(mydict['search_limits']['hasword'],basestring):
 393 |                 #Make sure it's an array
 394 |                 mydict['search_limits']['hasword'] =  [mydict['search_limits']['hasword']] 
 395 |             """
 396 |             #Ideally, this would shuffle into an order ensuring that the
 397 |             rarest words were nested deepest.
 398 |             #That would speed up query execution by ensuring there
 399 |             wasn't some massive search for 'the' being
 400 |             #done at the end.
 401 | 
 402 |             Instead, it just pops off the last element and sets up a
 403 |             recursive nested join. for every element in the 
 404 |             array.
 405 |             """
 406 |             mydict['search_limits']['word'] = [mydict['search_limits']['hasword'].pop()]
 407 |             if len(mydict['search_limits']['hasword'])==0:
 408 |                 del mydict['search_limits']['hasword']
 409 |             tempquery = userquery(mydict,databaseScheme=self.databaseScheme)
 410 |             listofBookids = tempquery.bookid_query()
 411 | 
 412 |             #Unique identifier for the query that persists across the 
 413 |             #various subqueries.
 414 |             queryID  = hashlib.sha1(listofBookids).hexdigest()[:20]
 415 | 
 416 |             tmpcatalog = "bookworm_scratch.tmp" + re.sub("-","",queryID)
 417 | 
 418 |             try:
 419 |                 self.cursor.execute("CREATE TABLE %s (bookid MEDIUMINT, PRIMARY KEY (bookid)) ENGINE=MYISAM;" %tmpcatalog)
 420 |                 self.cursor.execute("INSERT IGNORE INTO %s %s;" %(tmpcatalog,listofBookids))
 421 | 
 422 |             except MySQLdb.OperationalError,e:
 423 |                 #Usually the error will be 1050, which is a good thing: it means we don't need to
 424 |                 #create the table.
 425 |                 #If it's not, something bad is happening.
 426 |                 if not re.search("1050.*already exists",str(e)):
 427 |                     raise
 428 |             self.catalog += " NATURAL JOIN %s "%(tmpcatalog)
 429 | 
 430 | 
 431 |     def make_wordwheres(self):
 432 |         self.wordswhere = " TRUE "
 433 |         self.max_word_length = 0
 434 |         limits = []
 435 |         """
 436 |         "unigram" or "bigram" can be used as an alias for "word" in the search_limits field.
 437 |         """
 438 | 
 439 |         for gramterm in ['unigram','bigram']:
 440 |             if gramterm in self.limits.keys() and not "word" in self.limits.keys():
 441 |                 self.limits['word'] = self.limits[gramterm]
 442 |                 del self.limits[gramterm]
 443 | 
 444 |         if 'word' in self.limits.keys():
 445 |             """
 446 |             This doesn't currently allow mixing of one and two word searches together in a logical way.
 447 |             It might be possible to just join on both the tables in MySQL--I'm not completely sure what would happen.
 448 |             But the philosophy has been to keep users from doing those searches as far as possible in any case.
 449 |             """
 450 |             for phrase in self.limits['word']:
 451 |                 locallimits = dict()
 452 |                 array = phrase.split(" ")
 453 |                 n = 0
 454 |                 for word in array:
 455 |                     n += 1
 456 |                     searchingFor = word
 457 |                     if self.word_field=="stem":
 458 |                         from nltk import PorterStemmer
 459 |                         searchingFor = PorterStemmer().stem_word(searchingFor)
 460 |                     if self.word_field=="case_insensitive" or self.word_field=="Case_Insensitive":
 461 |                         searchingFor = searchingFor.lower()
 462 | 
 463 |                     selectString =  "SELECT wordid FROM wordsheap WHERE %s = %%s" % self.word_field
 464 |                     cursor = self.db.cursor
 465 |                     try:
 466 |                         cursor.execute(selectString, (searchingFor))
 467 |                     except MySQLdb.Error, e:
 468 |                         # Return HTML error code and log the following
 469 |                         # print e
 470 |                         # print cursor._last_executed
 471 |                         print ''
 472 |                     for row in cursor.fetchall():
 473 |                         wordid = row[0]
 474 |                         try:
 475 |                             locallimits['words'+str(n) + ".wordid"] += [wordid]
 476 |                         except KeyError:
 477 |                             locallimits['words'+str(n) + ".wordid"] = [wordid]
 478 |                     self.max_word_length = max(self.max_word_length,n)
 479 | 
 480 |                 #Strings have already been escaped, so don't need to be escaped again.
 481 |                 if len(locallimits.keys()) > 0:
 482 |                     limits.append(where_from_hash(locallimits,comp = " = ",escapeStrings=False))
 483 |                 #XXX for backward compatability
 484 |                 self.words_searched = phrase
 485 |                 #XXX end deprecated block
 486 |             self.wordswhere = "(" + ' OR '.join(limits) + ")"
 487 |             if limits == []:
 488 |                 #In the case that nothing has been found, tell it explicitly to search for
 489 |                 #a condition when nothing will be found.
 490 |                 self.wordswhere = "words1.wordid=-1"
 491 | 
 492 | 
 493 |         wordlimits = dict()
 494 | 
 495 |         limitlist = copy.deepcopy(self.limits.keys())
 496 | 
 497 |         for key in limitlist:
 498 |             if re.search("words\d",key):
 499 |                 wordlimits[key] = self.limits[key]
 500 |                 self.max_word_length = max(self.max_word_length,2)
 501 |                 del self.limits[key]
 502 | 
 503 |         if len(wordlimits.keys()) > 0:
 504 |             self.wordswhere = where_from_hash(wordlimits)
 505 | 
 506 |         return self.wordswhere
 507 | 
 508 |     def build_wordstables(self):
 509 |         #Deduce the words tables we're joining against. The iterating on this can be made more general to get 3 or four grams in pretty easily.
 510 |         #This relies on a determination already having been made about whether this is a unigram or bigram search; that's reflected in the self.selections
 511 |         #variable.
 512 | 
 513 | 
 514 |         """
 515 |         We also now check for whether it needs the topic assignments: this could be generalized, with difficulty, for any other kind of plugin.
 516 |         """
 517 | 
 518 |         needsBigrams = (self.max_word_length == 2 or re.search("words2",self.selections))
 519 |         needsUnigrams = self.max_word_length == 1 or re.search("[^h][^a][^s]word",self.selections)
 520 |         needsTopics   = bool(re.search("topic",self.selections)) or ("topic" in self.limits.keys())
 521 | 
 522 |         if needsBigrams:
 523 | 
 524 |             self.maintable = 'master_bigrams'
 525 | 
 526 |             self.main = '''
 527 |                  JOIN
 528 |                  master_bigrams as main
 529 |                  ON ('''+ self.prefs['fastcat'] +'''.bookid=main.bookid)
 530 |                  '''
 531 | 
 532 |             self.wordstables =  """
 533 |             JOIN %(wordsheap)s as words1 ON (main.word1 = words1.wordid)
 534 |             JOIN %(wordsheap)s as words2 ON (main.word2 = words2.wordid) """ % self.__dict__
 535 | 
 536 |         #I use a regex here to do a blanket search for any sort of word limitations. That has some messy sideffects (make sure the 'hasword'
 537 |         #key has already been eliminated, for example!) but generally works.
 538 | 
 539 |         elif needsTopics and needsUnigrams:
 540 |             self.maintable = 'master_topicWords'
 541 |             self.main = '''
 542 |                 NATURAL JOIN 
 543 |             master_topicWords as main
 544 |             '''
 545 |             self.wordstables = """
 546 |               JOIN ( %(wordsheap)s as words1)  ON (main.wordid = words1.wordid)
 547 |              """ % self.__dict__
 548 |             
 549 |         elif needsUnigrams:
 550 |             self.maintable = 'master_bookcounts'
 551 |             self.main = '''
 552 |                 NATURAL JOIN
 553 |                  master_bookcounts as main
 554 |             '''
 555 |                  #ON (''' + self.prefs['fastcat'] + '''.bookid=main.bookid)'''
 556 |             self.wordstables = """
 557 |               JOIN ( %(wordsheap)s as words1)  ON (main.wordid = words1.wordid)
 558 |              """ % self.__dict__
 559 | 
 560 |         elif needsTopics:
 561 |             self.maintable = 'master_topicCounts'
 562 |             self.main = '''
 563 |                 NATURAL JOIN
 564 |                  master_topicCounts as main '''
 565 |             self.wordstables = " "
 566 |             self.wordswhere = " TRUE "
 567 | 
 568 |         else:
 569 |             """
 570 |             Have _no_ words table if no words searched for or grouped by;
 571 |             instead just use nwords. This
 572 |             means that we can use the same basic functions both to build the
 573 |             counts for word searches and
 574 |             for metadata searches, which is valuable because there is a
 575 |             metadata-only search built in to every single ratio
 576 |             query. (To get the denominator values).
 577 | 
 578 |             Call this OLAP, if you like.
 579 |             """
 580 |             self.main = " "
 581 |             self.operation = ','.join(self.catoperations)
 582 |             """
 583 |             This, above is super important: the operation used is relative to the counttype, and changes to use 'catoperation' instead of 'bookoperation'
 584 |             That's the place that the denominator queries avoid having to do a table scan on full bookcounts that would take hours, and instead takes
 585 |             milliseconds.
 586 |             """
 587 |             self.wordstables = " "
 588 |             self.wordswhere  = " TRUE "
 589 |             #Just a dummy thing to make the SQL writing easier. Shouldn't take any time. Will usually be extended with actual conditions.
 590 | 
 591 |     def set_operations(self):
 592 |         """
 593 |         This is the code that allows multiple values to be selected.
 594 | 
 595 |         All can be removed when we kill back compatibility ! It's all handled now by the general_API, not the SQL_API.
 596 |         """
 597 | 
 598 | 
 599 |         backCompatability = {"Occurrences_per_Million_Words":"WordsPerMillion","Raw_Counts":"WordCount","Percentage_of_Books":"TextPercent","Number_of_Books":"TextCount"}
 600 | 
 601 |         for oldKey in backCompatability.keys():
 602 |             self.counttype = [re.sub(oldKey,backCompatability[oldKey],entry) for entry in self.counttype]
 603 | 
 604 |         self.bookoperation = {}
 605 |         self.catoperation = {}
 606 |         self.finaloperation = {}
 607 | 
 608 |         #Text statistics
 609 |         self.bookoperation['TextPercent'] = "count(DISTINCT " + self.prefs['fastcat'] + ".bookid) as TextCount"
 610 |         self.bookoperation['TextRatio'] = "count(DISTINCT " + self.prefs['fastcat'] + ".bookid) as TextCount"
 611 |         self.bookoperation['TextCount'] = "count(DISTINCT " + self.prefs['fastcat'] + ".bookid) as TextCount"
 612 | 
 613 |         #Word Statistics
 614 |         self.bookoperation['WordCount'] = "sum(main.count) as WordCount"
 615 |         self.bookoperation['WordsPerMillion'] = "sum(main.count) as WordCount"
 616 |         self.bookoperation['WordsRatio'] = "sum(main.count) as WordCount"
 617 | 
 618 | 
 619 |         """
 620 |         +Total Numbers for comparisons/significance assessments
 621 |         This is a little tricky. The total words is EITHER the denominator (as in a query against words per Million) or the numerator+denominator (if you're comparing
 622 |         Pittsburg and Pittsburgh, say, and want to know the total number of uses of the lemma. For now, "TotalWords" means the former and "SumWords" the latter,
 623 |         On the theory that 'TotalWords' is more intuitive and only I (Ben) will be using SumWords all that much.
 624 |         """
 625 |         self.bookoperation['TotalWords'] = self.bookoperation['WordsPerMillion']
 626 |         self.bookoperation['SumWords'] = self.bookoperation['WordsPerMillion']
 627 |         self.bookoperation['TotalTexts'] = self.bookoperation['TextCount']
 628 |         self.bookoperation['SumTexts'] = self.bookoperation['TextCount']
 629 | 
 630 |         for stattype in self.bookoperation.keys():
 631 |             if re.search("Word",stattype):
 632 |                 self.catoperation[stattype] = "sum(nwords) as WordCount"
 633 |             if re.search("Text",stattype):
 634 |                 self.catoperation[stattype] = "count(nwords) as TextCount"
 635 | 
 636 |         self.finaloperation['TextPercent'] = "IFNULL(numerator.TextCount,0)/IFNULL(denominator.TextCount,0)*100 as TextPercent"
 637 |         self.finaloperation['TextRatio'] = "IFNULL(numerator.TextCount,0)/IFNULL(denominator.TextCount,0) as TextRatio"
 638 |         self.finaloperation['TextCount'] = "IFNULL(numerator.TextCount,0) as TextCount"
 639 | 
 640 |         self.finaloperation['WordsPerMillion'] = "IFNULL(numerator.WordCount,0)*100000000/IFNULL(denominator.WordCount,0)/100 as WordsPerMillion"
 641 |         self.finaloperation['WordsRatio'] = "IFNULL(numerator.WordCount,0)/IFNULL(denominator.WordCount,0) as WordsRatio"
 642 |         self.finaloperation['WordCount'] = "IFNULL(numerator.WordCount,0) as WordCount"
 643 | 
 644 |         self.finaloperation['TotalWords'] = "IFNULL(denominator.WordCount,0) as TotalWords"
 645 |         self.finaloperation['SumWords']   = "IFNULL(denominator.WordCount,0) + IFNULL(numerator.WordCount,0) as SumWords"
 646 |         self.finaloperation['TotalTexts'] = "IFNULL(denominator.TextCount,0) as TotalTexts"
 647 |         self.finaloperation['SumTexts'] = "IFNULL(denominator.TextCount,0) + IFNULL(numerator.TextCount,0) as SumTexts"
 648 | 
 649 |         """
 650 |         The values here will be chosen in build_wordstables; that's what decides if it uses the 'bookoperation' or 'catoperation' dictionary to build out.
 651 |         """
 652 | 
 653 |         self.finaloperations = list()
 654 |         self.bookoperations = set()
 655 |         self.catoperations = set()
 656 | 
 657 |         for summaryStat in self.counttype:
 658 |             self.catoperations.add(self.catoperation[summaryStat])
 659 |             self.bookoperations.add(self.bookoperation[summaryStat])
 660 |             self.finaloperations.append(self.finaloperation[summaryStat])
 661 | 
 662 |     def counts_query(self):
 663 | 
 664 |         self.operation = ','.join(self.bookoperations)
 665 |         self.build_wordstables()
 666 | 
 667 |         countsQuery = """
 668 |             SELECT
 669 |                 %(selections)s,
 670 |                 %(operation)s
 671 |             FROM
 672 |                 %(catalog)s
 673 |                 %(main)s
 674 |                 %(wordstables)s
 675 |             WHERE
 676 |                  %(catwhere)s AND %(wordswhere)s
 677 |             GROUP BY
 678 |                 %(groupings)s
 679 |         """ % self.__dict__
 680 |         return countsQuery
 681 | 
 682 |     def bookid_query(self):
 683 |         #A temporary method to setup the hasword query.
 684 |         self.operation = ','.join(self.bookoperations)
 685 |         self.build_wordstables()
 686 | 
 687 |         countsQuery = """
 688 |             SELECT
 689 |                 main.bookid as bookid
 690 |             FROM
 691 |                 %(catalog)s
 692 |                 %(main)s
 693 |                 %(wordstables)s
 694 |             WHERE
 695 |                  %(catwhere)s AND %(wordswhere)s
 696 |         """ % self.__dict__
 697 |         return countsQuery
 698 | 
 699 |     def debug_query(self):
 700 |         query = self.ratio_query(materialize = False)
 701 |         return json.dumps(self.denominator.groupings.split(",")) + query 
 702 |     
 703 |     def query(self,materialize=False):
 704 |         """
 705 |         We launch a whole new userquery instance here to build the denominator, based on the 'compare_dictionary' option (which in most
 706 |         cases is the search_limits without the keys, see above; it can also be specially defined using asterisks as a shorthand to identify other fields to drop.
 707 |         We then get the counts_query results out of that result.
 708 |         """
 709 | 
 710 |         """
 711 |         self.denominator =  userquery(outside_dictionary = self.compare_dictionary,db=self.db,databaseScheme=self.databaseScheme)
 712 |         self.supersetquery = self.denominator.counts_query()
 713 |         supersetIndices = self.denominator.groupings.split(",")
 714 |         if materialize:
 715 |             self.supersetquery = derived_table(self.supersetquery,self.db,indices=supersetIndices).materialize()
 716 |         """
 717 |         self.mainquery    = self.counts_query()
 718 |         self.countcommand = ','.join(self.finaloperations)
 719 |         self.totalselections  = ",".join([group for group in self.outerGroups if group!="1 as In_Library" and group != ""])
 720 |         if self.totalselections != "": self.totalselections += ", "
 721 |        
 722 |         query = """
 723 |         SELECT
 724 |             %(totalselections)s
 725 |             %(countcommand)s
 726 |         FROM
 727 |             (%(mainquery)s) as numerator
 728 |         %(joinSuffix)s
 729 |         GROUP BY %(groupings)s;""" % self.__dict__
 730 | 
 731 |         return query
 732 | 
 733 | 
 734 |     def returnPossibleFields(self):
 735 |         try:
 736 |             self.cursor.execute("SELECT name,type,description,tablename,dbname,anchor FROM masterVariableTable WHERE status='public'")
 737 |             colnames = [line[0] for line in self.cursor.description]
 738 |             returnset = []
 739 |             for line in self.cursor.fetchall():
 740 |                 thisEntry = {}
 741 |                 for i in range(len(line)):
 742 |                     thisEntry[colnames[i]] = line[i]
 743 |                 returnset.append(thisEntry)
 744 |         except:
 745 |             returnset=[]
 746 |         return returnset
 747 | 
 748 |     def return_slug_data(self,force=False):
 749 |         #Rather than understand this error, I'm just returning 0 if it fails.
 750 |         #Probably that's the right thing to do, though it may cause trouble later.
 751 |         #It's just a punishment for not later using a decent API call with "Raw_Counts" to extract these counts out, and relying on this ugly method.
 752 |         #Please, citizens of the future, NEVER USE THIS METHOD.
 753 |         try:
 754 |             temp_words = self.return_n_words(force = True)
 755 |             temp_counts = self.return_n_books(force = True)
 756 |         except:
 757 |             temp_words = 0
 758 |             temp_counts = 0
 759 |         return [temp_counts,temp_words]
 760 | 
 761 |     def return_n_books(self,force=False): #deprecated
 762 |         if (not hasattr(self,'nbooks')) or force:
 763 |             query = "SELECT count(*) from " + self.catalog + " WHERE " + self.catwhere
 764 |             silent = self.cursor.execute(query)
 765 |             self.counts = int(self.cursor.fetchall()[0][0])
 766 |         return self.counts
 767 | 
 768 |     def return_n_words(self,force=False): #deprecated
 769 |         if (not hasattr(self,'nwords')) or force:
 770 |             query = "SELECT sum(nwords) from " + self.catalog + " WHERE " + self.catwhere
 771 |             silent = self.cursor.execute(query)
 772 |             self.nwords = int(self.cursor.fetchall()[0][0])
 773 |         return self.nwords
 774 | 
 775 |     def bibliography_query(self,limit = "100"):
 776 |         #I'd like to redo this at some point so it could work as an API call more naturally.
 777 |         self.limit = limit
 778 |         self.ordertype = "sum(main.count*10000/nwords)"
 779 |         try:
 780 |             if self.outside_dictionary['ordertype'] == "random":
 781 |                 if self.counttype==["Raw_Counts"] or self.counttype==["Number_of_Books"] or self.counttype==['WordCount'] or self.counttype==['BookCount'] or self.counttype==['TextCount']:
 782 |                     self.ordertype = "RAND()"
 783 |                 else:
 784 |                     #This is a based on an attempt to match various different distributions I found on the web somewhere to give
 785 |                     #weighted results based on the counts. It's not perfect, but might be good enough. Actually doing a weighted random search is not easy without
 786 |                     #massive memory usage inside sql.
 787 |                     self.ordertype = "LOG(1-RAND())/sum(main.count)"
 788 |         except KeyError:
 789 |             pass
 790 | 
 791 |         #If IDF searching is enabled, we could add a term like '*IDF' here to overweight better selecting words
 792 |         #in the event of a multiple search.
 793 |         self.idfterm = ""
 794 |         prep = self.counts_query()
 795 | 
 796 | 
 797 |         if self.main == " ":
 798 |             self.ordertype="RAND()"
 799 | 
 800 |         bibQuery = """
 801 |         SELECT searchstring
 802 |         FROM """ % self.__dict__ + self.prefs['fullcat'] + """ RIGHT JOIN (
 803 |         SELECT
 804 |         """+ self.prefs['fastcat'] + """.bookid, %(ordertype)s as ordering
 805 |             FROM
 806 |                 %(catalog)s
 807 |                 %(main)s
 808 |                 %(wordstables)s
 809 |             WHERE
 810 |                  %(catwhere)s AND %(wordswhere)s
 811 |         GROUP BY bookid ORDER BY %(ordertype)s DESC LIMIT %(limit)s
 812 |         ) as tmp USING(bookid) ORDER BY ordering DESC;
 813 |         """ % self.__dict__
 814 |         return bibQuery
 815 | 
 816 |     def disk_query(self,limit="100"):
 817 |         pass
 818 | 
 819 |     def return_books(self):
 820 |         #This preps up the display elements for a search: it returns an array with a single string for each book, sorted in the best possible way
 821 |         silent = self.cursor.execute(self.bibliography_query())
 822 |         returnarray = []
 823 |         for line in self.cursor.fetchall():
 824 |             returnarray.append(line[0])
 825 |         if not returnarray:
 826 |             #why would someone request a search with no locations? Turns out (usually) because the smoothing tricked them.
 827 |             returnarray.append("No results for this particular point: try again without smoothing")
 828 |         newerarray = self.custom_SearchString_additions(returnarray)
 829 |         return json.dumps(newerarray)
 830 | 
 831 |     def search_results(self):
 832 |         #This is an alias that is handled slightly differently in APIimplementation (no "RESULTS" bit in front). Once
 833 |         #that legacy code is cleared out, they can be one and the same.
 834 |         return json.loads(self.return_books())
 835 | 
 836 |     def getActualSearchedWords(self):
 837 |         if len(self.wordswhere) > 7:
 838 |             words = self.outside_dictionary['search_limits']['word']
 839 |             #Break bigrams into single words.
 840 |             words = ' '.join(words).split(' ')
 841 |             self.cursor.execute("""SELECT word FROM wordsheap WHERE """ + where_from_hash({self.word_field:words}))
 842 |             self.actualWords =[item[0] for item in self.cursor.fetchall()]
 843 |         else:
 844 |             self.actualWords = ["tasty","mistake","happened","here"]
 845 | 
 846 |     def custom_SearchString_additions(self,returnarray):
 847 |         """
 848 |         It's nice to highlight the words searched for. This will be on partner web sites, so requires custom code for different databases
 849 |         """
 850 |         db = self.outside_dictionary['database']
 851 |         if db in ('jstor','presidio','ChronAm','LOC','OL'):
 852 |             self.getActualSearchedWords()
 853 |             if db=='jstor':
 854 |                 joiner = "&searchText="
 855 |                 preface = "?Search=yes&searchText="
 856 |                 urlRegEx = "http://www.jstor.org/stable/\d+"
 857 |             if db=='presidio' or db=='OL':
 858 |                 joiner = "+"
 859 |                 preface =  "#page/1/mode/2up/search/"
 860 |                 urlRegEx = 'http://archive.org/stream/[^"# ><]*'
 861 |             if db in ('ChronAm','LOC'):
 862 |                 preface = "/;words="
 863 |                 joiner = "+"
 864 |                 urlRegEx = 'http://chroniclingamerica.loc.gov[^\"><]*/seq-\d+'
 865 |             newarray = []
 866 |             for string in returnarray:
 867 |                 try:
 868 |                     base = re.findall(urlRegEx,string)[0]
 869 |                     newcore = ' <a href = "' +  base  + preface + joiner.join(self.actualWords) + '"> search inside </a>'
 870 |                     string = re.sub("^<td>","",string)
 871 |                     string = re.sub("</td>$","",string)
 872 |                     string = string+newcore
 873 |                 except IndexError:
 874 |                     pass
 875 |                 newarray.append(string)
 876 |         #Arxiv is messier, requiring a whole different URL interface: http://search.arxiv.org:8081/paper.jsp?r=1204.3352&qs=netwokr
 877 |         else:
 878 |             newarray = returnarray
 879 |         return newarray
 880 | 
 881 |     def return_query_values(self,query = "ratio_query"):
 882 |         #The API returns a dictionary with years pointing to values.
 883 |         """
 884 |         DEPRECATED: use 'return_json' or 'return_tsv' (the latter only works with single 'search_limits' options) instead
 885 |         """
 886 |         values = []
 887 |         querytext = getattr(self,query)()
 888 |         silent = self.cursor.execute(querytext)
 889 |             #Gets the results
 890 |         mydict = dict(self.cursor.fetchall())
 891 |         try:
 892 |             for key in mydict.keys():
 893 |                 #Only return results inside the time limits
 894 |                 if key >= self.time_limits[0] and key <= self.time_limits[1]:
 895 |                     mydict[key] = str(mydict[key])
 896 |                 else:
 897 |                     del mydict[key]
 898 |             mydict = smooth_function(mydict,smooth_method = self.smoothingType,span = self.smoothingSpan)
 899 | 
 900 |         except:
 901 |             mydict = {0:"0"}
 902 | 
 903 |         #This is a good place to change some values.
 904 |         try:
 905 |             return {'index':self.index, 'Name':self.words_searched,"values":mydict,'words_searched':""}
 906 |         except:
 907 |             return{'values':mydict}
 908 | 
 909 | 
 910 |     def return_tsv(self,query = "ratio_query"):
 911 |         if self.outside_dictionary['counttype']=="Raw_Counts" or self.outside_dictionary['counttype']==["Raw_Counts"]:
 912 |             query="counts_query"
 913 |             #This allows much speedier access to counts data if you're willing not to know about all the zeroes.
 914 |             #Will not work as well once the id_fields are in use.
 915 |         querytext = getattr(self,query)()
 916 |         silent = self.cursor.execute(querytext)
 917 |         results = ["\t".join([to_unicode(item[0]) for item in self.cursor.description])]
 918 |         lines = self.cursor.fetchall()
 919 |         for line in lines:
 920 |             items = []
 921 |             for item in line:
 922 |                 item = to_unicode(item)
 923 |                 item = re.sub("\t","<tab>",item)
 924 |                 items.append(item)
 925 |             results.append("\t".join(items))
 926 |         return "\n".join(results)
 927 | 
 928 |     def export_data(self,query1="ratio_query"):
 929 |         self.smoothing=0
 930 |         return self.return_query_values(query=query1)
 931 | 
 932 |     def execute(self):
 933 |         #This performs the query using the method specified in the passed parameters.
 934 |         if self.method=="Nothing":
 935 |             pass
 936 |         else:
 937 |             value = getattr(self,self.method)()
 938 |             return value
 939 | 
 940 | class derived_table(object):
 941 |     """
 942 |     MySQL/MariaDB doesn't have good subquery materialization,
 943 |     so I'm implementing it by hand.
 944 |     """
 945 |     def __init__(self,SQLstring,db,indices = [],dbToPutIn = "bookworm_scratch"):
 946 |         """
 947 |         initialize with the code to create the table; the database it will be in
 948 |         (to prevent conflicts with other identical queries in other dbs);
 949 |         and the list of all tables to be indexed
 950 |         (optional, but which can really speed up joins)
 951 |         """
 952 |         self.query = SQLstring
 953 |         self.db = db
 954 |         #Each query is identified by a unique key hashed
 955 |         #from the query and the dbname.
 956 |         self.queryID  = dbToPutIn + "." + "derived" + hashlib.sha1(self.query + db.dbname).hexdigest()
 957 |         self.indices = "(" + ",".join(["INDEX(%s)" % index  for index in indices]) + ")" if indices != [] else ""
 958 |     
 959 |     def setStorageEngines(self,temp):
 960 |         """
 961 |         Chooses where and how to store tables.
 962 |         """
 963 |         self.tempString = "TEMPORARY" if temp else ""
 964 |         self.engine = "MEMORY" if temp else "MYISAM"
 965 | 
 966 |     def checkCache(self):
 967 |         """
 968 |         Checks what's already been calculated.
 969 |         """
 970 |         try:
 971 |             (self.count,self.created,self.modified,self.createCode,self.data) = self.db.cursor.execute("SELECT count,created,modified,createCode,data FROM bookworm_scratch.cache WHERE fieldname='%s'" %self.queryID)[0]
 972 |             return True
 973 |         except:
 974 |             (self.count,self.created,self.modified,self.createCode,self.data) = [None]*5
 975 |             return False
 976 | 
 977 |     def fillTableWithData(self,data):
 978 |         dataCode = "INSERT INTO %s values ("%self.queryID + ", ".join(["%s"]*len(data[0])) + ")"
 979 |         self.db.cursor.executemany(dataCode,data)
 980 |         self.db.db.commit()
 981 |             
 982 |     def materializeFromCache(self,temp):
 983 |         if self.data is not None:
 984 |             #Datacode should never exist without createCode also.
 985 |             self.db.cursor.execute(self.createCode)
 986 |             self.fillTableWithData(pickle.loads(self.data,protocol=-1))
 987 |             return True
 988 |         else:
 989 |             return False
 990 | 
 991 | 
 992 |     def createFromCacheWithDataFromBookworm(self,temp,postDataToCache=False):
 993 |         """
 994 |         If the create code exists but the data does not.
 995 |         This uses a form of query that MySQL can cache,
 996 |         unlike the normal subqueries OR the CREATE TABLE ... INSERT
 997 |         used by materializeFromBookworm.
 998 | 
 999 |         You can also post the data itself, but that's turned off by default:
1000 |         because why wouldn't it have been posted the first time?
1001 |         Probably it's too large or something, is why.
1002 |         """
1003 |         if self.createCode==None:
1004 |             return False
1005 |         self.db.cursor.execute(self.createCode)
1006 |         self.db.cursor.execute(self.query)
1007 |         data = [row for row in self.db.cursor.fetchall()]
1008 |         self.newdata = pickle.dumps(data,protocol=-1)
1009 |         self.fillTableWithData(data)
1010 |         if postDataToCache:
1011 |             self.updateCache(postQueries=["UPDATE bookworm_scratch.cache SET data='%s' WHERE fieldname='%s'" %(MySQLdb.escape_string(self.newdata),self.queryID)])
1012 |         else:
1013 |             self.updateCache()
1014 |         return True
1015 |             
1016 |     def materializeFromBookworm(self,temp,postDataToCache=True,postCreateToCache=True):
1017 |         import cPickle as pickle
1018 |         self.db.cursor.execute("CREATE %(tempString)s TABLE %(queryID)s %(indices)s ENGINE=%(engine)s %(query)s;" % self.__dict__)
1019 |         self.db.cursor.execute("SHOW CREATE TABLE %s" %self.queryID)
1020 |         self.newCreateCode = self.db.cursor.fetchall()[0][1]
1021 |         self.db.cursor.execute("SELECT * FROM %s" %self.queryID)
1022 |         #coerce the results to a list of tuples, then pickle it.
1023 |         self.newdata = pickle.dumps([row for row in self.db.cursor.fetchall()],protocol=-1)
1024 | 
1025 |         if postDataToCache:
1026 |             self.updateCache(postQueries=["UPDATE bookworm_scratch.cache SET data='%s',createCode='%s' WHERE fieldname='%s'" %(MySQLdb.escape_string(self.newdata),MySQLdb.escape_string(self.newCreateCode),self.queryID)])
1027 | 
1028 |         if postCreateToCache:
1029 |             self.updateCache(postQueries=["UPDATE bookworm_scratch.cache SET createCode='%s' WHERE fieldname='%s'" %(MySQLdb.escape_string(self.newCreateCode),self.queryID)])
1030 |         
1031 |         
1032 |     def updateCache(self,postQueries=[]):
1033 |         q1 = """
1034 |         INSERT INTO bookworm_scratch.cache (fieldname,created,modified,count) VALUES
1035 |         ('%s',NOW(),NOW(),1) ON DUPLICATE KEY UPDATE count = count + 1,modified=NOW();""" %self.queryID
1036 |         result = self.db.cursor.execute(q1)
1037 |         for query in postQueries:
1038 |             self.db.cursor.execute(query)
1039 |         self.db.db.commit()
1040 |         
1041 |     def materialize(self,temp="default"):
1042 |         """
1043 |         materializes the table, by default in memory in the bookworm_scratch
1044 |         database. If temp is false, the table will be stored on disk, available
1045 |         for future users too. This should be used sparingly, because you can't have too many
1046 |         tables on disk.
1047 | 
1048 |         Returns the tableID, which the superquery to this one may need to know.
1049 |         """
1050 |         if temp=="default":
1051 |             temp=True
1052 | 
1053 |         self.checkCache()
1054 |         self.setStorageEngines(temp)
1055 | 
1056 |         try:
1057 |             if not self.materializeFromCache(temp):
1058 |                 if not self.createFromCacheWithDataFromBookworm(temp):
1059 |                     self.materializeFromBookworm(temp)
1060 |     
1061 |         except MySQLdb.OperationalError,e:
1062 |             #Often the error will be 1050, which is a good thing:
1063 |             #It means we don't need to
1064 |             #create the table, because it's there already.
1065 |             #But if it's not, something bad is happening.
1066 |             if not re.search("1050.*already exists",str(e)):
1067 |                 raise
1068 |         
1069 |         return self.queryID
1070 |         
1071 | class databaseSchema:
1072 |     """
1073 |     This class stores information about the database setup that is used to optimize query creation query
1074 |     and so that queries know what tables to include.
1075 |     It's broken off like this because it might be usefully wrapped around some of the backend features,
1076 |     because it shouldn't be run multiple times in a single query (that spawns two instances of itself), as was happening before.
1077 | 
1078 |     It's closely related to some of the classes around variables and variableSets in the Bookworm Creation scripts,
1079 |     but is kept separate for now: that allows a bit more flexibility, but is probaby a Bad Thing in the long run.
1080 |     """
1081 | 
1082 |     def __init__(self,db):
1083 |         self.db = db
1084 |         self.cursor=db.cursor
1085 |         #has of what table each variable is in
1086 |         self.tableToLookIn = {}
1087 |         #hash of what the root variable for each search term is (eg, 'author_birth' might be crosswalked to 'authorid' in the main catalog.)
1088 |         self.anchorFields = {}
1089 |         #aliases: a hash showing internal identifications codes that dramatically speed up query time, but which shouldn't be exposed.
1090 |         #So you can run a search for "state," say, and the database will group on a 50-element integer code instead of a VARCHAR that
1091 |         #has to be long enough to support "Massachusetts" and "North Carolina."
1092 |         #A couple are hard-coded in, but most are derived by looking for fields that end in the suffix "__id" later.
1093 | 
1094 |         if self.db.dbname=="presidio":
1095 |             self.aliases = {"classification":"lc1","lat":"pointid","lng":"pointid"}
1096 |         else:
1097 |             self.aliases = dict()
1098 |             
1099 |         try:
1100 |             #First build using the new streamlined tables; if that fails, 
1101 |             #build using the old version that hits the INFORMATION_SCHEMA,
1102 |             #which is bad practice.
1103 |             self.newStyle(db)
1104 |         except:
1105 |             #The new style will fail on old bookworms: a failure is an easy way to test
1106 |             #for oldness, though of course something else might be causing the failure.
1107 |             self.oldStyle(db)
1108 |         
1109 |         
1110 |     def newStyle(self,db):
1111 |         self.tableToLookIn['bookid'] = 'fastcat'
1112 |         self.anchorFields['bookid'] = 'fastcat'
1113 |         self.anchorFields['wordid'] = 'wordid'
1114 |         self.tableToLookIn['wordid'] = 'wordsheap'
1115 | 
1116 | 
1117 |         tablenames = dict()
1118 |         tableDepends = dict()
1119 |         db.cursor.execute("SELECT dbname,alias,tablename,dependsOn FROM masterVariableTable JOIN masterTableTable USING (tablename);")
1120 |         for row in db.cursor.fetchall():
1121 |             (dbname,alias,tablename,dependsOn) = row
1122 |             self.tableToLookIn[dbname] = tablename
1123 |             self.anchorFields[tablename] = dependsOn
1124 |             self.aliases[dbname] = alias
1125 | 
1126 |     def oldStyle(self,db):
1127 | 
1128 |         #This is sorted by engine DESC so that memory table locations will overwrite disk table in the hash.
1129 | 
1130 |         self.cursor.execute("SELECT ENGINE,TABLE_NAME,COLUMN_NAME,COLUMN_KEY,TABLE_NAME='fastcat' OR TABLE_NAME='wordsheap' AS privileged FROM information_schema.COLUMNS JOIN INFORMATION_SCHEMA.TABLES USING (TABLE_NAME,TABLE_SCHEMA) WHERE TABLE_SCHEMA='%(dbname)s' ORDER BY privileged,ENGINE DESC,TABLE_NAME,COLUMN_KEY DESC;" % self.db.__dict__);
1131 |         columnNames = self.cursor.fetchall()
1132 | 
1133 |         parent = 'bookid'
1134 |         previous = None
1135 |         for databaseColumn in columnNames:
1136 |             if previous != databaseColumn[1]:
1137 |                 if databaseColumn[3]=='PRI' or databaseColumn[3]=='MUL':
1138 |                     parent = databaseColumn[2]
1139 |                     previous = databaseColumn[1]
1140 |                 else:
1141 |                     parent = 'bookid'
1142 |             else:
1143 |                 self.anchorFields[databaseColumn[2]]  = parent
1144 |                 if databaseColumn[3]!='PRI' and databaseColumn[3]!="MUL": #if it's a primary key, this isn't the right place to find it.
1145 |                     self.tableToLookIn[databaseColumn[2]] = databaseColumn[1]
1146 |                 if re.search('__id\*?$',databaseColumn[2]):
1147 |                     self.aliases[re.sub('__id','',databaseColumn[2])]=databaseColumn[2]
1148 | 
1149 |         try:
1150 |             cursor = self.cursor.execute("SELECT dbname,tablename,anchor,alias FROM masterVariableTables")
1151 |             for row in cursor.fetchall():
1152 |                 if row[0] != row[3]:
1153 |                     self.aliases[row[0]] = row[3]
1154 |                 if row[0] != row[2]:
1155 |                     self.anchorFields[row[0]] = row[2]
1156 |                 #Should be uncommented, but some temporary issues with the building script
1157 |                 #self.tableToLookIn[row[0]] = row[1]
1158 |         except:
1159 |             pass
1160 |         self.tableToLookIn['bookid'] = 'fastcat'
1161 |         self.anchorFields['bookid'] = 'fastcat'
1162 |         self.anchorFields['wordid'] = 'wordid'
1163 |         self.tableToLookIn['wordid'] = 'wordsheap'
1164 |     #############
1165 |     ##GENERAL#### #These are general purpose functional types of things not implemented in the class.
1166 |     #############
1167 | 
1168 | def to_unicode(obj, encoding='utf-8'):
1169 |     if isinstance(obj, basestring):
1170 |         if not isinstance(obj, unicode):
1171 |             obj = unicode(obj, encoding)
1172 |     elif isinstance(obj,int):
1173 |         obj=unicode(str(obj),encoding)
1174 |     else:
1175 |         obj = unicode(str(obj),encoding)
1176 |     return obj
1177 | 
1178 | def where_from_hash(myhash,joiner=" AND ",comp = " = ",escapeStrings=True):
1179 |     whereterm = []
1180 |     #The general idea here is that we try to break everything in search_limits down to a list, and then create a whereterm on that joined by whatever the 'joiner' is ("AND" or "OR"), with the comparison as whatever comp is ("=",">=",etc.).
1181 |     #For more complicated bits, it gets all recursive until the bits are all in terms of list.
1182 |     for key in myhash.keys():
1183 |         values = myhash[key]
1184 |         if isinstance(values,basestring) or isinstance(values,int) or isinstance(values,float):
1185 |             #This is just human-being handling. You can pass a single value instead of a list if you like, and it will just convert it
1186 |             #to a list for you.
1187 |             values = [values]
1188 |         #Or queries are special, since the default is "AND". This toggles that around for a subportion.
1189 |         if key=='$or' or key=="$OR":
1190 |             for comparison in values:
1191 |                 whereterm.append(where_from_hash(comparison,joiner=" OR ",comp=comp))
1192 |                 #The or doesn't get populated any farther down.
1193 |         elif isinstance(values,dict):
1194 |             #Certain function operators can use MySQL terms. These are the only cases that a dict can be passed as a limitations
1195 |             operations = {"$gt":">","$ne":"!=","$lt":"<","$grep":" REGEXP ","$gte":">=","$lte":"<=","$eq":"="}
1196 |             for operation in values.keys():
1197 |                 whereterm.append(where_from_hash({key:values[operation]},comp=operations[operation],joiner=joiner))
1198 |         elif isinstance(values,list):
1199 |             #and this is where the magic actually happens: the cases where the key is a string, and the target is a list.
1200 |             if isinstance(values[0],dict):
1201 |                 # If it's a list of dicts, then there's one thing that happens. Currently all types are assumed to be the same:
1202 |                 # you couldn't pass in, say {"year":[{"$gte":1900},1898]} to catch post-1898 years except for 1899. Not that you
1203 |                 # should need to.
1204 |                 for entry in values:
1205 |                     whereterm.append(where_from_hash(entry))
1206 |             else:
1207 |                 #Note that about a third of the code is spent on escaping strings.
1208 |                 if escapeStrings:
1209 |                     if isinstance(values[0],basestring):
1210 |                         quotesep="'"
1211 |                     else:
1212 |                         quotesep = ""
1213 |                     def escape(value): return MySQLdb.escape_string(to_unicode(value))
1214 |                 else:
1215 |                     def escape(value): return to_unicode(value)
1216 |                     quotesep=""
1217 | 
1218 |                 #Note the "OR" here. There's no way to pass in a query like "year=1876 AND year=1898" as currently set up.
1219 |                 #Obviously that's no great loss, but there might be something I'm missing that would be desire a similar format somehow.
1220 |                 #(In cases where the same book could have two different years associated with it)
1221 |                 whereterm.append(" (" + " OR ".join([" (" + key+comp+quotesep+escape(value)+quotesep+") " for value in values])+ ") ")
1222 |     return "(" + joiner.join(whereterm) + ")"
1223 |     #This works pretty well, except that it requires very specific sorts of terms going in, I think.
1224 | 
1225 | 
1226 | 
1227 | #I'd rather have all this smoothing stuff done at the client side, but currently it happens here.
1228 | 
1229 | #The idea is: this works by default by slurping up from the command line, but you could also load the functions in and run results on your own queries.
1230 | try:
1231 |     command = str(sys.argv[1])
1232 |     command = json.loads(command)
1233 | #Got to go before we let anything else happen.
1234 |     p = userqueries(command)
1235 |     result = p.execute()
1236 |     print json.dumps(result)
1237 | except:
1238 |     pass
1239 | 
1240 | 
1241 | 
1242 | def debug(string):
1243 |     """
1244 |     Makes it easier to debug through a web browser by handling the headers.
1245 |     Despite being called a `string`, it can be anything that python can print.
1246 |     """
1247 |     print headers('1')
1248 |     print "<br>"
1249 |     print string
1250 |     print "<br>"
1251 | 


--------------------------------------------------------------------------------
/bookworm/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bookworm-project/BookwormAPI/faac096f74a86ca7a9c8b4e02a3aacfa1f5f7b76/bookworm/__init__.py


--------------------------------------------------------------------------------
/bookworm/general_API.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python
  2 | 
  3 | import MySQLdb
  4 | from pandas import merge
  5 | from pandas.io.sql import read_sql
  6 | from pandas import set_option
  7 | from SQLAPI import *
  8 | from copy import deepcopy
  9 | from collections import defaultdict
 10 | import ConfigParser
 11 | import os.path
 12 | 
 13 | #Some settings can be overridden here, if no where else.
 14 | prefs = dict()
 15 | 
 16 | def find_my_cnf():
 17 |     """
 18 |     The password will be looked for in these places.
 19 |     """
 20 |     
 21 |     for file in ["etc/bookworm/my.cnf","/etc/my.cnf","/etc/mysql/my.cnf","/root/.my.cnf"]:
 22 |         if os.path.exists(file):
 23 |             return file
 24 | 
 25 | class dbConnect(object):
 26 |     #This is a read-only account
 27 |     def __init__(self,prefs=prefs,database="federalist",host="localhost"):
 28 |         self.dbname = database
 29 | 
 30 |         #For back-compatibility:
 31 |         if "HOST" in prefs:
 32 |             host=prefs['HOST']
 33 | 
 34 |         self.db = MySQLdb.connect(host=host,
 35 |                                   db=database,
 36 |                                   read_default_file = find_my_cnf(),
 37 |                                   use_unicode='True',
 38 |                                   charset='utf8')
 39 | 
 40 |         self.cursor = self.db.cursor()
 41 | 
 42 | def calculateAggregates(df,parameters):
 43 | 
 44 |     """
 45 |     We only collect "WordCoun" and "TextCount" for each query,
 46 |     but there are a lot of cool things you can do with those:
 47 |     basic things like frequency, all the way up to TF-IDF.
 48 |     """
 49 |     parameters = set(parameters)
 50 |     
 51 |     if "WordsPerMillion" in parameters:
 52 |         df["WordsPerMillion"] = df["WordCount_x"].multiply(1000000)/df["WordCount_y"]
 53 |     if "WordCount" in parameters:
 54 |         df["WordCount"] = df["WordCount_x"]
 55 |     if "TotalWords" in parameters:
 56 |         df["TotalWords"] = df["WordCount_y"]
 57 |     if "SumWords" in parameters:
 58 |         df["SumWords"] = df["WordCount_y"] + df["WordCount_x"]
 59 |     if "WordsRatio" in parameters:
 60 |         df["WordsRatio"] = df["WordCount_x"]/df["WordCount_y"]
 61 | 
 62 |     if "TextPercent" in parameters:
 63 |         df["TextPercent"] = 100*df["TextCount_x"].divide(df["TextCount_y"])
 64 |     if "TextCount" in parameters:
 65 |         df["TextCount"] = df["TextCount_x"]
 66 |     if "TotalTexts" in parameters:
 67 |         df["TotalTexts"] = df["TextCount_y"]
 68 | 
 69 |     if "HitsPerBook" in parameters:
 70 |         df["HitsPerMatch"] = df["WordCount_x"]/df["TextCount_x"]
 71 | 
 72 |     if "TextLength" in parameters:
 73 |         df["HitsPerMatch"] = df["WordCount_y"]/df["TextCount_y"]
 74 | 
 75 |     if "TFIDF" in parameters:
 76 |         from numpy import log as log
 77 |         df.eval("TF = WordCount_x/WordCount_y")
 78 |         df["TFIDF"] = (df["WordCount_x"]/df["WordCount_y"])*log(df["TextCount_y"]/df['TextCount_x'])
 79 | 
 80 |     def DunningLog(df=df,a = "WordCount_x",b = "WordCount_y"):
 81 |         from numpy import log as log
 82 |         destination = "Dunning"
 83 |         df[a] = df[a].replace(0,1)
 84 |         df[b] = df[b].replace(0,1)
 85 |         if a=="WordCount_x":
 86 |             # Dunning comparisons should be to the sums if counting:
 87 |             c = sum(df[a])
 88 |             d = sum(df[b])
 89 |         if a=="TextCount_x":
 90 |             # The max count isn't necessarily the total number of books, but it's a decent proxy.
 91 |             c = max(df[a])
 92 |             d = max(df[b])
 93 |         expectedRate = (df[a] + df[b]).divide(c+d)
 94 |         E1 = c*expectedRate
 95 |         E2 = d*expectedRate
 96 |         diff1 = log(df[a].divide(E1))
 97 |         diff2 = log(df[b].divide(E2))
 98 |         df[destination] = 2*(df[a].multiply(diff1) + df[b].multiply(diff2))
 99 |         # A hack, but a useful one: encode the direction of the significance,
100 |         # in the sign, so negative 
101 |         difference = diff1<diff2
102 |         df.ix[difference,destination] = -1*df.ix[difference,destination]
103 |         return df[destination]
104 | 
105 |     if "Dunning" in parameters:
106 |         df["Dunning"] = DunningLog(df,"WordCount_x","WordCount_y")
107 |         
108 |     if "DunningTexts" in parameters:
109 |         df["DunningTexts"] = DunningLog(df,"TextCount_x","TextCount_y")
110 | 
111 |     return df
112 |     
113 | def intersectingNames(p1,p2,full=False):
114 |     """
115 |     The list of intersection column names between two DataFrame objects.
116 | 
117 |     'full' lets you specify that you want to include the count values:
118 |     Otherwise, they're kept separate for convenience in merges.
119 |     """
120 |     exclude = set(['WordCount','TextCount'])
121 |     names1 = set([column for column in p1.columns if column not in exclude])
122 |     names2 = [column for column in p2.columns if column not in exclude]
123 |     if full:
124 |         return list(names1.union(names2))
125 |     return list(names1.intersection(names2))
126 | 
127 | def base_count_types(list_of_final_count_types):
128 |     """
129 |     the final count types are calculated from some base types across both
130 |     the local query and the superquery.
131 |     """
132 |     
133 |     output = set()
134 | 
135 |     for count_name in list_of_final_count_types:
136 |         if count_name in ["WordCount","WordsPerMillion","WordsRatio","TotalWords","SumWords","Dunning"]:
137 |             output.add("WordCount")
138 |         if count_name in ["TextCount","TextPercent","TextRatio","TotalTexts","SumTexts","DunningTexts"]:
139 |             output.add("TextCount")
140 |         if count_name in ["TextLength","HitsPerMatch","TFIDF"]:
141 |             output.add("TextCount")
142 |             output.add("WordCount")
143 |         
144 |             
145 |     return list(output)
146 | 
147 | def is_a_wordcount_field(string):
148 |     if string in ["unigram","bigram","word"]:
149 |         return True
150 |     return False
151 | 
152 | class APIcall(object):
153 |     """
154 |     This is the base class from which more specific classes for actual 
155 |     methods can be dispatched.
156 |     
157 |     Without a "return_pandas_frame" method, it won't run.
158 |     """
159 |     def __init__(self,APIcall):
160 |         """
161 |         Initialized with a dictionary unJSONed from the API defintion.
162 |         """
163 |         self.query = APIcall
164 |         self.idiot_proof_arrays()
165 |         self.set_defaults()
166 | 
167 |     def set_defaults(self):
168 |         query = self.query
169 |         if not "search_limits" in query:
170 |             self.query["search_limits"] = dict()
171 |         if "unigram" in query["search_limits"]:
172 |             #Hack: change somehow. You can't group on "word", just on "unigram"
173 |             query["search_limits"]["word"] = query["search_limits"]["unigram"]
174 |             del query["search_limits"]["unigram"]
175 |             
176 |     def idiot_proof_arrays(self):
177 |         for element in ['counttype','groups']:
178 |             try:
179 |                 if not isinstance(self.query[element],list):
180 |                     self.query[element] = [self.query[element]]
181 |             except KeyError:
182 |                 #It's OK if it's not there.
183 |                 pass
184 | 
185 |     def get_compare_limits(self):
186 |         """
187 |         The compare limits will try to 
188 |         first be the string specified:
189 |         if not that, then drop every term that begins with an asterisk:
190 |         if not that, then drop the words term;
191 |         if not that, then exactly the same as the search limits.
192 |         """
193 | 
194 |         if "compare_limits" in self.query:
195 |             return self.query['compare_limits']
196 | 
197 |         search_limits = self.query['search_limits']
198 |         compare_limits = deepcopy(search_limits)
199 | 
200 |         asterisked = False
201 |         for limit in search_limits.keys():
202 |             if re.search(r'^\*',limit):
203 |                 search_limits[limit.replace('*','')] = search_limits[limit]
204 |                 del search_limits[limit]
205 |                 del compare_limits[limit]
206 |                 asterisked = True
207 |         
208 |         if asterisked:
209 |             return compare_limits
210 | 
211 |         #Next, try deleting the word term.
212 |             
213 |         for word_term in search_limits.keys():
214 |             if word_term in ['word','unigram','bigram']:
215 |                 del compare_limits[word_term]
216 | 
217 |         #Finally, whether it's deleted a word term or not, return it all.
218 |         return compare_limits
219 |         
220 |     def data(self):
221 |         if hasattr(self,"pandas_frame"):
222 |             return self.pandas_frame
223 |         else:
224 |             self.pandas_frame = self.get_data_from_source()
225 |             return self.pandas_frame
226 |         
227 |     def get_data_from_source(self):
228 | 
229 |         """
230 |         This is a 
231 | 
232 |         Note that this method could be easily adapted to run on top of a Solr instance or
233 |         something else, just by changing the bits in the middle where it handles storage_format.
234 |         """
235 | 
236 |         call1 = deepcopy(self.query)
237 | 
238 |         #The individual calls need only the base counts: not "Percentage of Words," but just "WordCount" twice, and so forth
239 |         call1['counttype'] = base_count_types(call1['counttype'])
240 |         call2 = deepcopy(call1)
241 | 
242 |         call2['search_limits'] = self.get_compare_limits()
243 | 
244 |         #Drop out asterisks for that syntactic sugar.
245 |         for limit in call1['search_limits'].keys():
246 |             if re.search(r'^\*',limit):
247 |                 call1['search_limits'][limit.replace('*','')] = call1['search_limits'][limit]
248 |                 del call1['search_limits'][limit]
249 | 
250 |         for n,group in enumerate(self.query['groups']):
251 |             if re.search(r'^\*',group):
252 |                 replacement = group.replace("*","")
253 |                 call1['groups'][n] = replacement
254 |                 self.query['groups'][n] = replacement
255 |                 call2['groups'].remove(group)
256 | 
257 |         #Special case: unigram groupings are dropped if they're not explicitly limited
258 |         #if "unigram" not in call2['search_limits']:
259 |         #    call2['groups'] = filter(lambda x: not x in ["unigram","bigram","word"],call2['groups'])
260 | 
261 |         """
262 |         This could use any method other than pandas_SQL:
263 |         You'd just need to name objects df1 and df2 as pandas dataframes 
264 |         """
265 |         df1 = self.generate_pandas_frame(call1)
266 |         df2 = self.generate_pandas_frame(call2)
267 |          
268 |         intersections = intersectingNames(df1,df2)
269 |         fullLabels = intersectingNames(df1,df2,full=True)
270 |         
271 |         """
272 |         Would this merge be faster with indexes?
273 |         """
274 |         if len(intersections) > 0:
275 |             merged = merge(df1,df2,on=intersections,how='outer')
276 |         else:
277 |             """
278 |             Pandas doesn't seem to have a full, unkeyed merge, so I simulate it with a dummy.
279 |             """
280 |             df1['dummy_merge_variable'] = 1
281 |             df2['dummy_merge_variable'] = 1
282 |             merged = merge(df1,df2,on=["dummy_merge_variable"],how='outer')
283 |             
284 |         merged = merged.fillna(int(0))
285 |         
286 |         calculations = self.query['counttype']
287 |     
288 |         calcced = calculateAggregates(merged,calculations)
289 |         
290 |         calcced = calcced.fillna(int(0))
291 | 
292 |         final_DataFrame = calcced[self.query['groups'] + self.query['counttype']]
293 | 
294 |         return final_DataFrame
295 | 
296 |     def execute(self):
297 |         method = self.query['method']
298 | 
299 |         
300 |         if isinstance(self.query['search_limits'],list):
301 |             if self.query['method'] not in ["json","return_json"]:
302 |                 self.query['search_limits'] = self.query['search_limits'][0]
303 |             else:
304 |                 return self.multi_execute()
305 |         
306 |         if method=="return_json" or method=="json":
307 |             frame = self.data()
308 |             return self.return_json()
309 | 
310 |         if method=="return_tsv" or method=="tsv":
311 |             import csv
312 |             frame = self.data()
313 |             return frame.to_csv(sep="\t",encoding="utf8",index=False,quoting=csv.QUOTE_NONE,escapechar="\\")
314 | 
315 |         if method=="return_pickle" or method=="DataFrame":
316 |             frame = self.data()
317 |             from cPickle import dumps as pickleDumps
318 |             return pickleDumps(frame,protocol=-1)
319 | 
320 |         # Temporary catch-all pushes to the old methods:
321 |         if method in ["returnPossibleFields","search_results","return_books"]:
322 |             query = userquery(self.query)
323 |             if method=="return_books":
324 |                 return query.execute()
325 |             return json.dumps(query.execute())
326 | 
327 | 
328 | 
329 |     def multi_execute(self):
330 |         """
331 |         Queries may define several search limits in an array
332 |         if they use the return_json method.
333 |         """
334 |         returnable = []
335 |         for limits in self.query['search_limits']:
336 |             child = deepcopy(self.query)
337 |             child['search_limits'] = limits
338 |             returnable.append(self.__class__(child).return_json(raw_python_object=True))
339 | 
340 |         return json.dumps(returnable)
341 | 
342 |     def return_json(self,raw_python_object=False):
343 |         query = self.query
344 |         data = self.data()
345 | 
346 | 
347 |         def fixNumpyType(input):
348 |             #This is, weirdly, an occasional problem but not a constant one.
349 |             if str(input.dtype)=="int64":
350 |                 return int(input)
351 |             else:
352 |                 return input
353 |         
354 |         #Define a recursive structure to hold the stuff.
355 |         def tree():
356 |             return defaultdict(tree)
357 |         returnt = tree()
358 | 
359 |         import numpy as np
360 | 
361 |         for row in data.itertuples(index=False):
362 |             row = list(row)
363 |             destination = returnt
364 |             if len(row)==len(query['counttype']):
365 |                 returnt = [fixNumpyType(num) for num in row]
366 |             while len(row) > len(query['counttype']):
367 |                 key = row.pop(0)
368 |                 if len(row) == len(query['counttype']):
369 |                     # Assign the elements.
370 |                     destination[key] = row
371 |                     break
372 |                 # This bit of the loop is where we descend the recursive dictionary.
373 |                 destination = destination[key]
374 |         if raw_python_object:
375 |             return returnt
376 | 
377 |         try:
378 |             return json.dumps(returnt,allow_nan=False)
379 |         except ValueError:
380 |             return json.dumps(returnt)
381 |             kludge = json.dumps(returnt)
382 |             kludge = kludge.replace("Infinity","null")
383 |             print kludge
384 | 
385 | class SQLAPIcall(APIcall):
386 |     """
387 |     To make a new backend for the API, you just need to extend the base API call
388 |     class like this.
389 | 
390 |     This one is comically short because all the real work is done in the userquery object.
391 | 
392 |     But the point is, you need to define a function "generate_pandas_frame"
393 |     that accepts an API call and returns a pandas frame.
394 | 
395 |     But that API call is more limited than the general API; you only need to support "WordCount" and "TextCount"
396 |     methods.
397 |     """
398 |     
399 |     def generate_pandas_frame(self,call):
400 |         """
401 | 
402 |         This is good example of the query that actually fetches the results.
403 |         It creates some SQL, runs it, and returns it as a pandas DataFrame.
404 |         
405 |         The actual SQL production is handled by the userquery class, which uses more
406 |         legacy code.
407 | 
408 |         """
409 |         con=dbConnect(prefs,self.query['database'])
410 |         q = userquery(call).query()
411 |         if self.query['method']=="debug":
412 |             print q
413 |         df = read_sql(q, con.db)
414 |         return df
415 | 
416 | 
417 | 


--------------------------------------------------------------------------------
/bookworm/knownHosts.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | """
 4 | Whenever you add a new bookworm to your server, remember to add a line for it in this file.
 5 | 
 6 | However, always keep the 'default' line listed below in this file.
 7 | """
 8 | 
 9 | general_prefs = dict()
10 | general_prefs["default"] = {"fastcat": "fastcat", "HOST": "localhost", "separateDataTables": [], "fastword": "wordsheap", "database": "YourDatabaseNameHere", "read_url_head": "THIS_CAN_BE_ANYTHING...ITS_NOT_USED_ANYMORE", "fullcat": "catalog", "fullword": "words", "read_default_file": "/etc/mysql/my.cnf"}
11 | 


--------------------------------------------------------------------------------
/bookworm/logParser.py:
--------------------------------------------------------------------------------
 1 | import urllib
 2 | import os
 3 | import re
 4 | import gzip
 5 | import json
 6 | import sys
 7 | 
 8 | files = os.listdir("/var/log/apache2")
 9 | 
10 | words = [] 
11 | 
12 | for file in files:
13 |     reading = None
14 |     if re.search("^access.log..*.gz",file):
15 |         reading = gzip.open("/var/log/apache2/" + file)
16 |     elif re.search("^access.log.*",file):
17 |         reading = open("/var/log/apache2/" + file)
18 |     else:
19 |         continue
20 |     sys.stderr.write(file + "\n")
21 | 
22 |     for line in reading:
23 |         matches = re.findall(r"([0-9\.]+).*\[(.*)].*cgi-bin/dbbindings.py/?.query=([^ ]+)",line)
24 |         for fullmatch in matches:
25 |             t = dict()
26 |             t['ip'] = fullmatch[0]
27 |             match = fullmatch[2]
28 |             try:
29 |                 data = json.loads(urllib.unquote(match).decode('utf8'))
30 |             except ValueError:
31 |                 continue
32 |             try:
33 |                 if isinstance(data['search_limits'],dict):
34 |                     data['search_limits'] = [data['search_limits']]
35 |                 for setting in ['words_collation','database']:
36 |                     try:
37 |                         t[setting] = data[setting]
38 |                     except KeyError:
39 |                         t[setting] = ""
40 |                 for limit in data['search_limits']:
41 |                     p = dict()
42 |                     for constraint in ["word","TV_show","director"]:
43 |                         try:
44 |                             p[constraint] = p[constraint] + "," + (",".join(limit[constraint]))
45 |                         except KeyError:
46 |                             try:
47 |                                 p[constraint] = (",".join(limit[constraint]))
48 |                             except KeyError:
49 |                                 p[constraint] = ""
50 |                     for key in p.keys():
51 |                         t[key] = p[key]
52 |                     vals = [t[key] for key in ('ip','database','words_collation','word','TV_show','director')]
53 |                     print "\t".join(vals).encode("utf-8")
54 | 
55 |                     
56 |             except KeyError:
57 |                 raise
58 | 
59 | print len(words)
60 | 


--------------------------------------------------------------------------------
/dbbindings.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | 
 4 | #So we load in the terms that allow the API implementation to happen for now.
 5 | from datetime import datetime
 6 | from bookworm.general_API import *
 7 | import os
 8 | import cgitb
 9 | #import MySQLdb
10 | cgitb.enable()
11 | 
12 | def headers(method):
13 |     if method!="return_tsv":
14 |         print "Content-type: text/html\n"
15 | 
16 |     elif method=="return_tsv":
17 |         print "Content-type: text; charset=utf-8"
18 |         print "Content-Disposition: filename=Bookworm-data.txt"
19 |         print "Pragma: no-cache"
20 |         print "Expires: 0\n"
21 | 
22 | def debug(string):
23 |     """
24 |     Makes it easier to debug through a web browser by handling the headers
25 |     No calls should be permanently left in the code ever, or they will break things badly.
26 |     """
27 |     print headers('1')
28 |     print "<br>"
29 |     print string
30 |     print "<br>"
31 | 
32 | 
33 | def main(JSONinput):
34 | 
35 |     query = JSONinput
36 | 
37 |     try:
38 |         #Whether there are multiple search terms, as in the highcharts method.
39 |         usingSuccinctStyle = isinstance(query['search_limits'],dict)
40 |     except:
41 |         #If there are no search limits, it might be a returnPossibleFields query
42 |         usingSuccinctStyle = True
43 | 
44 |     headers(query['method'])
45 | 
46 |     p = SQLAPIcall(query)
47 | 
48 |     result = p.execute()
49 |     print result
50 | 
51 |     return True
52 | 
53 | 
54 | if __name__=="__main__":
55 |     form = cgi.FieldStorage()
56 | 
57 |     #Still supporting two names for the passed parameter.
58 |     try:
59 |         JSONinput = form["queryTerms"].value
60 |     except KeyError:
61 |         JSONinput = form["query"].value
62 | 
63 |     main(json.loads(JSONinput))
64 | 
65 | 
66 | 


--------------------------------------------------------------------------------
/testAPI.py:
--------------------------------------------------------------------------------
 1 | import dbbindings
 2 | import unittest
 3 | import bookworm.general_API as general_API
 4 | import bookworm.SQLAPI as SQLAPI
 5 | 
 6 | class SQLfunction(unittest.TestCase):
 7 | 
 8 |     def test1(self):
 9 | 
10 |         query = {
11 |             "database": "movies",
12 |             "method": "return_json",
13 |             "search_limits": {"MovieYear":1900},
14 |             "counttype": "WordCount",
15 |             "groups": ["TV_show"]
16 |         }
17 | 
18 |     
19 |         f = SQLAPI.userquery(query).query()
20 |         print f
21 |     
22 | 
23 | class SQLConnections(unittest.TestCase):
24 |     def dbConnectorsWork(self):
25 |         from general_API import prefs as prefs
26 |         connection = general_API.dbConnect(prefs,"federalist")
27 |         tables = connection.cursor.execute("SHOW TABLES")
28 |         self.assertTrue(connection.dbname=="federalist")
29 | 
30 |     def test1(self):
31 |         query = {
32 |                 "database":"federalist",
33 |                 "search_limits":{},
34 |                 "counttype":"TextPercent",
35 |                 "groups":["author"],
36 |                 "method":"return_json"
37 |         }
38 |         
39 |         try:
40 |             dbbindings.main(query)
41 |             worked = True
42 |         except:
43 |             worked = False
44 | 
45 |         self.assertTrue(worked)
46 | 
47 |     def test2(self):
48 |         query = {
49 |             "database":"federalist",
50 |             "search_limits":{"author":"Hamilton"},
51 |             "compare_limits":{"author":"Madison"},
52 |             "counttype":"Dunning",
53 |             "groups":["unigram"],
54 |             "method":"return_json"
55 |         }
56 |         
57 | 
58 |         try:
59 |             #dbbindings.main(query)
60 |             worked = True
61 |         except:
62 |             worked = False
63 | 
64 |         self.assertTrue(worked)
65 | 
66 | if __name__=="__main__":
67 |     unittest.main()
68 | 


--------------------------------------------------------------------------------