├── .gitignore ├── LICENSE.md ├── Makefile ├── README.md ├── bookworm ├── #APIimplementation.py# ├── .gitignore ├── APIimplementation.py ├── MetaWorm.py ├── SQLAPI.py ├── __init__.py ├── general_API.py ├── knownHosts.py └── logParser.py ├── dbbindings.py └── testAPI.py /.gitignore: -------------------------------------------------------------------------------- 1 | old/* 2 | *~ 3 | APIkeys 4 | #* 5 | .#* 6 | .DS_Store 7 | *.cgi 8 | migration.py 9 | shipping.py 10 | genderizer* 11 | *.pyc 12 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2014 Benjamin Schmidt 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of 6 | this software and associated documentation files (the "Software"), to deal in 7 | the Software without restriction, including without limitation the rights to 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 9 | the Software, and to permit persons to whom the Software is furnished to do so, 10 | subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 17 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 18 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 19 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 20 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 21 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | ubuntu-install: 2 | apt-get install python-numpy python-mysqldb 3 | mkdir -p /var/log/presidio 4 | touch /var/log/presidio/log.txt 5 | chown -R www-data:www-data /var/log/presidio 6 | mv ./*.py /usr/lib/cgi-bin/ 7 | chmod -R 755 /usr/lib/cgi-bin 8 | 9 | os-x-install: 10 | brew install python-numpy python-mysqldb 11 | mkdir -p /var/log/presidio 12 | touch /var/log/presidio/log.txt 13 | chown -R www /var/log/presidio 14 | chmod -R 755 /usr/lib/cgi-bin 15 | mkdir -p /etc/mysql 16 | ln -s /etc/my.cnf /etc/mysql/my.cnf 17 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Bookworm API 2 | 3 | **This entire repo is deprecated: the API is now bundled inside the [BookwormDB](http://github.com/bookworm-project/bookwormDB) repo** 4 | 5 | 6 | This is an implementation of the API for Bookworm, written in Python. It primarily implements the API on a MySQL database now, but includes classes for more easily implementing it on top of other platforms (such as Solr). 7 | 8 | It is used with the [Bookworm GUI](https://github.com/Bookworm-project/BookwormGUI) and can also be used as a standalone tool to query data from your database created by [the BookwormDB repo](https://github.com/Bookworm-project/BookwormDB). 9 | For a more interactive explanation of how the GUI works, see the [D3 bookworm browser](http://benschmidt.org/beta/APISandbox) 10 | 11 | ### General Description 12 | 13 | A file, currently at `dbbindings.py`, calls the script `bookworm/general_API.py`; that implements a general purpose API, and then further modules may implement the API on specific backends. Currently, the only backend is the one for the MySQL databases create by [the database repo](http://github.com/bookworm-project/BookwormDB). 14 | 15 | 16 | ### Installation 17 | 18 | Currently, you should just clone this repo into your cgi-bin directory, and make sure that `dbbindings.py` is executable. 19 | 20 | #### OS X caveat. 21 | 22 | If using homebrew, the shebang at the beginning of `dbbindings.py` is incorrect. (It will not load your installed python modules). Change it from `#!/usr/bin/env python` to `#!/usr/local/bin/python`, and it should work. 23 | 24 | ### Usage 25 | 26 | If the bookworm is located on your server, there is no need to do anything--it should be drag-and-drop. (Although on anything but debian, settings might require a small amount of tweaking. 27 | 28 | If you want to have the webserver and database server on different machines, that needs to be specified in the configuration file for mysql that this reads: if you want to have multiple mysql servers, you may need to get fancy. 29 | 30 | This tells the API where to look for the data for a particular bookworm. The benefit of this setup is that you can have your webserver on one server and the database on another server. 31 | 32 | -------------------------------------------------------------------------------- /bookworm/#APIimplementation.py#: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | import sys 4 | import json 5 | import cgi 6 | import re 7 | import numpy #used for smoothing. 8 | import copy 9 | 10 | #These are here so we can support multiple databases with different naming schemes from a single API. A bit ugly to have here; could be part of configuration file somewhere else, I guess. there are 'fast' and 'full' tables for books and words; 11 | #that's so memory tables can be used in certain cases for fast, hashed matching, but longer form data (like book titles) 12 | #can be stored on disk. Different queries use different types of calls. 13 | #Also, certain metadata fields are stored separately from the main catalog table; I list them manually here to avoid a database call to find out what they are, 14 | #although the latter would be more elegant. The way to do that would be a database call 15 | #of tables with two columns one of which is 'bookid', maybe, or something like that. 16 | #(Or to add it as error handling when a query failed; only then check for missing files. 17 | 18 | general_prefs = {"presidio":{"HOST":"melville.seas.harvard.edu","database":"presidio","fastcat":"fastcat","fullcat":"open_editions","fastword":"wordsheap","read_default_file":"/etc/mysql/my.cnf","fullword":"words","separateDataTables":["LCSH","gender"],"read_url_head":"http://www.archive.org/stream/"},"arxiv":{"HOST":"10.102.15.45","database":"arxiv","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":["genre","fastgenre","archive","subclass"],"read_url_head":"http://www.arxiv.org/abs/"},"jstor":{"HOST":"10.102.15.45","database":"jstor","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":["discipline"],"read_url_head":"http://www.arxiv.org/abs/"}, "politweets":{"HOST":"chaucer.fas.harvard.edu","database":"politweets","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":[],"read_url_head":"http://www.arxiv.org/abs/"},"LOC":{"HOST":"10.102.15.45","database":"LOC","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":[],"read_url_head":"http://www.arxiv.org/abs/"},"ChronAm":{"HOST":"10.102.15.45","database":"LOC","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":[],"read_url_head":"http://www.arxiv.org/abs/"}} 19 | #We define prefs to default to the Open Library set at first; later, it can do other things. 20 | 21 | class dbConnect(): 22 | #This is a read-only account 23 | def __init__(self,prefs = general_prefs['presidio']): 24 | import MySQLdb 25 | self.db = MySQLdb.connect(host=prefs['HOST'],read_default_file = prefs['read_default_file'],use_unicode='True',charset='utf8',db=prefs['database']) 26 | self.cursor = self.db.cursor() 27 | 28 | 29 | # The basic object here is a userquery: it takes dictionary as input, as defined in the API, and returns a value 30 | # via the 'execute' function whose behavior 31 | # depends on the mode that is passed to it. 32 | # Given the dictionary, it can return a number of objects. 33 | # The "Search_limits" array in the passed dictionary determines how many elements it returns; this lets multiple queries be bundled together. 34 | # Most functions describe a subquery that might be combined into one big query in various ways. 35 | 36 | class userqueries(): 37 | #This is a set of queries that are bound together; each element in search limits is iterated over, and we're done. 38 | def __init__(self,outside_dictionary = {"counttype":"Percentage_of_Books","search_limits":[{"word":["polka dot"],"LCSH":["Fiction"]}]},db = None): 39 | #coerce one-element dictionaries to an array. 40 | self.database = outside_dictionary.setdefault('database','presidio') 41 | prefs = general_prefs[self.database] 42 | self.prefs = prefs 43 | self.wordsheap = prefs['fastword'] 44 | self.words = prefs['fullword'] 45 | if 'search_limits' not in outside_dictionary.keys(): 46 | outside_dictionary['search_limits'] = [{}] 47 | if isinstance(outside_dictionary['search_limits'],dict): 48 | #(allowing passing of just single dictionaries instead of arrays) 49 | outside_dictionary['search_limits'] = [outside_dictionary['search_limits']] 50 | self.returnval = [] 51 | self.queryInstances = [] 52 | for limits in outside_dictionary['search_limits']: 53 | mylimits = outside_dictionary 54 | mylimits['search_limits'] = limits 55 | localQuery = userquery(mylimits) 56 | self.queryInstances.append(localQuery) 57 | self.returnval.append(localQuery.execute()) 58 | 59 | def execute(self): 60 | return self.returnval 61 | 62 | class userquery(): 63 | def __init__(self,outside_dictionary = {"counttype":"Percentage_of_Books","search_limits":{"word":["polka dot"],"LCSH":["Fiction"]}}): 64 | #Certain constructions require a DB connection already available, so we just start it here, or use the one passed to it. 65 | self.outside_dictionary = outside_dictionary 66 | self.prefs = general_prefs[outside_dictionary.setdefault('database','presidio')] 67 | self.db = dbConnect(self.prefs) 68 | self.cursor = self.db.cursor 69 | self.wordsheap = self.prefs['fastword'] 70 | self.words = self.prefs['fullword'] 71 | 72 | #I'm now allowing 'search_limits' to either be a dictionary or an array of dictionaries: 73 | #this makes the syntax cleaner on most queries, 74 | #while still allowing some more complicated ones. 75 | if isinstance(outside_dictionary['search_limits'],list): 76 | outside_dictionary['search_limits'] = outside_dictionary['search_limits'][0] 77 | self.defaults(outside_dictionary) #Take some defaults 78 | self.derive_variables() #Derive some useful variables that the query will use. 79 | 80 | def defaults(self,outside_dictionary): 81 | #these are default values;these are the only values that can be set in the query 82 | #search_limits is an array of dictionaries; 83 | #each one contains a set of limits that are mutually independent 84 | #The other limitations are universal for all the search limits being set. 85 | 86 | #Set up a dictionary for the denominator of any fraction if it doesn't already exist: 87 | self.search_limits = outside_dictionary.setdefault('search_limits',[{"word":["polka dot"]}]) 88 | 89 | self.words_collation = outside_dictionary.setdefault('words_collation',"Case_Insensitive") 90 | lookups = {"Case_Insensitive":'word',"case_insensitive":"word","Case_Sensitive":"casesens","Correct_Medial_s":'ffix',"All_Words_with_Same_Stem":"stem","Flagged":'wflag'} 91 | self.word_field = lookups[self.words_collation] 92 | 93 | self.groups = [] 94 | try: 95 | groups = outside_dictionary['groups'] 96 | except: 97 | groups = [outside_dictionary['time_measure']] 98 | 99 | if groups == []: 100 | groups = ["bookid is not null as In_Library"] 101 | if (len (groups) > 1): 102 | pass 103 | #self.groups = credentialCheckandClean(self.groups) 104 | #Define some sort of limitations here. 105 | for group in groups: 106 | group = group 107 | if group=="unigram" or group=="word": 108 | group = "words1." + self.word_field + " as unigram" 109 | if group=="bigram": 110 | group = "CONCAT (words1." + self.word_field + " ,' ' , words2." + self.word_field + ") as bigram" 111 | self.groups.append(group) 112 | 113 | self.selections = ",".join(self.groups) 114 | self.groupings = ",".join([re.sub(".* as","",group) for group in self.groups]) 115 | 116 | 117 | self.compare_dictionary = copy.deepcopy(self.outside_dictionary) 118 | if 'compare_limits' in self.outside_dictionary.keys(): 119 | self.compare_dictionary['search_limits'] = outside_dictionary['compare_limits'] 120 | del outside_dictionary['compare_limits'] 121 | else: #if nothing specified, we compare the word to the corpus. 122 | for key in ['word','word1','word2','word3','word4','word5','unigram','bigram']: 123 | try: 124 | del self.compare_dictionary['search_limits'][key] 125 | except: 126 | pass 127 | for key in self.outside_dictionary['search_limits'].keys(): 128 | if re.search('words?\d',key): 129 | try: 130 | del self.compare_dictionary['search_limits'][key] 131 | except: 132 | pass 133 | 134 | comparegroups = [] 135 | #This is a little tricky behavior here--hopefully it works in all cases. It drops out word groupings. 136 | try: 137 | compareGroups = self.compare_dictionary['groups'] 138 | except: 139 | compareGroups = [self.compare_dictionary['time_measure']] 140 | for group in compareGroups: 141 | if not re.match("words",group) and not re.match("[u]?[bn]igram",group): 142 | comparegroups.append(group) 143 | self.compare_dictionary['groups'] = comparegroups 144 | self.time_limits = outside_dictionary.setdefault('time_limits',[0,10000000]) 145 | self.time_measure = outside_dictionary.setdefault('time_measure','year') 146 | self.counttype = outside_dictionary.setdefault('counttype',"Occurrences_per_Million_Words") 147 | 148 | self.index = outside_dictionary.setdefault('index',0) 149 | #Ordinarily, the input should be an an array of groups that will both select and group by. 150 | #The joins may be screwed up by certain names that exist in multiple tables, so there's an option to do something like 151 | #SELECT catalog.bookid as myid, because WHERE clauses on myid will work but GROUP BY clauses on catalog.bookid may not 152 | #after a sufficiently large number of subqueries. 153 | #This smoothing code really ought to go somewhere else, since it doesn't quite fit into the whole API mentality and is 154 | #more about the webpage. 155 | self.smoothingType = outside_dictionary.setdefault('smoothingType',"triangle") 156 | self.smoothingSpan = outside_dictionary.setdefault('smoothingSpan',3) 157 | self.method = outside_dictionary.setdefault('method',"Nothing") 158 | self.tablename = outside_dictionary.setdefault('tablename','master'+"_bookcounts as bookcounts") 159 | 160 | def derive_variables(self): 161 | #These are locally useful, and depend on the variables 162 | self.limits = self.search_limits 163 | #Treat empty constraints as nothing at all, not as full restrictions. 164 | for key in self.limits.keys(): 165 | if self.limits[key] == []: 166 | del self.limits[key] 167 | self.create_catalog_table() 168 | self.make_catwhere() 169 | self.make_wordwheres() 170 | 171 | def create_catalog_table(self): 172 | self.catalog = self.prefs['fastcat'] #'catalog' #Can be replaced with a more complicated query. 173 | 174 | #Rather than just search for "LCSH", this should check query constraints against a list of tables, and join to them. 175 | #So if you query with a limit on LCSH, it joins the table "LCSH" to catalog; and then that table has one column, ALSO 176 | #called "LCSH", which is matched against. This allows a bookid to be a member of multiple catalogs. 177 | 178 | for limitation in self.prefs['separateDataTables']: 179 | #That re.sub thing is in here because sometimes I do queries that involve renaming. 180 | if limitation in [re.sub(" .*","",key) for key in self.limits.keys()] or limitation in [re.sub(" .*","",group) for group in self.groups]: 181 | self.catalog = self.catalog + """ JOIN """ + limitation + """ USING (bookid)""" 182 | 183 | #Here's a feature that's not yet fully implemented: it doesn't work quickly enough, probably because the joins involve a lot of jumping back and forth 184 | if 'hasword' in self.limits.keys(): 185 | #This is the sort of code I should have written more of: 186 | #it just generates a new API call to fill a small part of the code here: 187 | #(in this case, it merges the 'catalog' entry with a select query on 188 | #the word in the 'haswords' field. Enough of this could really 189 | #shrink the codebase, I suspect. But for some reason, these joins end up being too slow to run. 190 | #I think that has to do with the temporary table being created; we need to figure out how 191 | #to allow direct access to wordsheap here without having the table aliases for the different versions of wordsheap 192 | #being used overlapping. 193 | if self.limits['hasword'] == []: 194 | del self.limits['hasword'] 195 | return 196 | import copy 197 | #deepcopy lets us get a real copy of the dictionary 198 | #that can be changed without affecting the old one. 199 | mydict = copy.deepcopy(self.outside_dictionary) 200 | mydict['search_limits'] = copy.deepcopy(self.limits) 201 | mydict['search_limits']['word'] = copy.deepcopy(mydict['search_limits']['hasword']) 202 | del mydict['search_limits']['hasword'] 203 | tempquery = userquery(mydict) 204 | bookids = '' 205 | bookids = tempquery.counts_query() 206 | 207 | #If this is ever going to work, 'catalog' here should be some call to self.prefs['fastcat'] 208 | bookids = re.sub("(?s).*catalog[^\.]?[^\.\n]*\n","\n",bookids) 209 | bookids = re.sub("(?s)WHERE.*","\n",bookids) 210 | bookids = re.sub("(words|lookup)([0-9])","has\\1\\2",bookids) 211 | bookids = re.sub("main","hasTable",bookids) 212 | self.catalog = self.catalog + bookids 213 | #del self.limits['hasword'] 214 | 215 | def make_catwhere(self): 216 | #Where terms that don't include the words table join. Kept separate so that we can have subqueries only working on one half of the stack. 217 | catlimits = dict() 218 | for key in self.limits.keys(): 219 | if key not in ('word','word1','word2','hasword') and not re.search("words\d",key): 220 | catlimits[key] = self.limits[key] 221 | if len(catlimits.keys()) > 0: 222 | self.catwhere = where_from_hash(catlimits) 223 | else: 224 | self.catwhere = "TRUE" 225 | 226 | def make_wordwheres(self): 227 | self.wordswhere = " TRUE " 228 | self.max_word_length = 0 229 | limits = [] 230 | 231 | if 'word' in self.limits.keys(): 232 | """ 233 | This doesn't currently allow mixing of one and two word searches together in a logical way. 234 | It might be possible to just join on both the tables in MySQL--I'm not completely sure what would happen. 235 | But the philosophy has been to keep users from doing those searches as far as possible in any case. 236 | """ 237 | for phrase in self.limits['word']: 238 | locallimits = dict() 239 | array = phrase.split(" ") 240 | n=1 241 | for word in array: 242 | locallimits['words'+str(n) + "." + self.word_field] = word 243 | self.max_word_length = max(self.max_word_length,n) 244 | n = n+1 245 | limits.append(where_from_hash(locallimits)) 246 | #XXX for backward compatability 247 | self.words_searched = phrase 248 | #del self.limits['word'] 249 | self.wordswhere = '(' + ' OR '.join(limits) + ')' 250 | 251 | wordlimits = dict() 252 | 253 | limitlist = copy.deepcopy(self.limits.keys()) 254 | 255 | for key in limitlist: 256 | if re.search("words\d",key): 257 | wordlimits[key] = self.limits[key] 258 | self.max_word_length = max(self.max_word_length,2) 259 | del self.limits[key] 260 | 261 | if len(wordlimits.keys()) > 0: 262 | self.wordswhere = where_from_hash(wordlimits) 263 | 264 | 265 | # def return_wordstableOld(self, words = ['polka dot'], pos=1): 266 | # #This returns an SQL sequence suitable for querying or, probably, joining, that gives a words table only as long as the words that are 267 | # #listed in the query; it works with different word fields 268 | # #The pos value specifies a number to go after the table names, so that we can have more than one table in the join. But those numbers 269 | # #have to be assigned elsewhere, so overlap is a danger if programmed poorly. 270 | # self.lookupname = "lookup" + str(pos) 271 | # self.wordsname = "words" + str(pos) 272 | # if len(words) > 0: 273 | # self.wordwhere = where_from_hash({self.lookupname + ".casesens":words}) 274 | # self.wordstable = """ 275 | # %(wordsheap)s as %(wordsname)s JOIN 276 | # %(wordsheap)s AS %(lookupname)s 277 | # ON ( %(wordsname)s.%(word_field)s=%(lookupname)s.%(word_field)s 278 | # AND %(wordwhere)s ) """ % self.__dict__ 279 | # else: 280 | # #We want to have some words returned even if _none_ are the query so that they can be selected. Having all the joins doesn't allow that, 281 | # #because in certain cases (merging by stems, eg) it would have multiple rows returned for a single word. 282 | # self.wordstable = """ 283 | # %(wordsheap)s as %(wordsname)s """ % self.__dict__ 284 | # return self.wordstable 285 | 286 | def build_wordstables(self): 287 | #Deduce the words tables we're joining against. The iterating on this can be made more general to get 3 or four grams in pretty easily. 288 | #This relies on a determination already having been made about whether this is a unigram or bigram search; that's reflected in the keys passed. 289 | if (self.max_word_length == 2 or re.search("words2",self.selections)): 290 | self.maintable = 'master_bigrams' 291 | self.main = ''' 292 | JOIN 293 | master_bigrams as main 294 | ON ('''+ self.prefs['fastcat'] +'''.bookid=main.bookid) 295 | ''' 296 | self.wordstables = """ 297 | JOIN %(wordsheap)s as words1 ON (main.word1 = words1.wordid) 298 | JOIN %(wordsheap)s as words2 ON (main.word2 = words2.wordid) """ % self.__dict__ 299 | 300 | #I use a regex here to do a blanket search for any sort of word limitations. That has some messy sideffects (make sure the 'hasword' 301 | #key has already been eliminated, for example!) but generally works. 302 | elif self.max_word_length == 1 or re.search("word",self.selections): 303 | self.maintable = 'master_bookcounts' 304 | self.main = ''' 305 | JOIN 306 | master_bookcounts as main 307 | ON (''' + self.prefs['fastcat'] + '''.bookid=main.bookid)''' 308 | self.tablename = 'master_bookcounts' 309 | self.wordstables = """ 310 | JOIN ( %(wordsheap)s as words1) ON (main.wordid = words1.wordid) 311 | """ % self.__dict__ 312 | #Have _no_ words table if no words searched for or grouped by; instead just use nwords. This 313 | #isn't strictly necessary, but means the API can be used for the slug-filling queries, and some others. 314 | else: 315 | self.main = " " 316 | self.operation = self.catoperation[self.counttype] #Why did I do this? 317 | self.wordstables = " " 318 | self.wordswhere = " TRUE " #Just a dummy thing. Shouldn't take any time, right? 319 | 320 | def counts_query(self,countname='count'): 321 | self.countname=countname 322 | self.bookoperation = {"Occurrences_per_Million_Words":"sum(main.count)","Raw_Counts":"sum(main.count)","Percentage_of_Books":"count(DISTINCT " + self.prefs['fastcat'] + ".bookid)","Number_of_Books":"count(DISTINCT "+ self.prefs['fastcat'] + ".bookid)"} 323 | self.catoperation = {"Occurrences_per_Million_Words":"sum(nwords)","Raw_Counts":"sum(nwords)","Percentage_of_Books":"count(nwords)","Number_of_Books":"count(nwords)"} 324 | self.operation = self.bookoperation[self.counttype] 325 | self.build_wordstables() 326 | countsQuery = """ 327 | SELECT 328 | %(selections)s, 329 | %(operation)s as %(countname)s 330 | FROM 331 | %(catalog)s 332 | %(main)s 333 | %(wordstables)s 334 | WHERE 335 | %(catwhere)s AND %(wordswhere)s 336 | GROUP BY 337 | %(groupings)s 338 | """ % self.__dict__ 339 | return countsQuery 340 | 341 | def ratio_query(self): 342 | finalcountcommands = {"Occurrences_per_Million_Words":"IFNULL(count,0)*1000000/total","Raw_Counts":"IFNULL(count,0)","Percentage_of_Books":"IFNULL(count,0)*100/total","Number_of_Books":"IFNULL(count,0)"} 343 | self.countcommand = finalcountcommands[self.counttype] 344 | #if True: #In the case that we're not using a superset of words; this can be changed later 345 | # supersetGroups = [group for group in self.groups if not re.match('word',group)] 346 | # self.finalgroupings = self.groupings 347 | # for key in self.limits.keys(): 348 | # if re.match('word',key): 349 | # del self.limits[key] 350 | 351 | self.denominator = userquery(outside_dictionary = self.compare_dictionary) 352 | self.supersetquery = self.denominator.counts_query(countname='total') 353 | 354 | if re.search("In_Library",self.denominator.selections): 355 | self.selections = self.selections + ", fastcat.bookid is not null as In_Library" 356 | 357 | #See above: In_Library is a dummy variable so that there's always something to join on. 358 | self.mainquery = self.counts_query() 359 | 360 | 361 | """ 362 | We launch a whole new userquery instance here to build the denominator, based on the 'compare_dictionary' option (which in most 363 | cases is the search_limits without the keys, see above. 364 | We then get the counts_query results out of that result. 365 | """ 366 | 367 | 368 | self.totalMergeTerms = "USING (" + self.denominator.groupings + " ) " 369 | 370 | 371 | self.totalselections = ",".join([re.sub(".* as","",group) for group in self.groups]) 372 | 373 | query = """ 374 | SELECT 375 | %(totalselections)s, 376 | %(countcommand)s as value 377 | FROM 378 | ( %(mainquery)s 379 | ) as tmp 380 | RIGHT JOIN 381 | ( %(supersetquery)s ) as totaller 382 | %(totalMergeTerms)s 383 | GROUP BY %(groupings)s;""" % self.__dict__ 384 | return query 385 | 386 | def return_slug_data(self,force=False): 387 | #Rather than understand this error, I'm just returning 0 if it fails. 388 | #Probably that's the right thing to do, though it may cause trouble later. 389 | #It's just a punishment for not later using a decent API call with "Raw_Counts" to extract these counts out, and relying on this ugly method. 390 | try: 391 | temp_words = self.return_n_words(force = True) 392 | temp_counts = self.return_n_books(force = True) 393 | except: 394 | temp_words = 0 395 | temp_counts = 0 396 | return [temp_counts,temp_words] 397 | 398 | def return_n_books(self,force=False): 399 | if (not hasattr(self,'nbooks')) or force: 400 | query = "SELECT count(*) from " + self.catalog + " WHERE " + self.catwhere 401 | silent = self.cursor.execute(query) 402 | self.counts = int(self.cursor.fetchall()[0][0]) 403 | return self.counts 404 | 405 | def return_n_words(self,force=False): 406 | if (not hasattr(self,'nwords')) or force: 407 | query = "SELECT sum(nwords) from " + self.catalog + " WHERE " + self.catwhere 408 | silent = self.cursor.execute(query) 409 | self.nwords = int(self.cursor.fetchall()[0][0]) 410 | return self.nwords 411 | 412 | def ranked_query(self,percentile_to_return = 99,addwhere = ""): 413 | #NOT CURRENTLY IN USE ANYWHERE--DELETE??? 414 | ##This returns a list of bookids in order by how well they match the sort terms. 415 | ## Using an IDF term will give better search results for case-sensitive searches, but is currently disabled 416 | ## 417 | self.LIMIT = int((100-percentile_to_return) * self.return_n_books()/100) 418 | countQuery = """ 419 | SELECT 420 | bookid, 421 | sum(main.count*1000/nwords%(idfterm)s) as score 422 | FROM %(catalog)s LEFT JOIN %(tablename)s 423 | USING (bookid) 424 | WHERE %(catwhere)s AND %(wordswhere)s 425 | GROUP BY bookid 426 | ORDER BY score DESC 427 | LIMIT %(LIMIT)s 428 | """ % self.__dict__ 429 | return countQuery 430 | 431 | def bibliography_query(self,limit = "100"): 432 | #I'd like to redo this at some point so it could work as an API call. 433 | self.limit = limit 434 | self.ordertype = "sum(main.count*10000/nwords)" 435 | try: 436 | if self.outside_dictionary['ordertype'] == "random": 437 | if self.counttype=="Raw_Counts" or self.counttype=="Number_of_Books": 438 | self.ordertype = "RAND()" 439 | else: 440 | self.ordertype = "LOG(1-RAND())/sum(main.count)" 441 | except KeyError: 442 | pass 443 | 444 | #If IDF searching is enabled, we could add a term like '*IDF' here to overweight better selecting words 445 | #in the event of a multiple search. 446 | self.idfterm = "" 447 | prep = self.counts_query() 448 | 449 | bibQuery = """ 450 | SELECT searchstring 451 | FROM """ % self.__dict__ + self.prefs['fullcat'] + """ RIGHT JOIN ( 452 | SELECT 453 | """+ self.prefs['fastcat'] + """.bookid, %(ordertype)s as ordering 454 | FROM 455 | %(catalog)s 456 | %(main)s 457 | %(wordstables)s 458 | WHERE 459 | %(catwhere)s AND %(wordswhere)s 460 | GROUP BY bookid ORDER BY %(ordertype)s DESC LIMIT %(limit)s 461 | ) as tmp USING(bookid) ORDER BY ordering DESC; 462 | """ % self.__dict__ 463 | return bibQuery 464 | 465 | def disk_query(self,limit="100"): 466 | pass 467 | 468 | def return_books(self): 469 | #This preps up the display elements for a search. 470 | #All this needs to be rewritten. 471 | silent = self.cursor.execute(self.bibliography_query()) 472 | returnarray = [] 473 | for line in self.cursor.fetchall(): 474 | returnarray.append(line[0]) 475 | if not returnarray: 476 | returnarray.append("No results for this particular point: try again without smoothing") 477 | newerarray = self.custom_SearchString_additions(returnarray) 478 | return json.dumps(newerarray) 479 | 480 | def getActualSearchedWords(self): 481 | if len(self.wordswhere) > 7: 482 | words = self.outside_dictionary['search_limits']['word'] 483 | #Break bigrams into single words. 484 | words = ' '.join(words).split(' ') 485 | self.cursor.execute("""SELECT word FROM wordsheap WHERE """ + where_from_hash({self.word_field:words})) 486 | self.actualWords =[item[0] for item in self.cursor.fetchall()] 487 | else: 488 | self.actualWords = ["tasty","mistake","happened","here"] 489 | 490 | def custom_SearchString_additions(self,returnarray): 491 | db = self.outside_dictionary['database'] 492 | if db in ('jstor','presidio','ChronAm','LOC'): 493 | self.getActualSearchedWords() 494 | if db=='jstor': 495 | joiner = "&searchText=" 496 | preface = "?Search=yes&searchText=" 497 | urlRegEx = "http://www.jstor.org/stable/\d+" 498 | if db=='presidio': 499 | joiner = "+" 500 | preface = "#page/1/mode/2up/search/" 501 | urlRegEx = 'http://archive.org/stream/[^"# ><]*' 502 | if db in ('ChronAm','LOC'): 503 | preface = "/;words=" 504 | joiner = "+" 505 | urlRegEx = 'http://chroniclingamerica.loc.gov[^\"><]*/seq-\d' 506 | newarray = [] 507 | for string in returnarray: 508 | base = re.findall(urlRegEx,string)[0] 509 | newcore = ' search inside ' 510 | string = re.sub("^","",string) 511 | string = re.sub("$","",string) 512 | string = string+newcore 513 | newarray.append(string) 514 | #Arxiv is messier, requiring a whole different URL interface: http://search.arxiv.org:8081/paper.jsp?r=1204.3352&qs=network 515 | return newarray 516 | 517 | def return_query_values(self,query = "ratio_query"): 518 | #The API returns a dictionary with years pointing to values. 519 | values = [] 520 | querytext = getattr(self,query)() 521 | silent = self.cursor.execute(querytext) 522 | #Gets the results 523 | mydict = dict(self.cursor.fetchall()) 524 | try: 525 | for key in mydict.keys(): 526 | #Only return results inside the time limits 527 | if key >= self.time_limits[0] and key <= self.time_limits[1]: 528 | mydict[key] = str(mydict[key]) 529 | else: 530 | del mydict[key] 531 | mydict = smooth_function(mydict,smooth_method = self.smoothingType,span = self.smoothingSpan) 532 | 533 | except: 534 | mydict = {0:"0"} 535 | 536 | #This is a good place to change some values. 537 | try: 538 | return {'index':self.index, 'Name':self.words_searched,"values":mydict,'words_searched':""} 539 | except: 540 | return{'values':mydict} 541 | 542 | def arrayNest(self,array,returnt): 543 | #A recursive function to transform a list into a nested array 544 | if len(array)==2: 545 | try: 546 | returnt[array[0]] = float(array[1]) 547 | except: 548 | returnt[array[0]] = array[1] 549 | else: 550 | try: 551 | returnt[array[0]] = self.arrayNest(array[1:len(array)],returnt[array[0]]) 552 | except KeyError: 553 | returnt[array[0]] = self.arrayNest(array[1:len(array)],dict()) 554 | return returnt 555 | 556 | def return_json(self,query='ratio_query'): 557 | if self.counttype=="Raw_Counts" or self.counttype=="Number_of_Books": 558 | query="counts_query" 559 | querytext = getattr(self,query)() 560 | silent = self.cursor.execute(querytext) 561 | names = [to_unicode(item[0]) for item in self.cursor.description] 562 | returnt = dict() 563 | lines = self.cursor.fetchall() 564 | for line in lines: 565 | returnt = self.arrayNest(line,returnt) 566 | return returnt 567 | 568 | def return_tsv(self,query = "ratio_query"): 569 | if self.counttype=="Raw_Counts" or self.counttype=="Number_of_Books": 570 | query="counts_query" 571 | querytext = getattr(self,query)() 572 | silent = self.cursor.execute(querytext) 573 | results = ["\t".join([to_unicode(item[0]) for item in self.cursor.description])] 574 | lines = self.cursor.fetchall() 575 | for line in lines: 576 | items = [] 577 | for item in line: 578 | item = to_unicode(item) 579 | item = re.sub("\t","",item) 580 | items.append(item) 581 | results.append("\t".join(items)) 582 | return "\n".join(results) 583 | 584 | def export_data(self,query1="ratio_query"): 585 | self.smoothing=0 586 | return self.return_query_values(query=query1) 587 | 588 | def execute(self): 589 | #This performs the query using the method specified in the passed parameters. 590 | if self.method=="Nothing": 591 | pass 592 | else: 593 | return getattr(self,self.method)() 594 | 595 | 596 | ############# 597 | ##GENERAL#### #These are general purpose functional types of things not implemented in the class. 598 | ############# 599 | 600 | def to_unicode(obj, encoding='utf-8'): 601 | if isinstance(obj, basestring): 602 | if not isinstance(obj, unicode): 603 | obj = unicode(obj, encoding) 604 | elif isinstance(obj,int): 605 | obj=unicode(str(obj),encoding) 606 | else: 607 | obj = unicode(str(obj),encoding) 608 | return obj 609 | 610 | def where_from_hash(myhash,joiner=" AND ",comp = " = "): 611 | whereterm = [] 612 | #The general idea here is that we try to break everything in search_limits down to a list, and then create a whereterm on that joined by whatever the 'joiner' is ("AND" or "OR"), with the comparison as whatever comp is ("=",">=",etc.). 613 | #For more complicated bits, it gets all recursive until the bits are in terms of list. 614 | for key in myhash.keys(): 615 | values = myhash[key] 616 | if isinstance(values,basestring) or isinstance(values,int) or isinstance(values,float): 617 | #This is just error handling. You can pass a single value instead of a list if you like, and it will just convert it 618 | #to a list for you. 619 | values = [values] 620 | #Or queries are special, since the default is "AND". This toggles that around for a subportion. 621 | if key=='$or' or key=="$OR": 622 | for comparison in values: 623 | whereterm.append(where_from_hash(comparison,joiner=" OR ",comp=comp)) 624 | #The or doesn't get populated any farther down. 625 | elif isinstance(values,dict): 626 | #Certain function operators can use MySQL terms. These are the only cases that a dict can be passed as a limitations 627 | operations = {"$gt":">","$lt":"<","$grep":" REGEXP ","$gte":">=","$lte":"<=","$eq":"="} 628 | for operation in values.keys(): 629 | whereterm.append(where_from_hash({key:values[operation]},comp=operations[operation],joiner=joiner)) 630 | elif isinstance(values,list): 631 | #and this is where the magic actually happens 632 | if isinstance(values[0],dict): 633 | for entry in values: 634 | whereterm.append(where_from_hash(entry)) 635 | else: 636 | if isinstance(values[0],basestring): 637 | quotesep="'" 638 | else: 639 | quotesep = "" 640 | #Note the "OR" here. There's no way to pass in a query like "year=1876 AND year=1898" as currently set up. 641 | #Obviously that's no great loss, but there might be something I'm missing that would be. 642 | whereterm.append(" (" + " OR ".join([" (" + key+comp+quotesep+str(value)+quotesep+") " for value in values])+ ") ") 643 | return "(" + joiner.join(whereterm) + ")" 644 | #This works pretty well, except that it requires very specific sorts of terms going in, I think. 645 | 646 | 647 | 648 | #I'd rather have all this smoothing stuff done at the client side, but currently it happens here. 649 | def smooth_function(zinput,smooth_method = 'lowess',span = .05): 650 | if smooth_method not in ['lowess','triangle','rectangle']: 651 | return zinput 652 | xarray = [] 653 | yarray = [] 654 | years = zinput.keys() 655 | years.sort() 656 | for key in years: 657 | if zinput[key]!='None': 658 | xarray.append(float(key)) 659 | yarray.append(float(zinput[key])) 660 | from numpy import array 661 | x = array(xarray) 662 | y = array(yarray) 663 | if smooth_method == 'lowess': 664 | #print "starting lowess smoothing
" 665 | from Bio.Statistics.lowess import lowess 666 | smoothed = lowess(x,y,float(span)/100,3) 667 | x = [int(p) for p in x] 668 | returnval = dict(zip(x,smoothed)) 669 | return returnval 670 | if smooth_method == 'rectangle': 671 | from math import log 672 | #print "starting triangle smoothing
" 673 | span = int(span) #Takes the floor--so no smoothing on a span < 1. 674 | returnval = zinput 675 | windowsize = span*2 + 1 676 | from numpy import average 677 | for i in range(len(xarray)): 678 | surrounding = array(range(windowsize),dtype=float) 679 | weights = array(range(windowsize),dtype=float) 680 | for j in range(windowsize): 681 | key_dist = j - span #if span is 2, the zeroeth element is -2, the second element is 0 off, etc. 682 | workingon = i + key_dist 683 | if workingon >= 0 and workingon < len(xarray): 684 | surrounding[j] = float(yarray[workingon]) 685 | weights[j] = 1 686 | else: 687 | surrounding[j] = 0 688 | weights[j] = 0 689 | returnval[xarray[i]] = round(average(surrounding,weights=weights),3) 690 | return returnval 691 | if smooth_method == 'triangle': 692 | from math import log 693 | #print "starting triangle smoothing
" 694 | span = int(span) #Takes the floor--so no smoothing on a span < 1. 695 | returnval = zinput 696 | windowsize = span*2 + 1 697 | from numpy import average 698 | for i in range(len(xarray)): 699 | surrounding = array(range(windowsize),dtype=float) 700 | weights = array(range(windowsize),dtype=float) 701 | for j in range(windowsize): 702 | key_dist = j - span #if span is 2, the zeroeth element is -2, the second element is 0 off, etc. 703 | workingon = i + key_dist 704 | if workingon >= 0 and workingon < len(xarray): 705 | surrounding[j] = float(yarray[workingon]) 706 | #This isn't actually triangular smoothing: I dampen it by the logs, to keep the peaks from being too too big. 707 | #The minimum is '2', since log(1) == 0, which is a nonesense weight. 708 | weights[j] = log(span + 2 - abs(key_dist)) 709 | else: 710 | surrounding[j] = 0 711 | weights[j] = 0 712 | 713 | returnval[xarray[i]] = round(average(surrounding,weights=weights),3) 714 | return returnval 715 | 716 | #The idea is: this works by default by slurping up from the command line, but you could also load the functions in and run results on your own queries. 717 | try: 718 | command = str(sys.argv[1]) 719 | command = json.loads(command) 720 | #Got to go before we let anything else happen. 721 | print command 722 | p = userqueries(command) 723 | result = p.execute() 724 | print json.dumps(result) 725 | except: 726 | pass 727 | 728 | -------------------------------------------------------------------------------- /bookworm/.gitignore: -------------------------------------------------------------------------------- 1 | old/* 2 | *~ 3 | APIkeys 4 | #* 5 | .#* -------------------------------------------------------------------------------- /bookworm/APIimplementation.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | import sys 4 | import json 5 | import cgi 6 | import re 7 | import numpy #used for smoothing. 8 | import copy 9 | import decimal 10 | """ 11 | #These are here so we can support multiple databases with different naming schemes from a single API. 12 | #A bit ugly to have here; could be part of configuration file somewhere else, I guess. there are 'fast' and 'full' tables for books and words; 13 | #that's so memory tables can be used in certain cases for fast, hashed matching, but longer form data (like book titles) 14 | #can be stored on disk. Different queries use different types of calls. 15 | #Also, certain metadata fields are stored separately from the main catalog table; 16 | #I list them manually here to avoid a database call to find out what they are, 17 | #although the latter would be more elegant. The way to do that would be a database call 18 | #of tables with two columns one of which is 'bookid', maybe, or something like that. 19 | #(Or to add it as error handling when a query failed; only then check for missing files. 20 | """ 21 | 22 | general_prefs = {"presidio":{"HOST":"melville.seas.harvard.edu","database":"presidio","fastcat":"fastcat","fullcat":"open_editions","fastword":"wordsheap","read_default_file":"/etc/mysql/my.cnf","fullword":"words","separateDataTables":["LCSH","gender"],"read_url_head":"http://www.archive.org/stream/"},"arxiv":{"HOST":"10.102.15.45","database":"arxiv","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":["genre","fastgenre","archive","subclass"],"read_url_head":"http://www.arxiv.org/abs/"},"jstor":{"HOST":"10.102.15.45","database":"jstor","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":["discipline"],"read_url_head":"http://www.arxiv.org/abs/"}, "politweets":{"HOST":"chaucer.fas.harvard.edu","database":"politweets","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":[],"read_url_head":"http://www.arxiv.org/abs/"},"LOC":{"HOST":"10.102.15.45","database":"LOC","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":[],"read_url_head":"http://www.arxiv.org/abs/"},"ChronAm":{"HOST":"10.102.15.45","database":"ChronAm","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":["subjects"],"read_url_head":"http://www.arxiv.org/abs/"},"ngrams":{"fastcat": "fastcat", "HOST": "10.102.15.45", "separateDataTables": [], "fastword": "wordsheap", "database": "ngrams", "read_url_head": "arxiv.culturomics.org", "fullcat": "catalog", "fullword": "words", "read_default_file": "/etc/mysql/my.cnf"},"OL":{"HOST":"10.102.15.45","database":"OL","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":["subjects"],"read_url_head":"http://www.arxiv.org/abs/"}} 23 | 24 | general_prefs['OL'] = {"fastcat": "fastcat", "HOST": "10.102.15.45", "separateDataTables": ["authors", "publishers", "authors", "subjects"], "fastword": "wordsheap", "database": "OL", "read_url_head": "arxiv.culturomics.org", "fullcat":"catalog", "fullword": "words", "read_default_file": "/etc/mysql/my.cnf"} 25 | 26 | #We define prefs to default to the Open Library set at first; later, it can do other things. 27 | 28 | class dbConnect(): 29 | #This is a read-only account 30 | def __init__(self,prefs = general_prefs['presidio']): 31 | import MySQLdb 32 | self.db = MySQLdb.connect(host=prefs['HOST'],read_default_file = prefs['read_default_file'],use_unicode='True',charset='utf8',db=prefs['database']) 33 | self.cursor = self.db.cursor() 34 | 35 | 36 | # The basic object here is a userquery: it takes dictionary as input, as defined in the API, and returns a value 37 | # via the 'execute' function whose behavior 38 | # depends on the mode that is passed to it. 39 | # Given the dictionary, it can return a number of objects. 40 | # The "Search_limits" array in the passed dictionary determines how many elements it returns; this lets multiple queries be bundled together. 41 | # Most functions describe a subquery that might be combined into one big query in various ways. 42 | 43 | class userqueries(): 44 | #This is a set of queries that are bound together; each element in search limits is iterated over, and we're done. 45 | def __init__(self,outside_dictionary = {"counttype":["Percentage_of_Books"],"search_limits":[{"word":["polka dot"],"LCSH":["Fiction"]}]},db = None): 46 | self.database = outside_dictionary.setdefault('database','presidio') 47 | prefs = general_prefs[self.database] 48 | self.prefs = prefs 49 | self.wordsheap = prefs['fastword'] 50 | self.words = prefs['fullword'] 51 | if 'search_limits' not in outside_dictionary.keys(): 52 | outside_dictionary['search_limits'] = [{}] 53 | #coerce one-element dictionaries to an array. 54 | if isinstance(outside_dictionary['search_limits'],dict): 55 | #(allowing passing of just single dictionaries instead of arrays) 56 | outside_dictionary['search_limits'] = [outside_dictionary['search_limits']] 57 | self.returnval = [] 58 | self.queryInstances = [] 59 | for limits in outside_dictionary['search_limits']: 60 | mylimits = outside_dictionary 61 | mylimits['search_limits'] = limits 62 | localQuery = userquery(mylimits) 63 | self.queryInstances.append(localQuery) 64 | self.returnval.append(localQuery.execute()) 65 | 66 | def execute(self): 67 | return self.returnval 68 | 69 | class userquery(): 70 | def __init__(self,outside_dictionary = {"counttype":["Percentage_of_Books"],"search_limits":{"word":["polka dot"],"LCSH":["Fiction"]}}): 71 | #Certain constructions require a DB connection already available, so we just start it here, or use the one passed to it. 72 | self.outside_dictionary = outside_dictionary 73 | self.prefs = general_prefs[outside_dictionary.setdefault('database','presidio')] 74 | self.db = dbConnect(self.prefs) 75 | self.cursor = self.db.cursor 76 | self.wordsheap = self.prefs['fastword'] 77 | self.words = self.prefs['fullword'] 78 | 79 | #I'm now allowing 'search_limits' to either be a dictionary or an array of dictionaries: 80 | #this makes the syntax cleaner on most queries, 81 | #while still allowing some long ones from the Bookworm website. 82 | if isinstance(outside_dictionary['search_limits'],list): 83 | outside_dictionary['search_limits'] = outside_dictionary['search_limits'][0] 84 | self.defaults(outside_dictionary) #Take some defaults 85 | self.derive_variables() #Derive some useful variables that the query will use. 86 | 87 | def defaults(self,outside_dictionary): 88 | #these are default values;these are the only values that can be set in the query 89 | #search_limits is an array of dictionaries; 90 | #each one contains a set of limits that are mutually independent 91 | #The other limitations are universal for all the search limits being set. 92 | 93 | #Set up a dictionary for the denominator of any fraction if it doesn't already exist: 94 | self.search_limits = outside_dictionary.setdefault('search_limits',[{"word":["polka dot"]}]) 95 | 96 | self.words_collation = outside_dictionary.setdefault('words_collation',"Case_Insensitive") 97 | 98 | lookups = {"Case_Insensitive":'word',"case_insensitive":"word","Case_Sensitive":"casesens","Correct_Medial_s":'ffix',"All_Words_with_Same_Stem":"stem","Flagged":'wflag'} 99 | 100 | self.word_field = lookups[self.words_collation] 101 | 102 | self.groups = [] 103 | try: 104 | groups = outside_dictionary['groups'] 105 | except: 106 | groups = [outside_dictionary['time_measure']] 107 | 108 | if groups == []: 109 | #Set an arbitrary column name if nothing else is set. 110 | groups = ["bookid is not null as In_Library"] 111 | 112 | if (len (groups) > 1): 113 | pass 114 | #self.groups = credentialCheckandClean(self.groups) 115 | #Define some sort of limitations here, if not done in dbbindings.py 116 | 117 | for group in groups: 118 | group = group 119 | if group=="unigram" or group=="word": 120 | group = "words1." + self.word_field + " as unigram" 121 | if group=="bigram": 122 | group = "CONCAT (words1." + self.word_field + " ,' ' , words2." + self.word_field + ") as bigram" 123 | self.groups.append(group) 124 | 125 | self.selections = ",".join(self.groups) 126 | self.groupings = ",".join([re.sub(".* as","",group) for group in self.groups]) 127 | 128 | 129 | self.compare_dictionary = copy.deepcopy(self.outside_dictionary) 130 | if 'compare_limits' in self.outside_dictionary.keys(): 131 | self.compare_dictionary['search_limits'] = outside_dictionary['compare_limits'] 132 | del outside_dictionary['compare_limits'] 133 | else: #if nothing specified, we compare the word to the corpus. 134 | for key in ['word','word1','word2','word3','word4','word5','unigram','bigram']: 135 | try: 136 | del self.compare_dictionary['search_limits'][key] 137 | except: 138 | pass 139 | for key in self.outside_dictionary['search_limits'].keys(): 140 | if re.search('words?\d',key): 141 | try: 142 | del self.compare_dictionary['search_limits'][key] 143 | except: 144 | pass 145 | 146 | comparegroups = [] 147 | #This is a little tricky behavior here--hopefully it works in all cases. It drops out word groupings. 148 | try: 149 | compareGroups = self.compare_dictionary['groups'] 150 | except: 151 | compareGroups = [self.compare_dictionary['time_measure']] 152 | for group in compareGroups: 153 | if not re.match("words",group) and not re.match("[u]?[bn]igram",group): 154 | comparegroups.append(group) 155 | self.compare_dictionary['groups'] = comparegroups 156 | self.time_limits = outside_dictionary.setdefault('time_limits',[0,10000000]) 157 | self.time_measure = outside_dictionary.setdefault('time_measure','year') 158 | self.counttype = outside_dictionary.setdefault('counttype',["Occurrences_per_Million_Words"]) 159 | if isinstance(self.counttype,basestring): 160 | self.counttype = [self.counttype] 161 | self.index = outside_dictionary.setdefault('index',0) 162 | #Ordinarily, the input should be an an array of groups that will both select and group by. 163 | #The joins may be screwed up by certain names that exist in multiple tables, so there's an option to do something like 164 | #SELECT catalog.bookid as myid, because WHERE clauses on myid will work but GROUP BY clauses on catalog.bookid may not 165 | #after a sufficiently large number of subqueries. 166 | #This smoothing code really ought to go somewhere else, since it doesn't quite fit into the whole API mentality and is 167 | #more about the webpage. 168 | self.smoothingType = outside_dictionary.setdefault('smoothingType',"triangle") 169 | self.smoothingSpan = outside_dictionary.setdefault('smoothingSpan',3) 170 | self.method = outside_dictionary.setdefault('method',"Nothing") 171 | self.tablename = outside_dictionary.setdefault('tablename','master'+"_bookcounts as bookcounts") 172 | 173 | def derive_variables(self): 174 | #These are locally useful, and depend on the variables 175 | self.limits = self.search_limits 176 | #Treat empty constraints as nothing at all, not as full restrictions. 177 | for key in self.limits.keys(): 178 | if self.limits[key] == []: 179 | del self.limits[key] 180 | self.set_operations() 181 | self.create_catalog_table() 182 | self.make_catwhere() 183 | self.make_wordwheres() 184 | 185 | def create_catalog_table(self): 186 | self.catalog = self.prefs['fastcat'] #'catalog' #Can be replaced with a more complicated query in the event of longer joins. 187 | """ 188 | This should check query constraints against a list of tables, and join to them. 189 | So if you query with a limit on LCSH, and LCSH is listed as being in a separate table, 190 | it joins the table "LCSH" to catalog; and then that table has one column, ALSO 191 | called "LCSH", which is matched against. This allows a bookid to be a member of multiple catalogs. 192 | """ 193 | 194 | 195 | 196 | for limitation in self.prefs['separateDataTables']: 197 | #That re.sub thing is in here because sometimes I do queries that involve renaming. 198 | if limitation in [re.sub(" .*","",key) for key in self.limits.keys()] or limitation in [re.sub(" .*","",group) for group in self.groups]: 199 | self.catalog = self.catalog + """ JOIN """ + limitation + """ USING (bookid)""" 200 | 201 | """ 202 | Here it just pulls every variable and where to look for it. 203 | """ 204 | 205 | tableToLookIn = {} 206 | #This is sorted by engine DESC so that memory table locations will overwrite disk table in the hash. 207 | self.cursor.execute("SELECT ENGINE,TABLE_NAME,COLUMN_NAME,COLUMN_KEY FROM information_schema.COLUMNS JOIN INFORMATION_SCHEMA.TABLES USING (TABLE_NAME,TABLE_SCHEMA) WHERE TABLE_SCHEMA='" + self.outside_dictionary['database']+ "' ORDER BY ENGINE DESC,TABLE_NAME;"); 208 | columnNames = self.cursor.fetchall() 209 | 210 | for databaseColumn in columnNames: 211 | tableToLookIn[databaseColumn[2]] = databaseColumn[1] 212 | 213 | self.relevantTables = set() 214 | 215 | for columnInQuery in [re.sub(" .*","",key) for key in self.limits.keys()] + [re.sub(" .*","",group) for group in self.groups]: 216 | if not re.search('\.',columnInQuery): #Lets me keep a little bit of SQL sauce for my own queries 217 | try: 218 | self.relevantTables.add(tableToLookIn[columnInQuery]) 219 | except KeyError: 220 | pass 221 | #Could warn as well, but this helps back-compatability. 222 | 223 | self.catalog = "fastcat" 224 | for table in self.relevantTables: 225 | if table!="fastcat" and table!="words" and table!="wordsheap": 226 | self.catalog = self.catalog + """ NATURAL JOIN """ + table + " " 227 | 228 | #Here's a feature that's not yet fully implemented: it doesn't work quickly enough, probably because the joins involve a lot of jumping back and forth. 229 | if 'hasword' in self.limits.keys(): 230 | """ 231 | This is the sort of code I'm trying to move towards 232 | it just generates a new API call to fill a small part of the code here: 233 | (in this case, it merges the 'catalog' entry with a select query on 234 | the word in the 'haswords' field. Enough of this could really 235 | shrink the codebase, I suspect. It should be possible in MySQL 6.0, from what I've read, where subqueried tables will have indexes written for them by the query optimizer. 236 | """ 237 | 238 | if self.limits['hasword'] == []: 239 | del self.limits['hasword'] 240 | return 241 | 242 | #deepcopy lets us get a real copy of the dictionary 243 | #that can be changed without affecting the old one. 244 | mydict = copy.deepcopy(self.outside_dictionary) 245 | mydict['search_limits'] = copy.deepcopy(self.limits) 246 | mydict['search_limits']['word'] = copy.deepcopy(mydict['search_limits']['hasword']) 247 | del mydict['search_limits']['hasword'] 248 | tempquery = userquery(mydict) 249 | bookids = '' 250 | bookids = tempquery.counts_query() 251 | 252 | #If this is ever going to work, 'catalog' here should be some call to self.prefs['fastcat'] 253 | bookids = re.sub("(?s).*catalog[^\.]?[^\.\n]*\n","\n",bookids) 254 | bookids = re.sub("(?s)WHERE.*","\n",bookids) 255 | bookids = re.sub("(words|lookup)([0-9])","has\\1\\2",bookids) 256 | bookids = re.sub("main","hasTable",bookids) 257 | self.catalog = self.catalog + bookids 258 | #del self.limits['hasword'] 259 | 260 | def make_catwhere(self): 261 | #Where terms that don't include the words table join. Kept separate so that we can have subqueries only working on one half of the stack. 262 | catlimits = dict() 263 | for key in self.limits.keys(): 264 | if key not in ('word','word1','word2','hasword') and not re.search("words\d",key): 265 | catlimits[key] = self.limits[key] 266 | if len(catlimits.keys()) > 0: 267 | self.catwhere = where_from_hash(catlimits) 268 | else: 269 | self.catwhere = "TRUE" 270 | 271 | def make_wordwheres(self): 272 | self.wordswhere = " TRUE " 273 | self.max_word_length = 0 274 | limits = [] 275 | 276 | if 'word' in self.limits.keys(): 277 | """ 278 | This doesn't currently allow mixing of one and two word searches together in a logical way. 279 | It might be possible to just join on both the tables in MySQL--I'm not completely sure what would happen. 280 | But the philosophy has been to keep users from doing those searches as far as possible in any case. 281 | """ 282 | for phrase in self.limits['word']: 283 | locallimits = dict() 284 | array = phrase.split(" ") 285 | n=1 286 | for word in array: 287 | selectString = "(SELECT " + self.word_field + " FROM wordsheap WHERE casesens='" + word + "')" 288 | locallimits['words'+str(n) + "." + self.word_field] = selectString 289 | self.max_word_length = max(self.max_word_length,n) 290 | n = n+1 291 | limits.append(where_from_hash(locallimits,quotesep="")) 292 | #XXX for backward compatability 293 | self.words_searched = phrase 294 | self.wordswhere = '(' + ' OR '.join(limits) + ')' 295 | 296 | wordlimits = dict() 297 | 298 | limitlist = copy.deepcopy(self.limits.keys()) 299 | 300 | for key in limitlist: 301 | if re.search("words\d",key): 302 | wordlimits[key] = self.limits[key] 303 | self.max_word_length = max(self.max_word_length,2) 304 | del self.limits[key] 305 | 306 | if len(wordlimits.keys()) > 0: 307 | self.wordswhere = where_from_hash(wordlimits) 308 | 309 | 310 | def build_wordstables(self): 311 | #Deduce the words tables we're joining against. The iterating on this can be made more general to get 3 or four grams in pretty easily. 312 | #This relies on a determination already having been made about whether this is a unigram or bigram search; that's reflected in the keys passed. 313 | if (self.max_word_length == 2 or re.search("words2",self.selections)): 314 | 315 | self.maintable = 'master_bigrams' 316 | 317 | self.main = ''' 318 | JOIN 319 | master_bigrams as main 320 | ON ('''+ self.prefs['fastcat'] +'''.bookid=main.bookid) 321 | ''' 322 | 323 | self.wordstables = """ 324 | JOIN %(wordsheap)s as words1 ON (main.word1 = words1.wordid) 325 | JOIN %(wordsheap)s as words2 ON (main.word2 = words2.wordid) """ % self.__dict__ 326 | 327 | #I use a regex here to do a blanket search for any sort of word limitations. That has some messy sideffects (make sure the 'hasword' 328 | #key has already been eliminated, for example!) but generally works. 329 | 330 | elif self.max_word_length == 1 or re.search("[^h][^a][^s]word",self.selections): 331 | self.maintable = 'master_bookcounts' 332 | self.main = ''' 333 | JOIN 334 | master_bookcounts as main 335 | ON (''' + self.prefs['fastcat'] + '''.bookid=main.bookid)''' 336 | self.tablename = 'master_bookcounts' 337 | self.wordstables = """ 338 | JOIN ( %(wordsheap)s as words1) ON (main.wordid = words1.wordid) 339 | """ % self.__dict__ 340 | 341 | else: 342 | """ 343 | Have _no_ words table if no words searched for or grouped by; instead just use nwords. This 344 | means that we can use the same basic functions both to build the counts for word searches and 345 | for metadata searches, which is valuable because there is a metadata-only search built in to every single ratio 346 | query. (To get the denominator values). 347 | """ 348 | self.main = " " 349 | self.operation = ','.join(self.catoperations) 350 | """ 351 | This, above is super important: the operation used is relative to the counttype, and changes to use 'catoperation' instead of 'bookoperation' 352 | That's the place that the denominator queries avoid having to do a table scan on full bookcounts that would take hours, and instead takes 353 | milliseconds. 354 | """ 355 | self.wordstables = " " 356 | self.wordswhere = " TRUE " #Just a dummy thing to make the SQL writing easier. Shouldn't take any time. 357 | 358 | def set_operations(self): 359 | 360 | """ 361 | This is the code that allows multiple values to be selected. 362 | """ 363 | 364 | backCompatability = {"Occurrences_per_Million_Words":"WordsPerMillion","Raw_Counts":"WordCount","Percentage_of_Books":"TextPercent","Number_of_Books":"TextCount"} 365 | 366 | for oldKey in backCompatability.keys(): 367 | self.counttype = [re.sub(oldKey,backCompatability[oldKey],entry) for entry in self.counttype] 368 | 369 | self.bookoperation = {} 370 | self.catoperation = {} 371 | self.finaloperation = {} 372 | 373 | #Text statistics 374 | self.bookoperation['TextPercent'] = "count(DISTINCT " + self.prefs['fastcat'] + ".bookid) as TextCount" 375 | self.bookoperation['TextRatio'] = "count(DISTINCT " + self.prefs['fastcat'] + ".bookid) as TextCount" 376 | self.bookoperation['TextCount'] = "count(DISTINCT " + self.prefs['fastcat'] + ".bookid) as TextCount" 377 | #Word Statistics 378 | self.bookoperation['WordCount'] = "sum(main.count) as WordCount" 379 | self.bookoperation['WordsPerMillion'] = "sum(main.count) as WordCount" 380 | self.bookoperation['WordsRatio'] = "sum(main.count) as WordCount" 381 | """ 382 | +Total Numbers for comparisons/significance assessments 383 | This is a little tricky. The total words is EITHER the denominator (as in a query against words per Million) or the numerator+denominator (if you're comparing 384 | Pittsburg and Pittsburgh, say, and want to know the total number of uses of the lemma. For now, "TotalWords" means the former and "SumWords" the latter, 385 | On the theory that 'TotalWords' is more intuitive and only I (Ben) will be using SumWords all that much. 386 | """ 387 | self.bookoperation['TotalWords'] = self.bookoperation['WordsPerMillion'] 388 | self.bookoperation['SumWords'] = self.bookoperation['WordsPerMillion'] 389 | self.bookoperation['TotalTexts'] = self.bookoperation['TextCount'] 390 | self.bookoperation['SumTexts'] = self.bookoperation['TextCount'] 391 | 392 | for stattype in self.bookoperation.keys(): 393 | if re.search("Word",stattype): 394 | self.catoperation[stattype] = "sum(nwords) as WordCount" 395 | if re.search("Text",stattype): 396 | self.catoperation[stattype] = "count(nwords) as TextCount" 397 | 398 | self.finaloperation['TextPercent'] = "IFNULL(numerator.TextCount,0)/IFNULL(denominator.TextCount,0)*100 as TextPercent" 399 | self.finaloperation['TextRatio'] = "IFNULL(numerator.TextRatio,0)/IFNULL(denominator.TextCount,0) as TextRatio" 400 | self.finaloperation['TextCount'] = "IFNULL(numerator.TextCount,0) as TextCount" 401 | 402 | self.finaloperation['WordsPerMillion'] = "IFNULL(numerator.WordCount,0)*100000000/IFNULL(denominator.WordCount,0)/100 as WordsPerMillion" 403 | self.finaloperation['WordsRatio'] = "IFNULL(numerator.WordCount,0)/IFNULL(denominator.WordCount,0) as WordsRatio" 404 | self.finaloperation['WordCount'] = "IFNULL(numerator.WordCount,0) as WordCount" 405 | 406 | self.finaloperation['TotalWords'] = "IFNULL(denominator.WordCount,0) as TotalWords" 407 | self.finaloperation['SumWords'] = "IFNULL(denominator.WordCount,0) + IFNULL(numerator.WordCount,0) as SumWords" 408 | self.finaloperation['TotalTexts'] = "IFNULL(denominator.TextCount,0) as TotalTexts" 409 | self.finaloperation['SumTexts'] = "IFNULL(denominator.TextCount,0) + IFNULL(numerator.TextCount,0) as SumTexts" 410 | 411 | """ 412 | The values here will be chosen in build_wordstables; that's what decides if it uses the 'bookoperation' or 'catoperation' dictionary to build out. 413 | """ 414 | 415 | self.finaloperations = list() 416 | self.bookoperations = set() 417 | self.catoperations = set() 418 | 419 | for summaryStat in self.counttype: 420 | self.catoperations.add(self.catoperation[summaryStat]) 421 | self.bookoperations.add(self.bookoperation[summaryStat]) 422 | self.finaloperations.append(self.finaloperation[summaryStat]) 423 | 424 | #self.catoperation 425 | 426 | def counts_query(self): 427 | #self.bookoperation = {"Occurrences_per_Million_Words":"sum(main.count)","Raw_Counts":"sum(main.count)","Percentage_of_Books":"count(DISTINCT " + self.prefs['fastcat'] + ".bookid)","Number_of_Books":"count(DISTINCT "+ self.prefs['fastcat'] + ".bookid)"} 428 | #self.catoperation = {"Occurrences_per_Million_Words":"sum(nwords)","Raw_Counts":"sum(nwords)","Percentage_of_Books":"count(nwords)","Number_of_Books":"count(nwords)"} 429 | 430 | self.operation = ','.join(self.bookoperations) 431 | 432 | self.build_wordstables() 433 | countsQuery = """ 434 | SELECT 435 | %(selections)s, 436 | %(operation)s 437 | FROM 438 | %(catalog)s 439 | %(main)s 440 | %(wordstables)s 441 | WHERE 442 | %(catwhere)s AND %(wordswhere)s 443 | GROUP BY 444 | %(groupings)s 445 | """ % self.__dict__ 446 | return countsQuery 447 | 448 | def ratio_query(self): 449 | #if True: #In the case that we're not using a superset of words; this can be changed later 450 | # supersetGroups = [group for group in self.groups if not re.match('word',group)] 451 | # self.finalgroupings = self.groupings 452 | # for key in self.limits.keys(): 453 | # if re.match('word',key): 454 | # del self.limits[key] 455 | 456 | self.denominator = userquery(outside_dictionary = self.compare_dictionary) 457 | self.supersetquery = self.denominator.counts_query() 458 | 459 | if re.search("In_Library",self.denominator.selections): 460 | self.selections = self.selections + ", fastcat.bookid is not null as In_Library" 461 | 462 | #See above: In_Library is a dummy variable so that there's always something to join on. 463 | self.mainquery = self.counts_query() 464 | 465 | self.countcommand = ','.join(self.finaloperations) 466 | 467 | """ 468 | We launch a whole new userquery instance here to build the denominator, based on the 'compare_dictionary' option (which in most 469 | cases is the search_limits without the keys, see above. 470 | We then get the counts_query results out of that result. 471 | """ 472 | 473 | self.totalMergeTerms = "USING (" + self.denominator.groupings + " ) " 474 | self.totalselections = ",".join([re.sub(".* as","",group) for group in self.groups]) 475 | 476 | query = """ 477 | SELECT 478 | %(totalselections)s, 479 | %(countcommand)s 480 | FROM 481 | ( %(mainquery)s 482 | ) as numerator 483 | RIGHT OUTER JOIN 484 | ( %(supersetquery)s ) as denominator 485 | %(totalMergeTerms)s 486 | GROUP BY %(groupings)s;""" % self.__dict__ 487 | return query 488 | 489 | 490 | def return_slug_data(self,force=False): 491 | #Rather than understand this error, I'm just returning 0 if it fails. 492 | #Probably that's the right thing to do, though it may cause trouble later. 493 | #It's just a punishment for not later using a decent API call with "Raw_Counts" to extract these counts out, and relying on this ugly method. 494 | #Please, citizens of the future, NEVER USE THIS METHOD. 495 | try: 496 | temp_words = self.return_n_words(force = True) 497 | temp_counts = self.return_n_books(force = True) 498 | except: 499 | temp_words = 0 500 | temp_counts = 0 501 | return [temp_counts,temp_words] 502 | 503 | def return_n_books(self,force=False): #deprecated 504 | if (not hasattr(self,'nbooks')) or force: 505 | query = "SELECT count(*) from " + self.catalog + " WHERE " + self.catwhere 506 | silent = self.cursor.execute(query) 507 | self.counts = int(self.cursor.fetchall()[0][0]) 508 | return self.counts 509 | 510 | def return_n_words(self,force=False): #deprecated 511 | if (not hasattr(self,'nwords')) or force: 512 | query = "SELECT sum(nwords) from " + self.catalog + " WHERE " + self.catwhere 513 | silent = self.cursor.execute(query) 514 | self.nwords = int(self.cursor.fetchall()[0][0]) 515 | return self.nwords 516 | 517 | def ranked_query(self,percentile_to_return = 99,addwhere = ""): 518 | #NOT CURRENTLY IN USE ANYWHERE--DELETE??? 519 | ##This returns a list of bookids in order by how well they match the sort terms. 520 | ## Using an IDF term will give better search results for case-sensitive searches, but is currently disabled 521 | ## 522 | self.LIMIT = int((100-percentile_to_return) * self.return_n_books()/100) 523 | countQuery = """ 524 | SELECT 525 | bookid, 526 | sum(main.count*1000/nwords%(idfterm)s) as score 527 | FROM %(catalog)s LEFT JOIN %(tablename)s 528 | USING (bookid) 529 | WHERE %(catwhere)s AND %(wordswhere)s 530 | GROUP BY bookid 531 | ORDER BY score DESC 532 | LIMIT %(LIMIT)s 533 | """ % self.__dict__ 534 | return countQuery 535 | 536 | def bibliography_query(self,limit = "100"): 537 | #I'd like to redo this at some point so it could work as an API call. 538 | self.limit = limit 539 | self.ordertype = "sum(main.count*10000/nwords)" 540 | try: 541 | if self.outside_dictionary['ordertype'] == "random": 542 | if self.counttype==["Raw_Counts"] or self.counttype==["Number_of_Books"] or self.counttype==['WordCount'] or self.counttype==['BookCount']: 543 | self.ordertype = "RAND()" 544 | else: 545 | self.ordertype = "LOG(1-RAND())/sum(main.count)" 546 | except KeyError: 547 | pass 548 | 549 | #If IDF searching is enabled, we could add a term like '*IDF' here to overweight better selecting words 550 | #in the event of a multiple search. 551 | self.idfterm = "" 552 | prep = self.counts_query() 553 | 554 | bibQuery = """ 555 | SELECT searchstring 556 | FROM """ % self.__dict__ + self.prefs['fullcat'] + """ RIGHT JOIN ( 557 | SELECT 558 | """+ self.prefs['fastcat'] + """.bookid, %(ordertype)s as ordering 559 | FROM 560 | %(catalog)s 561 | %(main)s 562 | %(wordstables)s 563 | WHERE 564 | %(catwhere)s AND %(wordswhere)s 565 | GROUP BY bookid ORDER BY %(ordertype)s DESC LIMIT %(limit)s 566 | ) as tmp USING(bookid) ORDER BY ordering DESC; 567 | """ % self.__dict__ 568 | return bibQuery 569 | 570 | def disk_query(self,limit="100"): 571 | pass 572 | 573 | def return_books(self): 574 | #This preps up the display elements for a search: it returns an array with a single string for each book, sorted in the best possible way 575 | silent = self.cursor.execute(self.bibliography_query()) 576 | returnarray = [] 577 | for line in self.cursor.fetchall(): 578 | returnarray.append(line[0]) 579 | if not returnarray: 580 | #why would someone request a search with no locations? Turns out (usually) because the smoothing tricked them. 581 | returnarray.append("No results for this particular point: try again without smoothing") 582 | newerarray = self.custom_SearchString_additions(returnarray) 583 | return json.dumps(newerarray) 584 | 585 | def search_results(self): 586 | #This is an alias that is handled slightly differently in APIimplementation (no "RESULTS" bit in front). Once 587 | #that legacy code is cleared out, they can be one and the same. 588 | return json.loads(self.return_books()) 589 | 590 | def getActualSearchedWords(self): 591 | if len(self.wordswhere) > 7: 592 | words = self.outside_dictionary['search_limits']['word'] 593 | #Break bigrams into single words. 594 | words = ' '.join(words).split(' ') 595 | self.cursor.execute("""SELECT word FROM wordsheap WHERE """ + where_from_hash({self.word_field:words})) 596 | self.actualWords =[item[0] for item in self.cursor.fetchall()] 597 | else: 598 | self.actualWords = ["tasty","mistake","happened","here"] 599 | 600 | def custom_SearchString_additions(self,returnarray): 601 | db = self.outside_dictionary['database'] 602 | if db in ('jstor','presidio','ChronAm','LOC'): 603 | self.getActualSearchedWords() 604 | if db=='jstor': 605 | joiner = "&searchText=" 606 | preface = "?Search=yes&searchText=" 607 | urlRegEx = "http://www.jstor.org/stable/\d+" 608 | if db=='presidio': 609 | joiner = "+" 610 | preface = "#page/1/mode/2up/search/" 611 | urlRegEx = 'http://archive.org/stream/[^"# ><]*' 612 | if db in ('ChronAm','LOC'): 613 | preface = "/;words=" 614 | joiner = "+" 615 | urlRegEx = 'http://chroniclingamerica.loc.gov[^\"><]*/seq-\d+' 616 | newarray = [] 617 | for string in returnarray: 618 | base = re.findall(urlRegEx,string)[0] 619 | newcore = ' search inside ' 620 | string = re.sub("^","",string) 621 | string = re.sub("$","",string) 622 | string = string+newcore 623 | newarray.append(string) 624 | #Arxiv is messier, requiring a whole different URL interface: http://search.arxiv.org:8081/paper.jsp?r=1204.3352&qs=network 625 | else: 626 | newarray = returnarray 627 | return newarray 628 | 629 | def return_query_values(self,query = "ratio_query"): 630 | #The API returns a dictionary with years pointing to values. 631 | values = [] 632 | querytext = getattr(self,query)() 633 | silent = self.cursor.execute(querytext) 634 | #Gets the results 635 | mydict = dict(self.cursor.fetchall()) 636 | try: 637 | for key in mydict.keys(): 638 | #Only return results inside the time limits 639 | if key >= self.time_limits[0] and key <= self.time_limits[1]: 640 | mydict[key] = str(mydict[key]) 641 | else: 642 | del mydict[key] 643 | mydict = smooth_function(mydict,smooth_method = self.smoothingType,span = self.smoothingSpan) 644 | 645 | except: 646 | mydict = {0:"0"} 647 | 648 | #This is a good place to change some values. 649 | try: 650 | return {'index':self.index, 'Name':self.words_searched,"values":mydict,'words_searched':""} 651 | except: 652 | return{'values':mydict} 653 | 654 | def arrayNest(self,array,returnt,endLength=1): 655 | #A recursive function to transform a list into a nested array 656 | key = array[0] 657 | key = to_unicode(key) 658 | if len(array)==endLength+1: 659 | #This is the condition where we have the last two, which is where we no longer need to nest anymore: 660 | #it's just the last value[key] = value 661 | value = list(array[1:]) 662 | for i in range(len(value)): 663 | try: 664 | value[i] = float(value[i]) 665 | except: 666 | pass 667 | returnt[key] = value 668 | else: 669 | try: 670 | returnt[key] = self.arrayNest(array[1:len(array)],returnt[key],endLength=endLength) 671 | except KeyError: 672 | returnt[key] = self.arrayNest(array[1:len(array)],dict(),endLength=endLength) 673 | return returnt 674 | 675 | def return_json(self,query='ratio_query'): 676 | querytext = getattr(self,query)() 677 | silent = self.cursor.execute(querytext) 678 | names = [to_unicode(item[0]) for item in self.cursor.description] 679 | returnt = dict() 680 | lines = self.cursor.fetchall() 681 | for line in lines: 682 | returnt = self.arrayNest(line,returnt,endLength = len(self.counttype)) 683 | return returnt 684 | 685 | def return_tsv(self,query = "ratio_query"): 686 | if self.outside_dictionary['counttype']=="Raw_Counts" or self.outside_dictionary['counttype']==["Raw_Counts"]: 687 | query="counts_query" 688 | #This allows much speedier access to counts data if you're willing not to know about all the zeroes. 689 | querytext = getattr(self,query)() 690 | silent = self.cursor.execute(querytext) 691 | results = ["\t".join([to_unicode(item[0]) for item in self.cursor.description])] 692 | lines = self.cursor.fetchall() 693 | for line in lines: 694 | items = [] 695 | for item in line: 696 | item = to_unicode(item) 697 | item = re.sub("\t","",item) 698 | items.append(item) 699 | results.append("\t".join(items)) 700 | return "\n".join(results) 701 | 702 | def export_data(self,query1="ratio_query"): 703 | self.smoothing=0 704 | return self.return_query_values(query=query1) 705 | 706 | def execute(self): 707 | #This performs the query using the method specified in the passed parameters. 708 | if self.method=="Nothing": 709 | pass 710 | else: 711 | return getattr(self,self.method)() 712 | 713 | 714 | ############# 715 | ##GENERAL#### #These are general purpose functional types of things not implemented in the class. 716 | ############# 717 | 718 | def to_unicode(obj, encoding='utf-8'): 719 | if isinstance(obj, basestring): 720 | if not isinstance(obj, unicode): 721 | obj = unicode(obj, encoding) 722 | elif isinstance(obj,int): 723 | obj=unicode(str(obj),encoding) 724 | else: 725 | obj = unicode(str(obj),encoding) 726 | return obj 727 | 728 | def where_from_hash(myhash,joiner=" AND ",comp = " = ",quotesep=None): 729 | whereterm = [] 730 | #The general idea here is that we try to break everything in search_limits down to a list, and then create a whereterm on that joined by whatever the 'joiner' is ("AND" or "OR"), with the comparison as whatever comp is ("=",">=",etc.). 731 | #For more complicated bits, it gets all recursive until the bits are in terms of list. 732 | for key in myhash.keys(): 733 | values = myhash[key] 734 | if isinstance(values,basestring) or isinstance(values,int) or isinstance(values,float): 735 | #This is just error handling. You can pass a single value instead of a list if you like, and it will just convert it 736 | #to a list for you. 737 | values = [values] 738 | #Or queries are special, since the default is "AND". This toggles that around for a subportion. 739 | if key=='$or' or key=="$OR": 740 | for comparison in values: 741 | whereterm.append(where_from_hash(comparison,joiner=" OR ",comp=comp)) 742 | #The or doesn't get populated any farther down. 743 | elif isinstance(values,dict): 744 | #Certain function operators can use MySQL terms. These are the only cases that a dict can be passed as a limitations 745 | operations = {"$gt":">","$ne":"!=","$lt":"<","$grep":" REGEXP ","$gte":">=","$lte":"<=","$eq":"="} 746 | for operation in values.keys(): 747 | whereterm.append(where_from_hash({key:values[operation]},comp=operations[operation],joiner=joiner)) 748 | elif isinstance(values,list): 749 | #and this is where the magic actually happens 750 | if isinstance(values[0],dict): 751 | for entry in values: 752 | whereterm.append(where_from_hash(entry)) 753 | else: 754 | if quotesep is None: 755 | if isinstance(values[0],basestring): 756 | quotesep="'" 757 | else: 758 | quotesep = "" 759 | #Note the "OR" here. There's no way to pass in a query like "year=1876 AND year=1898" as currently set up. 760 | #Obviously that's no great loss, but there might be something I'm missing that would be. 761 | whereterm.append(" (" + " OR ".join([" (" + key+comp+quotesep+to_unicode(value)+quotesep+") " for value in values])+ ") ") 762 | return "(" + joiner.join(whereterm) + ")" 763 | #This works pretty well, except that it requires very specific sorts of terms going in, I think. 764 | 765 | 766 | 767 | #I'd rather have all this smoothing stuff done at the client side, but currently it happens here. 768 | def smooth_function(zinput,smooth_method = 'lowess',span = .05): 769 | if smooth_method not in ['lowess','triangle','rectangle']: 770 | return zinput 771 | xarray = [] 772 | yarray = [] 773 | years = zinput.keys() 774 | years.sort() 775 | for key in years: 776 | if zinput[key]!='None': 777 | xarray.append(float(key)) 778 | yarray.append(float(zinput[key])) 779 | from numpy import array 780 | x = array(xarray) 781 | y = array(yarray) 782 | if smooth_method == 'lowess': 783 | #print "starting lowess smoothing
" 784 | from Bio.Statistics.lowess import lowess 785 | smoothed = lowess(x,y,float(span)/100,3) 786 | x = [int(p) for p in x] 787 | returnval = dict(zip(x,smoothed)) 788 | return returnval 789 | if smooth_method == 'rectangle': 790 | from math import log 791 | #print "starting triangle smoothing
" 792 | span = int(span) #Takes the floor--so no smoothing on a span < 1. 793 | returnval = zinput 794 | windowsize = span*2 + 1 795 | from numpy import average 796 | for i in range(len(xarray)): 797 | surrounding = array(range(windowsize),dtype=float) 798 | weights = array(range(windowsize),dtype=float) 799 | for j in range(windowsize): 800 | key_dist = j - span #if span is 2, the zeroeth element is -2, the second element is 0 off, etc. 801 | workingon = i + key_dist 802 | if workingon >= 0 and workingon < len(xarray): 803 | surrounding[j] = float(yarray[workingon]) 804 | weights[j] = 1 805 | else: 806 | surrounding[j] = 0 807 | weights[j] = 0 808 | returnval[xarray[i]] = round(average(surrounding,weights=weights),3) 809 | return returnval 810 | if smooth_method == 'triangle': 811 | from math import log 812 | #print "starting triangle smoothing
" 813 | span = int(span) #Takes the floor--so no smoothing on a span < 1. 814 | returnval = zinput 815 | windowsize = span*2 + 1 816 | from numpy import average 817 | for i in range(len(xarray)): 818 | surrounding = array(range(windowsize),dtype=float) 819 | weights = array(range(windowsize),dtype=float) 820 | for j in range(windowsize): 821 | key_dist = j - span #if span is 2, the zeroeth element is -2, the second element is 0 off, etc. 822 | workingon = i + key_dist 823 | if workingon >= 0 and workingon < len(xarray): 824 | surrounding[j] = float(yarray[workingon]) 825 | #This isn't actually triangular smoothing: I dampen it by the logs, to keep the peaks from being too too big. 826 | #The minimum is '2', since log(1) == 0, which is a nonesense weight. 827 | weights[j] = log(span + 2 - abs(key_dist)) 828 | else: 829 | surrounding[j] = 0 830 | weights[j] = 0 831 | 832 | returnval[xarray[i]] = round(average(surrounding,weights=weights),3) 833 | return returnval 834 | 835 | #The idea is: this works by default by slurping up from the command line, but you could also load the functions in and run results on your own queries. 836 | try: 837 | command = str(sys.argv[1]) 838 | command = json.loads(command) 839 | #Got to go before we let anything else happen. 840 | print command 841 | p = userqueries(command) 842 | result = p.execute() 843 | print json.dumps(result) 844 | except: 845 | pass 846 | 847 | -------------------------------------------------------------------------------- /bookworm/MetaWorm.py: -------------------------------------------------------------------------------- 1 | import pandas 2 | import json 3 | import copy 4 | import threading 5 | import time 6 | from collections import defaultdict 7 | 8 | def hostlist(dblist): 9 | #This could do something fancier, but for now we look by default only on localhost. 10 | return ["localhost"]*len(dblist) 11 | 12 | class childQuery(threading.Thread): 13 | def __init__(self,dictJSON,host): 14 | super(SummingThread, self).__init__() 15 | self.dict = json.dumps(dict) 16 | self.host = host 17 | 18 | def runQuery(self): 19 | #make a webquery, assign it to self.data 20 | url = self.host + "/cgi-bin/bookwormAPI?query=" + self.dict 21 | 22 | def parseResults(self): 23 | pass 24 | #return json.loads(self.data) 25 | 26 | def run(self): 27 | self.runQuery() 28 | 29 | def flatten(dictOfdicts): 30 | """ 31 | Recursive function: transforms a dict with nested entries like 32 | foo["a"]["b"]["c"] = 3 33 | to one with tuple entries like 34 | fooPrime[("a","b","c")] = 3 35 | """ 36 | output = [] 37 | for (key,value) in dictOfdicts.iteritems(): 38 | if isinstance(value,dict): 39 | output.append([(key),value]) 40 | else: 41 | children = flatten(value) 42 | for child in children: 43 | output.append([(key,) + child[0],child[1]]) 44 | return output 45 | 46 | def animate(dictOfTuples): 47 | """ 48 | opposite of flatten 49 | """ 50 | 51 | def tree(): 52 | return defaultdict(tree) 53 | 54 | output = defaultdict(tree) 55 | 56 | 57 | 58 | def combineDicts(master,new): 59 | """ 60 | instead of a dict of dicts of arbitrary depth, use a dict of tuples to store. 61 | """ 62 | 63 | for (keysequence, valuesequence) in flatten(new): 64 | try: 65 | master[keysequence] = map(sum,zip(master[keysequence],valuesequence)) 66 | except KeyError: 67 | master[keysequence] = valuesequence 68 | return dict1 69 | 70 | class MetaQuery(object): 71 | def __init__(self,dictJSON): 72 | self.outside_outdictionary = json.dumps(dictJSON) 73 | 74 | def setDefaults(self): 75 | for specialKey in ["database","host"]: 76 | try: 77 | if isinstance(self.outside_dictionary[specialKey],basestring): 78 | #coerce strings to list: 79 | self.outside_dictionary[specialKey] = [self.outside_dictionary[specialKey]] 80 | except KeyError: 81 | #It's OK not to define host. 82 | if specialKey=="host": 83 | pass 84 | 85 | if 'host' not in self.outside_dictionary: 86 | #Build a hostlist: usually just localhost a bunch of times. 87 | self.outside_dictionary['host'] = hostlist(self.outside_dictionary['database']) 88 | 89 | for (target, dest) in [("database","host"),("host","database")]: 90 | #Expand out so you can search for the same database on multiple databases, or multiple databases on the same host. 91 | if len(self.outside_dictionary[target])==1 and len(self.outside_dictionary[dest]) != 1: 92 | self.outside_dictionary[target] = self.outside_dictionary[target] * len(self.outside_dictionary[dest]) 93 | 94 | 95 | def buildChildren(self): 96 | desiredCounts = [] 97 | for (host,dbname) in zip(self.outside_dictionary["host"],self.outside_dictionary["database"]): 98 | query = copy.deepcopy(self.outside_dictionary) 99 | del(query['host']) 100 | query['database'] = dbname 101 | 102 | desiredCounts.append(childQuery(query,host)) 103 | self.children = desiredCounts 104 | 105 | def runChildren(self): 106 | for child in self.children: 107 | child.start() 108 | 109 | def combineChildren(self): 110 | complete = dict() 111 | while (threading.enumerate()): 112 | for child in self.children: 113 | if not child.is_alive(): 114 | complete=combineDicts(complete,child.parseResult()) 115 | time.sleep(.05) 116 | 117 | def return_json(self): 118 | pass 119 | 120 | 121 | 122 | -------------------------------------------------------------------------------- /bookworm/SQLAPI.py: -------------------------------------------------------------------------------- 1 | #!/usr/local/bin/python 2 | 3 | import sys 4 | import json 5 | import cgi 6 | import re 7 | import numpy #used for smoothing. 8 | import copy 9 | import decimal 10 | import MySQLdb 11 | import warnings 12 | import hashlib 13 | 14 | """ 15 | #There are 'fast' and 'full' tables for books and words; 16 | #that's so memory tables can be used in certain cases for fast, hashed matching, but longer form data (like book titles) 17 | can be stored on disk. Different queries use different types of calls. 18 | #Also, certain metadata fields are stored separately from the main catalog table; 19 | """ 20 | 21 | from knownHosts import * 22 | 23 | class dbConnect(object): 24 | #This is a read-only account 25 | def __init__(self,prefs): 26 | self.dbname = prefs['database'] 27 | self.db = MySQLdb.connect(host=prefs['HOST'],read_default_file = prefs['read_default_file'],use_unicode='True',charset='utf8',db=prefs['database']) 28 | self.cursor = self.db.cursor() 29 | 30 | # The basic object here is a 'userquery:' it takes dictionary as input, as defined in the API, and returns a value 31 | # via the 'execute' function whose behavior 32 | # depends on the mode that is passed to it. 33 | # Given the dictionary, it can return a number of objects. 34 | # The "Search_limits" array in the passed dictionary determines how many elements it returns; this lets multiple queries be bundled together. 35 | # Most functions describe a subquery that might be combined into one big query in various ways. 36 | 37 | class userqueries: 38 | #This is a set of userqueries that are bound together; each element in search limits is iterated over, and we're done. 39 | #currently used for various different groups sent in a bundle (multiple lines on a Bookworm chart). 40 | #A sufficiently sophisticated 'group by' search might make this unnecessary. 41 | #But until that day, it's useful to be able to return lists of elements, which happens in here. 42 | 43 | def __init__(self,outside_dictionary = {"counttype":["Percentage_of_Books"],"search_limits":[{"word":["polka dot"],"LCSH":["Fiction"]}]},db = None): 44 | try: 45 | self.database = outside_dictionary.setdefault('database', 'default') 46 | prefs = general_prefs[self.database] 47 | except KeyError: #If it's not in the option, use some default preferences and search on localhost. This will work in most cases here on out. 48 | prefs = general_prefs['default'] 49 | prefs['database'] = self.database 50 | self.prefs = prefs 51 | 52 | self.wordsheap = prefs['fastword'] 53 | self.words = prefs['fullword'] 54 | if 'search_limits' not in outside_dictionary.keys(): 55 | outside_dictionary['search_limits'] = [{}] 56 | #coerce one-element dictionaries to an array. 57 | if isinstance(outside_dictionary['search_limits'],dict): 58 | #(allowing passing of just single dictionaries instead of arrays) 59 | outside_dictionary['search_limits'] = [outside_dictionary['search_limits']] 60 | self.returnval = [] 61 | self.queryInstances = [] 62 | db = dbConnect(prefs) 63 | databaseScheme = databaseSchema(db) 64 | for limits in outside_dictionary['search_limits']: 65 | mylimits = copy.deepcopy(outside_dictionary) 66 | mylimits['search_limits'] = limits 67 | localQuery = userquery(mylimits,db=db,databaseScheme=databaseScheme) 68 | self.queryInstances.append(localQuery) 69 | self.returnval.append(localQuery.execute()) 70 | 71 | def execute(self): 72 | 73 | return self.returnval 74 | 75 | 76 | class userquery: 77 | def __init__(self,outside_dictionary = {"counttype":["Percentage_of_Books"],"search_limits":{"word":["polka dot"],"LCSH":["Fiction"]}},db=None,databaseScheme=None): 78 | #Certain constructions require a DB connection already available, so we just start it here, or use the one passed to it. 79 | try: 80 | self.prefs = general_prefs[outside_dictionary['database']] 81 | except KeyError: 82 | #If it's not in the option, use some default preferences and search on localhost. This will work in most cases here on out. 83 | self.prefs = general_prefs['default'] 84 | self.prefs['database'] = outside_dictionary['database'] 85 | self.outside_dictionary = outside_dictionary 86 | #self.prefs = general_prefs[outside_dictionary.setdefault('database','presidio')] 87 | self.db = db 88 | if db is None: 89 | self.db = dbConnect(self.prefs) 90 | self.databaseScheme = databaseScheme 91 | if databaseScheme is None: 92 | self.databaseScheme = databaseSchema(self.db) 93 | 94 | self.cursor = self.db.cursor 95 | self.wordsheap = self.prefs['fastword'] 96 | self.words = self.prefs['fullword'] 97 | """ 98 | I'm now allowing 'search_limits' to either be a dictionary or an array of dictionaries: 99 | this makes the syntax cleaner on most queries, 100 | while still allowing some long ones from the Bookworm website. 101 | """ 102 | try: 103 | if isinstance(outside_dictionary['search_limits'],list): 104 | outside_dictionary['search_limits'] = outside_dictionary['search_limits'][0] 105 | except: 106 | outside_dictionary['search_limits'] = dict() 107 | #outside_dictionary = self.limitCategoricalQueries(outside_dictionary) 108 | self.defaults(outside_dictionary) #Take some defaults 109 | self.derive_variables() #Derive some useful variables that the query will use. 110 | 111 | def defaults(self,outside_dictionary): 112 | #these are default values;these are the only values that can be set in the query 113 | #search_limits is an array of dictionaries; 114 | #each one contains a set of limits that are mutually independent 115 | #The other limitations are universal for all the search limits being set. 116 | 117 | #Set up a dictionary for the denominator of any fraction if it doesn't already exist: 118 | self.search_limits = outside_dictionary.setdefault('search_limits',[{"word":["polka dot"]}]) 119 | self.words_collation = outside_dictionary.setdefault('words_collation',"Case_Insensitive") 120 | 121 | lookups = {"Case_Insensitive":'word','lowercase':'lowercase','casesens':'casesens',"case_insensitive":"word","Case_Sensitive":"casesens","All_Words_with_Same_Stem":"stem",'stem':'stem'} 122 | self.word_field = str(MySQLdb.escape_string(lookups[self.words_collation])) 123 | 124 | self.time_limits = outside_dictionary.setdefault('time_limits',[0,10000000]) 125 | self.time_measure = outside_dictionary.setdefault('time_measure','year') 126 | 127 | self.groups = set() 128 | self.outerGroups = [] #[] #Only used on the final join; directionality matters, unlike for the other ones. 129 | self.finalMergeTables=set() 130 | try: 131 | groups = outside_dictionary['groups'] 132 | except: 133 | groups = [outside_dictionary['time_measure']] 134 | 135 | if groups == [] or groups == ["unigram"]: 136 | #Set an arbitrary column name that will always be true if nothing else is set. 137 | groups.insert(0,"1 as In_Library") 138 | 139 | if (len (groups) > 1): 140 | pass 141 | #self.groups = credentialCheckandClean(self.groups) 142 | #Define some sort of limitations here, if not done in dbbindings.py 143 | 144 | for group in groups: 145 | 146 | #There's a special set of rules for how to handle unigram and bigrams 147 | multigramSearch = re.match("(unigram|bigram|trigram)(\d)?",group) 148 | 149 | if multigramSearch: 150 | if group=="unigram": 151 | gramPos = "1" 152 | gramType = "unigram" 153 | 154 | else: 155 | gramType = multigramSearch.groups()[0] 156 | try: 157 | gramPos = multigramSearch.groups()[1] 158 | except: 159 | print "currently you must specify which bigram element you want (eg, 'bigram1')" 160 | raise 161 | 162 | lookupTableName = "%sLookup%s" %(gramType,gramPos) 163 | self.outerGroups.append("%s.%s as %s" %(lookupTableName,self.word_field,group)) 164 | self.finalMergeTables.add(" JOIN wordsheap as %s ON %s.wordid=w%s" %(lookupTableName,lookupTableName,gramPos)) 165 | self.groups.add("words%s.wordid as w%s" %(gramPos,gramPos)) 166 | 167 | else: 168 | self.outerGroups.append(group) 169 | try: 170 | if self.databaseScheme.aliases[group] != group: 171 | #Search on the ID field, not the basic field. 172 | #debug(self.databaseScheme.aliases.keys()) 173 | self.groups.add(self.databaseScheme.aliases[group]) 174 | table = self.databaseScheme.tableToLookIn[group] 175 | 176 | joinfield = self.databaseScheme.aliases[group] 177 | self.finalMergeTables.add(" JOIN " + table + " USING (" + joinfield + ") ") 178 | else: 179 | self.groups.add(group) 180 | except KeyError: 181 | self.groups.add(group) 182 | 183 | """ 184 | There are the selections which can include table refs, and the groupings, which may not: 185 | and the final suffix to enable fast lookup 186 | """ 187 | 188 | self.selections = ",".join(self.groups) 189 | self.groupings = ",".join([re.sub(".* as","",group) for group in self.groups]) 190 | 191 | self.joinSuffix = "" + " ".join(self.finalMergeTables) 192 | 193 | """ 194 | Define the comparison set if a comparison is being done. 195 | """ 196 | #Deprecated--tagged for deletion 197 | #self.determineOutsideDictionary() 198 | 199 | #This is a little tricky behavior here--hopefully it works in all cases. It drops out word groupings. 200 | 201 | self.counttype = outside_dictionary.setdefault('counttype',["WordCount"]) 202 | 203 | if isinstance(self.counttype,basestring): 204 | self.counttype = [self.counttype] 205 | 206 | #index is deprecated,but the old version uses it. 207 | self.index = outside_dictionary.setdefault('index',0) 208 | """ 209 | #Ordinarily, the input should be an an array of groups that will both select and group by. 210 | #The joins may be screwed up by certain names that exist in multiple tables, so there's an option to do something like 211 | #SELECT catalog.bookid as myid, because WHERE clauses on myid will work but GROUP BY clauses on catalog.bookid may not 212 | #after a sufficiently large number of subqueries. 213 | #This smoothing code really ought to go somewhere else, since it doesn't quite fit into the whole API mentality and is 214 | #more about the webpage. It is only included here as a stopgap: NO FURTHER APPLICATIONS USING IT SHOULD BE BUILT. 215 | """ 216 | 217 | self.smoothingType = outside_dictionary.setdefault('smoothingType',"triangle") 218 | self.smoothingSpan = outside_dictionary.setdefault('smoothingSpan',3) 219 | self.method = outside_dictionary.setdefault('method',"Nothing") 220 | 221 | def determineOutsideDictionary(self): 222 | """ 223 | deprecated--tagged for deletion. 224 | """ 225 | self.compare_dictionary = copy.deepcopy(self.outside_dictionary) 226 | if 'compare_limits' in self.outside_dictionary.keys(): 227 | self.compare_dictionary['search_limits'] = self.outside_dictionary['compare_limits'] 228 | del self.outside_dictionary['compare_limits'] 229 | elif sum([bool(re.search(r'\*',string)) for string in self.outside_dictionary['search_limits'].keys()]) > 0: 230 | #If any keys have stars at the end, drop them from the compare set 231 | #This is often a _very_ helpful definition for succinct comparison queries of many types. 232 | #The cost is that an asterisk doesn't allow you 233 | 234 | for key in self.outside_dictionary['search_limits'].keys(): 235 | if re.search(r'\*',key): 236 | #rename the main one to not have a star 237 | self.outside_dictionary['search_limits'][re.sub(r'\*','',key)] = self.outside_dictionary['search_limits'][key] 238 | #drop it from the compare_limits and delete the version in the search_limits with a star 239 | del self.outside_dictionary['search_limits'][key] 240 | del self.compare_dictionary['search_limits'][key] 241 | else: #if nothing specified, we compare the word to the corpus. 242 | deleted = False 243 | for key in self.outside_dictionary['search_limits'].keys(): 244 | if re.search('words?\d',key) or re.search('gram$',key) or re.match(r'word',key): 245 | del self.compare_dictionary['search_limits'][key] 246 | deleted = True 247 | if not deleted: 248 | #If there are no words keys, just delete the first key of any type. 249 | #Sort order can't be assumed, but this is a useful failure mechanism of last resort. Maybe. 250 | try: 251 | del self.compare_dictionary['search_limits'][self.outside_dictionary['search_limits'].keys()[0]] 252 | except: 253 | pass 254 | """ 255 | The grouping behavior here is not desirable, but I'm not quite sure how yet. 256 | Aha--one way is that it accidentally drops out a bunch of options. I'm just disabling it: let's see what goes wrong now. 257 | """ 258 | try: 259 | pass#self.compare_dictionary['groups'] = [group for group in self.compare_dictionary['groups'] if not re.match('word',group) and not re.match("[u]?[bn]igram",group)]# topicfix? and not re.match("topic",group)] 260 | except: 261 | self.compare_dictionary['groups'] = [self.compare_dictionary['time_measure']] 262 | 263 | 264 | def derive_variables(self): 265 | #These are locally useful, and depend on the search limits put in. 266 | self.limits = self.search_limits 267 | #Treat empty constraints as nothing at all, not as full restrictions. 268 | for key in self.limits.keys(): 269 | if self.limits[key] == []: 270 | del self.limits[key] 271 | self.set_operations() 272 | self.create_catalog_table() 273 | self.make_catwhere() 274 | self.make_wordwheres() 275 | 276 | def tablesNeededForQuery(self,fieldNames=[]): 277 | db = self.db 278 | neededTables = set() 279 | tablenames = dict() 280 | tableDepends = dict() 281 | db.cursor.execute("SELECT dbname,alias,tablename,dependsOn FROM masterVariableTable JOIN masterTableTable USING (tablename);") 282 | for row in db.cursor.fetchall(): 283 | tablenames[row[0]] = row[2] 284 | tableDepends[row[2]] = row[3] 285 | 286 | for fieldname in fieldNames: 287 | parent = "" 288 | try: 289 | current = tablenames[fieldname] 290 | neededTables.add(current) 291 | n = 1 292 | while parent not in ['fastcat','wordsheap']: 293 | parent = tableDepends[current] 294 | neededTables.add(parent) 295 | current = parent; 296 | n+=1 297 | if n > 100: 298 | raise TypeError("Unable to handle this; seems like a recursion loop in the table definitions.") 299 | #This will add 'fastcat' or 'wordsheap' exactly once per entry 300 | except KeyError: 301 | pass 302 | 303 | return neededTables 304 | 305 | def create_catalog_table(self): 306 | self.catalog = self.prefs['fastcat'] #'catalog' #Can be replaced with a more complicated query in the event of longer joins. 307 | 308 | """ 309 | This should check query constraints against a list of tables, and join to them. 310 | So if you query with a limit on LCSH, and LCSH is listed as being in a separate table, 311 | it joins the table "LCSH" to catalog; and then that table has one column, ALSO 312 | called "LCSH", which is matched against. This allows a bookid to be a member of multiple catalogs. 313 | """ 314 | 315 | #for limitation in self.prefs['separateDataTables']: 316 | # #That re.sub thing is in here because sometimes I do queries that involve renaming. 317 | # if limitation in [re.sub(" .*","",key) for key in self.limits.keys()] or limitation in [re.sub(" .*","",group) for group in self.groups]: 318 | # self.catalog = self.catalog + """ JOIN """ + limitation + """ USING (bookid)""" 319 | 320 | """ 321 | Here it just pulls every variable and where to look for it. 322 | """ 323 | 324 | 325 | self.relevantTables = set() 326 | 327 | databaseScheme = self.databaseScheme 328 | columns = [] 329 | for columnInQuery in [re.sub(" .*","",key) for key in self.limits.keys()] + [re.sub(" .*","",group) for group in self.groups]: 330 | columns.append(columnInQuery) 331 | try: 332 | self.relevantTables.add(databaseScheme.tableToLookIn[columnInQuery]) 333 | try: 334 | self.relevantTables.add(databaseScheme.tableToLookIn[databaseScheme.anchorFields[columnInQuery]]) 335 | try: 336 | self.relevantTables.add(databaseScheme.tableToLookIn[databaseScheme.anchorFields[databaseScheme.anchorFields[columnInQuery]]]) 337 | except KeyError: 338 | pass 339 | except KeyError: 340 | pass 341 | except KeyError: 342 | pass 343 | #Could raise as well--shouldn't be errors--but this helps back-compatability. 344 | 345 | # if "catalog" in self.relevantTables and self.method != "bibliography_query": 346 | # self.relevantTables.remove('catalog') 347 | try: 348 | moreTables = self.tablesNeededForQuery(columns) 349 | except MySQLdb.ProgrammingError: 350 | #What happens on old-style Bookworm constructions. 351 | moreTables = set() 352 | self.relevantTables = self.relevantTables.union(moreTables) 353 | self.catalog = "fastcat" 354 | for table in self.relevantTables: 355 | if table!="fastcat" and table!="words" and table!="wordsheap" and table!="master_bookcounts" and table!="master_bigrams": 356 | self.catalog = self.catalog + """ NATURAL JOIN """ + table + " " 357 | 358 | def make_catwhere(self): 359 | #Where terms that don't include the words table join. Kept separate so that we can have subqueries only working on one half of the stack. 360 | catlimits = dict() 361 | for key in self.limits.keys(): 362 | ###Warning--none of these phrases can be used ina bookworm as a custom table names. 363 | if key not in ('word','word1','word2','hasword') and not re.search("words\d",key): 364 | catlimits[key] = self.limits[key] 365 | if len(catlimits.keys()) > 0: 366 | self.catwhere = where_from_hash(catlimits) 367 | else: 368 | self.catwhere = "TRUE" 369 | if 'hasword' in self.limits.keys(): 370 | """ 371 | Because derived tables don't carry indexes, we're just making the new tables 372 | with indexes on the fly to be stored in a temporary database, "bookworm_scratch" 373 | Each time a hasword query is performed, the results of that query are permanently cached; 374 | they're stored as a table that can be used in the future. 375 | 376 | This will create problems if database contents are changed; there needs to be some mechanism for 377 | clearing out the cache periodically. 378 | """ 379 | 380 | if self.limits['hasword'] == []: 381 | del self.limits['hasword'] 382 | return 383 | 384 | #deepcopy lets us get a real copy of the dictionary 385 | #that can be changed without affecting the old one. 386 | mydict = copy.deepcopy(self.outside_dictionary) 387 | # This may make it take longer than it should; we might want the list to 388 | # just be every bookid with the given word rather than 389 | # filtering by the limits as well. 390 | # It's not obvious to me which will be faster. 391 | mydict['search_limits'] = copy.deepcopy(self.limits) 392 | if isinstance(mydict['search_limits']['hasword'],basestring): 393 | #Make sure it's an array 394 | mydict['search_limits']['hasword'] = [mydict['search_limits']['hasword']] 395 | """ 396 | #Ideally, this would shuffle into an order ensuring that the 397 | rarest words were nested deepest. 398 | #That would speed up query execution by ensuring there 399 | wasn't some massive search for 'the' being 400 | #done at the end. 401 | 402 | Instead, it just pops off the last element and sets up a 403 | recursive nested join. for every element in the 404 | array. 405 | """ 406 | mydict['search_limits']['word'] = [mydict['search_limits']['hasword'].pop()] 407 | if len(mydict['search_limits']['hasword'])==0: 408 | del mydict['search_limits']['hasword'] 409 | tempquery = userquery(mydict,databaseScheme=self.databaseScheme) 410 | listofBookids = tempquery.bookid_query() 411 | 412 | #Unique identifier for the query that persists across the 413 | #various subqueries. 414 | queryID = hashlib.sha1(listofBookids).hexdigest()[:20] 415 | 416 | tmpcatalog = "bookworm_scratch.tmp" + re.sub("-","",queryID) 417 | 418 | try: 419 | self.cursor.execute("CREATE TABLE %s (bookid MEDIUMINT, PRIMARY KEY (bookid)) ENGINE=MYISAM;" %tmpcatalog) 420 | self.cursor.execute("INSERT IGNORE INTO %s %s;" %(tmpcatalog,listofBookids)) 421 | 422 | except MySQLdb.OperationalError,e: 423 | #Usually the error will be 1050, which is a good thing: it means we don't need to 424 | #create the table. 425 | #If it's not, something bad is happening. 426 | if not re.search("1050.*already exists",str(e)): 427 | raise 428 | self.catalog += " NATURAL JOIN %s "%(tmpcatalog) 429 | 430 | 431 | def make_wordwheres(self): 432 | self.wordswhere = " TRUE " 433 | self.max_word_length = 0 434 | limits = [] 435 | """ 436 | "unigram" or "bigram" can be used as an alias for "word" in the search_limits field. 437 | """ 438 | 439 | for gramterm in ['unigram','bigram']: 440 | if gramterm in self.limits.keys() and not "word" in self.limits.keys(): 441 | self.limits['word'] = self.limits[gramterm] 442 | del self.limits[gramterm] 443 | 444 | if 'word' in self.limits.keys(): 445 | """ 446 | This doesn't currently allow mixing of one and two word searches together in a logical way. 447 | It might be possible to just join on both the tables in MySQL--I'm not completely sure what would happen. 448 | But the philosophy has been to keep users from doing those searches as far as possible in any case. 449 | """ 450 | for phrase in self.limits['word']: 451 | locallimits = dict() 452 | array = phrase.split(" ") 453 | n = 0 454 | for word in array: 455 | n += 1 456 | searchingFor = word 457 | if self.word_field=="stem": 458 | from nltk import PorterStemmer 459 | searchingFor = PorterStemmer().stem_word(searchingFor) 460 | if self.word_field=="case_insensitive" or self.word_field=="Case_Insensitive": 461 | searchingFor = searchingFor.lower() 462 | 463 | selectString = "SELECT wordid FROM wordsheap WHERE %s = %%s" % self.word_field 464 | cursor = self.db.cursor 465 | try: 466 | cursor.execute(selectString, (searchingFor)) 467 | except MySQLdb.Error, e: 468 | # Return HTML error code and log the following 469 | # print e 470 | # print cursor._last_executed 471 | print '' 472 | for row in cursor.fetchall(): 473 | wordid = row[0] 474 | try: 475 | locallimits['words'+str(n) + ".wordid"] += [wordid] 476 | except KeyError: 477 | locallimits['words'+str(n) + ".wordid"] = [wordid] 478 | self.max_word_length = max(self.max_word_length,n) 479 | 480 | #Strings have already been escaped, so don't need to be escaped again. 481 | if len(locallimits.keys()) > 0: 482 | limits.append(where_from_hash(locallimits,comp = " = ",escapeStrings=False)) 483 | #XXX for backward compatability 484 | self.words_searched = phrase 485 | #XXX end deprecated block 486 | self.wordswhere = "(" + ' OR '.join(limits) + ")" 487 | if limits == []: 488 | #In the case that nothing has been found, tell it explicitly to search for 489 | #a condition when nothing will be found. 490 | self.wordswhere = "words1.wordid=-1" 491 | 492 | 493 | wordlimits = dict() 494 | 495 | limitlist = copy.deepcopy(self.limits.keys()) 496 | 497 | for key in limitlist: 498 | if re.search("words\d",key): 499 | wordlimits[key] = self.limits[key] 500 | self.max_word_length = max(self.max_word_length,2) 501 | del self.limits[key] 502 | 503 | if len(wordlimits.keys()) > 0: 504 | self.wordswhere = where_from_hash(wordlimits) 505 | 506 | return self.wordswhere 507 | 508 | def build_wordstables(self): 509 | #Deduce the words tables we're joining against. The iterating on this can be made more general to get 3 or four grams in pretty easily. 510 | #This relies on a determination already having been made about whether this is a unigram or bigram search; that's reflected in the self.selections 511 | #variable. 512 | 513 | 514 | """ 515 | We also now check for whether it needs the topic assignments: this could be generalized, with difficulty, for any other kind of plugin. 516 | """ 517 | 518 | needsBigrams = (self.max_word_length == 2 or re.search("words2",self.selections)) 519 | needsUnigrams = self.max_word_length == 1 or re.search("[^h][^a][^s]word",self.selections) 520 | needsTopics = bool(re.search("topic",self.selections)) or ("topic" in self.limits.keys()) 521 | 522 | if needsBigrams: 523 | 524 | self.maintable = 'master_bigrams' 525 | 526 | self.main = ''' 527 | JOIN 528 | master_bigrams as main 529 | ON ('''+ self.prefs['fastcat'] +'''.bookid=main.bookid) 530 | ''' 531 | 532 | self.wordstables = """ 533 | JOIN %(wordsheap)s as words1 ON (main.word1 = words1.wordid) 534 | JOIN %(wordsheap)s as words2 ON (main.word2 = words2.wordid) """ % self.__dict__ 535 | 536 | #I use a regex here to do a blanket search for any sort of word limitations. That has some messy sideffects (make sure the 'hasword' 537 | #key has already been eliminated, for example!) but generally works. 538 | 539 | elif needsTopics and needsUnigrams: 540 | self.maintable = 'master_topicWords' 541 | self.main = ''' 542 | NATURAL JOIN 543 | master_topicWords as main 544 | ''' 545 | self.wordstables = """ 546 | JOIN ( %(wordsheap)s as words1) ON (main.wordid = words1.wordid) 547 | """ % self.__dict__ 548 | 549 | elif needsUnigrams: 550 | self.maintable = 'master_bookcounts' 551 | self.main = ''' 552 | NATURAL JOIN 553 | master_bookcounts as main 554 | ''' 555 | #ON (''' + self.prefs['fastcat'] + '''.bookid=main.bookid)''' 556 | self.wordstables = """ 557 | JOIN ( %(wordsheap)s as words1) ON (main.wordid = words1.wordid) 558 | """ % self.__dict__ 559 | 560 | elif needsTopics: 561 | self.maintable = 'master_topicCounts' 562 | self.main = ''' 563 | NATURAL JOIN 564 | master_topicCounts as main ''' 565 | self.wordstables = " " 566 | self.wordswhere = " TRUE " 567 | 568 | else: 569 | """ 570 | Have _no_ words table if no words searched for or grouped by; 571 | instead just use nwords. This 572 | means that we can use the same basic functions both to build the 573 | counts for word searches and 574 | for metadata searches, which is valuable because there is a 575 | metadata-only search built in to every single ratio 576 | query. (To get the denominator values). 577 | 578 | Call this OLAP, if you like. 579 | """ 580 | self.main = " " 581 | self.operation = ','.join(self.catoperations) 582 | """ 583 | This, above is super important: the operation used is relative to the counttype, and changes to use 'catoperation' instead of 'bookoperation' 584 | That's the place that the denominator queries avoid having to do a table scan on full bookcounts that would take hours, and instead takes 585 | milliseconds. 586 | """ 587 | self.wordstables = " " 588 | self.wordswhere = " TRUE " 589 | #Just a dummy thing to make the SQL writing easier. Shouldn't take any time. Will usually be extended with actual conditions. 590 | 591 | def set_operations(self): 592 | """ 593 | This is the code that allows multiple values to be selected. 594 | 595 | All can be removed when we kill back compatibility ! It's all handled now by the general_API, not the SQL_API. 596 | """ 597 | 598 | 599 | backCompatability = {"Occurrences_per_Million_Words":"WordsPerMillion","Raw_Counts":"WordCount","Percentage_of_Books":"TextPercent","Number_of_Books":"TextCount"} 600 | 601 | for oldKey in backCompatability.keys(): 602 | self.counttype = [re.sub(oldKey,backCompatability[oldKey],entry) for entry in self.counttype] 603 | 604 | self.bookoperation = {} 605 | self.catoperation = {} 606 | self.finaloperation = {} 607 | 608 | #Text statistics 609 | self.bookoperation['TextPercent'] = "count(DISTINCT " + self.prefs['fastcat'] + ".bookid) as TextCount" 610 | self.bookoperation['TextRatio'] = "count(DISTINCT " + self.prefs['fastcat'] + ".bookid) as TextCount" 611 | self.bookoperation['TextCount'] = "count(DISTINCT " + self.prefs['fastcat'] + ".bookid) as TextCount" 612 | 613 | #Word Statistics 614 | self.bookoperation['WordCount'] = "sum(main.count) as WordCount" 615 | self.bookoperation['WordsPerMillion'] = "sum(main.count) as WordCount" 616 | self.bookoperation['WordsRatio'] = "sum(main.count) as WordCount" 617 | 618 | 619 | """ 620 | +Total Numbers for comparisons/significance assessments 621 | This is a little tricky. The total words is EITHER the denominator (as in a query against words per Million) or the numerator+denominator (if you're comparing 622 | Pittsburg and Pittsburgh, say, and want to know the total number of uses of the lemma. For now, "TotalWords" means the former and "SumWords" the latter, 623 | On the theory that 'TotalWords' is more intuitive and only I (Ben) will be using SumWords all that much. 624 | """ 625 | self.bookoperation['TotalWords'] = self.bookoperation['WordsPerMillion'] 626 | self.bookoperation['SumWords'] = self.bookoperation['WordsPerMillion'] 627 | self.bookoperation['TotalTexts'] = self.bookoperation['TextCount'] 628 | self.bookoperation['SumTexts'] = self.bookoperation['TextCount'] 629 | 630 | for stattype in self.bookoperation.keys(): 631 | if re.search("Word",stattype): 632 | self.catoperation[stattype] = "sum(nwords) as WordCount" 633 | if re.search("Text",stattype): 634 | self.catoperation[stattype] = "count(nwords) as TextCount" 635 | 636 | self.finaloperation['TextPercent'] = "IFNULL(numerator.TextCount,0)/IFNULL(denominator.TextCount,0)*100 as TextPercent" 637 | self.finaloperation['TextRatio'] = "IFNULL(numerator.TextCount,0)/IFNULL(denominator.TextCount,0) as TextRatio" 638 | self.finaloperation['TextCount'] = "IFNULL(numerator.TextCount,0) as TextCount" 639 | 640 | self.finaloperation['WordsPerMillion'] = "IFNULL(numerator.WordCount,0)*100000000/IFNULL(denominator.WordCount,0)/100 as WordsPerMillion" 641 | self.finaloperation['WordsRatio'] = "IFNULL(numerator.WordCount,0)/IFNULL(denominator.WordCount,0) as WordsRatio" 642 | self.finaloperation['WordCount'] = "IFNULL(numerator.WordCount,0) as WordCount" 643 | 644 | self.finaloperation['TotalWords'] = "IFNULL(denominator.WordCount,0) as TotalWords" 645 | self.finaloperation['SumWords'] = "IFNULL(denominator.WordCount,0) + IFNULL(numerator.WordCount,0) as SumWords" 646 | self.finaloperation['TotalTexts'] = "IFNULL(denominator.TextCount,0) as TotalTexts" 647 | self.finaloperation['SumTexts'] = "IFNULL(denominator.TextCount,0) + IFNULL(numerator.TextCount,0) as SumTexts" 648 | 649 | """ 650 | The values here will be chosen in build_wordstables; that's what decides if it uses the 'bookoperation' or 'catoperation' dictionary to build out. 651 | """ 652 | 653 | self.finaloperations = list() 654 | self.bookoperations = set() 655 | self.catoperations = set() 656 | 657 | for summaryStat in self.counttype: 658 | self.catoperations.add(self.catoperation[summaryStat]) 659 | self.bookoperations.add(self.bookoperation[summaryStat]) 660 | self.finaloperations.append(self.finaloperation[summaryStat]) 661 | 662 | def counts_query(self): 663 | 664 | self.operation = ','.join(self.bookoperations) 665 | self.build_wordstables() 666 | 667 | countsQuery = """ 668 | SELECT 669 | %(selections)s, 670 | %(operation)s 671 | FROM 672 | %(catalog)s 673 | %(main)s 674 | %(wordstables)s 675 | WHERE 676 | %(catwhere)s AND %(wordswhere)s 677 | GROUP BY 678 | %(groupings)s 679 | """ % self.__dict__ 680 | return countsQuery 681 | 682 | def bookid_query(self): 683 | #A temporary method to setup the hasword query. 684 | self.operation = ','.join(self.bookoperations) 685 | self.build_wordstables() 686 | 687 | countsQuery = """ 688 | SELECT 689 | main.bookid as bookid 690 | FROM 691 | %(catalog)s 692 | %(main)s 693 | %(wordstables)s 694 | WHERE 695 | %(catwhere)s AND %(wordswhere)s 696 | """ % self.__dict__ 697 | return countsQuery 698 | 699 | def debug_query(self): 700 | query = self.ratio_query(materialize = False) 701 | return json.dumps(self.denominator.groupings.split(",")) + query 702 | 703 | def query(self,materialize=False): 704 | """ 705 | We launch a whole new userquery instance here to build the denominator, based on the 'compare_dictionary' option (which in most 706 | cases is the search_limits without the keys, see above; it can also be specially defined using asterisks as a shorthand to identify other fields to drop. 707 | We then get the counts_query results out of that result. 708 | """ 709 | 710 | """ 711 | self.denominator = userquery(outside_dictionary = self.compare_dictionary,db=self.db,databaseScheme=self.databaseScheme) 712 | self.supersetquery = self.denominator.counts_query() 713 | supersetIndices = self.denominator.groupings.split(",") 714 | if materialize: 715 | self.supersetquery = derived_table(self.supersetquery,self.db,indices=supersetIndices).materialize() 716 | """ 717 | self.mainquery = self.counts_query() 718 | self.countcommand = ','.join(self.finaloperations) 719 | self.totalselections = ",".join([group for group in self.outerGroups if group!="1 as In_Library" and group != ""]) 720 | if self.totalselections != "": self.totalselections += ", " 721 | 722 | query = """ 723 | SELECT 724 | %(totalselections)s 725 | %(countcommand)s 726 | FROM 727 | (%(mainquery)s) as numerator 728 | %(joinSuffix)s 729 | GROUP BY %(groupings)s;""" % self.__dict__ 730 | 731 | return query 732 | 733 | 734 | def returnPossibleFields(self): 735 | try: 736 | self.cursor.execute("SELECT name,type,description,tablename,dbname,anchor FROM masterVariableTable WHERE status='public'") 737 | colnames = [line[0] for line in self.cursor.description] 738 | returnset = [] 739 | for line in self.cursor.fetchall(): 740 | thisEntry = {} 741 | for i in range(len(line)): 742 | thisEntry[colnames[i]] = line[i] 743 | returnset.append(thisEntry) 744 | except: 745 | returnset=[] 746 | return returnset 747 | 748 | def return_slug_data(self,force=False): 749 | #Rather than understand this error, I'm just returning 0 if it fails. 750 | #Probably that's the right thing to do, though it may cause trouble later. 751 | #It's just a punishment for not later using a decent API call with "Raw_Counts" to extract these counts out, and relying on this ugly method. 752 | #Please, citizens of the future, NEVER USE THIS METHOD. 753 | try: 754 | temp_words = self.return_n_words(force = True) 755 | temp_counts = self.return_n_books(force = True) 756 | except: 757 | temp_words = 0 758 | temp_counts = 0 759 | return [temp_counts,temp_words] 760 | 761 | def return_n_books(self,force=False): #deprecated 762 | if (not hasattr(self,'nbooks')) or force: 763 | query = "SELECT count(*) from " + self.catalog + " WHERE " + self.catwhere 764 | silent = self.cursor.execute(query) 765 | self.counts = int(self.cursor.fetchall()[0][0]) 766 | return self.counts 767 | 768 | def return_n_words(self,force=False): #deprecated 769 | if (not hasattr(self,'nwords')) or force: 770 | query = "SELECT sum(nwords) from " + self.catalog + " WHERE " + self.catwhere 771 | silent = self.cursor.execute(query) 772 | self.nwords = int(self.cursor.fetchall()[0][0]) 773 | return self.nwords 774 | 775 | def bibliography_query(self,limit = "100"): 776 | #I'd like to redo this at some point so it could work as an API call more naturally. 777 | self.limit = limit 778 | self.ordertype = "sum(main.count*10000/nwords)" 779 | try: 780 | if self.outside_dictionary['ordertype'] == "random": 781 | if self.counttype==["Raw_Counts"] or self.counttype==["Number_of_Books"] or self.counttype==['WordCount'] or self.counttype==['BookCount'] or self.counttype==['TextCount']: 782 | self.ordertype = "RAND()" 783 | else: 784 | #This is a based on an attempt to match various different distributions I found on the web somewhere to give 785 | #weighted results based on the counts. It's not perfect, but might be good enough. Actually doing a weighted random search is not easy without 786 | #massive memory usage inside sql. 787 | self.ordertype = "LOG(1-RAND())/sum(main.count)" 788 | except KeyError: 789 | pass 790 | 791 | #If IDF searching is enabled, we could add a term like '*IDF' here to overweight better selecting words 792 | #in the event of a multiple search. 793 | self.idfterm = "" 794 | prep = self.counts_query() 795 | 796 | 797 | if self.main == " ": 798 | self.ordertype="RAND()" 799 | 800 | bibQuery = """ 801 | SELECT searchstring 802 | FROM """ % self.__dict__ + self.prefs['fullcat'] + """ RIGHT JOIN ( 803 | SELECT 804 | """+ self.prefs['fastcat'] + """.bookid, %(ordertype)s as ordering 805 | FROM 806 | %(catalog)s 807 | %(main)s 808 | %(wordstables)s 809 | WHERE 810 | %(catwhere)s AND %(wordswhere)s 811 | GROUP BY bookid ORDER BY %(ordertype)s DESC LIMIT %(limit)s 812 | ) as tmp USING(bookid) ORDER BY ordering DESC; 813 | """ % self.__dict__ 814 | return bibQuery 815 | 816 | def disk_query(self,limit="100"): 817 | pass 818 | 819 | def return_books(self): 820 | #This preps up the display elements for a search: it returns an array with a single string for each book, sorted in the best possible way 821 | silent = self.cursor.execute(self.bibliography_query()) 822 | returnarray = [] 823 | for line in self.cursor.fetchall(): 824 | returnarray.append(line[0]) 825 | if not returnarray: 826 | #why would someone request a search with no locations? Turns out (usually) because the smoothing tricked them. 827 | returnarray.append("No results for this particular point: try again without smoothing") 828 | newerarray = self.custom_SearchString_additions(returnarray) 829 | return json.dumps(newerarray) 830 | 831 | def search_results(self): 832 | #This is an alias that is handled slightly differently in APIimplementation (no "RESULTS" bit in front). Once 833 | #that legacy code is cleared out, they can be one and the same. 834 | return json.loads(self.return_books()) 835 | 836 | def getActualSearchedWords(self): 837 | if len(self.wordswhere) > 7: 838 | words = self.outside_dictionary['search_limits']['word'] 839 | #Break bigrams into single words. 840 | words = ' '.join(words).split(' ') 841 | self.cursor.execute("""SELECT word FROM wordsheap WHERE """ + where_from_hash({self.word_field:words})) 842 | self.actualWords =[item[0] for item in self.cursor.fetchall()] 843 | else: 844 | self.actualWords = ["tasty","mistake","happened","here"] 845 | 846 | def custom_SearchString_additions(self,returnarray): 847 | """ 848 | It's nice to highlight the words searched for. This will be on partner web sites, so requires custom code for different databases 849 | """ 850 | db = self.outside_dictionary['database'] 851 | if db in ('jstor','presidio','ChronAm','LOC','OL'): 852 | self.getActualSearchedWords() 853 | if db=='jstor': 854 | joiner = "&searchText=" 855 | preface = "?Search=yes&searchText=" 856 | urlRegEx = "http://www.jstor.org/stable/\d+" 857 | if db=='presidio' or db=='OL': 858 | joiner = "+" 859 | preface = "#page/1/mode/2up/search/" 860 | urlRegEx = 'http://archive.org/stream/[^"# ><]*' 861 | if db in ('ChronAm','LOC'): 862 | preface = "/;words=" 863 | joiner = "+" 864 | urlRegEx = 'http://chroniclingamerica.loc.gov[^\"><]*/seq-\d+' 865 | newarray = [] 866 | for string in returnarray: 867 | try: 868 | base = re.findall(urlRegEx,string)[0] 869 | newcore = ' search inside ' 870 | string = re.sub("^","",string) 871 | string = re.sub("$","",string) 872 | string = string+newcore 873 | except IndexError: 874 | pass 875 | newarray.append(string) 876 | #Arxiv is messier, requiring a whole different URL interface: http://search.arxiv.org:8081/paper.jsp?r=1204.3352&qs=netwokr 877 | else: 878 | newarray = returnarray 879 | return newarray 880 | 881 | def return_query_values(self,query = "ratio_query"): 882 | #The API returns a dictionary with years pointing to values. 883 | """ 884 | DEPRECATED: use 'return_json' or 'return_tsv' (the latter only works with single 'search_limits' options) instead 885 | """ 886 | values = [] 887 | querytext = getattr(self,query)() 888 | silent = self.cursor.execute(querytext) 889 | #Gets the results 890 | mydict = dict(self.cursor.fetchall()) 891 | try: 892 | for key in mydict.keys(): 893 | #Only return results inside the time limits 894 | if key >= self.time_limits[0] and key <= self.time_limits[1]: 895 | mydict[key] = str(mydict[key]) 896 | else: 897 | del mydict[key] 898 | mydict = smooth_function(mydict,smooth_method = self.smoothingType,span = self.smoothingSpan) 899 | 900 | except: 901 | mydict = {0:"0"} 902 | 903 | #This is a good place to change some values. 904 | try: 905 | return {'index':self.index, 'Name':self.words_searched,"values":mydict,'words_searched':""} 906 | except: 907 | return{'values':mydict} 908 | 909 | 910 | def return_tsv(self,query = "ratio_query"): 911 | if self.outside_dictionary['counttype']=="Raw_Counts" or self.outside_dictionary['counttype']==["Raw_Counts"]: 912 | query="counts_query" 913 | #This allows much speedier access to counts data if you're willing not to know about all the zeroes. 914 | #Will not work as well once the id_fields are in use. 915 | querytext = getattr(self,query)() 916 | silent = self.cursor.execute(querytext) 917 | results = ["\t".join([to_unicode(item[0]) for item in self.cursor.description])] 918 | lines = self.cursor.fetchall() 919 | for line in lines: 920 | items = [] 921 | for item in line: 922 | item = to_unicode(item) 923 | item = re.sub("\t","",item) 924 | items.append(item) 925 | results.append("\t".join(items)) 926 | return "\n".join(results) 927 | 928 | def export_data(self,query1="ratio_query"): 929 | self.smoothing=0 930 | return self.return_query_values(query=query1) 931 | 932 | def execute(self): 933 | #This performs the query using the method specified in the passed parameters. 934 | if self.method=="Nothing": 935 | pass 936 | else: 937 | value = getattr(self,self.method)() 938 | return value 939 | 940 | class derived_table(object): 941 | """ 942 | MySQL/MariaDB doesn't have good subquery materialization, 943 | so I'm implementing it by hand. 944 | """ 945 | def __init__(self,SQLstring,db,indices = [],dbToPutIn = "bookworm_scratch"): 946 | """ 947 | initialize with the code to create the table; the database it will be in 948 | (to prevent conflicts with other identical queries in other dbs); 949 | and the list of all tables to be indexed 950 | (optional, but which can really speed up joins) 951 | """ 952 | self.query = SQLstring 953 | self.db = db 954 | #Each query is identified by a unique key hashed 955 | #from the query and the dbname. 956 | self.queryID = dbToPutIn + "." + "derived" + hashlib.sha1(self.query + db.dbname).hexdigest() 957 | self.indices = "(" + ",".join(["INDEX(%s)" % index for index in indices]) + ")" if indices != [] else "" 958 | 959 | def setStorageEngines(self,temp): 960 | """ 961 | Chooses where and how to store tables. 962 | """ 963 | self.tempString = "TEMPORARY" if temp else "" 964 | self.engine = "MEMORY" if temp else "MYISAM" 965 | 966 | def checkCache(self): 967 | """ 968 | Checks what's already been calculated. 969 | """ 970 | try: 971 | (self.count,self.created,self.modified,self.createCode,self.data) = self.db.cursor.execute("SELECT count,created,modified,createCode,data FROM bookworm_scratch.cache WHERE fieldname='%s'" %self.queryID)[0] 972 | return True 973 | except: 974 | (self.count,self.created,self.modified,self.createCode,self.data) = [None]*5 975 | return False 976 | 977 | def fillTableWithData(self,data): 978 | dataCode = "INSERT INTO %s values ("%self.queryID + ", ".join(["%s"]*len(data[0])) + ")" 979 | self.db.cursor.executemany(dataCode,data) 980 | self.db.db.commit() 981 | 982 | def materializeFromCache(self,temp): 983 | if self.data is not None: 984 | #Datacode should never exist without createCode also. 985 | self.db.cursor.execute(self.createCode) 986 | self.fillTableWithData(pickle.loads(self.data,protocol=-1)) 987 | return True 988 | else: 989 | return False 990 | 991 | 992 | def createFromCacheWithDataFromBookworm(self,temp,postDataToCache=False): 993 | """ 994 | If the create code exists but the data does not. 995 | This uses a form of query that MySQL can cache, 996 | unlike the normal subqueries OR the CREATE TABLE ... INSERT 997 | used by materializeFromBookworm. 998 | 999 | You can also post the data itself, but that's turned off by default: 1000 | because why wouldn't it have been posted the first time? 1001 | Probably it's too large or something, is why. 1002 | """ 1003 | if self.createCode==None: 1004 | return False 1005 | self.db.cursor.execute(self.createCode) 1006 | self.db.cursor.execute(self.query) 1007 | data = [row for row in self.db.cursor.fetchall()] 1008 | self.newdata = pickle.dumps(data,protocol=-1) 1009 | self.fillTableWithData(data) 1010 | if postDataToCache: 1011 | self.updateCache(postQueries=["UPDATE bookworm_scratch.cache SET data='%s' WHERE fieldname='%s'" %(MySQLdb.escape_string(self.newdata),self.queryID)]) 1012 | else: 1013 | self.updateCache() 1014 | return True 1015 | 1016 | def materializeFromBookworm(self,temp,postDataToCache=True,postCreateToCache=True): 1017 | import cPickle as pickle 1018 | self.db.cursor.execute("CREATE %(tempString)s TABLE %(queryID)s %(indices)s ENGINE=%(engine)s %(query)s;" % self.__dict__) 1019 | self.db.cursor.execute("SHOW CREATE TABLE %s" %self.queryID) 1020 | self.newCreateCode = self.db.cursor.fetchall()[0][1] 1021 | self.db.cursor.execute("SELECT * FROM %s" %self.queryID) 1022 | #coerce the results to a list of tuples, then pickle it. 1023 | self.newdata = pickle.dumps([row for row in self.db.cursor.fetchall()],protocol=-1) 1024 | 1025 | if postDataToCache: 1026 | self.updateCache(postQueries=["UPDATE bookworm_scratch.cache SET data='%s',createCode='%s' WHERE fieldname='%s'" %(MySQLdb.escape_string(self.newdata),MySQLdb.escape_string(self.newCreateCode),self.queryID)]) 1027 | 1028 | if postCreateToCache: 1029 | self.updateCache(postQueries=["UPDATE bookworm_scratch.cache SET createCode='%s' WHERE fieldname='%s'" %(MySQLdb.escape_string(self.newCreateCode),self.queryID)]) 1030 | 1031 | 1032 | def updateCache(self,postQueries=[]): 1033 | q1 = """ 1034 | INSERT INTO bookworm_scratch.cache (fieldname,created,modified,count) VALUES 1035 | ('%s',NOW(),NOW(),1) ON DUPLICATE KEY UPDATE count = count + 1,modified=NOW();""" %self.queryID 1036 | result = self.db.cursor.execute(q1) 1037 | for query in postQueries: 1038 | self.db.cursor.execute(query) 1039 | self.db.db.commit() 1040 | 1041 | def materialize(self,temp="default"): 1042 | """ 1043 | materializes the table, by default in memory in the bookworm_scratch 1044 | database. If temp is false, the table will be stored on disk, available 1045 | for future users too. This should be used sparingly, because you can't have too many 1046 | tables on disk. 1047 | 1048 | Returns the tableID, which the superquery to this one may need to know. 1049 | """ 1050 | if temp=="default": 1051 | temp=True 1052 | 1053 | self.checkCache() 1054 | self.setStorageEngines(temp) 1055 | 1056 | try: 1057 | if not self.materializeFromCache(temp): 1058 | if not self.createFromCacheWithDataFromBookworm(temp): 1059 | self.materializeFromBookworm(temp) 1060 | 1061 | except MySQLdb.OperationalError,e: 1062 | #Often the error will be 1050, which is a good thing: 1063 | #It means we don't need to 1064 | #create the table, because it's there already. 1065 | #But if it's not, something bad is happening. 1066 | if not re.search("1050.*already exists",str(e)): 1067 | raise 1068 | 1069 | return self.queryID 1070 | 1071 | class databaseSchema: 1072 | """ 1073 | This class stores information about the database setup that is used to optimize query creation query 1074 | and so that queries know what tables to include. 1075 | It's broken off like this because it might be usefully wrapped around some of the backend features, 1076 | because it shouldn't be run multiple times in a single query (that spawns two instances of itself), as was happening before. 1077 | 1078 | It's closely related to some of the classes around variables and variableSets in the Bookworm Creation scripts, 1079 | but is kept separate for now: that allows a bit more flexibility, but is probaby a Bad Thing in the long run. 1080 | """ 1081 | 1082 | def __init__(self,db): 1083 | self.db = db 1084 | self.cursor=db.cursor 1085 | #has of what table each variable is in 1086 | self.tableToLookIn = {} 1087 | #hash of what the root variable for each search term is (eg, 'author_birth' might be crosswalked to 'authorid' in the main catalog.) 1088 | self.anchorFields = {} 1089 | #aliases: a hash showing internal identifications codes that dramatically speed up query time, but which shouldn't be exposed. 1090 | #So you can run a search for "state," say, and the database will group on a 50-element integer code instead of a VARCHAR that 1091 | #has to be long enough to support "Massachusetts" and "North Carolina." 1092 | #A couple are hard-coded in, but most are derived by looking for fields that end in the suffix "__id" later. 1093 | 1094 | if self.db.dbname=="presidio": 1095 | self.aliases = {"classification":"lc1","lat":"pointid","lng":"pointid"} 1096 | else: 1097 | self.aliases = dict() 1098 | 1099 | try: 1100 | #First build using the new streamlined tables; if that fails, 1101 | #build using the old version that hits the INFORMATION_SCHEMA, 1102 | #which is bad practice. 1103 | self.newStyle(db) 1104 | except: 1105 | #The new style will fail on old bookworms: a failure is an easy way to test 1106 | #for oldness, though of course something else might be causing the failure. 1107 | self.oldStyle(db) 1108 | 1109 | 1110 | def newStyle(self,db): 1111 | self.tableToLookIn['bookid'] = 'fastcat' 1112 | self.anchorFields['bookid'] = 'fastcat' 1113 | self.anchorFields['wordid'] = 'wordid' 1114 | self.tableToLookIn['wordid'] = 'wordsheap' 1115 | 1116 | 1117 | tablenames = dict() 1118 | tableDepends = dict() 1119 | db.cursor.execute("SELECT dbname,alias,tablename,dependsOn FROM masterVariableTable JOIN masterTableTable USING (tablename);") 1120 | for row in db.cursor.fetchall(): 1121 | (dbname,alias,tablename,dependsOn) = row 1122 | self.tableToLookIn[dbname] = tablename 1123 | self.anchorFields[tablename] = dependsOn 1124 | self.aliases[dbname] = alias 1125 | 1126 | def oldStyle(self,db): 1127 | 1128 | #This is sorted by engine DESC so that memory table locations will overwrite disk table in the hash. 1129 | 1130 | self.cursor.execute("SELECT ENGINE,TABLE_NAME,COLUMN_NAME,COLUMN_KEY,TABLE_NAME='fastcat' OR TABLE_NAME='wordsheap' AS privileged FROM information_schema.COLUMNS JOIN INFORMATION_SCHEMA.TABLES USING (TABLE_NAME,TABLE_SCHEMA) WHERE TABLE_SCHEMA='%(dbname)s' ORDER BY privileged,ENGINE DESC,TABLE_NAME,COLUMN_KEY DESC;" % self.db.__dict__); 1131 | columnNames = self.cursor.fetchall() 1132 | 1133 | parent = 'bookid' 1134 | previous = None 1135 | for databaseColumn in columnNames: 1136 | if previous != databaseColumn[1]: 1137 | if databaseColumn[3]=='PRI' or databaseColumn[3]=='MUL': 1138 | parent = databaseColumn[2] 1139 | previous = databaseColumn[1] 1140 | else: 1141 | parent = 'bookid' 1142 | else: 1143 | self.anchorFields[databaseColumn[2]] = parent 1144 | if databaseColumn[3]!='PRI' and databaseColumn[3]!="MUL": #if it's a primary key, this isn't the right place to find it. 1145 | self.tableToLookIn[databaseColumn[2]] = databaseColumn[1] 1146 | if re.search('__id\*?$',databaseColumn[2]): 1147 | self.aliases[re.sub('__id','',databaseColumn[2])]=databaseColumn[2] 1148 | 1149 | try: 1150 | cursor = self.cursor.execute("SELECT dbname,tablename,anchor,alias FROM masterVariableTables") 1151 | for row in cursor.fetchall(): 1152 | if row[0] != row[3]: 1153 | self.aliases[row[0]] = row[3] 1154 | if row[0] != row[2]: 1155 | self.anchorFields[row[0]] = row[2] 1156 | #Should be uncommented, but some temporary issues with the building script 1157 | #self.tableToLookIn[row[0]] = row[1] 1158 | except: 1159 | pass 1160 | self.tableToLookIn['bookid'] = 'fastcat' 1161 | self.anchorFields['bookid'] = 'fastcat' 1162 | self.anchorFields['wordid'] = 'wordid' 1163 | self.tableToLookIn['wordid'] = 'wordsheap' 1164 | ############# 1165 | ##GENERAL#### #These are general purpose functional types of things not implemented in the class. 1166 | ############# 1167 | 1168 | def to_unicode(obj, encoding='utf-8'): 1169 | if isinstance(obj, basestring): 1170 | if not isinstance(obj, unicode): 1171 | obj = unicode(obj, encoding) 1172 | elif isinstance(obj,int): 1173 | obj=unicode(str(obj),encoding) 1174 | else: 1175 | obj = unicode(str(obj),encoding) 1176 | return obj 1177 | 1178 | def where_from_hash(myhash,joiner=" AND ",comp = " = ",escapeStrings=True): 1179 | whereterm = [] 1180 | #The general idea here is that we try to break everything in search_limits down to a list, and then create a whereterm on that joined by whatever the 'joiner' is ("AND" or "OR"), with the comparison as whatever comp is ("=",">=",etc.). 1181 | #For more complicated bits, it gets all recursive until the bits are all in terms of list. 1182 | for key in myhash.keys(): 1183 | values = myhash[key] 1184 | if isinstance(values,basestring) or isinstance(values,int) or isinstance(values,float): 1185 | #This is just human-being handling. You can pass a single value instead of a list if you like, and it will just convert it 1186 | #to a list for you. 1187 | values = [values] 1188 | #Or queries are special, since the default is "AND". This toggles that around for a subportion. 1189 | if key=='$or' or key=="$OR": 1190 | for comparison in values: 1191 | whereterm.append(where_from_hash(comparison,joiner=" OR ",comp=comp)) 1192 | #The or doesn't get populated any farther down. 1193 | elif isinstance(values,dict): 1194 | #Certain function operators can use MySQL terms. These are the only cases that a dict can be passed as a limitations 1195 | operations = {"$gt":">","$ne":"!=","$lt":"<","$grep":" REGEXP ","$gte":">=","$lte":"<=","$eq":"="} 1196 | for operation in values.keys(): 1197 | whereterm.append(where_from_hash({key:values[operation]},comp=operations[operation],joiner=joiner)) 1198 | elif isinstance(values,list): 1199 | #and this is where the magic actually happens: the cases where the key is a string, and the target is a list. 1200 | if isinstance(values[0],dict): 1201 | # If it's a list of dicts, then there's one thing that happens. Currently all types are assumed to be the same: 1202 | # you couldn't pass in, say {"year":[{"$gte":1900},1898]} to catch post-1898 years except for 1899. Not that you 1203 | # should need to. 1204 | for entry in values: 1205 | whereterm.append(where_from_hash(entry)) 1206 | else: 1207 | #Note that about a third of the code is spent on escaping strings. 1208 | if escapeStrings: 1209 | if isinstance(values[0],basestring): 1210 | quotesep="'" 1211 | else: 1212 | quotesep = "" 1213 | def escape(value): return MySQLdb.escape_string(to_unicode(value)) 1214 | else: 1215 | def escape(value): return to_unicode(value) 1216 | quotesep="" 1217 | 1218 | #Note the "OR" here. There's no way to pass in a query like "year=1876 AND year=1898" as currently set up. 1219 | #Obviously that's no great loss, but there might be something I'm missing that would be desire a similar format somehow. 1220 | #(In cases where the same book could have two different years associated with it) 1221 | whereterm.append(" (" + " OR ".join([" (" + key+comp+quotesep+escape(value)+quotesep+") " for value in values])+ ") ") 1222 | return "(" + joiner.join(whereterm) + ")" 1223 | #This works pretty well, except that it requires very specific sorts of terms going in, I think. 1224 | 1225 | 1226 | 1227 | #I'd rather have all this smoothing stuff done at the client side, but currently it happens here. 1228 | 1229 | #The idea is: this works by default by slurping up from the command line, but you could also load the functions in and run results on your own queries. 1230 | try: 1231 | command = str(sys.argv[1]) 1232 | command = json.loads(command) 1233 | #Got to go before we let anything else happen. 1234 | p = userqueries(command) 1235 | result = p.execute() 1236 | print json.dumps(result) 1237 | except: 1238 | pass 1239 | 1240 | 1241 | 1242 | def debug(string): 1243 | """ 1244 | Makes it easier to debug through a web browser by handling the headers. 1245 | Despite being called a `string`, it can be anything that python can print. 1246 | """ 1247 | print headers('1') 1248 | print "
" 1249 | print string 1250 | print "
" 1251 | -------------------------------------------------------------------------------- /bookworm/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bookworm-project/BookwormAPI/faac096f74a86ca7a9c8b4e02a3aacfa1f5f7b76/bookworm/__init__.py -------------------------------------------------------------------------------- /bookworm/general_API.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | import MySQLdb 4 | from pandas import merge 5 | from pandas.io.sql import read_sql 6 | from pandas import set_option 7 | from SQLAPI import * 8 | from copy import deepcopy 9 | from collections import defaultdict 10 | import ConfigParser 11 | import os.path 12 | 13 | #Some settings can be overridden here, if no where else. 14 | prefs = dict() 15 | 16 | def find_my_cnf(): 17 | """ 18 | The password will be looked for in these places. 19 | """ 20 | 21 | for file in ["etc/bookworm/my.cnf","/etc/my.cnf","/etc/mysql/my.cnf","/root/.my.cnf"]: 22 | if os.path.exists(file): 23 | return file 24 | 25 | class dbConnect(object): 26 | #This is a read-only account 27 | def __init__(self,prefs=prefs,database="federalist",host="localhost"): 28 | self.dbname = database 29 | 30 | #For back-compatibility: 31 | if "HOST" in prefs: 32 | host=prefs['HOST'] 33 | 34 | self.db = MySQLdb.connect(host=host, 35 | db=database, 36 | read_default_file = find_my_cnf(), 37 | use_unicode='True', 38 | charset='utf8') 39 | 40 | self.cursor = self.db.cursor() 41 | 42 | def calculateAggregates(df,parameters): 43 | 44 | """ 45 | We only collect "WordCoun" and "TextCount" for each query, 46 | but there are a lot of cool things you can do with those: 47 | basic things like frequency, all the way up to TF-IDF. 48 | """ 49 | parameters = set(parameters) 50 | 51 | if "WordsPerMillion" in parameters: 52 | df["WordsPerMillion"] = df["WordCount_x"].multiply(1000000)/df["WordCount_y"] 53 | if "WordCount" in parameters: 54 | df["WordCount"] = df["WordCount_x"] 55 | if "TotalWords" in parameters: 56 | df["TotalWords"] = df["WordCount_y"] 57 | if "SumWords" in parameters: 58 | df["SumWords"] = df["WordCount_y"] + df["WordCount_x"] 59 | if "WordsRatio" in parameters: 60 | df["WordsRatio"] = df["WordCount_x"]/df["WordCount_y"] 61 | 62 | if "TextPercent" in parameters: 63 | df["TextPercent"] = 100*df["TextCount_x"].divide(df["TextCount_y"]) 64 | if "TextCount" in parameters: 65 | df["TextCount"] = df["TextCount_x"] 66 | if "TotalTexts" in parameters: 67 | df["TotalTexts"] = df["TextCount_y"] 68 | 69 | if "HitsPerBook" in parameters: 70 | df["HitsPerMatch"] = df["WordCount_x"]/df["TextCount_x"] 71 | 72 | if "TextLength" in parameters: 73 | df["HitsPerMatch"] = df["WordCount_y"]/df["TextCount_y"] 74 | 75 | if "TFIDF" in parameters: 76 | from numpy import log as log 77 | df.eval("TF = WordCount_x/WordCount_y") 78 | df["TFIDF"] = (df["WordCount_x"]/df["WordCount_y"])*log(df["TextCount_y"]/df['TextCount_x']) 79 | 80 | def DunningLog(df=df,a = "WordCount_x",b = "WordCount_y"): 81 | from numpy import log as log 82 | destination = "Dunning" 83 | df[a] = df[a].replace(0,1) 84 | df[b] = df[b].replace(0,1) 85 | if a=="WordCount_x": 86 | # Dunning comparisons should be to the sums if counting: 87 | c = sum(df[a]) 88 | d = sum(df[b]) 89 | if a=="TextCount_x": 90 | # The max count isn't necessarily the total number of books, but it's a decent proxy. 91 | c = max(df[a]) 92 | d = max(df[b]) 93 | expectedRate = (df[a] + df[b]).divide(c+d) 94 | E1 = c*expectedRate 95 | E2 = d*expectedRate 96 | diff1 = log(df[a].divide(E1)) 97 | diff2 = log(df[b].divide(E2)) 98 | df[destination] = 2*(df[a].multiply(diff1) + df[b].multiply(diff2)) 99 | # A hack, but a useful one: encode the direction of the significance, 100 | # in the sign, so negative 101 | difference = diff1 0: 275 | merged = merge(df1,df2,on=intersections,how='outer') 276 | else: 277 | """ 278 | Pandas doesn't seem to have a full, unkeyed merge, so I simulate it with a dummy. 279 | """ 280 | df1['dummy_merge_variable'] = 1 281 | df2['dummy_merge_variable'] = 1 282 | merged = merge(df1,df2,on=["dummy_merge_variable"],how='outer') 283 | 284 | merged = merged.fillna(int(0)) 285 | 286 | calculations = self.query['counttype'] 287 | 288 | calcced = calculateAggregates(merged,calculations) 289 | 290 | calcced = calcced.fillna(int(0)) 291 | 292 | final_DataFrame = calcced[self.query['groups'] + self.query['counttype']] 293 | 294 | return final_DataFrame 295 | 296 | def execute(self): 297 | method = self.query['method'] 298 | 299 | 300 | if isinstance(self.query['search_limits'],list): 301 | if self.query['method'] not in ["json","return_json"]: 302 | self.query['search_limits'] = self.query['search_limits'][0] 303 | else: 304 | return self.multi_execute() 305 | 306 | if method=="return_json" or method=="json": 307 | frame = self.data() 308 | return self.return_json() 309 | 310 | if method=="return_tsv" or method=="tsv": 311 | import csv 312 | frame = self.data() 313 | return frame.to_csv(sep="\t",encoding="utf8",index=False,quoting=csv.QUOTE_NONE,escapechar="\\") 314 | 315 | if method=="return_pickle" or method=="DataFrame": 316 | frame = self.data() 317 | from cPickle import dumps as pickleDumps 318 | return pickleDumps(frame,protocol=-1) 319 | 320 | # Temporary catch-all pushes to the old methods: 321 | if method in ["returnPossibleFields","search_results","return_books"]: 322 | query = userquery(self.query) 323 | if method=="return_books": 324 | return query.execute() 325 | return json.dumps(query.execute()) 326 | 327 | 328 | 329 | def multi_execute(self): 330 | """ 331 | Queries may define several search limits in an array 332 | if they use the return_json method. 333 | """ 334 | returnable = [] 335 | for limits in self.query['search_limits']: 336 | child = deepcopy(self.query) 337 | child['search_limits'] = limits 338 | returnable.append(self.__class__(child).return_json(raw_python_object=True)) 339 | 340 | return json.dumps(returnable) 341 | 342 | def return_json(self,raw_python_object=False): 343 | query = self.query 344 | data = self.data() 345 | 346 | 347 | def fixNumpyType(input): 348 | #This is, weirdly, an occasional problem but not a constant one. 349 | if str(input.dtype)=="int64": 350 | return int(input) 351 | else: 352 | return input 353 | 354 | #Define a recursive structure to hold the stuff. 355 | def tree(): 356 | return defaultdict(tree) 357 | returnt = tree() 358 | 359 | import numpy as np 360 | 361 | for row in data.itertuples(index=False): 362 | row = list(row) 363 | destination = returnt 364 | if len(row)==len(query['counttype']): 365 | returnt = [fixNumpyType(num) for num in row] 366 | while len(row) > len(query['counttype']): 367 | key = row.pop(0) 368 | if len(row) == len(query['counttype']): 369 | # Assign the elements. 370 | destination[key] = row 371 | break 372 | # This bit of the loop is where we descend the recursive dictionary. 373 | destination = destination[key] 374 | if raw_python_object: 375 | return returnt 376 | 377 | try: 378 | return json.dumps(returnt,allow_nan=False) 379 | except ValueError: 380 | return json.dumps(returnt) 381 | kludge = json.dumps(returnt) 382 | kludge = kludge.replace("Infinity","null") 383 | print kludge 384 | 385 | class SQLAPIcall(APIcall): 386 | """ 387 | To make a new backend for the API, you just need to extend the base API call 388 | class like this. 389 | 390 | This one is comically short because all the real work is done in the userquery object. 391 | 392 | But the point is, you need to define a function "generate_pandas_frame" 393 | that accepts an API call and returns a pandas frame. 394 | 395 | But that API call is more limited than the general API; you only need to support "WordCount" and "TextCount" 396 | methods. 397 | """ 398 | 399 | def generate_pandas_frame(self,call): 400 | """ 401 | 402 | This is good example of the query that actually fetches the results. 403 | It creates some SQL, runs it, and returns it as a pandas DataFrame. 404 | 405 | The actual SQL production is handled by the userquery class, which uses more 406 | legacy code. 407 | 408 | """ 409 | con=dbConnect(prefs,self.query['database']) 410 | q = userquery(call).query() 411 | if self.query['method']=="debug": 412 | print q 413 | df = read_sql(q, con.db) 414 | return df 415 | 416 | 417 | -------------------------------------------------------------------------------- /bookworm/knownHosts.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | """ 4 | Whenever you add a new bookworm to your server, remember to add a line for it in this file. 5 | 6 | However, always keep the 'default' line listed below in this file. 7 | """ 8 | 9 | general_prefs = dict() 10 | general_prefs["default"] = {"fastcat": "fastcat", "HOST": "localhost", "separateDataTables": [], "fastword": "wordsheap", "database": "YourDatabaseNameHere", "read_url_head": "THIS_CAN_BE_ANYTHING...ITS_NOT_USED_ANYMORE", "fullcat": "catalog", "fullword": "words", "read_default_file": "/etc/mysql/my.cnf"} 11 | -------------------------------------------------------------------------------- /bookworm/logParser.py: -------------------------------------------------------------------------------- 1 | import urllib 2 | import os 3 | import re 4 | import gzip 5 | import json 6 | import sys 7 | 8 | files = os.listdir("/var/log/apache2") 9 | 10 | words = [] 11 | 12 | for file in files: 13 | reading = None 14 | if re.search("^access.log..*.gz",file): 15 | reading = gzip.open("/var/log/apache2/" + file) 16 | elif re.search("^access.log.*",file): 17 | reading = open("/var/log/apache2/" + file) 18 | else: 19 | continue 20 | sys.stderr.write(file + "\n") 21 | 22 | for line in reading: 23 | matches = re.findall(r"([0-9\.]+).*\[(.*)].*cgi-bin/dbbindings.py/?.query=([^ ]+)",line) 24 | for fullmatch in matches: 25 | t = dict() 26 | t['ip'] = fullmatch[0] 27 | match = fullmatch[2] 28 | try: 29 | data = json.loads(urllib.unquote(match).decode('utf8')) 30 | except ValueError: 31 | continue 32 | try: 33 | if isinstance(data['search_limits'],dict): 34 | data['search_limits'] = [data['search_limits']] 35 | for setting in ['words_collation','database']: 36 | try: 37 | t[setting] = data[setting] 38 | except KeyError: 39 | t[setting] = "" 40 | for limit in data['search_limits']: 41 | p = dict() 42 | for constraint in ["word","TV_show","director"]: 43 | try: 44 | p[constraint] = p[constraint] + "," + (",".join(limit[constraint])) 45 | except KeyError: 46 | try: 47 | p[constraint] = (",".join(limit[constraint])) 48 | except KeyError: 49 | p[constraint] = "" 50 | for key in p.keys(): 51 | t[key] = p[key] 52 | vals = [t[key] for key in ('ip','database','words_collation','word','TV_show','director')] 53 | print "\t".join(vals).encode("utf-8") 54 | 55 | 56 | except KeyError: 57 | raise 58 | 59 | print len(words) 60 | -------------------------------------------------------------------------------- /dbbindings.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | 4 | #So we load in the terms that allow the API implementation to happen for now. 5 | from datetime import datetime 6 | from bookworm.general_API import * 7 | import os 8 | import cgitb 9 | #import MySQLdb 10 | cgitb.enable() 11 | 12 | def headers(method): 13 | if method!="return_tsv": 14 | print "Content-type: text/html\n" 15 | 16 | elif method=="return_tsv": 17 | print "Content-type: text; charset=utf-8" 18 | print "Content-Disposition: filename=Bookworm-data.txt" 19 | print "Pragma: no-cache" 20 | print "Expires: 0\n" 21 | 22 | def debug(string): 23 | """ 24 | Makes it easier to debug through a web browser by handling the headers 25 | No calls should be permanently left in the code ever, or they will break things badly. 26 | """ 27 | print headers('1') 28 | print "
" 29 | print string 30 | print "
" 31 | 32 | 33 | def main(JSONinput): 34 | 35 | query = JSONinput 36 | 37 | try: 38 | #Whether there are multiple search terms, as in the highcharts method. 39 | usingSuccinctStyle = isinstance(query['search_limits'],dict) 40 | except: 41 | #If there are no search limits, it might be a returnPossibleFields query 42 | usingSuccinctStyle = True 43 | 44 | headers(query['method']) 45 | 46 | p = SQLAPIcall(query) 47 | 48 | result = p.execute() 49 | print result 50 | 51 | return True 52 | 53 | 54 | if __name__=="__main__": 55 | form = cgi.FieldStorage() 56 | 57 | #Still supporting two names for the passed parameter. 58 | try: 59 | JSONinput = form["queryTerms"].value 60 | except KeyError: 61 | JSONinput = form["query"].value 62 | 63 | main(json.loads(JSONinput)) 64 | 65 | 66 | -------------------------------------------------------------------------------- /testAPI.py: -------------------------------------------------------------------------------- 1 | import dbbindings 2 | import unittest 3 | import bookworm.general_API as general_API 4 | import bookworm.SQLAPI as SQLAPI 5 | 6 | class SQLfunction(unittest.TestCase): 7 | 8 | def test1(self): 9 | 10 | query = { 11 | "database": "movies", 12 | "method": "return_json", 13 | "search_limits": {"MovieYear":1900}, 14 | "counttype": "WordCount", 15 | "groups": ["TV_show"] 16 | } 17 | 18 | 19 | f = SQLAPI.userquery(query).query() 20 | print f 21 | 22 | 23 | class SQLConnections(unittest.TestCase): 24 | def dbConnectorsWork(self): 25 | from general_API import prefs as prefs 26 | connection = general_API.dbConnect(prefs,"federalist") 27 | tables = connection.cursor.execute("SHOW TABLES") 28 | self.assertTrue(connection.dbname=="federalist") 29 | 30 | def test1(self): 31 | query = { 32 | "database":"federalist", 33 | "search_limits":{}, 34 | "counttype":"TextPercent", 35 | "groups":["author"], 36 | "method":"return_json" 37 | } 38 | 39 | try: 40 | dbbindings.main(query) 41 | worked = True 42 | except: 43 | worked = False 44 | 45 | self.assertTrue(worked) 46 | 47 | def test2(self): 48 | query = { 49 | "database":"federalist", 50 | "search_limits":{"author":"Hamilton"}, 51 | "compare_limits":{"author":"Madison"}, 52 | "counttype":"Dunning", 53 | "groups":["unigram"], 54 | "method":"return_json" 55 | } 56 | 57 | 58 | try: 59 | #dbbindings.main(query) 60 | worked = True 61 | except: 62 | worked = False 63 | 64 | self.assertTrue(worked) 65 | 66 | if __name__=="__main__": 67 | unittest.main() 68 | --------------------------------------------------------------------------------