├── .gitignore
├── LICENSE.md
├── Makefile
├── README.md
├── bookworm
├── #APIimplementation.py#
├── .gitignore
├── APIimplementation.py
├── MetaWorm.py
├── SQLAPI.py
├── __init__.py
├── general_API.py
├── knownHosts.py
└── logParser.py
├── dbbindings.py
└── testAPI.py
/.gitignore:
--------------------------------------------------------------------------------
1 | old/*
2 | *~
3 | APIkeys
4 | #*
5 | .#*
6 | .DS_Store
7 | *.cgi
8 | migration.py
9 | shipping.py
10 | genderizer*
11 | *.pyc
12 |
--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
1 | The MIT License (MIT)
2 |
3 | Copyright (c) 2014 Benjamin Schmidt
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy of
6 | this software and associated documentation files (the "Software"), to deal in
7 | the Software without restriction, including without limitation the rights to
8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
9 | the Software, and to permit persons to whom the Software is furnished to do so,
10 | subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
17 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
18 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
19 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
20 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
21 |
--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
1 | ubuntu-install:
2 | apt-get install python-numpy python-mysqldb
3 | mkdir -p /var/log/presidio
4 | touch /var/log/presidio/log.txt
5 | chown -R www-data:www-data /var/log/presidio
6 | mv ./*.py /usr/lib/cgi-bin/
7 | chmod -R 755 /usr/lib/cgi-bin
8 |
9 | os-x-install:
10 | brew install python-numpy python-mysqldb
11 | mkdir -p /var/log/presidio
12 | touch /var/log/presidio/log.txt
13 | chown -R www /var/log/presidio
14 | chmod -R 755 /usr/lib/cgi-bin
15 | mkdir -p /etc/mysql
16 | ln -s /etc/my.cnf /etc/mysql/my.cnf
17 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ## Bookworm API
2 |
3 | **This entire repo is deprecated: the API is now bundled inside the [BookwormDB](http://github.com/bookworm-project/bookwormDB) repo**
4 |
5 |
6 | This is an implementation of the API for Bookworm, written in Python. It primarily implements the API on a MySQL database now, but includes classes for more easily implementing it on top of other platforms (such as Solr).
7 |
8 | It is used with the [Bookworm GUI](https://github.com/Bookworm-project/BookwormGUI) and can also be used as a standalone tool to query data from your database created by [the BookwormDB repo](https://github.com/Bookworm-project/BookwormDB).
9 | For a more interactive explanation of how the GUI works, see the [D3 bookworm browser](http://benschmidt.org/beta/APISandbox)
10 |
11 | ### General Description
12 |
13 | A file, currently at `dbbindings.py`, calls the script `bookworm/general_API.py`; that implements a general purpose API, and then further modules may implement the API on specific backends. Currently, the only backend is the one for the MySQL databases create by [the database repo](http://github.com/bookworm-project/BookwormDB).
14 |
15 |
16 | ### Installation
17 |
18 | Currently, you should just clone this repo into your cgi-bin directory, and make sure that `dbbindings.py` is executable.
19 |
20 | #### OS X caveat.
21 |
22 | If using homebrew, the shebang at the beginning of `dbbindings.py` is incorrect. (It will not load your installed python modules). Change it from `#!/usr/bin/env python` to `#!/usr/local/bin/python`, and it should work.
23 |
24 | ### Usage
25 |
26 | If the bookworm is located on your server, there is no need to do anything--it should be drag-and-drop. (Although on anything but debian, settings might require a small amount of tweaking.
27 |
28 | If you want to have the webserver and database server on different machines, that needs to be specified in the configuration file for mysql that this reads: if you want to have multiple mysql servers, you may need to get fancy.
29 |
30 | This tells the API where to look for the data for a particular bookworm. The benefit of this setup is that you can have your webserver on one server and the database on another server.
31 |
32 |
--------------------------------------------------------------------------------
/bookworm/#APIimplementation.py#:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 |
3 | import sys
4 | import json
5 | import cgi
6 | import re
7 | import numpy #used for smoothing.
8 | import copy
9 |
10 | #These are here so we can support multiple databases with different naming schemes from a single API. A bit ugly to have here; could be part of configuration file somewhere else, I guess. there are 'fast' and 'full' tables for books and words;
11 | #that's so memory tables can be used in certain cases for fast, hashed matching, but longer form data (like book titles)
12 | #can be stored on disk. Different queries use different types of calls.
13 | #Also, certain metadata fields are stored separately from the main catalog table; I list them manually here to avoid a database call to find out what they are,
14 | #although the latter would be more elegant. The way to do that would be a database call
15 | #of tables with two columns one of which is 'bookid', maybe, or something like that.
16 | #(Or to add it as error handling when a query failed; only then check for missing files.
17 |
18 | general_prefs = {"presidio":{"HOST":"melville.seas.harvard.edu","database":"presidio","fastcat":"fastcat","fullcat":"open_editions","fastword":"wordsheap","read_default_file":"/etc/mysql/my.cnf","fullword":"words","separateDataTables":["LCSH","gender"],"read_url_head":"http://www.archive.org/stream/"},"arxiv":{"HOST":"10.102.15.45","database":"arxiv","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":["genre","fastgenre","archive","subclass"],"read_url_head":"http://www.arxiv.org/abs/"},"jstor":{"HOST":"10.102.15.45","database":"jstor","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":["discipline"],"read_url_head":"http://www.arxiv.org/abs/"}, "politweets":{"HOST":"chaucer.fas.harvard.edu","database":"politweets","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":[],"read_url_head":"http://www.arxiv.org/abs/"},"LOC":{"HOST":"10.102.15.45","database":"LOC","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":[],"read_url_head":"http://www.arxiv.org/abs/"},"ChronAm":{"HOST":"10.102.15.45","database":"LOC","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":[],"read_url_head":"http://www.arxiv.org/abs/"}}
19 | #We define prefs to default to the Open Library set at first; later, it can do other things.
20 |
21 | class dbConnect():
22 | #This is a read-only account
23 | def __init__(self,prefs = general_prefs['presidio']):
24 | import MySQLdb
25 | self.db = MySQLdb.connect(host=prefs['HOST'],read_default_file = prefs['read_default_file'],use_unicode='True',charset='utf8',db=prefs['database'])
26 | self.cursor = self.db.cursor()
27 |
28 |
29 | # The basic object here is a userquery: it takes dictionary as input, as defined in the API, and returns a value
30 | # via the 'execute' function whose behavior
31 | # depends on the mode that is passed to it.
32 | # Given the dictionary, it can return a number of objects.
33 | # The "Search_limits" array in the passed dictionary determines how many elements it returns; this lets multiple queries be bundled together.
34 | # Most functions describe a subquery that might be combined into one big query in various ways.
35 |
36 | class userqueries():
37 | #This is a set of queries that are bound together; each element in search limits is iterated over, and we're done.
38 | def __init__(self,outside_dictionary = {"counttype":"Percentage_of_Books","search_limits":[{"word":["polka dot"],"LCSH":["Fiction"]}]},db = None):
39 | #coerce one-element dictionaries to an array.
40 | self.database = outside_dictionary.setdefault('database','presidio')
41 | prefs = general_prefs[self.database]
42 | self.prefs = prefs
43 | self.wordsheap = prefs['fastword']
44 | self.words = prefs['fullword']
45 | if 'search_limits' not in outside_dictionary.keys():
46 | outside_dictionary['search_limits'] = [{}]
47 | if isinstance(outside_dictionary['search_limits'],dict):
48 | #(allowing passing of just single dictionaries instead of arrays)
49 | outside_dictionary['search_limits'] = [outside_dictionary['search_limits']]
50 | self.returnval = []
51 | self.queryInstances = []
52 | for limits in outside_dictionary['search_limits']:
53 | mylimits = outside_dictionary
54 | mylimits['search_limits'] = limits
55 | localQuery = userquery(mylimits)
56 | self.queryInstances.append(localQuery)
57 | self.returnval.append(localQuery.execute())
58 |
59 | def execute(self):
60 | return self.returnval
61 |
62 | class userquery():
63 | def __init__(self,outside_dictionary = {"counttype":"Percentage_of_Books","search_limits":{"word":["polka dot"],"LCSH":["Fiction"]}}):
64 | #Certain constructions require a DB connection already available, so we just start it here, or use the one passed to it.
65 | self.outside_dictionary = outside_dictionary
66 | self.prefs = general_prefs[outside_dictionary.setdefault('database','presidio')]
67 | self.db = dbConnect(self.prefs)
68 | self.cursor = self.db.cursor
69 | self.wordsheap = self.prefs['fastword']
70 | self.words = self.prefs['fullword']
71 |
72 | #I'm now allowing 'search_limits' to either be a dictionary or an array of dictionaries:
73 | #this makes the syntax cleaner on most queries,
74 | #while still allowing some more complicated ones.
75 | if isinstance(outside_dictionary['search_limits'],list):
76 | outside_dictionary['search_limits'] = outside_dictionary['search_limits'][0]
77 | self.defaults(outside_dictionary) #Take some defaults
78 | self.derive_variables() #Derive some useful variables that the query will use.
79 |
80 | def defaults(self,outside_dictionary):
81 | #these are default values;these are the only values that can be set in the query
82 | #search_limits is an array of dictionaries;
83 | #each one contains a set of limits that are mutually independent
84 | #The other limitations are universal for all the search limits being set.
85 |
86 | #Set up a dictionary for the denominator of any fraction if it doesn't already exist:
87 | self.search_limits = outside_dictionary.setdefault('search_limits',[{"word":["polka dot"]}])
88 |
89 | self.words_collation = outside_dictionary.setdefault('words_collation',"Case_Insensitive")
90 | lookups = {"Case_Insensitive":'word',"case_insensitive":"word","Case_Sensitive":"casesens","Correct_Medial_s":'ffix',"All_Words_with_Same_Stem":"stem","Flagged":'wflag'}
91 | self.word_field = lookups[self.words_collation]
92 |
93 | self.groups = []
94 | try:
95 | groups = outside_dictionary['groups']
96 | except:
97 | groups = [outside_dictionary['time_measure']]
98 |
99 | if groups == []:
100 | groups = ["bookid is not null as In_Library"]
101 | if (len (groups) > 1):
102 | pass
103 | #self.groups = credentialCheckandClean(self.groups)
104 | #Define some sort of limitations here.
105 | for group in groups:
106 | group = group
107 | if group=="unigram" or group=="word":
108 | group = "words1." + self.word_field + " as unigram"
109 | if group=="bigram":
110 | group = "CONCAT (words1." + self.word_field + " ,' ' , words2." + self.word_field + ") as bigram"
111 | self.groups.append(group)
112 |
113 | self.selections = ",".join(self.groups)
114 | self.groupings = ",".join([re.sub(".* as","",group) for group in self.groups])
115 |
116 |
117 | self.compare_dictionary = copy.deepcopy(self.outside_dictionary)
118 | if 'compare_limits' in self.outside_dictionary.keys():
119 | self.compare_dictionary['search_limits'] = outside_dictionary['compare_limits']
120 | del outside_dictionary['compare_limits']
121 | else: #if nothing specified, we compare the word to the corpus.
122 | for key in ['word','word1','word2','word3','word4','word5','unigram','bigram']:
123 | try:
124 | del self.compare_dictionary['search_limits'][key]
125 | except:
126 | pass
127 | for key in self.outside_dictionary['search_limits'].keys():
128 | if re.search('words?\d',key):
129 | try:
130 | del self.compare_dictionary['search_limits'][key]
131 | except:
132 | pass
133 |
134 | comparegroups = []
135 | #This is a little tricky behavior here--hopefully it works in all cases. It drops out word groupings.
136 | try:
137 | compareGroups = self.compare_dictionary['groups']
138 | except:
139 | compareGroups = [self.compare_dictionary['time_measure']]
140 | for group in compareGroups:
141 | if not re.match("words",group) and not re.match("[u]?[bn]igram",group):
142 | comparegroups.append(group)
143 | self.compare_dictionary['groups'] = comparegroups
144 | self.time_limits = outside_dictionary.setdefault('time_limits',[0,10000000])
145 | self.time_measure = outside_dictionary.setdefault('time_measure','year')
146 | self.counttype = outside_dictionary.setdefault('counttype',"Occurrences_per_Million_Words")
147 |
148 | self.index = outside_dictionary.setdefault('index',0)
149 | #Ordinarily, the input should be an an array of groups that will both select and group by.
150 | #The joins may be screwed up by certain names that exist in multiple tables, so there's an option to do something like
151 | #SELECT catalog.bookid as myid, because WHERE clauses on myid will work but GROUP BY clauses on catalog.bookid may not
152 | #after a sufficiently large number of subqueries.
153 | #This smoothing code really ought to go somewhere else, since it doesn't quite fit into the whole API mentality and is
154 | #more about the webpage.
155 | self.smoothingType = outside_dictionary.setdefault('smoothingType',"triangle")
156 | self.smoothingSpan = outside_dictionary.setdefault('smoothingSpan',3)
157 | self.method = outside_dictionary.setdefault('method',"Nothing")
158 | self.tablename = outside_dictionary.setdefault('tablename','master'+"_bookcounts as bookcounts")
159 |
160 | def derive_variables(self):
161 | #These are locally useful, and depend on the variables
162 | self.limits = self.search_limits
163 | #Treat empty constraints as nothing at all, not as full restrictions.
164 | for key in self.limits.keys():
165 | if self.limits[key] == []:
166 | del self.limits[key]
167 | self.create_catalog_table()
168 | self.make_catwhere()
169 | self.make_wordwheres()
170 |
171 | def create_catalog_table(self):
172 | self.catalog = self.prefs['fastcat'] #'catalog' #Can be replaced with a more complicated query.
173 |
174 | #Rather than just search for "LCSH", this should check query constraints against a list of tables, and join to them.
175 | #So if you query with a limit on LCSH, it joins the table "LCSH" to catalog; and then that table has one column, ALSO
176 | #called "LCSH", which is matched against. This allows a bookid to be a member of multiple catalogs.
177 |
178 | for limitation in self.prefs['separateDataTables']:
179 | #That re.sub thing is in here because sometimes I do queries that involve renaming.
180 | if limitation in [re.sub(" .*","",key) for key in self.limits.keys()] or limitation in [re.sub(" .*","",group) for group in self.groups]:
181 | self.catalog = self.catalog + """ JOIN """ + limitation + """ USING (bookid)"""
182 |
183 | #Here's a feature that's not yet fully implemented: it doesn't work quickly enough, probably because the joins involve a lot of jumping back and forth
184 | if 'hasword' in self.limits.keys():
185 | #This is the sort of code I should have written more of:
186 | #it just generates a new API call to fill a small part of the code here:
187 | #(in this case, it merges the 'catalog' entry with a select query on
188 | #the word in the 'haswords' field. Enough of this could really
189 | #shrink the codebase, I suspect. But for some reason, these joins end up being too slow to run.
190 | #I think that has to do with the temporary table being created; we need to figure out how
191 | #to allow direct access to wordsheap here without having the table aliases for the different versions of wordsheap
192 | #being used overlapping.
193 | if self.limits['hasword'] == []:
194 | del self.limits['hasword']
195 | return
196 | import copy
197 | #deepcopy lets us get a real copy of the dictionary
198 | #that can be changed without affecting the old one.
199 | mydict = copy.deepcopy(self.outside_dictionary)
200 | mydict['search_limits'] = copy.deepcopy(self.limits)
201 | mydict['search_limits']['word'] = copy.deepcopy(mydict['search_limits']['hasword'])
202 | del mydict['search_limits']['hasword']
203 | tempquery = userquery(mydict)
204 | bookids = ''
205 | bookids = tempquery.counts_query()
206 |
207 | #If this is ever going to work, 'catalog' here should be some call to self.prefs['fastcat']
208 | bookids = re.sub("(?s).*catalog[^\.]?[^\.\n]*\n","\n",bookids)
209 | bookids = re.sub("(?s)WHERE.*","\n",bookids)
210 | bookids = re.sub("(words|lookup)([0-9])","has\\1\\2",bookids)
211 | bookids = re.sub("main","hasTable",bookids)
212 | self.catalog = self.catalog + bookids
213 | #del self.limits['hasword']
214 |
215 | def make_catwhere(self):
216 | #Where terms that don't include the words table join. Kept separate so that we can have subqueries only working on one half of the stack.
217 | catlimits = dict()
218 | for key in self.limits.keys():
219 | if key not in ('word','word1','word2','hasword') and not re.search("words\d",key):
220 | catlimits[key] = self.limits[key]
221 | if len(catlimits.keys()) > 0:
222 | self.catwhere = where_from_hash(catlimits)
223 | else:
224 | self.catwhere = "TRUE"
225 |
226 | def make_wordwheres(self):
227 | self.wordswhere = " TRUE "
228 | self.max_word_length = 0
229 | limits = []
230 |
231 | if 'word' in self.limits.keys():
232 | """
233 | This doesn't currently allow mixing of one and two word searches together in a logical way.
234 | It might be possible to just join on both the tables in MySQL--I'm not completely sure what would happen.
235 | But the philosophy has been to keep users from doing those searches as far as possible in any case.
236 | """
237 | for phrase in self.limits['word']:
238 | locallimits = dict()
239 | array = phrase.split(" ")
240 | n=1
241 | for word in array:
242 | locallimits['words'+str(n) + "." + self.word_field] = word
243 | self.max_word_length = max(self.max_word_length,n)
244 | n = n+1
245 | limits.append(where_from_hash(locallimits))
246 | #XXX for backward compatability
247 | self.words_searched = phrase
248 | #del self.limits['word']
249 | self.wordswhere = '(' + ' OR '.join(limits) + ')'
250 |
251 | wordlimits = dict()
252 |
253 | limitlist = copy.deepcopy(self.limits.keys())
254 |
255 | for key in limitlist:
256 | if re.search("words\d",key):
257 | wordlimits[key] = self.limits[key]
258 | self.max_word_length = max(self.max_word_length,2)
259 | del self.limits[key]
260 |
261 | if len(wordlimits.keys()) > 0:
262 | self.wordswhere = where_from_hash(wordlimits)
263 |
264 |
265 | # def return_wordstableOld(self, words = ['polka dot'], pos=1):
266 | # #This returns an SQL sequence suitable for querying or, probably, joining, that gives a words table only as long as the words that are
267 | # #listed in the query; it works with different word fields
268 | # #The pos value specifies a number to go after the table names, so that we can have more than one table in the join. But those numbers
269 | # #have to be assigned elsewhere, so overlap is a danger if programmed poorly.
270 | # self.lookupname = "lookup" + str(pos)
271 | # self.wordsname = "words" + str(pos)
272 | # if len(words) > 0:
273 | # self.wordwhere = where_from_hash({self.lookupname + ".casesens":words})
274 | # self.wordstable = """
275 | # %(wordsheap)s as %(wordsname)s JOIN
276 | # %(wordsheap)s AS %(lookupname)s
277 | # ON ( %(wordsname)s.%(word_field)s=%(lookupname)s.%(word_field)s
278 | # AND %(wordwhere)s ) """ % self.__dict__
279 | # else:
280 | # #We want to have some words returned even if _none_ are the query so that they can be selected. Having all the joins doesn't allow that,
281 | # #because in certain cases (merging by stems, eg) it would have multiple rows returned for a single word.
282 | # self.wordstable = """
283 | # %(wordsheap)s as %(wordsname)s """ % self.__dict__
284 | # return self.wordstable
285 |
286 | def build_wordstables(self):
287 | #Deduce the words tables we're joining against. The iterating on this can be made more general to get 3 or four grams in pretty easily.
288 | #This relies on a determination already having been made about whether this is a unigram or bigram search; that's reflected in the keys passed.
289 | if (self.max_word_length == 2 or re.search("words2",self.selections)):
290 | self.maintable = 'master_bigrams'
291 | self.main = '''
292 | JOIN
293 | master_bigrams as main
294 | ON ('''+ self.prefs['fastcat'] +'''.bookid=main.bookid)
295 | '''
296 | self.wordstables = """
297 | JOIN %(wordsheap)s as words1 ON (main.word1 = words1.wordid)
298 | JOIN %(wordsheap)s as words2 ON (main.word2 = words2.wordid) """ % self.__dict__
299 |
300 | #I use a regex here to do a blanket search for any sort of word limitations. That has some messy sideffects (make sure the 'hasword'
301 | #key has already been eliminated, for example!) but generally works.
302 | elif self.max_word_length == 1 or re.search("word",self.selections):
303 | self.maintable = 'master_bookcounts'
304 | self.main = '''
305 | JOIN
306 | master_bookcounts as main
307 | ON (''' + self.prefs['fastcat'] + '''.bookid=main.bookid)'''
308 | self.tablename = 'master_bookcounts'
309 | self.wordstables = """
310 | JOIN ( %(wordsheap)s as words1) ON (main.wordid = words1.wordid)
311 | """ % self.__dict__
312 | #Have _no_ words table if no words searched for or grouped by; instead just use nwords. This
313 | #isn't strictly necessary, but means the API can be used for the slug-filling queries, and some others.
314 | else:
315 | self.main = " "
316 | self.operation = self.catoperation[self.counttype] #Why did I do this?
317 | self.wordstables = " "
318 | self.wordswhere = " TRUE " #Just a dummy thing. Shouldn't take any time, right?
319 |
320 | def counts_query(self,countname='count'):
321 | self.countname=countname
322 | self.bookoperation = {"Occurrences_per_Million_Words":"sum(main.count)","Raw_Counts":"sum(main.count)","Percentage_of_Books":"count(DISTINCT " + self.prefs['fastcat'] + ".bookid)","Number_of_Books":"count(DISTINCT "+ self.prefs['fastcat'] + ".bookid)"}
323 | self.catoperation = {"Occurrences_per_Million_Words":"sum(nwords)","Raw_Counts":"sum(nwords)","Percentage_of_Books":"count(nwords)","Number_of_Books":"count(nwords)"}
324 | self.operation = self.bookoperation[self.counttype]
325 | self.build_wordstables()
326 | countsQuery = """
327 | SELECT
328 | %(selections)s,
329 | %(operation)s as %(countname)s
330 | FROM
331 | %(catalog)s
332 | %(main)s
333 | %(wordstables)s
334 | WHERE
335 | %(catwhere)s AND %(wordswhere)s
336 | GROUP BY
337 | %(groupings)s
338 | """ % self.__dict__
339 | return countsQuery
340 |
341 | def ratio_query(self):
342 | finalcountcommands = {"Occurrences_per_Million_Words":"IFNULL(count,0)*1000000/total","Raw_Counts":"IFNULL(count,0)","Percentage_of_Books":"IFNULL(count,0)*100/total","Number_of_Books":"IFNULL(count,0)"}
343 | self.countcommand = finalcountcommands[self.counttype]
344 | #if True: #In the case that we're not using a superset of words; this can be changed later
345 | # supersetGroups = [group for group in self.groups if not re.match('word',group)]
346 | # self.finalgroupings = self.groupings
347 | # for key in self.limits.keys():
348 | # if re.match('word',key):
349 | # del self.limits[key]
350 |
351 | self.denominator = userquery(outside_dictionary = self.compare_dictionary)
352 | self.supersetquery = self.denominator.counts_query(countname='total')
353 |
354 | if re.search("In_Library",self.denominator.selections):
355 | self.selections = self.selections + ", fastcat.bookid is not null as In_Library"
356 |
357 | #See above: In_Library is a dummy variable so that there's always something to join on.
358 | self.mainquery = self.counts_query()
359 |
360 |
361 | """
362 | We launch a whole new userquery instance here to build the denominator, based on the 'compare_dictionary' option (which in most
363 | cases is the search_limits without the keys, see above.
364 | We then get the counts_query results out of that result.
365 | """
366 |
367 |
368 | self.totalMergeTerms = "USING (" + self.denominator.groupings + " ) "
369 |
370 |
371 | self.totalselections = ",".join([re.sub(".* as","",group) for group in self.groups])
372 |
373 | query = """
374 | SELECT
375 | %(totalselections)s,
376 | %(countcommand)s as value
377 | FROM
378 | ( %(mainquery)s
379 | ) as tmp
380 | RIGHT JOIN
381 | ( %(supersetquery)s ) as totaller
382 | %(totalMergeTerms)s
383 | GROUP BY %(groupings)s;""" % self.__dict__
384 | return query
385 |
386 | def return_slug_data(self,force=False):
387 | #Rather than understand this error, I'm just returning 0 if it fails.
388 | #Probably that's the right thing to do, though it may cause trouble later.
389 | #It's just a punishment for not later using a decent API call with "Raw_Counts" to extract these counts out, and relying on this ugly method.
390 | try:
391 | temp_words = self.return_n_words(force = True)
392 | temp_counts = self.return_n_books(force = True)
393 | except:
394 | temp_words = 0
395 | temp_counts = 0
396 | return [temp_counts,temp_words]
397 |
398 | def return_n_books(self,force=False):
399 | if (not hasattr(self,'nbooks')) or force:
400 | query = "SELECT count(*) from " + self.catalog + " WHERE " + self.catwhere
401 | silent = self.cursor.execute(query)
402 | self.counts = int(self.cursor.fetchall()[0][0])
403 | return self.counts
404 |
405 | def return_n_words(self,force=False):
406 | if (not hasattr(self,'nwords')) or force:
407 | query = "SELECT sum(nwords) from " + self.catalog + " WHERE " + self.catwhere
408 | silent = self.cursor.execute(query)
409 | self.nwords = int(self.cursor.fetchall()[0][0])
410 | return self.nwords
411 |
412 | def ranked_query(self,percentile_to_return = 99,addwhere = ""):
413 | #NOT CURRENTLY IN USE ANYWHERE--DELETE???
414 | ##This returns a list of bookids in order by how well they match the sort terms.
415 | ## Using an IDF term will give better search results for case-sensitive searches, but is currently disabled
416 | ##
417 | self.LIMIT = int((100-percentile_to_return) * self.return_n_books()/100)
418 | countQuery = """
419 | SELECT
420 | bookid,
421 | sum(main.count*1000/nwords%(idfterm)s) as score
422 | FROM %(catalog)s LEFT JOIN %(tablename)s
423 | USING (bookid)
424 | WHERE %(catwhere)s AND %(wordswhere)s
425 | GROUP BY bookid
426 | ORDER BY score DESC
427 | LIMIT %(LIMIT)s
428 | """ % self.__dict__
429 | return countQuery
430 |
431 | def bibliography_query(self,limit = "100"):
432 | #I'd like to redo this at some point so it could work as an API call.
433 | self.limit = limit
434 | self.ordertype = "sum(main.count*10000/nwords)"
435 | try:
436 | if self.outside_dictionary['ordertype'] == "random":
437 | if self.counttype=="Raw_Counts" or self.counttype=="Number_of_Books":
438 | self.ordertype = "RAND()"
439 | else:
440 | self.ordertype = "LOG(1-RAND())/sum(main.count)"
441 | except KeyError:
442 | pass
443 |
444 | #If IDF searching is enabled, we could add a term like '*IDF' here to overweight better selecting words
445 | #in the event of a multiple search.
446 | self.idfterm = ""
447 | prep = self.counts_query()
448 |
449 | bibQuery = """
450 | SELECT searchstring
451 | FROM """ % self.__dict__ + self.prefs['fullcat'] + """ RIGHT JOIN (
452 | SELECT
453 | """+ self.prefs['fastcat'] + """.bookid, %(ordertype)s as ordering
454 | FROM
455 | %(catalog)s
456 | %(main)s
457 | %(wordstables)s
458 | WHERE
459 | %(catwhere)s AND %(wordswhere)s
460 | GROUP BY bookid ORDER BY %(ordertype)s DESC LIMIT %(limit)s
461 | ) as tmp USING(bookid) ORDER BY ordering DESC;
462 | """ % self.__dict__
463 | return bibQuery
464 |
465 | def disk_query(self,limit="100"):
466 | pass
467 |
468 | def return_books(self):
469 | #This preps up the display elements for a search.
470 | #All this needs to be rewritten.
471 | silent = self.cursor.execute(self.bibliography_query())
472 | returnarray = []
473 | for line in self.cursor.fetchall():
474 | returnarray.append(line[0])
475 | if not returnarray:
476 | returnarray.append("No results for this particular point: try again without smoothing")
477 | newerarray = self.custom_SearchString_additions(returnarray)
478 | return json.dumps(newerarray)
479 |
480 | def getActualSearchedWords(self):
481 | if len(self.wordswhere) > 7:
482 | words = self.outside_dictionary['search_limits']['word']
483 | #Break bigrams into single words.
484 | words = ' '.join(words).split(' ')
485 | self.cursor.execute("""SELECT word FROM wordsheap WHERE """ + where_from_hash({self.word_field:words}))
486 | self.actualWords =[item[0] for item in self.cursor.fetchall()]
487 | else:
488 | self.actualWords = ["tasty","mistake","happened","here"]
489 |
490 | def custom_SearchString_additions(self,returnarray):
491 | db = self.outside_dictionary['database']
492 | if db in ('jstor','presidio','ChronAm','LOC'):
493 | self.getActualSearchedWords()
494 | if db=='jstor':
495 | joiner = "&searchText="
496 | preface = "?Search=yes&searchText="
497 | urlRegEx = "http://www.jstor.org/stable/\d+"
498 | if db=='presidio':
499 | joiner = "+"
500 | preface = "#page/1/mode/2up/search/"
501 | urlRegEx = 'http://archive.org/stream/[^"# ><]*'
502 | if db in ('ChronAm','LOC'):
503 | preface = "/;words="
504 | joiner = "+"
505 | urlRegEx = 'http://chroniclingamerica.loc.gov[^\"><]*/seq-\d'
506 | newarray = []
507 | for string in returnarray:
508 | base = re.findall(urlRegEx,string)[0]
509 | newcore = ' search inside '
510 | string = re.sub("^
","",string)
511 | string = re.sub(" | $","",string)
512 | string = string+newcore
513 | newarray.append(string)
514 | #Arxiv is messier, requiring a whole different URL interface: http://search.arxiv.org:8081/paper.jsp?r=1204.3352&qs=network
515 | return newarray
516 |
517 | def return_query_values(self,query = "ratio_query"):
518 | #The API returns a dictionary with years pointing to values.
519 | values = []
520 | querytext = getattr(self,query)()
521 | silent = self.cursor.execute(querytext)
522 | #Gets the results
523 | mydict = dict(self.cursor.fetchall())
524 | try:
525 | for key in mydict.keys():
526 | #Only return results inside the time limits
527 | if key >= self.time_limits[0] and key <= self.time_limits[1]:
528 | mydict[key] = str(mydict[key])
529 | else:
530 | del mydict[key]
531 | mydict = smooth_function(mydict,smooth_method = self.smoothingType,span = self.smoothingSpan)
532 |
533 | except:
534 | mydict = {0:"0"}
535 |
536 | #This is a good place to change some values.
537 | try:
538 | return {'index':self.index, 'Name':self.words_searched,"values":mydict,'words_searched':""}
539 | except:
540 | return{'values':mydict}
541 |
542 | def arrayNest(self,array,returnt):
543 | #A recursive function to transform a list into a nested array
544 | if len(array)==2:
545 | try:
546 | returnt[array[0]] = float(array[1])
547 | except:
548 | returnt[array[0]] = array[1]
549 | else:
550 | try:
551 | returnt[array[0]] = self.arrayNest(array[1:len(array)],returnt[array[0]])
552 | except KeyError:
553 | returnt[array[0]] = self.arrayNest(array[1:len(array)],dict())
554 | return returnt
555 |
556 | def return_json(self,query='ratio_query'):
557 | if self.counttype=="Raw_Counts" or self.counttype=="Number_of_Books":
558 | query="counts_query"
559 | querytext = getattr(self,query)()
560 | silent = self.cursor.execute(querytext)
561 | names = [to_unicode(item[0]) for item in self.cursor.description]
562 | returnt = dict()
563 | lines = self.cursor.fetchall()
564 | for line in lines:
565 | returnt = self.arrayNest(line,returnt)
566 | return returnt
567 |
568 | def return_tsv(self,query = "ratio_query"):
569 | if self.counttype=="Raw_Counts" or self.counttype=="Number_of_Books":
570 | query="counts_query"
571 | querytext = getattr(self,query)()
572 | silent = self.cursor.execute(querytext)
573 | results = ["\t".join([to_unicode(item[0]) for item in self.cursor.description])]
574 | lines = self.cursor.fetchall()
575 | for line in lines:
576 | items = []
577 | for item in line:
578 | item = to_unicode(item)
579 | item = re.sub("\t","",item)
580 | items.append(item)
581 | results.append("\t".join(items))
582 | return "\n".join(results)
583 |
584 | def export_data(self,query1="ratio_query"):
585 | self.smoothing=0
586 | return self.return_query_values(query=query1)
587 |
588 | def execute(self):
589 | #This performs the query using the method specified in the passed parameters.
590 | if self.method=="Nothing":
591 | pass
592 | else:
593 | return getattr(self,self.method)()
594 |
595 |
596 | #############
597 | ##GENERAL#### #These are general purpose functional types of things not implemented in the class.
598 | #############
599 |
600 | def to_unicode(obj, encoding='utf-8'):
601 | if isinstance(obj, basestring):
602 | if not isinstance(obj, unicode):
603 | obj = unicode(obj, encoding)
604 | elif isinstance(obj,int):
605 | obj=unicode(str(obj),encoding)
606 | else:
607 | obj = unicode(str(obj),encoding)
608 | return obj
609 |
610 | def where_from_hash(myhash,joiner=" AND ",comp = " = "):
611 | whereterm = []
612 | #The general idea here is that we try to break everything in search_limits down to a list, and then create a whereterm on that joined by whatever the 'joiner' is ("AND" or "OR"), with the comparison as whatever comp is ("=",">=",etc.).
613 | #For more complicated bits, it gets all recursive until the bits are in terms of list.
614 | for key in myhash.keys():
615 | values = myhash[key]
616 | if isinstance(values,basestring) or isinstance(values,int) or isinstance(values,float):
617 | #This is just error handling. You can pass a single value instead of a list if you like, and it will just convert it
618 | #to a list for you.
619 | values = [values]
620 | #Or queries are special, since the default is "AND". This toggles that around for a subportion.
621 | if key=='$or' or key=="$OR":
622 | for comparison in values:
623 | whereterm.append(where_from_hash(comparison,joiner=" OR ",comp=comp))
624 | #The or doesn't get populated any farther down.
625 | elif isinstance(values,dict):
626 | #Certain function operators can use MySQL terms. These are the only cases that a dict can be passed as a limitations
627 | operations = {"$gt":">","$lt":"<","$grep":" REGEXP ","$gte":">=","$lte":"<=","$eq":"="}
628 | for operation in values.keys():
629 | whereterm.append(where_from_hash({key:values[operation]},comp=operations[operation],joiner=joiner))
630 | elif isinstance(values,list):
631 | #and this is where the magic actually happens
632 | if isinstance(values[0],dict):
633 | for entry in values:
634 | whereterm.append(where_from_hash(entry))
635 | else:
636 | if isinstance(values[0],basestring):
637 | quotesep="'"
638 | else:
639 | quotesep = ""
640 | #Note the "OR" here. There's no way to pass in a query like "year=1876 AND year=1898" as currently set up.
641 | #Obviously that's no great loss, but there might be something I'm missing that would be.
642 | whereterm.append(" (" + " OR ".join([" (" + key+comp+quotesep+str(value)+quotesep+") " for value in values])+ ") ")
643 | return "(" + joiner.join(whereterm) + ")"
644 | #This works pretty well, except that it requires very specific sorts of terms going in, I think.
645 |
646 |
647 |
648 | #I'd rather have all this smoothing stuff done at the client side, but currently it happens here.
649 | def smooth_function(zinput,smooth_method = 'lowess',span = .05):
650 | if smooth_method not in ['lowess','triangle','rectangle']:
651 | return zinput
652 | xarray = []
653 | yarray = []
654 | years = zinput.keys()
655 | years.sort()
656 | for key in years:
657 | if zinput[key]!='None':
658 | xarray.append(float(key))
659 | yarray.append(float(zinput[key]))
660 | from numpy import array
661 | x = array(xarray)
662 | y = array(yarray)
663 | if smooth_method == 'lowess':
664 | #print "starting lowess smoothing
"
665 | from Bio.Statistics.lowess import lowess
666 | smoothed = lowess(x,y,float(span)/100,3)
667 | x = [int(p) for p in x]
668 | returnval = dict(zip(x,smoothed))
669 | return returnval
670 | if smooth_method == 'rectangle':
671 | from math import log
672 | #print "starting triangle smoothing
"
673 | span = int(span) #Takes the floor--so no smoothing on a span < 1.
674 | returnval = zinput
675 | windowsize = span*2 + 1
676 | from numpy import average
677 | for i in range(len(xarray)):
678 | surrounding = array(range(windowsize),dtype=float)
679 | weights = array(range(windowsize),dtype=float)
680 | for j in range(windowsize):
681 | key_dist = j - span #if span is 2, the zeroeth element is -2, the second element is 0 off, etc.
682 | workingon = i + key_dist
683 | if workingon >= 0 and workingon < len(xarray):
684 | surrounding[j] = float(yarray[workingon])
685 | weights[j] = 1
686 | else:
687 | surrounding[j] = 0
688 | weights[j] = 0
689 | returnval[xarray[i]] = round(average(surrounding,weights=weights),3)
690 | return returnval
691 | if smooth_method == 'triangle':
692 | from math import log
693 | #print "starting triangle smoothing
"
694 | span = int(span) #Takes the floor--so no smoothing on a span < 1.
695 | returnval = zinput
696 | windowsize = span*2 + 1
697 | from numpy import average
698 | for i in range(len(xarray)):
699 | surrounding = array(range(windowsize),dtype=float)
700 | weights = array(range(windowsize),dtype=float)
701 | for j in range(windowsize):
702 | key_dist = j - span #if span is 2, the zeroeth element is -2, the second element is 0 off, etc.
703 | workingon = i + key_dist
704 | if workingon >= 0 and workingon < len(xarray):
705 | surrounding[j] = float(yarray[workingon])
706 | #This isn't actually triangular smoothing: I dampen it by the logs, to keep the peaks from being too too big.
707 | #The minimum is '2', since log(1) == 0, which is a nonesense weight.
708 | weights[j] = log(span + 2 - abs(key_dist))
709 | else:
710 | surrounding[j] = 0
711 | weights[j] = 0
712 |
713 | returnval[xarray[i]] = round(average(surrounding,weights=weights),3)
714 | return returnval
715 |
716 | #The idea is: this works by default by slurping up from the command line, but you could also load the functions in and run results on your own queries.
717 | try:
718 | command = str(sys.argv[1])
719 | command = json.loads(command)
720 | #Got to go before we let anything else happen.
721 | print command
722 | p = userqueries(command)
723 | result = p.execute()
724 | print json.dumps(result)
725 | except:
726 | pass
727 |
728 |
--------------------------------------------------------------------------------
/bookworm/.gitignore:
--------------------------------------------------------------------------------
1 | old/*
2 | *~
3 | APIkeys
4 | #*
5 | .#*
--------------------------------------------------------------------------------
/bookworm/APIimplementation.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 |
3 | import sys
4 | import json
5 | import cgi
6 | import re
7 | import numpy #used for smoothing.
8 | import copy
9 | import decimal
10 | """
11 | #These are here so we can support multiple databases with different naming schemes from a single API.
12 | #A bit ugly to have here; could be part of configuration file somewhere else, I guess. there are 'fast' and 'full' tables for books and words;
13 | #that's so memory tables can be used in certain cases for fast, hashed matching, but longer form data (like book titles)
14 | #can be stored on disk. Different queries use different types of calls.
15 | #Also, certain metadata fields are stored separately from the main catalog table;
16 | #I list them manually here to avoid a database call to find out what they are,
17 | #although the latter would be more elegant. The way to do that would be a database call
18 | #of tables with two columns one of which is 'bookid', maybe, or something like that.
19 | #(Or to add it as error handling when a query failed; only then check for missing files.
20 | """
21 |
22 | general_prefs = {"presidio":{"HOST":"melville.seas.harvard.edu","database":"presidio","fastcat":"fastcat","fullcat":"open_editions","fastword":"wordsheap","read_default_file":"/etc/mysql/my.cnf","fullword":"words","separateDataTables":["LCSH","gender"],"read_url_head":"http://www.archive.org/stream/"},"arxiv":{"HOST":"10.102.15.45","database":"arxiv","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":["genre","fastgenre","archive","subclass"],"read_url_head":"http://www.arxiv.org/abs/"},"jstor":{"HOST":"10.102.15.45","database":"jstor","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":["discipline"],"read_url_head":"http://www.arxiv.org/abs/"}, "politweets":{"HOST":"chaucer.fas.harvard.edu","database":"politweets","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":[],"read_url_head":"http://www.arxiv.org/abs/"},"LOC":{"HOST":"10.102.15.45","database":"LOC","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":[],"read_url_head":"http://www.arxiv.org/abs/"},"ChronAm":{"HOST":"10.102.15.45","database":"ChronAm","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":["subjects"],"read_url_head":"http://www.arxiv.org/abs/"},"ngrams":{"fastcat": "fastcat", "HOST": "10.102.15.45", "separateDataTables": [], "fastword": "wordsheap", "database": "ngrams", "read_url_head": "arxiv.culturomics.org", "fullcat": "catalog", "fullword": "words", "read_default_file": "/etc/mysql/my.cnf"},"OL":{"HOST":"10.102.15.45","database":"OL","fastcat":"fastcat","fullcat":"catalog","fastword":"wordsheap","fullword":"words","read_default_file":"/etc/mysql/my.cnf","separateDataTables":["subjects"],"read_url_head":"http://www.arxiv.org/abs/"}}
23 |
24 | general_prefs['OL'] = {"fastcat": "fastcat", "HOST": "10.102.15.45", "separateDataTables": ["authors", "publishers", "authors", "subjects"], "fastword": "wordsheap", "database": "OL", "read_url_head": "arxiv.culturomics.org", "fullcat":"catalog", "fullword": "words", "read_default_file": "/etc/mysql/my.cnf"}
25 |
26 | #We define prefs to default to the Open Library set at first; later, it can do other things.
27 |
28 | class dbConnect():
29 | #This is a read-only account
30 | def __init__(self,prefs = general_prefs['presidio']):
31 | import MySQLdb
32 | self.db = MySQLdb.connect(host=prefs['HOST'],read_default_file = prefs['read_default_file'],use_unicode='True',charset='utf8',db=prefs['database'])
33 | self.cursor = self.db.cursor()
34 |
35 |
36 | # The basic object here is a userquery: it takes dictionary as input, as defined in the API, and returns a value
37 | # via the 'execute' function whose behavior
38 | # depends on the mode that is passed to it.
39 | # Given the dictionary, it can return a number of objects.
40 | # The "Search_limits" array in the passed dictionary determines how many elements it returns; this lets multiple queries be bundled together.
41 | # Most functions describe a subquery that might be combined into one big query in various ways.
42 |
43 | class userqueries():
44 | #This is a set of queries that are bound together; each element in search limits is iterated over, and we're done.
45 | def __init__(self,outside_dictionary = {"counttype":["Percentage_of_Books"],"search_limits":[{"word":["polka dot"],"LCSH":["Fiction"]}]},db = None):
46 | self.database = outside_dictionary.setdefault('database','presidio')
47 | prefs = general_prefs[self.database]
48 | self.prefs = prefs
49 | self.wordsheap = prefs['fastword']
50 | self.words = prefs['fullword']
51 | if 'search_limits' not in outside_dictionary.keys():
52 | outside_dictionary['search_limits'] = [{}]
53 | #coerce one-element dictionaries to an array.
54 | if isinstance(outside_dictionary['search_limits'],dict):
55 | #(allowing passing of just single dictionaries instead of arrays)
56 | outside_dictionary['search_limits'] = [outside_dictionary['search_limits']]
57 | self.returnval = []
58 | self.queryInstances = []
59 | for limits in outside_dictionary['search_limits']:
60 | mylimits = outside_dictionary
61 | mylimits['search_limits'] = limits
62 | localQuery = userquery(mylimits)
63 | self.queryInstances.append(localQuery)
64 | self.returnval.append(localQuery.execute())
65 |
66 | def execute(self):
67 | return self.returnval
68 |
69 | class userquery():
70 | def __init__(self,outside_dictionary = {"counttype":["Percentage_of_Books"],"search_limits":{"word":["polka dot"],"LCSH":["Fiction"]}}):
71 | #Certain constructions require a DB connection already available, so we just start it here, or use the one passed to it.
72 | self.outside_dictionary = outside_dictionary
73 | self.prefs = general_prefs[outside_dictionary.setdefault('database','presidio')]
74 | self.db = dbConnect(self.prefs)
75 | self.cursor = self.db.cursor
76 | self.wordsheap = self.prefs['fastword']
77 | self.words = self.prefs['fullword']
78 |
79 | #I'm now allowing 'search_limits' to either be a dictionary or an array of dictionaries:
80 | #this makes the syntax cleaner on most queries,
81 | #while still allowing some long ones from the Bookworm website.
82 | if isinstance(outside_dictionary['search_limits'],list):
83 | outside_dictionary['search_limits'] = outside_dictionary['search_limits'][0]
84 | self.defaults(outside_dictionary) #Take some defaults
85 | self.derive_variables() #Derive some useful variables that the query will use.
86 |
87 | def defaults(self,outside_dictionary):
88 | #these are default values;these are the only values that can be set in the query
89 | #search_limits is an array of dictionaries;
90 | #each one contains a set of limits that are mutually independent
91 | #The other limitations are universal for all the search limits being set.
92 |
93 | #Set up a dictionary for the denominator of any fraction if it doesn't already exist:
94 | self.search_limits = outside_dictionary.setdefault('search_limits',[{"word":["polka dot"]}])
95 |
96 | self.words_collation = outside_dictionary.setdefault('words_collation',"Case_Insensitive")
97 |
98 | lookups = {"Case_Insensitive":'word',"case_insensitive":"word","Case_Sensitive":"casesens","Correct_Medial_s":'ffix',"All_Words_with_Same_Stem":"stem","Flagged":'wflag'}
99 |
100 | self.word_field = lookups[self.words_collation]
101 |
102 | self.groups = []
103 | try:
104 | groups = outside_dictionary['groups']
105 | except:
106 | groups = [outside_dictionary['time_measure']]
107 |
108 | if groups == []:
109 | #Set an arbitrary column name if nothing else is set.
110 | groups = ["bookid is not null as In_Library"]
111 |
112 | if (len (groups) > 1):
113 | pass
114 | #self.groups = credentialCheckandClean(self.groups)
115 | #Define some sort of limitations here, if not done in dbbindings.py
116 |
117 | for group in groups:
118 | group = group
119 | if group=="unigram" or group=="word":
120 | group = "words1." + self.word_field + " as unigram"
121 | if group=="bigram":
122 | group = "CONCAT (words1." + self.word_field + " ,' ' , words2." + self.word_field + ") as bigram"
123 | self.groups.append(group)
124 |
125 | self.selections = ",".join(self.groups)
126 | self.groupings = ",".join([re.sub(".* as","",group) for group in self.groups])
127 |
128 |
129 | self.compare_dictionary = copy.deepcopy(self.outside_dictionary)
130 | if 'compare_limits' in self.outside_dictionary.keys():
131 | self.compare_dictionary['search_limits'] = outside_dictionary['compare_limits']
132 | del outside_dictionary['compare_limits']
133 | else: #if nothing specified, we compare the word to the corpus.
134 | for key in ['word','word1','word2','word3','word4','word5','unigram','bigram']:
135 | try:
136 | del self.compare_dictionary['search_limits'][key]
137 | except:
138 | pass
139 | for key in self.outside_dictionary['search_limits'].keys():
140 | if re.search('words?\d',key):
141 | try:
142 | del self.compare_dictionary['search_limits'][key]
143 | except:
144 | pass
145 |
146 | comparegroups = []
147 | #This is a little tricky behavior here--hopefully it works in all cases. It drops out word groupings.
148 | try:
149 | compareGroups = self.compare_dictionary['groups']
150 | except:
151 | compareGroups = [self.compare_dictionary['time_measure']]
152 | for group in compareGroups:
153 | if not re.match("words",group) and not re.match("[u]?[bn]igram",group):
154 | comparegroups.append(group)
155 | self.compare_dictionary['groups'] = comparegroups
156 | self.time_limits = outside_dictionary.setdefault('time_limits',[0,10000000])
157 | self.time_measure = outside_dictionary.setdefault('time_measure','year')
158 | self.counttype = outside_dictionary.setdefault('counttype',["Occurrences_per_Million_Words"])
159 | if isinstance(self.counttype,basestring):
160 | self.counttype = [self.counttype]
161 | self.index = outside_dictionary.setdefault('index',0)
162 | #Ordinarily, the input should be an an array of groups that will both select and group by.
163 | #The joins may be screwed up by certain names that exist in multiple tables, so there's an option to do something like
164 | #SELECT catalog.bookid as myid, because WHERE clauses on myid will work but GROUP BY clauses on catalog.bookid may not
165 | #after a sufficiently large number of subqueries.
166 | #This smoothing code really ought to go somewhere else, since it doesn't quite fit into the whole API mentality and is
167 | #more about the webpage.
168 | self.smoothingType = outside_dictionary.setdefault('smoothingType',"triangle")
169 | self.smoothingSpan = outside_dictionary.setdefault('smoothingSpan',3)
170 | self.method = outside_dictionary.setdefault('method',"Nothing")
171 | self.tablename = outside_dictionary.setdefault('tablename','master'+"_bookcounts as bookcounts")
172 |
173 | def derive_variables(self):
174 | #These are locally useful, and depend on the variables
175 | self.limits = self.search_limits
176 | #Treat empty constraints as nothing at all, not as full restrictions.
177 | for key in self.limits.keys():
178 | if self.limits[key] == []:
179 | del self.limits[key]
180 | self.set_operations()
181 | self.create_catalog_table()
182 | self.make_catwhere()
183 | self.make_wordwheres()
184 |
185 | def create_catalog_table(self):
186 | self.catalog = self.prefs['fastcat'] #'catalog' #Can be replaced with a more complicated query in the event of longer joins.
187 | """
188 | This should check query constraints against a list of tables, and join to them.
189 | So if you query with a limit on LCSH, and LCSH is listed as being in a separate table,
190 | it joins the table "LCSH" to catalog; and then that table has one column, ALSO
191 | called "LCSH", which is matched against. This allows a bookid to be a member of multiple catalogs.
192 | """
193 |
194 |
195 |
196 | for limitation in self.prefs['separateDataTables']:
197 | #That re.sub thing is in here because sometimes I do queries that involve renaming.
198 | if limitation in [re.sub(" .*","",key) for key in self.limits.keys()] or limitation in [re.sub(" .*","",group) for group in self.groups]:
199 | self.catalog = self.catalog + """ JOIN """ + limitation + """ USING (bookid)"""
200 |
201 | """
202 | Here it just pulls every variable and where to look for it.
203 | """
204 |
205 | tableToLookIn = {}
206 | #This is sorted by engine DESC so that memory table locations will overwrite disk table in the hash.
207 | self.cursor.execute("SELECT ENGINE,TABLE_NAME,COLUMN_NAME,COLUMN_KEY FROM information_schema.COLUMNS JOIN INFORMATION_SCHEMA.TABLES USING (TABLE_NAME,TABLE_SCHEMA) WHERE TABLE_SCHEMA='" + self.outside_dictionary['database']+ "' ORDER BY ENGINE DESC,TABLE_NAME;");
208 | columnNames = self.cursor.fetchall()
209 |
210 | for databaseColumn in columnNames:
211 | tableToLookIn[databaseColumn[2]] = databaseColumn[1]
212 |
213 | self.relevantTables = set()
214 |
215 | for columnInQuery in [re.sub(" .*","",key) for key in self.limits.keys()] + [re.sub(" .*","",group) for group in self.groups]:
216 | if not re.search('\.',columnInQuery): #Lets me keep a little bit of SQL sauce for my own queries
217 | try:
218 | self.relevantTables.add(tableToLookIn[columnInQuery])
219 | except KeyError:
220 | pass
221 | #Could warn as well, but this helps back-compatability.
222 |
223 | self.catalog = "fastcat"
224 | for table in self.relevantTables:
225 | if table!="fastcat" and table!="words" and table!="wordsheap":
226 | self.catalog = self.catalog + """ NATURAL JOIN """ + table + " "
227 |
228 | #Here's a feature that's not yet fully implemented: it doesn't work quickly enough, probably because the joins involve a lot of jumping back and forth.
229 | if 'hasword' in self.limits.keys():
230 | """
231 | This is the sort of code I'm trying to move towards
232 | it just generates a new API call to fill a small part of the code here:
233 | (in this case, it merges the 'catalog' entry with a select query on
234 | the word in the 'haswords' field. Enough of this could really
235 | shrink the codebase, I suspect. It should be possible in MySQL 6.0, from what I've read, where subqueried tables will have indexes written for them by the query optimizer.
236 | """
237 |
238 | if self.limits['hasword'] == []:
239 | del self.limits['hasword']
240 | return
241 |
242 | #deepcopy lets us get a real copy of the dictionary
243 | #that can be changed without affecting the old one.
244 | mydict = copy.deepcopy(self.outside_dictionary)
245 | mydict['search_limits'] = copy.deepcopy(self.limits)
246 | mydict['search_limits']['word'] = copy.deepcopy(mydict['search_limits']['hasword'])
247 | del mydict['search_limits']['hasword']
248 | tempquery = userquery(mydict)
249 | bookids = ''
250 | bookids = tempquery.counts_query()
251 |
252 | #If this is ever going to work, 'catalog' here should be some call to self.prefs['fastcat']
253 | bookids = re.sub("(?s).*catalog[^\.]?[^\.\n]*\n","\n",bookids)
254 | bookids = re.sub("(?s)WHERE.*","\n",bookids)
255 | bookids = re.sub("(words|lookup)([0-9])","has\\1\\2",bookids)
256 | bookids = re.sub("main","hasTable",bookids)
257 | self.catalog = self.catalog + bookids
258 | #del self.limits['hasword']
259 |
260 | def make_catwhere(self):
261 | #Where terms that don't include the words table join. Kept separate so that we can have subqueries only working on one half of the stack.
262 | catlimits = dict()
263 | for key in self.limits.keys():
264 | if key not in ('word','word1','word2','hasword') and not re.search("words\d",key):
265 | catlimits[key] = self.limits[key]
266 | if len(catlimits.keys()) > 0:
267 | self.catwhere = where_from_hash(catlimits)
268 | else:
269 | self.catwhere = "TRUE"
270 |
271 | def make_wordwheres(self):
272 | self.wordswhere = " TRUE "
273 | self.max_word_length = 0
274 | limits = []
275 |
276 | if 'word' in self.limits.keys():
277 | """
278 | This doesn't currently allow mixing of one and two word searches together in a logical way.
279 | It might be possible to just join on both the tables in MySQL--I'm not completely sure what would happen.
280 | But the philosophy has been to keep users from doing those searches as far as possible in any case.
281 | """
282 | for phrase in self.limits['word']:
283 | locallimits = dict()
284 | array = phrase.split(" ")
285 | n=1
286 | for word in array:
287 | selectString = "(SELECT " + self.word_field + " FROM wordsheap WHERE casesens='" + word + "')"
288 | locallimits['words'+str(n) + "." + self.word_field] = selectString
289 | self.max_word_length = max(self.max_word_length,n)
290 | n = n+1
291 | limits.append(where_from_hash(locallimits,quotesep=""))
292 | #XXX for backward compatability
293 | self.words_searched = phrase
294 | self.wordswhere = '(' + ' OR '.join(limits) + ')'
295 |
296 | wordlimits = dict()
297 |
298 | limitlist = copy.deepcopy(self.limits.keys())
299 |
300 | for key in limitlist:
301 | if re.search("words\d",key):
302 | wordlimits[key] = self.limits[key]
303 | self.max_word_length = max(self.max_word_length,2)
304 | del self.limits[key]
305 |
306 | if len(wordlimits.keys()) > 0:
307 | self.wordswhere = where_from_hash(wordlimits)
308 |
309 |
310 | def build_wordstables(self):
311 | #Deduce the words tables we're joining against. The iterating on this can be made more general to get 3 or four grams in pretty easily.
312 | #This relies on a determination already having been made about whether this is a unigram or bigram search; that's reflected in the keys passed.
313 | if (self.max_word_length == 2 or re.search("words2",self.selections)):
314 |
315 | self.maintable = 'master_bigrams'
316 |
317 | self.main = '''
318 | JOIN
319 | master_bigrams as main
320 | ON ('''+ self.prefs['fastcat'] +'''.bookid=main.bookid)
321 | '''
322 |
323 | self.wordstables = """
324 | JOIN %(wordsheap)s as words1 ON (main.word1 = words1.wordid)
325 | JOIN %(wordsheap)s as words2 ON (main.word2 = words2.wordid) """ % self.__dict__
326 |
327 | #I use a regex here to do a blanket search for any sort of word limitations. That has some messy sideffects (make sure the 'hasword'
328 | #key has already been eliminated, for example!) but generally works.
329 |
330 | elif self.max_word_length == 1 or re.search("[^h][^a][^s]word",self.selections):
331 | self.maintable = 'master_bookcounts'
332 | self.main = '''
333 | JOIN
334 | master_bookcounts as main
335 | ON (''' + self.prefs['fastcat'] + '''.bookid=main.bookid)'''
336 | self.tablename = 'master_bookcounts'
337 | self.wordstables = """
338 | JOIN ( %(wordsheap)s as words1) ON (main.wordid = words1.wordid)
339 | """ % self.__dict__
340 |
341 | else:
342 | """
343 | Have _no_ words table if no words searched for or grouped by; instead just use nwords. This
344 | means that we can use the same basic functions both to build the counts for word searches and
345 | for metadata searches, which is valuable because there is a metadata-only search built in to every single ratio
346 | query. (To get the denominator values).
347 | """
348 | self.main = " "
349 | self.operation = ','.join(self.catoperations)
350 | """
351 | This, above is super important: the operation used is relative to the counttype, and changes to use 'catoperation' instead of 'bookoperation'
352 | That's the place that the denominator queries avoid having to do a table scan on full bookcounts that would take hours, and instead takes
353 | milliseconds.
354 | """
355 | self.wordstables = " "
356 | self.wordswhere = " TRUE " #Just a dummy thing to make the SQL writing easier. Shouldn't take any time.
357 |
358 | def set_operations(self):
359 |
360 | """
361 | This is the code that allows multiple values to be selected.
362 | """
363 |
364 | backCompatability = {"Occurrences_per_Million_Words":"WordsPerMillion","Raw_Counts":"WordCount","Percentage_of_Books":"TextPercent","Number_of_Books":"TextCount"}
365 |
366 | for oldKey in backCompatability.keys():
367 | self.counttype = [re.sub(oldKey,backCompatability[oldKey],entry) for entry in self.counttype]
368 |
369 | self.bookoperation = {}
370 | self.catoperation = {}
371 | self.finaloperation = {}
372 |
373 | #Text statistics
374 | self.bookoperation['TextPercent'] = "count(DISTINCT " + self.prefs['fastcat'] + ".bookid) as TextCount"
375 | self.bookoperation['TextRatio'] = "count(DISTINCT " + self.prefs['fastcat'] + ".bookid) as TextCount"
376 | self.bookoperation['TextCount'] = "count(DISTINCT " + self.prefs['fastcat'] + ".bookid) as TextCount"
377 | #Word Statistics
378 | self.bookoperation['WordCount'] = "sum(main.count) as WordCount"
379 | self.bookoperation['WordsPerMillion'] = "sum(main.count) as WordCount"
380 | self.bookoperation['WordsRatio'] = "sum(main.count) as WordCount"
381 | """
382 | +Total Numbers for comparisons/significance assessments
383 | This is a little tricky. The total words is EITHER the denominator (as in a query against words per Million) or the numerator+denominator (if you're comparing
384 | Pittsburg and Pittsburgh, say, and want to know the total number of uses of the lemma. For now, "TotalWords" means the former and "SumWords" the latter,
385 | On the theory that 'TotalWords' is more intuitive and only I (Ben) will be using SumWords all that much.
386 | """
387 | self.bookoperation['TotalWords'] = self.bookoperation['WordsPerMillion']
388 | self.bookoperation['SumWords'] = self.bookoperation['WordsPerMillion']
389 | self.bookoperation['TotalTexts'] = self.bookoperation['TextCount']
390 | self.bookoperation['SumTexts'] = self.bookoperation['TextCount']
391 |
392 | for stattype in self.bookoperation.keys():
393 | if re.search("Word",stattype):
394 | self.catoperation[stattype] = "sum(nwords) as WordCount"
395 | if re.search("Text",stattype):
396 | self.catoperation[stattype] = "count(nwords) as TextCount"
397 |
398 | self.finaloperation['TextPercent'] = "IFNULL(numerator.TextCount,0)/IFNULL(denominator.TextCount,0)*100 as TextPercent"
399 | self.finaloperation['TextRatio'] = "IFNULL(numerator.TextRatio,0)/IFNULL(denominator.TextCount,0) as TextRatio"
400 | self.finaloperation['TextCount'] = "IFNULL(numerator.TextCount,0) as TextCount"
401 |
402 | self.finaloperation['WordsPerMillion'] = "IFNULL(numerator.WordCount,0)*100000000/IFNULL(denominator.WordCount,0)/100 as WordsPerMillion"
403 | self.finaloperation['WordsRatio'] = "IFNULL(numerator.WordCount,0)/IFNULL(denominator.WordCount,0) as WordsRatio"
404 | self.finaloperation['WordCount'] = "IFNULL(numerator.WordCount,0) as WordCount"
405 |
406 | self.finaloperation['TotalWords'] = "IFNULL(denominator.WordCount,0) as TotalWords"
407 | self.finaloperation['SumWords'] = "IFNULL(denominator.WordCount,0) + IFNULL(numerator.WordCount,0) as SumWords"
408 | self.finaloperation['TotalTexts'] = "IFNULL(denominator.TextCount,0) as TotalTexts"
409 | self.finaloperation['SumTexts'] = "IFNULL(denominator.TextCount,0) + IFNULL(numerator.TextCount,0) as SumTexts"
410 |
411 | """
412 | The values here will be chosen in build_wordstables; that's what decides if it uses the 'bookoperation' or 'catoperation' dictionary to build out.
413 | """
414 |
415 | self.finaloperations = list()
416 | self.bookoperations = set()
417 | self.catoperations = set()
418 |
419 | for summaryStat in self.counttype:
420 | self.catoperations.add(self.catoperation[summaryStat])
421 | self.bookoperations.add(self.bookoperation[summaryStat])
422 | self.finaloperations.append(self.finaloperation[summaryStat])
423 |
424 | #self.catoperation
425 |
426 | def counts_query(self):
427 | #self.bookoperation = {"Occurrences_per_Million_Words":"sum(main.count)","Raw_Counts":"sum(main.count)","Percentage_of_Books":"count(DISTINCT " + self.prefs['fastcat'] + ".bookid)","Number_of_Books":"count(DISTINCT "+ self.prefs['fastcat'] + ".bookid)"}
428 | #self.catoperation = {"Occurrences_per_Million_Words":"sum(nwords)","Raw_Counts":"sum(nwords)","Percentage_of_Books":"count(nwords)","Number_of_Books":"count(nwords)"}
429 |
430 | self.operation = ','.join(self.bookoperations)
431 |
432 | self.build_wordstables()
433 | countsQuery = """
434 | SELECT
435 | %(selections)s,
436 | %(operation)s
437 | FROM
438 | %(catalog)s
439 | %(main)s
440 | %(wordstables)s
441 | WHERE
442 | %(catwhere)s AND %(wordswhere)s
443 | GROUP BY
444 | %(groupings)s
445 | """ % self.__dict__
446 | return countsQuery
447 |
448 | def ratio_query(self):
449 | #if True: #In the case that we're not using a superset of words; this can be changed later
450 | # supersetGroups = [group for group in self.groups if not re.match('word',group)]
451 | # self.finalgroupings = self.groupings
452 | # for key in self.limits.keys():
453 | # if re.match('word',key):
454 | # del self.limits[key]
455 |
456 | self.denominator = userquery(outside_dictionary = self.compare_dictionary)
457 | self.supersetquery = self.denominator.counts_query()
458 |
459 | if re.search("In_Library",self.denominator.selections):
460 | self.selections = self.selections + ", fastcat.bookid is not null as In_Library"
461 |
462 | #See above: In_Library is a dummy variable so that there's always something to join on.
463 | self.mainquery = self.counts_query()
464 |
465 | self.countcommand = ','.join(self.finaloperations)
466 |
467 | """
468 | We launch a whole new userquery instance here to build the denominator, based on the 'compare_dictionary' option (which in most
469 | cases is the search_limits without the keys, see above.
470 | We then get the counts_query results out of that result.
471 | """
472 |
473 | self.totalMergeTerms = "USING (" + self.denominator.groupings + " ) "
474 | self.totalselections = ",".join([re.sub(".* as","",group) for group in self.groups])
475 |
476 | query = """
477 | SELECT
478 | %(totalselections)s,
479 | %(countcommand)s
480 | FROM
481 | ( %(mainquery)s
482 | ) as numerator
483 | RIGHT OUTER JOIN
484 | ( %(supersetquery)s ) as denominator
485 | %(totalMergeTerms)s
486 | GROUP BY %(groupings)s;""" % self.__dict__
487 | return query
488 |
489 |
490 | def return_slug_data(self,force=False):
491 | #Rather than understand this error, I'm just returning 0 if it fails.
492 | #Probably that's the right thing to do, though it may cause trouble later.
493 | #It's just a punishment for not later using a decent API call with "Raw_Counts" to extract these counts out, and relying on this ugly method.
494 | #Please, citizens of the future, NEVER USE THIS METHOD.
495 | try:
496 | temp_words = self.return_n_words(force = True)
497 | temp_counts = self.return_n_books(force = True)
498 | except:
499 | temp_words = 0
500 | temp_counts = 0
501 | return [temp_counts,temp_words]
502 |
503 | def return_n_books(self,force=False): #deprecated
504 | if (not hasattr(self,'nbooks')) or force:
505 | query = "SELECT count(*) from " + self.catalog + " WHERE " + self.catwhere
506 | silent = self.cursor.execute(query)
507 | self.counts = int(self.cursor.fetchall()[0][0])
508 | return self.counts
509 |
510 | def return_n_words(self,force=False): #deprecated
511 | if (not hasattr(self,'nwords')) or force:
512 | query = "SELECT sum(nwords) from " + self.catalog + " WHERE " + self.catwhere
513 | silent = self.cursor.execute(query)
514 | self.nwords = int(self.cursor.fetchall()[0][0])
515 | return self.nwords
516 |
517 | def ranked_query(self,percentile_to_return = 99,addwhere = ""):
518 | #NOT CURRENTLY IN USE ANYWHERE--DELETE???
519 | ##This returns a list of bookids in order by how well they match the sort terms.
520 | ## Using an IDF term will give better search results for case-sensitive searches, but is currently disabled
521 | ##
522 | self.LIMIT = int((100-percentile_to_return) * self.return_n_books()/100)
523 | countQuery = """
524 | SELECT
525 | bookid,
526 | sum(main.count*1000/nwords%(idfterm)s) as score
527 | FROM %(catalog)s LEFT JOIN %(tablename)s
528 | USING (bookid)
529 | WHERE %(catwhere)s AND %(wordswhere)s
530 | GROUP BY bookid
531 | ORDER BY score DESC
532 | LIMIT %(LIMIT)s
533 | """ % self.__dict__
534 | return countQuery
535 |
536 | def bibliography_query(self,limit = "100"):
537 | #I'd like to redo this at some point so it could work as an API call.
538 | self.limit = limit
539 | self.ordertype = "sum(main.count*10000/nwords)"
540 | try:
541 | if self.outside_dictionary['ordertype'] == "random":
542 | if self.counttype==["Raw_Counts"] or self.counttype==["Number_of_Books"] or self.counttype==['WordCount'] or self.counttype==['BookCount']:
543 | self.ordertype = "RAND()"
544 | else:
545 | self.ordertype = "LOG(1-RAND())/sum(main.count)"
546 | except KeyError:
547 | pass
548 |
549 | #If IDF searching is enabled, we could add a term like '*IDF' here to overweight better selecting words
550 | #in the event of a multiple search.
551 | self.idfterm = ""
552 | prep = self.counts_query()
553 |
554 | bibQuery = """
555 | SELECT searchstring
556 | FROM """ % self.__dict__ + self.prefs['fullcat'] + """ RIGHT JOIN (
557 | SELECT
558 | """+ self.prefs['fastcat'] + """.bookid, %(ordertype)s as ordering
559 | FROM
560 | %(catalog)s
561 | %(main)s
562 | %(wordstables)s
563 | WHERE
564 | %(catwhere)s AND %(wordswhere)s
565 | GROUP BY bookid ORDER BY %(ordertype)s DESC LIMIT %(limit)s
566 | ) as tmp USING(bookid) ORDER BY ordering DESC;
567 | """ % self.__dict__
568 | return bibQuery
569 |
570 | def disk_query(self,limit="100"):
571 | pass
572 |
573 | def return_books(self):
574 | #This preps up the display elements for a search: it returns an array with a single string for each book, sorted in the best possible way
575 | silent = self.cursor.execute(self.bibliography_query())
576 | returnarray = []
577 | for line in self.cursor.fetchall():
578 | returnarray.append(line[0])
579 | if not returnarray:
580 | #why would someone request a search with no locations? Turns out (usually) because the smoothing tricked them.
581 | returnarray.append("No results for this particular point: try again without smoothing")
582 | newerarray = self.custom_SearchString_additions(returnarray)
583 | return json.dumps(newerarray)
584 |
585 | def search_results(self):
586 | #This is an alias that is handled slightly differently in APIimplementation (no "RESULTS" bit in front). Once
587 | #that legacy code is cleared out, they can be one and the same.
588 | return json.loads(self.return_books())
589 |
590 | def getActualSearchedWords(self):
591 | if len(self.wordswhere) > 7:
592 | words = self.outside_dictionary['search_limits']['word']
593 | #Break bigrams into single words.
594 | words = ' '.join(words).split(' ')
595 | self.cursor.execute("""SELECT word FROM wordsheap WHERE """ + where_from_hash({self.word_field:words}))
596 | self.actualWords =[item[0] for item in self.cursor.fetchall()]
597 | else:
598 | self.actualWords = ["tasty","mistake","happened","here"]
599 |
600 | def custom_SearchString_additions(self,returnarray):
601 | db = self.outside_dictionary['database']
602 | if db in ('jstor','presidio','ChronAm','LOC'):
603 | self.getActualSearchedWords()
604 | if db=='jstor':
605 | joiner = "&searchText="
606 | preface = "?Search=yes&searchText="
607 | urlRegEx = "http://www.jstor.org/stable/\d+"
608 | if db=='presidio':
609 | joiner = "+"
610 | preface = "#page/1/mode/2up/search/"
611 | urlRegEx = 'http://archive.org/stream/[^"# ><]*'
612 | if db in ('ChronAm','LOC'):
613 | preface = "/;words="
614 | joiner = "+"
615 | urlRegEx = 'http://chroniclingamerica.loc.gov[^\"><]*/seq-\d+'
616 | newarray = []
617 | for string in returnarray:
618 | base = re.findall(urlRegEx,string)[0]
619 | newcore = ' search inside '
620 | string = re.sub("^","",string)
621 | string = re.sub(" | $","",string)
622 | string = string+newcore
623 | newarray.append(string)
624 | #Arxiv is messier, requiring a whole different URL interface: http://search.arxiv.org:8081/paper.jsp?r=1204.3352&qs=network
625 | else:
626 | newarray = returnarray
627 | return newarray
628 |
629 | def return_query_values(self,query = "ratio_query"):
630 | #The API returns a dictionary with years pointing to values.
631 | values = []
632 | querytext = getattr(self,query)()
633 | silent = self.cursor.execute(querytext)
634 | #Gets the results
635 | mydict = dict(self.cursor.fetchall())
636 | try:
637 | for key in mydict.keys():
638 | #Only return results inside the time limits
639 | if key >= self.time_limits[0] and key <= self.time_limits[1]:
640 | mydict[key] = str(mydict[key])
641 | else:
642 | del mydict[key]
643 | mydict = smooth_function(mydict,smooth_method = self.smoothingType,span = self.smoothingSpan)
644 |
645 | except:
646 | mydict = {0:"0"}
647 |
648 | #This is a good place to change some values.
649 | try:
650 | return {'index':self.index, 'Name':self.words_searched,"values":mydict,'words_searched':""}
651 | except:
652 | return{'values':mydict}
653 |
654 | def arrayNest(self,array,returnt,endLength=1):
655 | #A recursive function to transform a list into a nested array
656 | key = array[0]
657 | key = to_unicode(key)
658 | if len(array)==endLength+1:
659 | #This is the condition where we have the last two, which is where we no longer need to nest anymore:
660 | #it's just the last value[key] = value
661 | value = list(array[1:])
662 | for i in range(len(value)):
663 | try:
664 | value[i] = float(value[i])
665 | except:
666 | pass
667 | returnt[key] = value
668 | else:
669 | try:
670 | returnt[key] = self.arrayNest(array[1:len(array)],returnt[key],endLength=endLength)
671 | except KeyError:
672 | returnt[key] = self.arrayNest(array[1:len(array)],dict(),endLength=endLength)
673 | return returnt
674 |
675 | def return_json(self,query='ratio_query'):
676 | querytext = getattr(self,query)()
677 | silent = self.cursor.execute(querytext)
678 | names = [to_unicode(item[0]) for item in self.cursor.description]
679 | returnt = dict()
680 | lines = self.cursor.fetchall()
681 | for line in lines:
682 | returnt = self.arrayNest(line,returnt,endLength = len(self.counttype))
683 | return returnt
684 |
685 | def return_tsv(self,query = "ratio_query"):
686 | if self.outside_dictionary['counttype']=="Raw_Counts" or self.outside_dictionary['counttype']==["Raw_Counts"]:
687 | query="counts_query"
688 | #This allows much speedier access to counts data if you're willing not to know about all the zeroes.
689 | querytext = getattr(self,query)()
690 | silent = self.cursor.execute(querytext)
691 | results = ["\t".join([to_unicode(item[0]) for item in self.cursor.description])]
692 | lines = self.cursor.fetchall()
693 | for line in lines:
694 | items = []
695 | for item in line:
696 | item = to_unicode(item)
697 | item = re.sub("\t","",item)
698 | items.append(item)
699 | results.append("\t".join(items))
700 | return "\n".join(results)
701 |
702 | def export_data(self,query1="ratio_query"):
703 | self.smoothing=0
704 | return self.return_query_values(query=query1)
705 |
706 | def execute(self):
707 | #This performs the query using the method specified in the passed parameters.
708 | if self.method=="Nothing":
709 | pass
710 | else:
711 | return getattr(self,self.method)()
712 |
713 |
714 | #############
715 | ##GENERAL#### #These are general purpose functional types of things not implemented in the class.
716 | #############
717 |
718 | def to_unicode(obj, encoding='utf-8'):
719 | if isinstance(obj, basestring):
720 | if not isinstance(obj, unicode):
721 | obj = unicode(obj, encoding)
722 | elif isinstance(obj,int):
723 | obj=unicode(str(obj),encoding)
724 | else:
725 | obj = unicode(str(obj),encoding)
726 | return obj
727 |
728 | def where_from_hash(myhash,joiner=" AND ",comp = " = ",quotesep=None):
729 | whereterm = []
730 | #The general idea here is that we try to break everything in search_limits down to a list, and then create a whereterm on that joined by whatever the 'joiner' is ("AND" or "OR"), with the comparison as whatever comp is ("=",">=",etc.).
731 | #For more complicated bits, it gets all recursive until the bits are in terms of list.
732 | for key in myhash.keys():
733 | values = myhash[key]
734 | if isinstance(values,basestring) or isinstance(values,int) or isinstance(values,float):
735 | #This is just error handling. You can pass a single value instead of a list if you like, and it will just convert it
736 | #to a list for you.
737 | values = [values]
738 | #Or queries are special, since the default is "AND". This toggles that around for a subportion.
739 | if key=='$or' or key=="$OR":
740 | for comparison in values:
741 | whereterm.append(where_from_hash(comparison,joiner=" OR ",comp=comp))
742 | #The or doesn't get populated any farther down.
743 | elif isinstance(values,dict):
744 | #Certain function operators can use MySQL terms. These are the only cases that a dict can be passed as a limitations
745 | operations = {"$gt":">","$ne":"!=","$lt":"<","$grep":" REGEXP ","$gte":">=","$lte":"<=","$eq":"="}
746 | for operation in values.keys():
747 | whereterm.append(where_from_hash({key:values[operation]},comp=operations[operation],joiner=joiner))
748 | elif isinstance(values,list):
749 | #and this is where the magic actually happens
750 | if isinstance(values[0],dict):
751 | for entry in values:
752 | whereterm.append(where_from_hash(entry))
753 | else:
754 | if quotesep is None:
755 | if isinstance(values[0],basestring):
756 | quotesep="'"
757 | else:
758 | quotesep = ""
759 | #Note the "OR" here. There's no way to pass in a query like "year=1876 AND year=1898" as currently set up.
760 | #Obviously that's no great loss, but there might be something I'm missing that would be.
761 | whereterm.append(" (" + " OR ".join([" (" + key+comp+quotesep+to_unicode(value)+quotesep+") " for value in values])+ ") ")
762 | return "(" + joiner.join(whereterm) + ")"
763 | #This works pretty well, except that it requires very specific sorts of terms going in, I think.
764 |
765 |
766 |
767 | #I'd rather have all this smoothing stuff done at the client side, but currently it happens here.
768 | def smooth_function(zinput,smooth_method = 'lowess',span = .05):
769 | if smooth_method not in ['lowess','triangle','rectangle']:
770 | return zinput
771 | xarray = []
772 | yarray = []
773 | years = zinput.keys()
774 | years.sort()
775 | for key in years:
776 | if zinput[key]!='None':
777 | xarray.append(float(key))
778 | yarray.append(float(zinput[key]))
779 | from numpy import array
780 | x = array(xarray)
781 | y = array(yarray)
782 | if smooth_method == 'lowess':
783 | #print "starting lowess smoothing
"
784 | from Bio.Statistics.lowess import lowess
785 | smoothed = lowess(x,y,float(span)/100,3)
786 | x = [int(p) for p in x]
787 | returnval = dict(zip(x,smoothed))
788 | return returnval
789 | if smooth_method == 'rectangle':
790 | from math import log
791 | #print "starting triangle smoothing
"
792 | span = int(span) #Takes the floor--so no smoothing on a span < 1.
793 | returnval = zinput
794 | windowsize = span*2 + 1
795 | from numpy import average
796 | for i in range(len(xarray)):
797 | surrounding = array(range(windowsize),dtype=float)
798 | weights = array(range(windowsize),dtype=float)
799 | for j in range(windowsize):
800 | key_dist = j - span #if span is 2, the zeroeth element is -2, the second element is 0 off, etc.
801 | workingon = i + key_dist
802 | if workingon >= 0 and workingon < len(xarray):
803 | surrounding[j] = float(yarray[workingon])
804 | weights[j] = 1
805 | else:
806 | surrounding[j] = 0
807 | weights[j] = 0
808 | returnval[xarray[i]] = round(average(surrounding,weights=weights),3)
809 | return returnval
810 | if smooth_method == 'triangle':
811 | from math import log
812 | #print "starting triangle smoothing
"
813 | span = int(span) #Takes the floor--so no smoothing on a span < 1.
814 | returnval = zinput
815 | windowsize = span*2 + 1
816 | from numpy import average
817 | for i in range(len(xarray)):
818 | surrounding = array(range(windowsize),dtype=float)
819 | weights = array(range(windowsize),dtype=float)
820 | for j in range(windowsize):
821 | key_dist = j - span #if span is 2, the zeroeth element is -2, the second element is 0 off, etc.
822 | workingon = i + key_dist
823 | if workingon >= 0 and workingon < len(xarray):
824 | surrounding[j] = float(yarray[workingon])
825 | #This isn't actually triangular smoothing: I dampen it by the logs, to keep the peaks from being too too big.
826 | #The minimum is '2', since log(1) == 0, which is a nonesense weight.
827 | weights[j] = log(span + 2 - abs(key_dist))
828 | else:
829 | surrounding[j] = 0
830 | weights[j] = 0
831 |
832 | returnval[xarray[i]] = round(average(surrounding,weights=weights),3)
833 | return returnval
834 |
835 | #The idea is: this works by default by slurping up from the command line, but you could also load the functions in and run results on your own queries.
836 | try:
837 | command = str(sys.argv[1])
838 | command = json.loads(command)
839 | #Got to go before we let anything else happen.
840 | print command
841 | p = userqueries(command)
842 | result = p.execute()
843 | print json.dumps(result)
844 | except:
845 | pass
846 |
847 |
--------------------------------------------------------------------------------
/bookworm/MetaWorm.py:
--------------------------------------------------------------------------------
1 | import pandas
2 | import json
3 | import copy
4 | import threading
5 | import time
6 | from collections import defaultdict
7 |
8 | def hostlist(dblist):
9 | #This could do something fancier, but for now we look by default only on localhost.
10 | return ["localhost"]*len(dblist)
11 |
12 | class childQuery(threading.Thread):
13 | def __init__(self,dictJSON,host):
14 | super(SummingThread, self).__init__()
15 | self.dict = json.dumps(dict)
16 | self.host = host
17 |
18 | def runQuery(self):
19 | #make a webquery, assign it to self.data
20 | url = self.host + "/cgi-bin/bookwormAPI?query=" + self.dict
21 |
22 | def parseResults(self):
23 | pass
24 | #return json.loads(self.data)
25 |
26 | def run(self):
27 | self.runQuery()
28 |
29 | def flatten(dictOfdicts):
30 | """
31 | Recursive function: transforms a dict with nested entries like
32 | foo["a"]["b"]["c"] = 3
33 | to one with tuple entries like
34 | fooPrime[("a","b","c")] = 3
35 | """
36 | output = []
37 | for (key,value) in dictOfdicts.iteritems():
38 | if isinstance(value,dict):
39 | output.append([(key),value])
40 | else:
41 | children = flatten(value)
42 | for child in children:
43 | output.append([(key,) + child[0],child[1]])
44 | return output
45 |
46 | def animate(dictOfTuples):
47 | """
48 | opposite of flatten
49 | """
50 |
51 | def tree():
52 | return defaultdict(tree)
53 |
54 | output = defaultdict(tree)
55 |
56 |
57 |
58 | def combineDicts(master,new):
59 | """
60 | instead of a dict of dicts of arbitrary depth, use a dict of tuples to store.
61 | """
62 |
63 | for (keysequence, valuesequence) in flatten(new):
64 | try:
65 | master[keysequence] = map(sum,zip(master[keysequence],valuesequence))
66 | except KeyError:
67 | master[keysequence] = valuesequence
68 | return dict1
69 |
70 | class MetaQuery(object):
71 | def __init__(self,dictJSON):
72 | self.outside_outdictionary = json.dumps(dictJSON)
73 |
74 | def setDefaults(self):
75 | for specialKey in ["database","host"]:
76 | try:
77 | if isinstance(self.outside_dictionary[specialKey],basestring):
78 | #coerce strings to list:
79 | self.outside_dictionary[specialKey] = [self.outside_dictionary[specialKey]]
80 | except KeyError:
81 | #It's OK not to define host.
82 | if specialKey=="host":
83 | pass
84 |
85 | if 'host' not in self.outside_dictionary:
86 | #Build a hostlist: usually just localhost a bunch of times.
87 | self.outside_dictionary['host'] = hostlist(self.outside_dictionary['database'])
88 |
89 | for (target, dest) in [("database","host"),("host","database")]:
90 | #Expand out so you can search for the same database on multiple databases, or multiple databases on the same host.
91 | if len(self.outside_dictionary[target])==1 and len(self.outside_dictionary[dest]) != 1:
92 | self.outside_dictionary[target] = self.outside_dictionary[target] * len(self.outside_dictionary[dest])
93 |
94 |
95 | def buildChildren(self):
96 | desiredCounts = []
97 | for (host,dbname) in zip(self.outside_dictionary["host"],self.outside_dictionary["database"]):
98 | query = copy.deepcopy(self.outside_dictionary)
99 | del(query['host'])
100 | query['database'] = dbname
101 |
102 | desiredCounts.append(childQuery(query,host))
103 | self.children = desiredCounts
104 |
105 | def runChildren(self):
106 | for child in self.children:
107 | child.start()
108 |
109 | def combineChildren(self):
110 | complete = dict()
111 | while (threading.enumerate()):
112 | for child in self.children:
113 | if not child.is_alive():
114 | complete=combineDicts(complete,child.parseResult())
115 | time.sleep(.05)
116 |
117 | def return_json(self):
118 | pass
119 |
120 |
121 |
122 |
--------------------------------------------------------------------------------
/bookworm/SQLAPI.py:
--------------------------------------------------------------------------------
1 | #!/usr/local/bin/python
2 |
3 | import sys
4 | import json
5 | import cgi
6 | import re
7 | import numpy #used for smoothing.
8 | import copy
9 | import decimal
10 | import MySQLdb
11 | import warnings
12 | import hashlib
13 |
14 | """
15 | #There are 'fast' and 'full' tables for books and words;
16 | #that's so memory tables can be used in certain cases for fast, hashed matching, but longer form data (like book titles)
17 | can be stored on disk. Different queries use different types of calls.
18 | #Also, certain metadata fields are stored separately from the main catalog table;
19 | """
20 |
21 | from knownHosts import *
22 |
23 | class dbConnect(object):
24 | #This is a read-only account
25 | def __init__(self,prefs):
26 | self.dbname = prefs['database']
27 | self.db = MySQLdb.connect(host=prefs['HOST'],read_default_file = prefs['read_default_file'],use_unicode='True',charset='utf8',db=prefs['database'])
28 | self.cursor = self.db.cursor()
29 |
30 | # The basic object here is a 'userquery:' it takes dictionary as input, as defined in the API, and returns a value
31 | # via the 'execute' function whose behavior
32 | # depends on the mode that is passed to it.
33 | # Given the dictionary, it can return a number of objects.
34 | # The "Search_limits" array in the passed dictionary determines how many elements it returns; this lets multiple queries be bundled together.
35 | # Most functions describe a subquery that might be combined into one big query in various ways.
36 |
37 | class userqueries:
38 | #This is a set of userqueries that are bound together; each element in search limits is iterated over, and we're done.
39 | #currently used for various different groups sent in a bundle (multiple lines on a Bookworm chart).
40 | #A sufficiently sophisticated 'group by' search might make this unnecessary.
41 | #But until that day, it's useful to be able to return lists of elements, which happens in here.
42 |
43 | def __init__(self,outside_dictionary = {"counttype":["Percentage_of_Books"],"search_limits":[{"word":["polka dot"],"LCSH":["Fiction"]}]},db = None):
44 | try:
45 | self.database = outside_dictionary.setdefault('database', 'default')
46 | prefs = general_prefs[self.database]
47 | except KeyError: #If it's not in the option, use some default preferences and search on localhost. This will work in most cases here on out.
48 | prefs = general_prefs['default']
49 | prefs['database'] = self.database
50 | self.prefs = prefs
51 |
52 | self.wordsheap = prefs['fastword']
53 | self.words = prefs['fullword']
54 | if 'search_limits' not in outside_dictionary.keys():
55 | outside_dictionary['search_limits'] = [{}]
56 | #coerce one-element dictionaries to an array.
57 | if isinstance(outside_dictionary['search_limits'],dict):
58 | #(allowing passing of just single dictionaries instead of arrays)
59 | outside_dictionary['search_limits'] = [outside_dictionary['search_limits']]
60 | self.returnval = []
61 | self.queryInstances = []
62 | db = dbConnect(prefs)
63 | databaseScheme = databaseSchema(db)
64 | for limits in outside_dictionary['search_limits']:
65 | mylimits = copy.deepcopy(outside_dictionary)
66 | mylimits['search_limits'] = limits
67 | localQuery = userquery(mylimits,db=db,databaseScheme=databaseScheme)
68 | self.queryInstances.append(localQuery)
69 | self.returnval.append(localQuery.execute())
70 |
71 | def execute(self):
72 |
73 | return self.returnval
74 |
75 |
76 | class userquery:
77 | def __init__(self,outside_dictionary = {"counttype":["Percentage_of_Books"],"search_limits":{"word":["polka dot"],"LCSH":["Fiction"]}},db=None,databaseScheme=None):
78 | #Certain constructions require a DB connection already available, so we just start it here, or use the one passed to it.
79 | try:
80 | self.prefs = general_prefs[outside_dictionary['database']]
81 | except KeyError:
82 | #If it's not in the option, use some default preferences and search on localhost. This will work in most cases here on out.
83 | self.prefs = general_prefs['default']
84 | self.prefs['database'] = outside_dictionary['database']
85 | self.outside_dictionary = outside_dictionary
86 | #self.prefs = general_prefs[outside_dictionary.setdefault('database','presidio')]
87 | self.db = db
88 | if db is None:
89 | self.db = dbConnect(self.prefs)
90 | self.databaseScheme = databaseScheme
91 | if databaseScheme is None:
92 | self.databaseScheme = databaseSchema(self.db)
93 |
94 | self.cursor = self.db.cursor
95 | self.wordsheap = self.prefs['fastword']
96 | self.words = self.prefs['fullword']
97 | """
98 | I'm now allowing 'search_limits' to either be a dictionary or an array of dictionaries:
99 | this makes the syntax cleaner on most queries,
100 | while still allowing some long ones from the Bookworm website.
101 | """
102 | try:
103 | if isinstance(outside_dictionary['search_limits'],list):
104 | outside_dictionary['search_limits'] = outside_dictionary['search_limits'][0]
105 | except:
106 | outside_dictionary['search_limits'] = dict()
107 | #outside_dictionary = self.limitCategoricalQueries(outside_dictionary)
108 | self.defaults(outside_dictionary) #Take some defaults
109 | self.derive_variables() #Derive some useful variables that the query will use.
110 |
111 | def defaults(self,outside_dictionary):
112 | #these are default values;these are the only values that can be set in the query
113 | #search_limits is an array of dictionaries;
114 | #each one contains a set of limits that are mutually independent
115 | #The other limitations are universal for all the search limits being set.
116 |
117 | #Set up a dictionary for the denominator of any fraction if it doesn't already exist:
118 | self.search_limits = outside_dictionary.setdefault('search_limits',[{"word":["polka dot"]}])
119 | self.words_collation = outside_dictionary.setdefault('words_collation',"Case_Insensitive")
120 |
121 | lookups = {"Case_Insensitive":'word','lowercase':'lowercase','casesens':'casesens',"case_insensitive":"word","Case_Sensitive":"casesens","All_Words_with_Same_Stem":"stem",'stem':'stem'}
122 | self.word_field = str(MySQLdb.escape_string(lookups[self.words_collation]))
123 |
124 | self.time_limits = outside_dictionary.setdefault('time_limits',[0,10000000])
125 | self.time_measure = outside_dictionary.setdefault('time_measure','year')
126 |
127 | self.groups = set()
128 | self.outerGroups = [] #[] #Only used on the final join; directionality matters, unlike for the other ones.
129 | self.finalMergeTables=set()
130 | try:
131 | groups = outside_dictionary['groups']
132 | except:
133 | groups = [outside_dictionary['time_measure']]
134 |
135 | if groups == [] or groups == ["unigram"]:
136 | #Set an arbitrary column name that will always be true if nothing else is set.
137 | groups.insert(0,"1 as In_Library")
138 |
139 | if (len (groups) > 1):
140 | pass
141 | #self.groups = credentialCheckandClean(self.groups)
142 | #Define some sort of limitations here, if not done in dbbindings.py
143 |
144 | for group in groups:
145 |
146 | #There's a special set of rules for how to handle unigram and bigrams
147 | multigramSearch = re.match("(unigram|bigram|trigram)(\d)?",group)
148 |
149 | if multigramSearch:
150 | if group=="unigram":
151 | gramPos = "1"
152 | gramType = "unigram"
153 |
154 | else:
155 | gramType = multigramSearch.groups()[0]
156 | try:
157 | gramPos = multigramSearch.groups()[1]
158 | except:
159 | print "currently you must specify which bigram element you want (eg, 'bigram1')"
160 | raise
161 |
162 | lookupTableName = "%sLookup%s" %(gramType,gramPos)
163 | self.outerGroups.append("%s.%s as %s" %(lookupTableName,self.word_field,group))
164 | self.finalMergeTables.add(" JOIN wordsheap as %s ON %s.wordid=w%s" %(lookupTableName,lookupTableName,gramPos))
165 | self.groups.add("words%s.wordid as w%s" %(gramPos,gramPos))
166 |
167 | else:
168 | self.outerGroups.append(group)
169 | try:
170 | if self.databaseScheme.aliases[group] != group:
171 | #Search on the ID field, not the basic field.
172 | #debug(self.databaseScheme.aliases.keys())
173 | self.groups.add(self.databaseScheme.aliases[group])
174 | table = self.databaseScheme.tableToLookIn[group]
175 |
176 | joinfield = self.databaseScheme.aliases[group]
177 | self.finalMergeTables.add(" JOIN " + table + " USING (" + joinfield + ") ")
178 | else:
179 | self.groups.add(group)
180 | except KeyError:
181 | self.groups.add(group)
182 |
183 | """
184 | There are the selections which can include table refs, and the groupings, which may not:
185 | and the final suffix to enable fast lookup
186 | """
187 |
188 | self.selections = ",".join(self.groups)
189 | self.groupings = ",".join([re.sub(".* as","",group) for group in self.groups])
190 |
191 | self.joinSuffix = "" + " ".join(self.finalMergeTables)
192 |
193 | """
194 | Define the comparison set if a comparison is being done.
195 | """
196 | #Deprecated--tagged for deletion
197 | #self.determineOutsideDictionary()
198 |
199 | #This is a little tricky behavior here--hopefully it works in all cases. It drops out word groupings.
200 |
201 | self.counttype = outside_dictionary.setdefault('counttype',["WordCount"])
202 |
203 | if isinstance(self.counttype,basestring):
204 | self.counttype = [self.counttype]
205 |
206 | #index is deprecated,but the old version uses it.
207 | self.index = outside_dictionary.setdefault('index',0)
208 | """
209 | #Ordinarily, the input should be an an array of groups that will both select and group by.
210 | #The joins may be screwed up by certain names that exist in multiple tables, so there's an option to do something like
211 | #SELECT catalog.bookid as myid, because WHERE clauses on myid will work but GROUP BY clauses on catalog.bookid may not
212 | #after a sufficiently large number of subqueries.
213 | #This smoothing code really ought to go somewhere else, since it doesn't quite fit into the whole API mentality and is
214 | #more about the webpage. It is only included here as a stopgap: NO FURTHER APPLICATIONS USING IT SHOULD BE BUILT.
215 | """
216 |
217 | self.smoothingType = outside_dictionary.setdefault('smoothingType',"triangle")
218 | self.smoothingSpan = outside_dictionary.setdefault('smoothingSpan',3)
219 | self.method = outside_dictionary.setdefault('method',"Nothing")
220 |
221 | def determineOutsideDictionary(self):
222 | """
223 | deprecated--tagged for deletion.
224 | """
225 | self.compare_dictionary = copy.deepcopy(self.outside_dictionary)
226 | if 'compare_limits' in self.outside_dictionary.keys():
227 | self.compare_dictionary['search_limits'] = self.outside_dictionary['compare_limits']
228 | del self.outside_dictionary['compare_limits']
229 | elif sum([bool(re.search(r'\*',string)) for string in self.outside_dictionary['search_limits'].keys()]) > 0:
230 | #If any keys have stars at the end, drop them from the compare set
231 | #This is often a _very_ helpful definition for succinct comparison queries of many types.
232 | #The cost is that an asterisk doesn't allow you
233 |
234 | for key in self.outside_dictionary['search_limits'].keys():
235 | if re.search(r'\*',key):
236 | #rename the main one to not have a star
237 | self.outside_dictionary['search_limits'][re.sub(r'\*','',key)] = self.outside_dictionary['search_limits'][key]
238 | #drop it from the compare_limits and delete the version in the search_limits with a star
239 | del self.outside_dictionary['search_limits'][key]
240 | del self.compare_dictionary['search_limits'][key]
241 | else: #if nothing specified, we compare the word to the corpus.
242 | deleted = False
243 | for key in self.outside_dictionary['search_limits'].keys():
244 | if re.search('words?\d',key) or re.search('gram$',key) or re.match(r'word',key):
245 | del self.compare_dictionary['search_limits'][key]
246 | deleted = True
247 | if not deleted:
248 | #If there are no words keys, just delete the first key of any type.
249 | #Sort order can't be assumed, but this is a useful failure mechanism of last resort. Maybe.
250 | try:
251 | del self.compare_dictionary['search_limits'][self.outside_dictionary['search_limits'].keys()[0]]
252 | except:
253 | pass
254 | """
255 | The grouping behavior here is not desirable, but I'm not quite sure how yet.
256 | Aha--one way is that it accidentally drops out a bunch of options. I'm just disabling it: let's see what goes wrong now.
257 | """
258 | try:
259 | pass#self.compare_dictionary['groups'] = [group for group in self.compare_dictionary['groups'] if not re.match('word',group) and not re.match("[u]?[bn]igram",group)]# topicfix? and not re.match("topic",group)]
260 | except:
261 | self.compare_dictionary['groups'] = [self.compare_dictionary['time_measure']]
262 |
263 |
264 | def derive_variables(self):
265 | #These are locally useful, and depend on the search limits put in.
266 | self.limits = self.search_limits
267 | #Treat empty constraints as nothing at all, not as full restrictions.
268 | for key in self.limits.keys():
269 | if self.limits[key] == []:
270 | del self.limits[key]
271 | self.set_operations()
272 | self.create_catalog_table()
273 | self.make_catwhere()
274 | self.make_wordwheres()
275 |
276 | def tablesNeededForQuery(self,fieldNames=[]):
277 | db = self.db
278 | neededTables = set()
279 | tablenames = dict()
280 | tableDepends = dict()
281 | db.cursor.execute("SELECT dbname,alias,tablename,dependsOn FROM masterVariableTable JOIN masterTableTable USING (tablename);")
282 | for row in db.cursor.fetchall():
283 | tablenames[row[0]] = row[2]
284 | tableDepends[row[2]] = row[3]
285 |
286 | for fieldname in fieldNames:
287 | parent = ""
288 | try:
289 | current = tablenames[fieldname]
290 | neededTables.add(current)
291 | n = 1
292 | while parent not in ['fastcat','wordsheap']:
293 | parent = tableDepends[current]
294 | neededTables.add(parent)
295 | current = parent;
296 | n+=1
297 | if n > 100:
298 | raise TypeError("Unable to handle this; seems like a recursion loop in the table definitions.")
299 | #This will add 'fastcat' or 'wordsheap' exactly once per entry
300 | except KeyError:
301 | pass
302 |
303 | return neededTables
304 |
305 | def create_catalog_table(self):
306 | self.catalog = self.prefs['fastcat'] #'catalog' #Can be replaced with a more complicated query in the event of longer joins.
307 |
308 | """
309 | This should check query constraints against a list of tables, and join to them.
310 | So if you query with a limit on LCSH, and LCSH is listed as being in a separate table,
311 | it joins the table "LCSH" to catalog; and then that table has one column, ALSO
312 | called "LCSH", which is matched against. This allows a bookid to be a member of multiple catalogs.
313 | """
314 |
315 | #for limitation in self.prefs['separateDataTables']:
316 | # #That re.sub thing is in here because sometimes I do queries that involve renaming.
317 | # if limitation in [re.sub(" .*","",key) for key in self.limits.keys()] or limitation in [re.sub(" .*","",group) for group in self.groups]:
318 | # self.catalog = self.catalog + """ JOIN """ + limitation + """ USING (bookid)"""
319 |
320 | """
321 | Here it just pulls every variable and where to look for it.
322 | """
323 |
324 |
325 | self.relevantTables = set()
326 |
327 | databaseScheme = self.databaseScheme
328 | columns = []
329 | for columnInQuery in [re.sub(" .*","",key) for key in self.limits.keys()] + [re.sub(" .*","",group) for group in self.groups]:
330 | columns.append(columnInQuery)
331 | try:
332 | self.relevantTables.add(databaseScheme.tableToLookIn[columnInQuery])
333 | try:
334 | self.relevantTables.add(databaseScheme.tableToLookIn[databaseScheme.anchorFields[columnInQuery]])
335 | try:
336 | self.relevantTables.add(databaseScheme.tableToLookIn[databaseScheme.anchorFields[databaseScheme.anchorFields[columnInQuery]]])
337 | except KeyError:
338 | pass
339 | except KeyError:
340 | pass
341 | except KeyError:
342 | pass
343 | #Could raise as well--shouldn't be errors--but this helps back-compatability.
344 |
345 | # if "catalog" in self.relevantTables and self.method != "bibliography_query":
346 | # self.relevantTables.remove('catalog')
347 | try:
348 | moreTables = self.tablesNeededForQuery(columns)
349 | except MySQLdb.ProgrammingError:
350 | #What happens on old-style Bookworm constructions.
351 | moreTables = set()
352 | self.relevantTables = self.relevantTables.union(moreTables)
353 | self.catalog = "fastcat"
354 | for table in self.relevantTables:
355 | if table!="fastcat" and table!="words" and table!="wordsheap" and table!="master_bookcounts" and table!="master_bigrams":
356 | self.catalog = self.catalog + """ NATURAL JOIN """ + table + " "
357 |
358 | def make_catwhere(self):
359 | #Where terms that don't include the words table join. Kept separate so that we can have subqueries only working on one half of the stack.
360 | catlimits = dict()
361 | for key in self.limits.keys():
362 | ###Warning--none of these phrases can be used ina bookworm as a custom table names.
363 | if key not in ('word','word1','word2','hasword') and not re.search("words\d",key):
364 | catlimits[key] = self.limits[key]
365 | if len(catlimits.keys()) > 0:
366 | self.catwhere = where_from_hash(catlimits)
367 | else:
368 | self.catwhere = "TRUE"
369 | if 'hasword' in self.limits.keys():
370 | """
371 | Because derived tables don't carry indexes, we're just making the new tables
372 | with indexes on the fly to be stored in a temporary database, "bookworm_scratch"
373 | Each time a hasword query is performed, the results of that query are permanently cached;
374 | they're stored as a table that can be used in the future.
375 |
376 | This will create problems if database contents are changed; there needs to be some mechanism for
377 | clearing out the cache periodically.
378 | """
379 |
380 | if self.limits['hasword'] == []:
381 | del self.limits['hasword']
382 | return
383 |
384 | #deepcopy lets us get a real copy of the dictionary
385 | #that can be changed without affecting the old one.
386 | mydict = copy.deepcopy(self.outside_dictionary)
387 | # This may make it take longer than it should; we might want the list to
388 | # just be every bookid with the given word rather than
389 | # filtering by the limits as well.
390 | # It's not obvious to me which will be faster.
391 | mydict['search_limits'] = copy.deepcopy(self.limits)
392 | if isinstance(mydict['search_limits']['hasword'],basestring):
393 | #Make sure it's an array
394 | mydict['search_limits']['hasword'] = [mydict['search_limits']['hasword']]
395 | """
396 | #Ideally, this would shuffle into an order ensuring that the
397 | rarest words were nested deepest.
398 | #That would speed up query execution by ensuring there
399 | wasn't some massive search for 'the' being
400 | #done at the end.
401 |
402 | Instead, it just pops off the last element and sets up a
403 | recursive nested join. for every element in the
404 | array.
405 | """
406 | mydict['search_limits']['word'] = [mydict['search_limits']['hasword'].pop()]
407 | if len(mydict['search_limits']['hasword'])==0:
408 | del mydict['search_limits']['hasword']
409 | tempquery = userquery(mydict,databaseScheme=self.databaseScheme)
410 | listofBookids = tempquery.bookid_query()
411 |
412 | #Unique identifier for the query that persists across the
413 | #various subqueries.
414 | queryID = hashlib.sha1(listofBookids).hexdigest()[:20]
415 |
416 | tmpcatalog = "bookworm_scratch.tmp" + re.sub("-","",queryID)
417 |
418 | try:
419 | self.cursor.execute("CREATE TABLE %s (bookid MEDIUMINT, PRIMARY KEY (bookid)) ENGINE=MYISAM;" %tmpcatalog)
420 | self.cursor.execute("INSERT IGNORE INTO %s %s;" %(tmpcatalog,listofBookids))
421 |
422 | except MySQLdb.OperationalError,e:
423 | #Usually the error will be 1050, which is a good thing: it means we don't need to
424 | #create the table.
425 | #If it's not, something bad is happening.
426 | if not re.search("1050.*already exists",str(e)):
427 | raise
428 | self.catalog += " NATURAL JOIN %s "%(tmpcatalog)
429 |
430 |
431 | def make_wordwheres(self):
432 | self.wordswhere = " TRUE "
433 | self.max_word_length = 0
434 | limits = []
435 | """
436 | "unigram" or "bigram" can be used as an alias for "word" in the search_limits field.
437 | """
438 |
439 | for gramterm in ['unigram','bigram']:
440 | if gramterm in self.limits.keys() and not "word" in self.limits.keys():
441 | self.limits['word'] = self.limits[gramterm]
442 | del self.limits[gramterm]
443 |
444 | if 'word' in self.limits.keys():
445 | """
446 | This doesn't currently allow mixing of one and two word searches together in a logical way.
447 | It might be possible to just join on both the tables in MySQL--I'm not completely sure what would happen.
448 | But the philosophy has been to keep users from doing those searches as far as possible in any case.
449 | """
450 | for phrase in self.limits['word']:
451 | locallimits = dict()
452 | array = phrase.split(" ")
453 | n = 0
454 | for word in array:
455 | n += 1
456 | searchingFor = word
457 | if self.word_field=="stem":
458 | from nltk import PorterStemmer
459 | searchingFor = PorterStemmer().stem_word(searchingFor)
460 | if self.word_field=="case_insensitive" or self.word_field=="Case_Insensitive":
461 | searchingFor = searchingFor.lower()
462 |
463 | selectString = "SELECT wordid FROM wordsheap WHERE %s = %%s" % self.word_field
464 | cursor = self.db.cursor
465 | try:
466 | cursor.execute(selectString, (searchingFor))
467 | except MySQLdb.Error, e:
468 | # Return HTML error code and log the following
469 | # print e
470 | # print cursor._last_executed
471 | print ''
472 | for row in cursor.fetchall():
473 | wordid = row[0]
474 | try:
475 | locallimits['words'+str(n) + ".wordid"] += [wordid]
476 | except KeyError:
477 | locallimits['words'+str(n) + ".wordid"] = [wordid]
478 | self.max_word_length = max(self.max_word_length,n)
479 |
480 | #Strings have already been escaped, so don't need to be escaped again.
481 | if len(locallimits.keys()) > 0:
482 | limits.append(where_from_hash(locallimits,comp = " = ",escapeStrings=False))
483 | #XXX for backward compatability
484 | self.words_searched = phrase
485 | #XXX end deprecated block
486 | self.wordswhere = "(" + ' OR '.join(limits) + ")"
487 | if limits == []:
488 | #In the case that nothing has been found, tell it explicitly to search for
489 | #a condition when nothing will be found.
490 | self.wordswhere = "words1.wordid=-1"
491 |
492 |
493 | wordlimits = dict()
494 |
495 | limitlist = copy.deepcopy(self.limits.keys())
496 |
497 | for key in limitlist:
498 | if re.search("words\d",key):
499 | wordlimits[key] = self.limits[key]
500 | self.max_word_length = max(self.max_word_length,2)
501 | del self.limits[key]
502 |
503 | if len(wordlimits.keys()) > 0:
504 | self.wordswhere = where_from_hash(wordlimits)
505 |
506 | return self.wordswhere
507 |
508 | def build_wordstables(self):
509 | #Deduce the words tables we're joining against. The iterating on this can be made more general to get 3 or four grams in pretty easily.
510 | #This relies on a determination already having been made about whether this is a unigram or bigram search; that's reflected in the self.selections
511 | #variable.
512 |
513 |
514 | """
515 | We also now check for whether it needs the topic assignments: this could be generalized, with difficulty, for any other kind of plugin.
516 | """
517 |
518 | needsBigrams = (self.max_word_length == 2 or re.search("words2",self.selections))
519 | needsUnigrams = self.max_word_length == 1 or re.search("[^h][^a][^s]word",self.selections)
520 | needsTopics = bool(re.search("topic",self.selections)) or ("topic" in self.limits.keys())
521 |
522 | if needsBigrams:
523 |
524 | self.maintable = 'master_bigrams'
525 |
526 | self.main = '''
527 | JOIN
528 | master_bigrams as main
529 | ON ('''+ self.prefs['fastcat'] +'''.bookid=main.bookid)
530 | '''
531 |
532 | self.wordstables = """
533 | JOIN %(wordsheap)s as words1 ON (main.word1 = words1.wordid)
534 | JOIN %(wordsheap)s as words2 ON (main.word2 = words2.wordid) """ % self.__dict__
535 |
536 | #I use a regex here to do a blanket search for any sort of word limitations. That has some messy sideffects (make sure the 'hasword'
537 | #key has already been eliminated, for example!) but generally works.
538 |
539 | elif needsTopics and needsUnigrams:
540 | self.maintable = 'master_topicWords'
541 | self.main = '''
542 | NATURAL JOIN
543 | master_topicWords as main
544 | '''
545 | self.wordstables = """
546 | JOIN ( %(wordsheap)s as words1) ON (main.wordid = words1.wordid)
547 | """ % self.__dict__
548 |
549 | elif needsUnigrams:
550 | self.maintable = 'master_bookcounts'
551 | self.main = '''
552 | NATURAL JOIN
553 | master_bookcounts as main
554 | '''
555 | #ON (''' + self.prefs['fastcat'] + '''.bookid=main.bookid)'''
556 | self.wordstables = """
557 | JOIN ( %(wordsheap)s as words1) ON (main.wordid = words1.wordid)
558 | """ % self.__dict__
559 |
560 | elif needsTopics:
561 | self.maintable = 'master_topicCounts'
562 | self.main = '''
563 | NATURAL JOIN
564 | master_topicCounts as main '''
565 | self.wordstables = " "
566 | self.wordswhere = " TRUE "
567 |
568 | else:
569 | """
570 | Have _no_ words table if no words searched for or grouped by;
571 | instead just use nwords. This
572 | means that we can use the same basic functions both to build the
573 | counts for word searches and
574 | for metadata searches, which is valuable because there is a
575 | metadata-only search built in to every single ratio
576 | query. (To get the denominator values).
577 |
578 | Call this OLAP, if you like.
579 | """
580 | self.main = " "
581 | self.operation = ','.join(self.catoperations)
582 | """
583 | This, above is super important: the operation used is relative to the counttype, and changes to use 'catoperation' instead of 'bookoperation'
584 | That's the place that the denominator queries avoid having to do a table scan on full bookcounts that would take hours, and instead takes
585 | milliseconds.
586 | """
587 | self.wordstables = " "
588 | self.wordswhere = " TRUE "
589 | #Just a dummy thing to make the SQL writing easier. Shouldn't take any time. Will usually be extended with actual conditions.
590 |
591 | def set_operations(self):
592 | """
593 | This is the code that allows multiple values to be selected.
594 |
595 | All can be removed when we kill back compatibility ! It's all handled now by the general_API, not the SQL_API.
596 | """
597 |
598 |
599 | backCompatability = {"Occurrences_per_Million_Words":"WordsPerMillion","Raw_Counts":"WordCount","Percentage_of_Books":"TextPercent","Number_of_Books":"TextCount"}
600 |
601 | for oldKey in backCompatability.keys():
602 | self.counttype = [re.sub(oldKey,backCompatability[oldKey],entry) for entry in self.counttype]
603 |
604 | self.bookoperation = {}
605 | self.catoperation = {}
606 | self.finaloperation = {}
607 |
608 | #Text statistics
609 | self.bookoperation['TextPercent'] = "count(DISTINCT " + self.prefs['fastcat'] + ".bookid) as TextCount"
610 | self.bookoperation['TextRatio'] = "count(DISTINCT " + self.prefs['fastcat'] + ".bookid) as TextCount"
611 | self.bookoperation['TextCount'] = "count(DISTINCT " + self.prefs['fastcat'] + ".bookid) as TextCount"
612 |
613 | #Word Statistics
614 | self.bookoperation['WordCount'] = "sum(main.count) as WordCount"
615 | self.bookoperation['WordsPerMillion'] = "sum(main.count) as WordCount"
616 | self.bookoperation['WordsRatio'] = "sum(main.count) as WordCount"
617 |
618 |
619 | """
620 | +Total Numbers for comparisons/significance assessments
621 | This is a little tricky. The total words is EITHER the denominator (as in a query against words per Million) or the numerator+denominator (if you're comparing
622 | Pittsburg and Pittsburgh, say, and want to know the total number of uses of the lemma. For now, "TotalWords" means the former and "SumWords" the latter,
623 | On the theory that 'TotalWords' is more intuitive and only I (Ben) will be using SumWords all that much.
624 | """
625 | self.bookoperation['TotalWords'] = self.bookoperation['WordsPerMillion']
626 | self.bookoperation['SumWords'] = self.bookoperation['WordsPerMillion']
627 | self.bookoperation['TotalTexts'] = self.bookoperation['TextCount']
628 | self.bookoperation['SumTexts'] = self.bookoperation['TextCount']
629 |
630 | for stattype in self.bookoperation.keys():
631 | if re.search("Word",stattype):
632 | self.catoperation[stattype] = "sum(nwords) as WordCount"
633 | if re.search("Text",stattype):
634 | self.catoperation[stattype] = "count(nwords) as TextCount"
635 |
636 | self.finaloperation['TextPercent'] = "IFNULL(numerator.TextCount,0)/IFNULL(denominator.TextCount,0)*100 as TextPercent"
637 | self.finaloperation['TextRatio'] = "IFNULL(numerator.TextCount,0)/IFNULL(denominator.TextCount,0) as TextRatio"
638 | self.finaloperation['TextCount'] = "IFNULL(numerator.TextCount,0) as TextCount"
639 |
640 | self.finaloperation['WordsPerMillion'] = "IFNULL(numerator.WordCount,0)*100000000/IFNULL(denominator.WordCount,0)/100 as WordsPerMillion"
641 | self.finaloperation['WordsRatio'] = "IFNULL(numerator.WordCount,0)/IFNULL(denominator.WordCount,0) as WordsRatio"
642 | self.finaloperation['WordCount'] = "IFNULL(numerator.WordCount,0) as WordCount"
643 |
644 | self.finaloperation['TotalWords'] = "IFNULL(denominator.WordCount,0) as TotalWords"
645 | self.finaloperation['SumWords'] = "IFNULL(denominator.WordCount,0) + IFNULL(numerator.WordCount,0) as SumWords"
646 | self.finaloperation['TotalTexts'] = "IFNULL(denominator.TextCount,0) as TotalTexts"
647 | self.finaloperation['SumTexts'] = "IFNULL(denominator.TextCount,0) + IFNULL(numerator.TextCount,0) as SumTexts"
648 |
649 | """
650 | The values here will be chosen in build_wordstables; that's what decides if it uses the 'bookoperation' or 'catoperation' dictionary to build out.
651 | """
652 |
653 | self.finaloperations = list()
654 | self.bookoperations = set()
655 | self.catoperations = set()
656 |
657 | for summaryStat in self.counttype:
658 | self.catoperations.add(self.catoperation[summaryStat])
659 | self.bookoperations.add(self.bookoperation[summaryStat])
660 | self.finaloperations.append(self.finaloperation[summaryStat])
661 |
662 | def counts_query(self):
663 |
664 | self.operation = ','.join(self.bookoperations)
665 | self.build_wordstables()
666 |
667 | countsQuery = """
668 | SELECT
669 | %(selections)s,
670 | %(operation)s
671 | FROM
672 | %(catalog)s
673 | %(main)s
674 | %(wordstables)s
675 | WHERE
676 | %(catwhere)s AND %(wordswhere)s
677 | GROUP BY
678 | %(groupings)s
679 | """ % self.__dict__
680 | return countsQuery
681 |
682 | def bookid_query(self):
683 | #A temporary method to setup the hasword query.
684 | self.operation = ','.join(self.bookoperations)
685 | self.build_wordstables()
686 |
687 | countsQuery = """
688 | SELECT
689 | main.bookid as bookid
690 | FROM
691 | %(catalog)s
692 | %(main)s
693 | %(wordstables)s
694 | WHERE
695 | %(catwhere)s AND %(wordswhere)s
696 | """ % self.__dict__
697 | return countsQuery
698 |
699 | def debug_query(self):
700 | query = self.ratio_query(materialize = False)
701 | return json.dumps(self.denominator.groupings.split(",")) + query
702 |
703 | def query(self,materialize=False):
704 | """
705 | We launch a whole new userquery instance here to build the denominator, based on the 'compare_dictionary' option (which in most
706 | cases is the search_limits without the keys, see above; it can also be specially defined using asterisks as a shorthand to identify other fields to drop.
707 | We then get the counts_query results out of that result.
708 | """
709 |
710 | """
711 | self.denominator = userquery(outside_dictionary = self.compare_dictionary,db=self.db,databaseScheme=self.databaseScheme)
712 | self.supersetquery = self.denominator.counts_query()
713 | supersetIndices = self.denominator.groupings.split(",")
714 | if materialize:
715 | self.supersetquery = derived_table(self.supersetquery,self.db,indices=supersetIndices).materialize()
716 | """
717 | self.mainquery = self.counts_query()
718 | self.countcommand = ','.join(self.finaloperations)
719 | self.totalselections = ",".join([group for group in self.outerGroups if group!="1 as In_Library" and group != ""])
720 | if self.totalselections != "": self.totalselections += ", "
721 |
722 | query = """
723 | SELECT
724 | %(totalselections)s
725 | %(countcommand)s
726 | FROM
727 | (%(mainquery)s) as numerator
728 | %(joinSuffix)s
729 | GROUP BY %(groupings)s;""" % self.__dict__
730 |
731 | return query
732 |
733 |
734 | def returnPossibleFields(self):
735 | try:
736 | self.cursor.execute("SELECT name,type,description,tablename,dbname,anchor FROM masterVariableTable WHERE status='public'")
737 | colnames = [line[0] for line in self.cursor.description]
738 | returnset = []
739 | for line in self.cursor.fetchall():
740 | thisEntry = {}
741 | for i in range(len(line)):
742 | thisEntry[colnames[i]] = line[i]
743 | returnset.append(thisEntry)
744 | except:
745 | returnset=[]
746 | return returnset
747 |
748 | def return_slug_data(self,force=False):
749 | #Rather than understand this error, I'm just returning 0 if it fails.
750 | #Probably that's the right thing to do, though it may cause trouble later.
751 | #It's just a punishment for not later using a decent API call with "Raw_Counts" to extract these counts out, and relying on this ugly method.
752 | #Please, citizens of the future, NEVER USE THIS METHOD.
753 | try:
754 | temp_words = self.return_n_words(force = True)
755 | temp_counts = self.return_n_books(force = True)
756 | except:
757 | temp_words = 0
758 | temp_counts = 0
759 | return [temp_counts,temp_words]
760 |
761 | def return_n_books(self,force=False): #deprecated
762 | if (not hasattr(self,'nbooks')) or force:
763 | query = "SELECT count(*) from " + self.catalog + " WHERE " + self.catwhere
764 | silent = self.cursor.execute(query)
765 | self.counts = int(self.cursor.fetchall()[0][0])
766 | return self.counts
767 |
768 | def return_n_words(self,force=False): #deprecated
769 | if (not hasattr(self,'nwords')) or force:
770 | query = "SELECT sum(nwords) from " + self.catalog + " WHERE " + self.catwhere
771 | silent = self.cursor.execute(query)
772 | self.nwords = int(self.cursor.fetchall()[0][0])
773 | return self.nwords
774 |
775 | def bibliography_query(self,limit = "100"):
776 | #I'd like to redo this at some point so it could work as an API call more naturally.
777 | self.limit = limit
778 | self.ordertype = "sum(main.count*10000/nwords)"
779 | try:
780 | if self.outside_dictionary['ordertype'] == "random":
781 | if self.counttype==["Raw_Counts"] or self.counttype==["Number_of_Books"] or self.counttype==['WordCount'] or self.counttype==['BookCount'] or self.counttype==['TextCount']:
782 | self.ordertype = "RAND()"
783 | else:
784 | #This is a based on an attempt to match various different distributions I found on the web somewhere to give
785 | #weighted results based on the counts. It's not perfect, but might be good enough. Actually doing a weighted random search is not easy without
786 | #massive memory usage inside sql.
787 | self.ordertype = "LOG(1-RAND())/sum(main.count)"
788 | except KeyError:
789 | pass
790 |
791 | #If IDF searching is enabled, we could add a term like '*IDF' here to overweight better selecting words
792 | #in the event of a multiple search.
793 | self.idfterm = ""
794 | prep = self.counts_query()
795 |
796 |
797 | if self.main == " ":
798 | self.ordertype="RAND()"
799 |
800 | bibQuery = """
801 | SELECT searchstring
802 | FROM """ % self.__dict__ + self.prefs['fullcat'] + """ RIGHT JOIN (
803 | SELECT
804 | """+ self.prefs['fastcat'] + """.bookid, %(ordertype)s as ordering
805 | FROM
806 | %(catalog)s
807 | %(main)s
808 | %(wordstables)s
809 | WHERE
810 | %(catwhere)s AND %(wordswhere)s
811 | GROUP BY bookid ORDER BY %(ordertype)s DESC LIMIT %(limit)s
812 | ) as tmp USING(bookid) ORDER BY ordering DESC;
813 | """ % self.__dict__
814 | return bibQuery
815 |
816 | def disk_query(self,limit="100"):
817 | pass
818 |
819 | def return_books(self):
820 | #This preps up the display elements for a search: it returns an array with a single string for each book, sorted in the best possible way
821 | silent = self.cursor.execute(self.bibliography_query())
822 | returnarray = []
823 | for line in self.cursor.fetchall():
824 | returnarray.append(line[0])
825 | if not returnarray:
826 | #why would someone request a search with no locations? Turns out (usually) because the smoothing tricked them.
827 | returnarray.append("No results for this particular point: try again without smoothing")
828 | newerarray = self.custom_SearchString_additions(returnarray)
829 | return json.dumps(newerarray)
830 |
831 | def search_results(self):
832 | #This is an alias that is handled slightly differently in APIimplementation (no "RESULTS" bit in front). Once
833 | #that legacy code is cleared out, they can be one and the same.
834 | return json.loads(self.return_books())
835 |
836 | def getActualSearchedWords(self):
837 | if len(self.wordswhere) > 7:
838 | words = self.outside_dictionary['search_limits']['word']
839 | #Break bigrams into single words.
840 | words = ' '.join(words).split(' ')
841 | self.cursor.execute("""SELECT word FROM wordsheap WHERE """ + where_from_hash({self.word_field:words}))
842 | self.actualWords =[item[0] for item in self.cursor.fetchall()]
843 | else:
844 | self.actualWords = ["tasty","mistake","happened","here"]
845 |
846 | def custom_SearchString_additions(self,returnarray):
847 | """
848 | It's nice to highlight the words searched for. This will be on partner web sites, so requires custom code for different databases
849 | """
850 | db = self.outside_dictionary['database']
851 | if db in ('jstor','presidio','ChronAm','LOC','OL'):
852 | self.getActualSearchedWords()
853 | if db=='jstor':
854 | joiner = "&searchText="
855 | preface = "?Search=yes&searchText="
856 | urlRegEx = "http://www.jstor.org/stable/\d+"
857 | if db=='presidio' or db=='OL':
858 | joiner = "+"
859 | preface = "#page/1/mode/2up/search/"
860 | urlRegEx = 'http://archive.org/stream/[^"# ><]*'
861 | if db in ('ChronAm','LOC'):
862 | preface = "/;words="
863 | joiner = "+"
864 | urlRegEx = 'http://chroniclingamerica.loc.gov[^\"><]*/seq-\d+'
865 | newarray = []
866 | for string in returnarray:
867 | try:
868 | base = re.findall(urlRegEx,string)[0]
869 | newcore = ' search inside '
870 | string = re.sub("^","",string)
871 | string = re.sub(" | $","",string)
872 | string = string+newcore
873 | except IndexError:
874 | pass
875 | newarray.append(string)
876 | #Arxiv is messier, requiring a whole different URL interface: http://search.arxiv.org:8081/paper.jsp?r=1204.3352&qs=netwokr
877 | else:
878 | newarray = returnarray
879 | return newarray
880 |
881 | def return_query_values(self,query = "ratio_query"):
882 | #The API returns a dictionary with years pointing to values.
883 | """
884 | DEPRECATED: use 'return_json' or 'return_tsv' (the latter only works with single 'search_limits' options) instead
885 | """
886 | values = []
887 | querytext = getattr(self,query)()
888 | silent = self.cursor.execute(querytext)
889 | #Gets the results
890 | mydict = dict(self.cursor.fetchall())
891 | try:
892 | for key in mydict.keys():
893 | #Only return results inside the time limits
894 | if key >= self.time_limits[0] and key <= self.time_limits[1]:
895 | mydict[key] = str(mydict[key])
896 | else:
897 | del mydict[key]
898 | mydict = smooth_function(mydict,smooth_method = self.smoothingType,span = self.smoothingSpan)
899 |
900 | except:
901 | mydict = {0:"0"}
902 |
903 | #This is a good place to change some values.
904 | try:
905 | return {'index':self.index, 'Name':self.words_searched,"values":mydict,'words_searched':""}
906 | except:
907 | return{'values':mydict}
908 |
909 |
910 | def return_tsv(self,query = "ratio_query"):
911 | if self.outside_dictionary['counttype']=="Raw_Counts" or self.outside_dictionary['counttype']==["Raw_Counts"]:
912 | query="counts_query"
913 | #This allows much speedier access to counts data if you're willing not to know about all the zeroes.
914 | #Will not work as well once the id_fields are in use.
915 | querytext = getattr(self,query)()
916 | silent = self.cursor.execute(querytext)
917 | results = ["\t".join([to_unicode(item[0]) for item in self.cursor.description])]
918 | lines = self.cursor.fetchall()
919 | for line in lines:
920 | items = []
921 | for item in line:
922 | item = to_unicode(item)
923 | item = re.sub("\t","",item)
924 | items.append(item)
925 | results.append("\t".join(items))
926 | return "\n".join(results)
927 |
928 | def export_data(self,query1="ratio_query"):
929 | self.smoothing=0
930 | return self.return_query_values(query=query1)
931 |
932 | def execute(self):
933 | #This performs the query using the method specified in the passed parameters.
934 | if self.method=="Nothing":
935 | pass
936 | else:
937 | value = getattr(self,self.method)()
938 | return value
939 |
940 | class derived_table(object):
941 | """
942 | MySQL/MariaDB doesn't have good subquery materialization,
943 | so I'm implementing it by hand.
944 | """
945 | def __init__(self,SQLstring,db,indices = [],dbToPutIn = "bookworm_scratch"):
946 | """
947 | initialize with the code to create the table; the database it will be in
948 | (to prevent conflicts with other identical queries in other dbs);
949 | and the list of all tables to be indexed
950 | (optional, but which can really speed up joins)
951 | """
952 | self.query = SQLstring
953 | self.db = db
954 | #Each query is identified by a unique key hashed
955 | #from the query and the dbname.
956 | self.queryID = dbToPutIn + "." + "derived" + hashlib.sha1(self.query + db.dbname).hexdigest()
957 | self.indices = "(" + ",".join(["INDEX(%s)" % index for index in indices]) + ")" if indices != [] else ""
958 |
959 | def setStorageEngines(self,temp):
960 | """
961 | Chooses where and how to store tables.
962 | """
963 | self.tempString = "TEMPORARY" if temp else ""
964 | self.engine = "MEMORY" if temp else "MYISAM"
965 |
966 | def checkCache(self):
967 | """
968 | Checks what's already been calculated.
969 | """
970 | try:
971 | (self.count,self.created,self.modified,self.createCode,self.data) = self.db.cursor.execute("SELECT count,created,modified,createCode,data FROM bookworm_scratch.cache WHERE fieldname='%s'" %self.queryID)[0]
972 | return True
973 | except:
974 | (self.count,self.created,self.modified,self.createCode,self.data) = [None]*5
975 | return False
976 |
977 | def fillTableWithData(self,data):
978 | dataCode = "INSERT INTO %s values ("%self.queryID + ", ".join(["%s"]*len(data[0])) + ")"
979 | self.db.cursor.executemany(dataCode,data)
980 | self.db.db.commit()
981 |
982 | def materializeFromCache(self,temp):
983 | if self.data is not None:
984 | #Datacode should never exist without createCode also.
985 | self.db.cursor.execute(self.createCode)
986 | self.fillTableWithData(pickle.loads(self.data,protocol=-1))
987 | return True
988 | else:
989 | return False
990 |
991 |
992 | def createFromCacheWithDataFromBookworm(self,temp,postDataToCache=False):
993 | """
994 | If the create code exists but the data does not.
995 | This uses a form of query that MySQL can cache,
996 | unlike the normal subqueries OR the CREATE TABLE ... INSERT
997 | used by materializeFromBookworm.
998 |
999 | You can also post the data itself, but that's turned off by default:
1000 | because why wouldn't it have been posted the first time?
1001 | Probably it's too large or something, is why.
1002 | """
1003 | if self.createCode==None:
1004 | return False
1005 | self.db.cursor.execute(self.createCode)
1006 | self.db.cursor.execute(self.query)
1007 | data = [row for row in self.db.cursor.fetchall()]
1008 | self.newdata = pickle.dumps(data,protocol=-1)
1009 | self.fillTableWithData(data)
1010 | if postDataToCache:
1011 | self.updateCache(postQueries=["UPDATE bookworm_scratch.cache SET data='%s' WHERE fieldname='%s'" %(MySQLdb.escape_string(self.newdata),self.queryID)])
1012 | else:
1013 | self.updateCache()
1014 | return True
1015 |
1016 | def materializeFromBookworm(self,temp,postDataToCache=True,postCreateToCache=True):
1017 | import cPickle as pickle
1018 | self.db.cursor.execute("CREATE %(tempString)s TABLE %(queryID)s %(indices)s ENGINE=%(engine)s %(query)s;" % self.__dict__)
1019 | self.db.cursor.execute("SHOW CREATE TABLE %s" %self.queryID)
1020 | self.newCreateCode = self.db.cursor.fetchall()[0][1]
1021 | self.db.cursor.execute("SELECT * FROM %s" %self.queryID)
1022 | #coerce the results to a list of tuples, then pickle it.
1023 | self.newdata = pickle.dumps([row for row in self.db.cursor.fetchall()],protocol=-1)
1024 |
1025 | if postDataToCache:
1026 | self.updateCache(postQueries=["UPDATE bookworm_scratch.cache SET data='%s',createCode='%s' WHERE fieldname='%s'" %(MySQLdb.escape_string(self.newdata),MySQLdb.escape_string(self.newCreateCode),self.queryID)])
1027 |
1028 | if postCreateToCache:
1029 | self.updateCache(postQueries=["UPDATE bookworm_scratch.cache SET createCode='%s' WHERE fieldname='%s'" %(MySQLdb.escape_string(self.newCreateCode),self.queryID)])
1030 |
1031 |
1032 | def updateCache(self,postQueries=[]):
1033 | q1 = """
1034 | INSERT INTO bookworm_scratch.cache (fieldname,created,modified,count) VALUES
1035 | ('%s',NOW(),NOW(),1) ON DUPLICATE KEY UPDATE count = count + 1,modified=NOW();""" %self.queryID
1036 | result = self.db.cursor.execute(q1)
1037 | for query in postQueries:
1038 | self.db.cursor.execute(query)
1039 | self.db.db.commit()
1040 |
1041 | def materialize(self,temp="default"):
1042 | """
1043 | materializes the table, by default in memory in the bookworm_scratch
1044 | database. If temp is false, the table will be stored on disk, available
1045 | for future users too. This should be used sparingly, because you can't have too many
1046 | tables on disk.
1047 |
1048 | Returns the tableID, which the superquery to this one may need to know.
1049 | """
1050 | if temp=="default":
1051 | temp=True
1052 |
1053 | self.checkCache()
1054 | self.setStorageEngines(temp)
1055 |
1056 | try:
1057 | if not self.materializeFromCache(temp):
1058 | if not self.createFromCacheWithDataFromBookworm(temp):
1059 | self.materializeFromBookworm(temp)
1060 |
1061 | except MySQLdb.OperationalError,e:
1062 | #Often the error will be 1050, which is a good thing:
1063 | #It means we don't need to
1064 | #create the table, because it's there already.
1065 | #But if it's not, something bad is happening.
1066 | if not re.search("1050.*already exists",str(e)):
1067 | raise
1068 |
1069 | return self.queryID
1070 |
1071 | class databaseSchema:
1072 | """
1073 | This class stores information about the database setup that is used to optimize query creation query
1074 | and so that queries know what tables to include.
1075 | It's broken off like this because it might be usefully wrapped around some of the backend features,
1076 | because it shouldn't be run multiple times in a single query (that spawns two instances of itself), as was happening before.
1077 |
1078 | It's closely related to some of the classes around variables and variableSets in the Bookworm Creation scripts,
1079 | but is kept separate for now: that allows a bit more flexibility, but is probaby a Bad Thing in the long run.
1080 | """
1081 |
1082 | def __init__(self,db):
1083 | self.db = db
1084 | self.cursor=db.cursor
1085 | #has of what table each variable is in
1086 | self.tableToLookIn = {}
1087 | #hash of what the root variable for each search term is (eg, 'author_birth' might be crosswalked to 'authorid' in the main catalog.)
1088 | self.anchorFields = {}
1089 | #aliases: a hash showing internal identifications codes that dramatically speed up query time, but which shouldn't be exposed.
1090 | #So you can run a search for "state," say, and the database will group on a 50-element integer code instead of a VARCHAR that
1091 | #has to be long enough to support "Massachusetts" and "North Carolina."
1092 | #A couple are hard-coded in, but most are derived by looking for fields that end in the suffix "__id" later.
1093 |
1094 | if self.db.dbname=="presidio":
1095 | self.aliases = {"classification":"lc1","lat":"pointid","lng":"pointid"}
1096 | else:
1097 | self.aliases = dict()
1098 |
1099 | try:
1100 | #First build using the new streamlined tables; if that fails,
1101 | #build using the old version that hits the INFORMATION_SCHEMA,
1102 | #which is bad practice.
1103 | self.newStyle(db)
1104 | except:
1105 | #The new style will fail on old bookworms: a failure is an easy way to test
1106 | #for oldness, though of course something else might be causing the failure.
1107 | self.oldStyle(db)
1108 |
1109 |
1110 | def newStyle(self,db):
1111 | self.tableToLookIn['bookid'] = 'fastcat'
1112 | self.anchorFields['bookid'] = 'fastcat'
1113 | self.anchorFields['wordid'] = 'wordid'
1114 | self.tableToLookIn['wordid'] = 'wordsheap'
1115 |
1116 |
1117 | tablenames = dict()
1118 | tableDepends = dict()
1119 | db.cursor.execute("SELECT dbname,alias,tablename,dependsOn FROM masterVariableTable JOIN masterTableTable USING (tablename);")
1120 | for row in db.cursor.fetchall():
1121 | (dbname,alias,tablename,dependsOn) = row
1122 | self.tableToLookIn[dbname] = tablename
1123 | self.anchorFields[tablename] = dependsOn
1124 | self.aliases[dbname] = alias
1125 |
1126 | def oldStyle(self,db):
1127 |
1128 | #This is sorted by engine DESC so that memory table locations will overwrite disk table in the hash.
1129 |
1130 | self.cursor.execute("SELECT ENGINE,TABLE_NAME,COLUMN_NAME,COLUMN_KEY,TABLE_NAME='fastcat' OR TABLE_NAME='wordsheap' AS privileged FROM information_schema.COLUMNS JOIN INFORMATION_SCHEMA.TABLES USING (TABLE_NAME,TABLE_SCHEMA) WHERE TABLE_SCHEMA='%(dbname)s' ORDER BY privileged,ENGINE DESC,TABLE_NAME,COLUMN_KEY DESC;" % self.db.__dict__);
1131 | columnNames = self.cursor.fetchall()
1132 |
1133 | parent = 'bookid'
1134 | previous = None
1135 | for databaseColumn in columnNames:
1136 | if previous != databaseColumn[1]:
1137 | if databaseColumn[3]=='PRI' or databaseColumn[3]=='MUL':
1138 | parent = databaseColumn[2]
1139 | previous = databaseColumn[1]
1140 | else:
1141 | parent = 'bookid'
1142 | else:
1143 | self.anchorFields[databaseColumn[2]] = parent
1144 | if databaseColumn[3]!='PRI' and databaseColumn[3]!="MUL": #if it's a primary key, this isn't the right place to find it.
1145 | self.tableToLookIn[databaseColumn[2]] = databaseColumn[1]
1146 | if re.search('__id\*?$',databaseColumn[2]):
1147 | self.aliases[re.sub('__id','',databaseColumn[2])]=databaseColumn[2]
1148 |
1149 | try:
1150 | cursor = self.cursor.execute("SELECT dbname,tablename,anchor,alias FROM masterVariableTables")
1151 | for row in cursor.fetchall():
1152 | if row[0] != row[3]:
1153 | self.aliases[row[0]] = row[3]
1154 | if row[0] != row[2]:
1155 | self.anchorFields[row[0]] = row[2]
1156 | #Should be uncommented, but some temporary issues with the building script
1157 | #self.tableToLookIn[row[0]] = row[1]
1158 | except:
1159 | pass
1160 | self.tableToLookIn['bookid'] = 'fastcat'
1161 | self.anchorFields['bookid'] = 'fastcat'
1162 | self.anchorFields['wordid'] = 'wordid'
1163 | self.tableToLookIn['wordid'] = 'wordsheap'
1164 | #############
1165 | ##GENERAL#### #These are general purpose functional types of things not implemented in the class.
1166 | #############
1167 |
1168 | def to_unicode(obj, encoding='utf-8'):
1169 | if isinstance(obj, basestring):
1170 | if not isinstance(obj, unicode):
1171 | obj = unicode(obj, encoding)
1172 | elif isinstance(obj,int):
1173 | obj=unicode(str(obj),encoding)
1174 | else:
1175 | obj = unicode(str(obj),encoding)
1176 | return obj
1177 |
1178 | def where_from_hash(myhash,joiner=" AND ",comp = " = ",escapeStrings=True):
1179 | whereterm = []
1180 | #The general idea here is that we try to break everything in search_limits down to a list, and then create a whereterm on that joined by whatever the 'joiner' is ("AND" or "OR"), with the comparison as whatever comp is ("=",">=",etc.).
1181 | #For more complicated bits, it gets all recursive until the bits are all in terms of list.
1182 | for key in myhash.keys():
1183 | values = myhash[key]
1184 | if isinstance(values,basestring) or isinstance(values,int) or isinstance(values,float):
1185 | #This is just human-being handling. You can pass a single value instead of a list if you like, and it will just convert it
1186 | #to a list for you.
1187 | values = [values]
1188 | #Or queries are special, since the default is "AND". This toggles that around for a subportion.
1189 | if key=='$or' or key=="$OR":
1190 | for comparison in values:
1191 | whereterm.append(where_from_hash(comparison,joiner=" OR ",comp=comp))
1192 | #The or doesn't get populated any farther down.
1193 | elif isinstance(values,dict):
1194 | #Certain function operators can use MySQL terms. These are the only cases that a dict can be passed as a limitations
1195 | operations = {"$gt":">","$ne":"!=","$lt":"<","$grep":" REGEXP ","$gte":">=","$lte":"<=","$eq":"="}
1196 | for operation in values.keys():
1197 | whereterm.append(where_from_hash({key:values[operation]},comp=operations[operation],joiner=joiner))
1198 | elif isinstance(values,list):
1199 | #and this is where the magic actually happens: the cases where the key is a string, and the target is a list.
1200 | if isinstance(values[0],dict):
1201 | # If it's a list of dicts, then there's one thing that happens. Currently all types are assumed to be the same:
1202 | # you couldn't pass in, say {"year":[{"$gte":1900},1898]} to catch post-1898 years except for 1899. Not that you
1203 | # should need to.
1204 | for entry in values:
1205 | whereterm.append(where_from_hash(entry))
1206 | else:
1207 | #Note that about a third of the code is spent on escaping strings.
1208 | if escapeStrings:
1209 | if isinstance(values[0],basestring):
1210 | quotesep="'"
1211 | else:
1212 | quotesep = ""
1213 | def escape(value): return MySQLdb.escape_string(to_unicode(value))
1214 | else:
1215 | def escape(value): return to_unicode(value)
1216 | quotesep=""
1217 |
1218 | #Note the "OR" here. There's no way to pass in a query like "year=1876 AND year=1898" as currently set up.
1219 | #Obviously that's no great loss, but there might be something I'm missing that would be desire a similar format somehow.
1220 | #(In cases where the same book could have two different years associated with it)
1221 | whereterm.append(" (" + " OR ".join([" (" + key+comp+quotesep+escape(value)+quotesep+") " for value in values])+ ") ")
1222 | return "(" + joiner.join(whereterm) + ")"
1223 | #This works pretty well, except that it requires very specific sorts of terms going in, I think.
1224 |
1225 |
1226 |
1227 | #I'd rather have all this smoothing stuff done at the client side, but currently it happens here.
1228 |
1229 | #The idea is: this works by default by slurping up from the command line, but you could also load the functions in and run results on your own queries.
1230 | try:
1231 | command = str(sys.argv[1])
1232 | command = json.loads(command)
1233 | #Got to go before we let anything else happen.
1234 | p = userqueries(command)
1235 | result = p.execute()
1236 | print json.dumps(result)
1237 | except:
1238 | pass
1239 |
1240 |
1241 |
1242 | def debug(string):
1243 | """
1244 | Makes it easier to debug through a web browser by handling the headers.
1245 | Despite being called a `string`, it can be anything that python can print.
1246 | """
1247 | print headers('1')
1248 | print "
"
1249 | print string
1250 | print "
"
1251 |
--------------------------------------------------------------------------------
/bookworm/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bookworm-project/BookwormAPI/faac096f74a86ca7a9c8b4e02a3aacfa1f5f7b76/bookworm/__init__.py
--------------------------------------------------------------------------------
/bookworm/general_API.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 |
3 | import MySQLdb
4 | from pandas import merge
5 | from pandas.io.sql import read_sql
6 | from pandas import set_option
7 | from SQLAPI import *
8 | from copy import deepcopy
9 | from collections import defaultdict
10 | import ConfigParser
11 | import os.path
12 |
13 | #Some settings can be overridden here, if no where else.
14 | prefs = dict()
15 |
16 | def find_my_cnf():
17 | """
18 | The password will be looked for in these places.
19 | """
20 |
21 | for file in ["etc/bookworm/my.cnf","/etc/my.cnf","/etc/mysql/my.cnf","/root/.my.cnf"]:
22 | if os.path.exists(file):
23 | return file
24 |
25 | class dbConnect(object):
26 | #This is a read-only account
27 | def __init__(self,prefs=prefs,database="federalist",host="localhost"):
28 | self.dbname = database
29 |
30 | #For back-compatibility:
31 | if "HOST" in prefs:
32 | host=prefs['HOST']
33 |
34 | self.db = MySQLdb.connect(host=host,
35 | db=database,
36 | read_default_file = find_my_cnf(),
37 | use_unicode='True',
38 | charset='utf8')
39 |
40 | self.cursor = self.db.cursor()
41 |
42 | def calculateAggregates(df,parameters):
43 |
44 | """
45 | We only collect "WordCoun" and "TextCount" for each query,
46 | but there are a lot of cool things you can do with those:
47 | basic things like frequency, all the way up to TF-IDF.
48 | """
49 | parameters = set(parameters)
50 |
51 | if "WordsPerMillion" in parameters:
52 | df["WordsPerMillion"] = df["WordCount_x"].multiply(1000000)/df["WordCount_y"]
53 | if "WordCount" in parameters:
54 | df["WordCount"] = df["WordCount_x"]
55 | if "TotalWords" in parameters:
56 | df["TotalWords"] = df["WordCount_y"]
57 | if "SumWords" in parameters:
58 | df["SumWords"] = df["WordCount_y"] + df["WordCount_x"]
59 | if "WordsRatio" in parameters:
60 | df["WordsRatio"] = df["WordCount_x"]/df["WordCount_y"]
61 |
62 | if "TextPercent" in parameters:
63 | df["TextPercent"] = 100*df["TextCount_x"].divide(df["TextCount_y"])
64 | if "TextCount" in parameters:
65 | df["TextCount"] = df["TextCount_x"]
66 | if "TotalTexts" in parameters:
67 | df["TotalTexts"] = df["TextCount_y"]
68 |
69 | if "HitsPerBook" in parameters:
70 | df["HitsPerMatch"] = df["WordCount_x"]/df["TextCount_x"]
71 |
72 | if "TextLength" in parameters:
73 | df["HitsPerMatch"] = df["WordCount_y"]/df["TextCount_y"]
74 |
75 | if "TFIDF" in parameters:
76 | from numpy import log as log
77 | df.eval("TF = WordCount_x/WordCount_y")
78 | df["TFIDF"] = (df["WordCount_x"]/df["WordCount_y"])*log(df["TextCount_y"]/df['TextCount_x'])
79 |
80 | def DunningLog(df=df,a = "WordCount_x",b = "WordCount_y"):
81 | from numpy import log as log
82 | destination = "Dunning"
83 | df[a] = df[a].replace(0,1)
84 | df[b] = df[b].replace(0,1)
85 | if a=="WordCount_x":
86 | # Dunning comparisons should be to the sums if counting:
87 | c = sum(df[a])
88 | d = sum(df[b])
89 | if a=="TextCount_x":
90 | # The max count isn't necessarily the total number of books, but it's a decent proxy.
91 | c = max(df[a])
92 | d = max(df[b])
93 | expectedRate = (df[a] + df[b]).divide(c+d)
94 | E1 = c*expectedRate
95 | E2 = d*expectedRate
96 | diff1 = log(df[a].divide(E1))
97 | diff2 = log(df[b].divide(E2))
98 | df[destination] = 2*(df[a].multiply(diff1) + df[b].multiply(diff2))
99 | # A hack, but a useful one: encode the direction of the significance,
100 | # in the sign, so negative
101 | difference = diff1 0:
275 | merged = merge(df1,df2,on=intersections,how='outer')
276 | else:
277 | """
278 | Pandas doesn't seem to have a full, unkeyed merge, so I simulate it with a dummy.
279 | """
280 | df1['dummy_merge_variable'] = 1
281 | df2['dummy_merge_variable'] = 1
282 | merged = merge(df1,df2,on=["dummy_merge_variable"],how='outer')
283 |
284 | merged = merged.fillna(int(0))
285 |
286 | calculations = self.query['counttype']
287 |
288 | calcced = calculateAggregates(merged,calculations)
289 |
290 | calcced = calcced.fillna(int(0))
291 |
292 | final_DataFrame = calcced[self.query['groups'] + self.query['counttype']]
293 |
294 | return final_DataFrame
295 |
296 | def execute(self):
297 | method = self.query['method']
298 |
299 |
300 | if isinstance(self.query['search_limits'],list):
301 | if self.query['method'] not in ["json","return_json"]:
302 | self.query['search_limits'] = self.query['search_limits'][0]
303 | else:
304 | return self.multi_execute()
305 |
306 | if method=="return_json" or method=="json":
307 | frame = self.data()
308 | return self.return_json()
309 |
310 | if method=="return_tsv" or method=="tsv":
311 | import csv
312 | frame = self.data()
313 | return frame.to_csv(sep="\t",encoding="utf8",index=False,quoting=csv.QUOTE_NONE,escapechar="\\")
314 |
315 | if method=="return_pickle" or method=="DataFrame":
316 | frame = self.data()
317 | from cPickle import dumps as pickleDumps
318 | return pickleDumps(frame,protocol=-1)
319 |
320 | # Temporary catch-all pushes to the old methods:
321 | if method in ["returnPossibleFields","search_results","return_books"]:
322 | query = userquery(self.query)
323 | if method=="return_books":
324 | return query.execute()
325 | return json.dumps(query.execute())
326 |
327 |
328 |
329 | def multi_execute(self):
330 | """
331 | Queries may define several search limits in an array
332 | if they use the return_json method.
333 | """
334 | returnable = []
335 | for limits in self.query['search_limits']:
336 | child = deepcopy(self.query)
337 | child['search_limits'] = limits
338 | returnable.append(self.__class__(child).return_json(raw_python_object=True))
339 |
340 | return json.dumps(returnable)
341 |
342 | def return_json(self,raw_python_object=False):
343 | query = self.query
344 | data = self.data()
345 |
346 |
347 | def fixNumpyType(input):
348 | #This is, weirdly, an occasional problem but not a constant one.
349 | if str(input.dtype)=="int64":
350 | return int(input)
351 | else:
352 | return input
353 |
354 | #Define a recursive structure to hold the stuff.
355 | def tree():
356 | return defaultdict(tree)
357 | returnt = tree()
358 |
359 | import numpy as np
360 |
361 | for row in data.itertuples(index=False):
362 | row = list(row)
363 | destination = returnt
364 | if len(row)==len(query['counttype']):
365 | returnt = [fixNumpyType(num) for num in row]
366 | while len(row) > len(query['counttype']):
367 | key = row.pop(0)
368 | if len(row) == len(query['counttype']):
369 | # Assign the elements.
370 | destination[key] = row
371 | break
372 | # This bit of the loop is where we descend the recursive dictionary.
373 | destination = destination[key]
374 | if raw_python_object:
375 | return returnt
376 |
377 | try:
378 | return json.dumps(returnt,allow_nan=False)
379 | except ValueError:
380 | return json.dumps(returnt)
381 | kludge = json.dumps(returnt)
382 | kludge = kludge.replace("Infinity","null")
383 | print kludge
384 |
385 | class SQLAPIcall(APIcall):
386 | """
387 | To make a new backend for the API, you just need to extend the base API call
388 | class like this.
389 |
390 | This one is comically short because all the real work is done in the userquery object.
391 |
392 | But the point is, you need to define a function "generate_pandas_frame"
393 | that accepts an API call and returns a pandas frame.
394 |
395 | But that API call is more limited than the general API; you only need to support "WordCount" and "TextCount"
396 | methods.
397 | """
398 |
399 | def generate_pandas_frame(self,call):
400 | """
401 |
402 | This is good example of the query that actually fetches the results.
403 | It creates some SQL, runs it, and returns it as a pandas DataFrame.
404 |
405 | The actual SQL production is handled by the userquery class, which uses more
406 | legacy code.
407 |
408 | """
409 | con=dbConnect(prefs,self.query['database'])
410 | q = userquery(call).query()
411 | if self.query['method']=="debug":
412 | print q
413 | df = read_sql(q, con.db)
414 | return df
415 |
416 |
417 |
--------------------------------------------------------------------------------
/bookworm/knownHosts.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 |
3 | """
4 | Whenever you add a new bookworm to your server, remember to add a line for it in this file.
5 |
6 | However, always keep the 'default' line listed below in this file.
7 | """
8 |
9 | general_prefs = dict()
10 | general_prefs["default"] = {"fastcat": "fastcat", "HOST": "localhost", "separateDataTables": [], "fastword": "wordsheap", "database": "YourDatabaseNameHere", "read_url_head": "THIS_CAN_BE_ANYTHING...ITS_NOT_USED_ANYMORE", "fullcat": "catalog", "fullword": "words", "read_default_file": "/etc/mysql/my.cnf"}
11 |
--------------------------------------------------------------------------------
/bookworm/logParser.py:
--------------------------------------------------------------------------------
1 | import urllib
2 | import os
3 | import re
4 | import gzip
5 | import json
6 | import sys
7 |
8 | files = os.listdir("/var/log/apache2")
9 |
10 | words = []
11 |
12 | for file in files:
13 | reading = None
14 | if re.search("^access.log..*.gz",file):
15 | reading = gzip.open("/var/log/apache2/" + file)
16 | elif re.search("^access.log.*",file):
17 | reading = open("/var/log/apache2/" + file)
18 | else:
19 | continue
20 | sys.stderr.write(file + "\n")
21 |
22 | for line in reading:
23 | matches = re.findall(r"([0-9\.]+).*\[(.*)].*cgi-bin/dbbindings.py/?.query=([^ ]+)",line)
24 | for fullmatch in matches:
25 | t = dict()
26 | t['ip'] = fullmatch[0]
27 | match = fullmatch[2]
28 | try:
29 | data = json.loads(urllib.unquote(match).decode('utf8'))
30 | except ValueError:
31 | continue
32 | try:
33 | if isinstance(data['search_limits'],dict):
34 | data['search_limits'] = [data['search_limits']]
35 | for setting in ['words_collation','database']:
36 | try:
37 | t[setting] = data[setting]
38 | except KeyError:
39 | t[setting] = ""
40 | for limit in data['search_limits']:
41 | p = dict()
42 | for constraint in ["word","TV_show","director"]:
43 | try:
44 | p[constraint] = p[constraint] + "," + (",".join(limit[constraint]))
45 | except KeyError:
46 | try:
47 | p[constraint] = (",".join(limit[constraint]))
48 | except KeyError:
49 | p[constraint] = ""
50 | for key in p.keys():
51 | t[key] = p[key]
52 | vals = [t[key] for key in ('ip','database','words_collation','word','TV_show','director')]
53 | print "\t".join(vals).encode("utf-8")
54 |
55 |
56 | except KeyError:
57 | raise
58 |
59 | print len(words)
60 |
--------------------------------------------------------------------------------
/dbbindings.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 |
3 |
4 | #So we load in the terms that allow the API implementation to happen for now.
5 | from datetime import datetime
6 | from bookworm.general_API import *
7 | import os
8 | import cgitb
9 | #import MySQLdb
10 | cgitb.enable()
11 |
12 | def headers(method):
13 | if method!="return_tsv":
14 | print "Content-type: text/html\n"
15 |
16 | elif method=="return_tsv":
17 | print "Content-type: text; charset=utf-8"
18 | print "Content-Disposition: filename=Bookworm-data.txt"
19 | print "Pragma: no-cache"
20 | print "Expires: 0\n"
21 |
22 | def debug(string):
23 | """
24 | Makes it easier to debug through a web browser by handling the headers
25 | No calls should be permanently left in the code ever, or they will break things badly.
26 | """
27 | print headers('1')
28 | print "
"
29 | print string
30 | print "
"
31 |
32 |
33 | def main(JSONinput):
34 |
35 | query = JSONinput
36 |
37 | try:
38 | #Whether there are multiple search terms, as in the highcharts method.
39 | usingSuccinctStyle = isinstance(query['search_limits'],dict)
40 | except:
41 | #If there are no search limits, it might be a returnPossibleFields query
42 | usingSuccinctStyle = True
43 |
44 | headers(query['method'])
45 |
46 | p = SQLAPIcall(query)
47 |
48 | result = p.execute()
49 | print result
50 |
51 | return True
52 |
53 |
54 | if __name__=="__main__":
55 | form = cgi.FieldStorage()
56 |
57 | #Still supporting two names for the passed parameter.
58 | try:
59 | JSONinput = form["queryTerms"].value
60 | except KeyError:
61 | JSONinput = form["query"].value
62 |
63 | main(json.loads(JSONinput))
64 |
65 |
66 |
--------------------------------------------------------------------------------
/testAPI.py:
--------------------------------------------------------------------------------
1 | import dbbindings
2 | import unittest
3 | import bookworm.general_API as general_API
4 | import bookworm.SQLAPI as SQLAPI
5 |
6 | class SQLfunction(unittest.TestCase):
7 |
8 | def test1(self):
9 |
10 | query = {
11 | "database": "movies",
12 | "method": "return_json",
13 | "search_limits": {"MovieYear":1900},
14 | "counttype": "WordCount",
15 | "groups": ["TV_show"]
16 | }
17 |
18 |
19 | f = SQLAPI.userquery(query).query()
20 | print f
21 |
22 |
23 | class SQLConnections(unittest.TestCase):
24 | def dbConnectorsWork(self):
25 | from general_API import prefs as prefs
26 | connection = general_API.dbConnect(prefs,"federalist")
27 | tables = connection.cursor.execute("SHOW TABLES")
28 | self.assertTrue(connection.dbname=="federalist")
29 |
30 | def test1(self):
31 | query = {
32 | "database":"federalist",
33 | "search_limits":{},
34 | "counttype":"TextPercent",
35 | "groups":["author"],
36 | "method":"return_json"
37 | }
38 |
39 | try:
40 | dbbindings.main(query)
41 | worked = True
42 | except:
43 | worked = False
44 |
45 | self.assertTrue(worked)
46 |
47 | def test2(self):
48 | query = {
49 | "database":"federalist",
50 | "search_limits":{"author":"Hamilton"},
51 | "compare_limits":{"author":"Madison"},
52 | "counttype":"Dunning",
53 | "groups":["unigram"],
54 | "method":"return_json"
55 | }
56 |
57 |
58 | try:
59 | #dbbindings.main(query)
60 | worked = True
61 | except:
62 | worked = False
63 |
64 | self.assertTrue(worked)
65 |
66 | if __name__=="__main__":
67 | unittest.main()
68 |
--------------------------------------------------------------------------------