├── .gitignore ├── LICENSE ├── README.md ├── config.py ├── lc_parse.py ├── reconcile.py ├── requirements.txt └── text.py /.gitignore: -------------------------------------------------------------------------------- 1 | *pyc 2 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | BSD-style license 2 | ================= 3 | 4 | Copyright (c) 2010, Michael Stephens. 5 | 6 | All rights reserved. 7 | 8 | Redistribution and use in source and binary forms, with or without modification, 9 | are permitted provided that the following conditions are met: 10 | 11 | * Redistributions of source code must retain the above copyright notice, 12 | this list of conditions and the following disclaimer. 13 | * Redistributions in binary form must reproduce the above copyright notice, 14 | this list of conditions and the following disclaimer in the documentation 15 | and/or other materials provided with the distribution. 16 | * Neither the name of Sunlight Labs nor the names of its contributors may be 17 | used to endorse or promote products derived from this software without 18 | specific prior written permission. 19 | 20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 21 | "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 22 | LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 23 | A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR 24 | CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 25 | EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, 26 | PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR 27 | PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 28 | LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING 29 | NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 30 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 31 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ##About 2 | 3 | An OpenRefine reconciliation service for [GeoNames](http://www.geonames.org/). 4 | 5 | **Tested with, working on python 2.7.10, 3.4.3** 6 | 7 | The service queries the [GeoNames API](http://www.geonames.org/export/web-services.html) 8 | and provides normalized scores across queries for reconciling in Refine. 9 | 10 | I'm just a small-town metadataist in a big code world, so please don't assume I did something 'the hard way' because I had a theory or opinion or whatnot. I probably just don't know that an easier way exists. So please share your corrections and thoughts (but please don't be a jerk about it either). 11 | 12 | If you'd like to hear my thoughts about why do this instead of creating a column by pulling in URLs, or what I do with this data once I export my data to metadata records, or if we should even have to keep coordinates in bibliographic metadata records, see some thoughts here: http://christinaharlow.com/thoughts-on-geospatial-metadata and http://christinaharlow.com/walkthrough-of-geonames-recon-service 13 | 14 | ##Provenance 15 | 16 | Michael Stephens wrote a [demo reconcilliation service](https://github.com/mikejs/reconcile-demo) and Ted Lawless wrote a [FAST reconciliation service](https://github.com/lawlesst/fast-reconcile) that this code basically repeats but for a different API. 17 | 18 | Please give any thanks for this work to Ted Lawless, and any complaints to Christina. Also give thanks to Trevor Muñoz for some cleanups to make this code easier to work with. 19 | 20 | ##Special Notes 21 | 22 | This came out of frustration that the Library of Congress authorities are: 23 | 24 | - haphazard in containing Latitude and Longitude in the authority records for geographic names/subjects (although a requirement for authorities now, many such authorities do not contain the coordinates currently) 25 | - the way that the Library of Congress authorities formulate place names for primary headings (not as subdivisions) would often return no results in the GeoNames API because of abbreviations, many of which for U.S. states were unique to the Library of Congress authorities (for example, 'Calif.' for headings of cities in California). 26 | 27 | So this service takes Library of Congress authorities headings (or headings formulated to mimic the LoC authorities structure), expand U.S. abbreviations, then reconcile against GeoNames. The returned GeoNames 'name' gives both the GeoNames name for the location as well as the coordinates. There are, no doubts, better ways to handle getting both in an OpenRefine reconciliation service, but this was a quick hack to get both while I continue to explore how OpenRefine Reconciliation Services are structured. 28 | 29 | ##Instructions 30 | 31 | Before getting started, you'll need python on your computer (this was built with python 2.7.8, updated to work with python3.4, most recently tested and worked with python 2.7.10 and 3.4.3) and be comfortable using OpenRefine/Google Refine. 32 | 33 | This reconciliation service also requires a GeoNames API username. You can find and use the one used in the original code for testing, but you'll run against maximum number counts quickly, so it is strongly recommended you get your own (free, quick & easy to obtain) GeoNames account. 34 | 35 | To do so, go to this webpage and register: http://www.geonames.org/login 36 | After your account is activated, enable it for free web services: http://www.geonames.org/manageaccount 37 | 38 | - Once you have your GeoNames username, create an environment variable on your computer with your Geonames username as so: 39 | - Open the Command Line Interface of your choice (Terminal on is default on a Mac) 40 | - Type in $ export GEONAMES_USERNAME="username" (replacing username with your username) 41 | - You may need to restart your terminal window, but probably not. 42 | - Go ahead and clone/download/get a copy of this code repository on your computer. 43 | - In the Command Line Interface, change to the directory where you downloaded this code (cd directory/with/code/ ) 44 | - Type in: python reconcile.py --debug (you don't need to use debug but this is helpful for knowing what this service is up to while you are working with it). 45 | - You should see a screen telling you that the service is 'Running on http://0.0.0.0:5000/'. 46 | - Leaving that terminal window open and the service running, go start up OpenRefine (however you normally go about it). Open a project in OpenRefine. 47 | - On the column you would like to reconcile with GeoNames, click on the arrow at the top, choose 'Reconcile' > 'Start Reconciling...' 48 | - Click on the 'Add Standard Service' button in the bottom left corner. 49 | - Now enter the URL that the local service is running on - if you've changed nothing in the code except your GeoNames API username, it should be 'http://0.0.0.0:5000/reconcile'. Click Add Service. 50 | - If nothing happens upon entering 'http://0.0.0.0:5000/reconcile', try 'http://localhost:5000/reconcile' or 'http://127.0.0.1:5000/reconcile' instead. 51 | - You should now be greeted by a list of possible reconciliation types for the GeoNames Reconciliation Service. They should be fairly straight-forward to understand, and use /geonames/all if you need the broadest search capabilities possible. 52 | - Click 'Start Reconciling' in the bottom right corner. 53 | - Once finished, you should see the closest options that the GeoNames API found for each cell. You can click on the options and be taken to the GeoNames site for that entry. Once you find the appropriate reconciliation choice, click the single arrow box beside it to use that choice just for the one cell, or the double arrows box to use that choice for all other cells containing that text. 54 | - Once you've got your reconciliation choices done or rejected, you then need to store the GeoNames name, id, and coordinates (or any subset of those that you want to keep in the data) in your OpenRefine project. This is important: 55 | 56 | **Although it appears that you have retrieved your reconciled data into your OpenRefine project, OpenRefine is actually storing the original data still. You need to explicit save the reconciled data in order to make sure it appears/exists when you export your data. Annoying as mosquito in your bedroom, I know, but please learn from my own mistakes, sweat and confusion.** 57 | 58 | - So, depending on whether or not you wish to keep the original data, you can replace the column with the reconciled data or add a column that contains the reconciled data. I'll do the latter here. On the reconciled data column, click the arrow at the top, then Choose 'Edit Columns' > 'Add a new column based on this column' 59 | - In the GREL box that appears, put the following depending on what you want to pull: 60 | - Name and Coordinates: `cell.recon.match.name` (will pull the GeoNames Name plus coordinates, separated by a | - you can split that column later to have just name then coordinates) 61 | - URI: `cell.recon.match.id` (will pull the GeoNames URI/link) 62 | - Coordinates Only: `replace(substring(cell.recon.match.name, indexOf(cell.recon.match.name, ' | ')), ' | ', '')` 63 | - Name, Coordinates, and URI each separated by | (for easier column splitting later): `cell.recon.match.name + " | " + cell.recon.match.id` 64 | 65 | I'll maybe make a screencast of this work later if I get time or there is interested. 66 | 67 | Holla if you have questions - email is charlow2(at)utk(dot)edu and Twitter handle is @cm_harlow 68 | 69 | 70 | ##Plans for Improvement 71 | 72 | I'm hoping to build in next a way for searching within reconciliation cells next. 73 | 74 | I'd like to expand the extremely rudimentary but gets the job done LoC geographic names abbreviations parser/expander text to handle other LoC Authorities abbreviations oddities. I'm afraid to say, since even the states abbreviations vary in their construction, these will need to be added on a case by case basis. 75 | 76 | I'd also like to build in a way to use other columns as additional search properties. 77 | 78 | Finally, finding a better way to handling the API username updates as well as parsing name plus coordinates (instead of the hack I've put into this for the time being) would be great. 79 | 80 | -------------------------------------------------------------------------------- /config.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | GEONAMES_USERNAME = os.environ['GEONAMES_USERNAME'] 4 | -------------------------------------------------------------------------------- /lc_parse.py: -------------------------------------------------------------------------------- 1 | from unicodedata import normalize as ucnorm, category 2 | 3 | def lc2geonames(text, PY3): 4 | if PY3: 5 | if not isinstance(text, str): 6 | str(text, 'utf-8') 7 | else: 8 | if not isinstance(text, unicode): 9 | text = unicode(text) 10 | if '(' in text: 11 | text = text.replace('Ala.)', ', Alabama') 12 | text = text.replace('Alaska)', ', Alaska') 13 | text = text.replace('Ariz.)', ', Arizona') 14 | text = text.replace('Ark.)', ', Arkansas') 15 | text = text.replace('Calif.)', ', California') 16 | text = text.replace('Colo.)', ', Colorado') 17 | text = text.replace('Conn.)', ', Connecticut') 18 | text = text.replace('Del.)', ', Delaware') 19 | text = text.replace('D.C.)', ', District of Columbia') 20 | text = text.replace('Fla.)', ', Florida') 21 | text = text.replace('Ga.)', ', Georgia') 22 | text = text.replace('Hawaii)', ', Hawaii') 23 | text = text.replace('Idaho)', ', Idaho') 24 | text = text.replace('Ill.)', ', Illinois') 25 | text = text.replace('Ind.)', ', Indiana') 26 | text = text.replace('Iowa)', ', Iowa') 27 | text = text.replace('Kan.)', ', Kansas') 28 | text = text.replace('Ky.)', ', Kentucky') 29 | text = text.replace('La.)', ', Louisiana') 30 | text = text.replace('Me.)', ', Maine') 31 | text = text.replace('Md.)', ', Maryland') 32 | text = text.replace('Mass.)', ', Massachusetts') 33 | text = text.replace('Mich.)', ', Michigan') 34 | text = text.replace('Minn.)', ', Minnesota') 35 | text = text.replace('Miss.)', ', Mississippi') 36 | text = text.replace('Mo.)', ', Missouri') 37 | text = text.replace('Mont.)', ', Montana') 38 | text = text.replace('Neb.)', ', Nebraska') 39 | text = text.replace('Nev.)', ', Nevada') 40 | text = text.replace('N.H.)', ', New Hampshire') 41 | text = text.replace('N.J.)', ', New Jersey') 42 | text = text.replace('N.M.)', ', New Mexico') 43 | text = text.replace('N.Y.)', ', New York') 44 | text = text.replace('N.C.)', ', North Carolina') 45 | text = text.replace('N.D.)', ', North Dakota') 46 | text = text.replace('Ohio)', ', Ohio') 47 | text = text.replace('Okla.)', ', Oklahoma') 48 | text = text.replace('Or.)', ', Oregon') 49 | text = text.replace('Pa.)', ', Pennsylvania') 50 | text = text.replace('R.I.)', ', Rhode Island') 51 | text = text.replace('S.C.)', ', South Carolina') 52 | text = text.replace('S.D.)', ', South Dakota') 53 | text = text.replace('Tenn.)', ', Tennessee') 54 | text = text.replace('Tex.)', ', Texas') 55 | text = text.replace('Utah)', ', Utah') 56 | text = text.replace('Vt.)', ', Vermont') 57 | text = text.replace('Va.)', ', Virginia') 58 | text = text.replace('Wash.)', ', Washington') 59 | text = text.replace('W. Va.)', ', West Virginia') 60 | text = text.replace('Wis.)', ', Wisconsin') 61 | text = text.replace('Wyo.)', ', Wyoming') 62 | text = text.replace(' (', ' ') 63 | text = text.replace(')', ' ') 64 | text = text.replace(':', ',') 65 | return text 66 | -------------------------------------------------------------------------------- /reconcile.py: -------------------------------------------------------------------------------- 1 | """ 2 | An OpenRefine reconciliation service for the API provided GeoNames. 3 | 4 | GeoNames API documentation: http://www.geonames.org/export/web-services.html 5 | 6 | This code is adapted from the wonderful Ted Lawless' work at https://github.com/lawlesst/fast-reconcile 7 | """ 8 | from flask import Flask 9 | from flask import request 10 | from flask import jsonify 11 | import json 12 | from operator import itemgetter 13 | import urllib 14 | from fuzzywuzzy import fuzz 15 | import requests 16 | from sys import version_info 17 | 18 | app = Flask(__name__) 19 | app.config.from_object('config') 20 | 21 | geonames_username = app.config['GEONAMES_USERNAME'] 22 | 23 | #If it's installed, use the requests_cache library to 24 | #cache calls to the GeoNames API. 25 | try: 26 | import requests_cache 27 | requests_cache.install_cache('geonames_cache') 28 | except ImportError: 29 | app.logger.debug("No request cache found.") 30 | pass 31 | 32 | #See if Python 3 for unicode/str use decisions 33 | PY3 = version_info > (3,) 34 | 35 | #Help text processing 36 | import text 37 | import lc_parse 38 | 39 | #Create base URLs/URIs 40 | api_base_url = 'http://api.geonames.org/searchJSON?username=' + geonames_username + '&isNameRequired=yes&' 41 | geonames_uri_base = 'http://sws.geonames.org/{0}' 42 | 43 | #Map the GeoNames query indexes to service types 44 | default_query = { 45 | "id": "/geonames/all", 46 | "name": "All GeoNames terms", 47 | "index": "q" 48 | } 49 | 50 | refine_to_geonames = [ 51 | { 52 | "id": "/geonames/name", 53 | "name": "Place Name", 54 | "index": "name" 55 | }, 56 | { 57 | "id": "/geonames/name_startsWith", 58 | "name": "Place Name Starts With", 59 | "index": "name_startsWith" 60 | }, 61 | { 62 | "id": "/geonames/name_equals", 63 | "name": "Exact Place Name", 64 | "index": "name_equals" 65 | } 66 | ] 67 | refine_to_geonames.append(default_query) 68 | 69 | 70 | #Make a copy of the GeoNames mappings. 71 | query_types = [{'id': item['id'], 'name': item['name']} for item in refine_to_geonames] 72 | 73 | # Basic service metadata. There are a number of other documented options 74 | # but this is all we need for a simple service. 75 | metadata = { 76 | "name": "GeoNames Reconciliation Service", 77 | "identifierSpace" : "http://localhost/identifier", 78 | "schemaSpace" : "http://localhost/schema", 79 | "defaultTypes": query_types, 80 | "view": { 81 | "url": "{{id}}" 82 | } 83 | } 84 | 85 | def make_uri(geonames_id): 86 | """ 87 | Prepare a GeoNames url from the ID returned by the API. 88 | """ 89 | geonames_uri = geonames_uri_base.format(geonames_id) 90 | return geonames_uri 91 | 92 | 93 | def jsonpify(obj): 94 | """ 95 | Helper to support JSONP 96 | """ 97 | try: 98 | callback = request.args['callback'] 99 | response = app.make_response("%s(%s)" % (callback, json.dumps(obj))) 100 | response.mimetype = "text/javascript" 101 | return response 102 | except KeyError: 103 | return jsonify(obj) 104 | 105 | 106 | def search(raw_query, query_type='/geonames/all'): 107 | """ 108 | Hit the GeoNames API for names. 109 | """ 110 | out = [] 111 | unique_geonames_ids = [] 112 | mid_query = lc_parse.lc2geonames(raw_query, PY3) 113 | query = text.normalize(mid_query, PY3).strip() 114 | query_type_meta = [i for i in refine_to_geonames if i['id'] == query_type] 115 | if query_type_meta == []: 116 | query_type_meta = default_query 117 | query_index = query_type_meta[0]['index'] 118 | try: 119 | if PY3: 120 | url = api_base_url + query_index + '=' + urllib.parse.quote(query) 121 | else: 122 | url = api_base_url + query_index + '=' + urllib.quote(query) 123 | app.logger.debug("GeoNames API url is " + url) 124 | resp = requests.get(url) 125 | results = resp.json() 126 | except getopt.GetoptError as e: 127 | app.logger.warning(e) 128 | return out 129 | for position, item in enumerate(results['geonames']): 130 | match = False 131 | name = item.get('name') 132 | alternate = item.get('toponymName') 133 | if (len(alternate) > 0): 134 | alt = alternate[0] 135 | else: 136 | alt = '' 137 | geonames_id = item.get('geonameId') 138 | geonames_uri = make_uri(geonames_id) 139 | lat = item.get('lat') 140 | lng = item.get('lng') 141 | #Way to cheat + get name + coordinates into results: 142 | name_coords = name + ' | ' + lat + ', ' + lng 143 | #Avoid returning duplicates: 144 | if geonames_id in unique_geonames_ids: 145 | continue 146 | else: 147 | unique_geonames_ids.append(geonames_id) 148 | score_1 = fuzz.token_sort_ratio(query, name) 149 | score_2 = fuzz.token_sort_ratio(query, alt) 150 | score = max(score_1, score_2) 151 | if query == text.normalize(name, PY3): 152 | match = True 153 | elif query == text.normalize(alt, PY3): 154 | match = True 155 | resource = { 156 | "id": geonames_uri, 157 | "name": name_coords, 158 | "score": score, 159 | "match": match, 160 | "type": query_type_meta 161 | } 162 | out.append(resource) 163 | #Sort this list by score 164 | sorted_out = sorted(out, key=itemgetter('score'), reverse=True) 165 | #Refine only will handle top three matches. 166 | return sorted_out[:3] 167 | 168 | 169 | @app.route("/reconcile", methods=['POST', 'GET']) 170 | def reconcile(): 171 | # If a 'queries' parameter is supplied then it is a dictionary 172 | # of (key, query) pairs representing a batch of queries. We 173 | # should return a dictionary of (key, results) pairs. 174 | queries = request.form.get('queries') 175 | if queries: 176 | queries = json.loads(queries) 177 | results = {} 178 | for (key, query) in queries.items(): 179 | qtype = query.get('type') 180 | if qtype is None: 181 | return jsonpify(metadata) 182 | data = search(query['query'], query_type=qtype) 183 | results[key] = {"result": data} 184 | return jsonpify(results) 185 | # If neither a 'query' nor 'queries' parameter is supplied then 186 | # we should return the service metadata. 187 | return jsonpify(metadata) 188 | 189 | if __name__ == '__main__': 190 | from optparse import OptionParser 191 | oparser = OptionParser() 192 | oparser.add_option('-d', '--debug', action='store_true', default=False) 193 | opts, args = oparser.parse_args() 194 | app.debug = opts.debug 195 | app.run(host='0.0.0.0') 196 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | flask==1.0 2 | fuzzywuzzy==0.18.0 3 | python-Levenshtein==0.12.0 4 | requests==2.20.0 5 | -------------------------------------------------------------------------------- /text.py: -------------------------------------------------------------------------------- 1 | """ 2 | Taken from the Helmut project. 3 | https://github.com/okfn/helmut/blob/master/helmut/text.py 4 | """ 5 | 6 | from unicodedata import normalize as ucnorm, category 7 | 8 | def normalize(text, PY3): 9 | """ Simplify a piece of text to generate a more canonical 10 | representation. This involves lowercasing, stripping trailing 11 | spaces, removing symbols, diacritical marks (umlauts) and 12 | converting all newlines etc. to single spaces. 13 | """ 14 | if PY3: 15 | if not isinstance(text, str): 16 | str(text, 'utf-8') 17 | else: 18 | if not isinstance(text, unicode): 19 | text = unicode(text) 20 | text = text.lower() 21 | decomposed = ucnorm('NFKD', text) 22 | filtered = [] 23 | for char in decomposed: 24 | cat = category(char) 25 | if cat.startswith('C'): 26 | filtered.append(' ') 27 | elif cat.startswith('M'): 28 | # marks, such as umlauts 29 | continue 30 | elif cat.startswith('Z'): 31 | # newlines, non-breaking etc. 32 | filtered.append(' ') 33 | elif cat.startswith('S'): 34 | # symbols, such as currency 35 | continue 36 | else: 37 | filtered.append(char) 38 | text = u''.join(filtered) 39 | while ' ' in text: 40 | text = text.replace(' ', ' ') 41 | #remove hyphens 42 | text = text.replace('-', ' ') 43 | text = text.strip() 44 | return ucnorm('NFKC', text) 45 | 46 | def url_slug(text, PY3): 47 | text = normalize(text) 48 | text = text.replace(' ', '-') 49 | text = text.replace('.', '_') 50 | return text 51 | 52 | def tokenize(text, splits='COPZ'): 53 | token = [] 54 | if PY3: 55 | for c in str(text, 'utf-8'): 56 | if category(c)[0] in splits: 57 | if len(token): 58 | yield u''.join(token) 59 | token = [] 60 | else: 61 | token.append(c) 62 | else: 63 | for c in unicode(text): 64 | if category(c)[0] in splits: 65 | if len(token): 66 | yield u''.join(token) 67 | token = [] 68 | else: 69 | token.append(c) 70 | if len(token): 71 | yield u''.join(token) 72 | --------------------------------------------------------------------------------