├── README.md └── src ├── create_index.sh └── update.py /README.md: -------------------------------------------------------------------------------- 1 | Elasticsearch For Beginners: Index and Search Hacker News 2 | ================ 3 | 4 | 5 | #### Big picture plz? 6 | 7 | Hacker News officially released their [API](http://blog.ycombinator.com/hacker-news-api) this October, giving access to a vast amount of news articles, comments, polls, job postings, etc and via JSON, perfect to put it into Elasticsearch. 8 | 9 | [Elasticsearch](http://elasticsearch.org) is currently the most popular Open-Source search engine, used for a wide variety of use cases. It natively works with JSON documents so this sounds like a perfect fit. 10 | 11 | It runs on a [DigitalOcean 512MB droplet](https://m.do.co/c/c9b25dec9715) droplet and hosts the Elasticsearch node and a simple Tornado app for the frontend. Crontab runs the update every 5 minutes. 12 | 13 | 14 | #### Prerequisites 15 | 16 | Set up Elasticsearch and make sure it's running at [http://localhost:9200](http://localhost:9200) 17 | 18 | See [here](https://www.elastic.co/guide/en/elasticsearch/guide/current/running-elasticsearch.html) if you need more information on how to install Elasticsearch. 19 | 20 | I use Python and [Tornado](https://github.com/tornadoweb/tornado/) for the scripts to import and query the data. 21 | 22 | 23 | 24 | #### Aight, so what are we doing? 25 | 26 | We'll start with loading the Top 100 HN stories IDs, retrieve detailed information about each item and then index them in Elasticsearch. 27 | 28 | 29 | Top 100 Stories: 30 | 31 | `curl https://hacker-news.firebaseio.com/v0/topstories.json?print=pretty` 32 | 33 | the result looking something like this: 34 | 35 | ``` 36 | [ 8605204, 8604814, 8602936, 8604489, 8604533, 8604626, 8605207, 8605186, 37 | ... 38 | 8603147, 8602037 ] 39 | ``` 40 | 41 | We can now loop through the IDs and retrieve more detailed information: 42 | 43 | `curl https://hacker-news.firebaseio.com/v0/item/8605204.json?print=pretty` 44 | 45 | yields this: 46 | 47 | ``` 48 | { 49 | "by" : "davecheney", 50 | "id" : 8605204, 51 | "kids" : [ 8605567, 8605461, 8605280, 8605824, 8605404, 8605601, 8605246, 8605323, 8605712, 8605346, 8605743, 8605242, 8605321, 8605268 ], 52 | "score" : 260, 53 | "text" : "", 54 | "time" : 1415926359, 55 | "title" : "Go is moving to GitHub", 56 | "type" : "story", 57 | "url" : "https://groups.google.com/forum/#!topic/golang-dev/sckirqOWepg" 58 | } 59 | ``` 60 | 61 | And store the JSON document in Elasticsearch: 62 | 63 | `curl -XPUT http://localhost:9200/hn/story/***item['id']*** -d @doc.json` 64 | 65 | where `***item['id']***` is the ID of the document we just retrieved and `@doc.json` is the body of the document we just downloaded. 66 | 67 | 68 | #### Got it, show me some real code! 69 | 70 | Check out the full Python code here: [src/update.py](src/update.py) 71 | 72 | This is the loop over the top 100 IDs: 73 | 74 | ``` 75 | response = yield http_client.fetch('https://hacker-news.firebaseio.com/v0/topstories.json?print=pretty') 76 | top100_ids = json.loads(response.body) 77 | 78 | for item_id in top100_ids: 79 | yield download_and_index_item(item_id) 80 | 81 | print "Done" 82 | 83 | ``` 84 | 85 | and this (shortened) piece downloads the individual items: 86 | 87 | ``` 88 | def download_and_index_item(item_id): 89 | 90 | url = "https://hacker-news.firebaseio.com/v0/item/%s.json?print=pretty" % item_id 91 | response = yield http_client.fetch(url) 92 | item = json.loads(response.body) 93 | 94 | # all sorts of clean-up of "item" 95 | 96 | es_url = "http://localhost:9200/hn/%s/%s" % (item['type'], item['id']) 97 | request = HTTPRequest(es_url, method="PUT", body=json.dumps(item), request_timeout=10) 98 | response = yield http_client.fetch(request) 99 | if not response.code in [200, 201]: 100 | print "\nfailed to add item %s" % item['id'] 101 | else: 102 | sys.stdout.write('.') 103 | ``` 104 | 105 | 106 | #### Ok, but where's the data? 107 | 108 | Once we have a batch of HN articles in ES we can run queries 109 | 110 | `curl "http://localhost:9200/hn/story/_search?pretty"` 111 | 112 | gives us all the stories (the first 10 really as ES defaults to 10 results by default). 113 | 114 | All stories for a given user: 115 | 116 | `curl "http://localhost:9200/hn/story/_search?q=by:davecheney&pretty"` 117 | 118 | We can also run aggregations and for see who posted the most stories and what the most popular domains are: 119 | 120 | ``` 121 | curl -XGET 'http://localhost:9200/hn/story/_search?search_type=count' -d ' 122 | { "aggs" : { "domains" : { "terms" : { "field" : "domain", "size": 11 } }, "by" : { "terms" : { "field" : "by", "size": 5 } } } }' 123 | ``` 124 | 125 | returning something like this: 126 | 127 | ``` 128 | { "aggregations": { 129 | "by": { 130 | "buckets": [ 131 | { "doc_count": 5, 132 | "key": "luu" "}, 133 | { "doc_count": 3, 134 | "key": "benbreen" }, 135 | { "doc_count": 3, 136 | "key": "dnetesn" "}, 137 | ... 138 | ] 139 | }, 140 | "domains": { 141 | "buckets": [ 142 | { "doc_count": 6, 143 | "key": "github.com" }, 144 | { "doc_count": 4, 145 | "key": "medium.com" }, 146 | ... 147 | ] 148 | } 149 | } 150 | } 151 | ``` 152 | 153 | 154 | 155 | #### What can we do better? 156 | 157 | ##### Field Mappings 158 | 159 | Elasticsearch is doing a pretty good job at figuring out what type a field is but sometimes it can use a little help. 160 | Run this query to see how ES maps each field of the `story` type: 161 | 162 | `curl -XGET 'http://localhost:9200/hn/_mapping/story'` 163 | 164 | Looks all pretty straight forward but one mapping sticks out: 165 | 166 | ``` 167 | "time": { 168 | "type": "long" 169 | }, 170 | ``` 171 | 172 | The type `long` is ok but what we really want is the type `date` so we can take advantage of the built-in date operators and aggregations.
173 | Let's set up a index mapping for `time`: 174 | 175 | ``` 176 | curl -XPUT "http://localhost:9200/hn/" -d '{ 177 | "mappings" : { 178 | "story" : { 179 | "properties" : { 180 | "time" : { "type" : "date" } 181 | } 182 | } 183 | } 184 | }' 185 | ``` 186 | That should do the trick so now we can run a query to see how many stories are being posted to the HN Top 100 per week: 187 | 188 | ``` 189 | curl -XGET 'http://localhost:9200/hn/story/_search?search_type=count' -d ' 190 | { 191 | "aggs" : { 192 | "articles_over_time" : { 193 | "date_histogram" : { 194 | "field" : "time", 195 | "interval" : "1w" 196 | } 197 | } 198 | } 199 | } 200 | ' 201 | ``` 202 | Result: 203 | 204 | ``` 205 | { "aggregations": { 206 | "articles_over_time": { 207 | "buckets": [ 208 | { "doc_count": 1609, 209 | "key": 1413158400000, 210 | "key_as_string": "2014-10-13T00:00:00.000Z" 211 | }, 212 | { "doc_count": 1195, 213 | "key": 1413763200000, 214 | "key_as_string": "2014-10-20T00:00:00.000Z" 215 | }, 216 | { "doc_count": 1236, 217 | "key": 1414368000000, 218 | "key_as_string": "2014-10-27T00:00:00.000Z" 219 | }, 220 | { "doc_count": 1304, 221 | "key": 1414972800000, 222 | "key_as_string": "2014-11-03T00:00:00.000Z" 223 | } 224 | ] } }, 225 | } 226 | ``` 227 | 228 | 229 | 230 | ##### Other possible future improvements 231 | 232 | - use bulk API 233 | - more interesting queries 234 | - simple web interface to query ES 235 | 236 | 237 | #### feedback 238 | 239 | Open pull requests, issues or email me at o@21zoo.com 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | -------------------------------------------------------------------------------- /src/create_index.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | curl -XDELETE "http://localhost:9200/*" 4 | 5 | curl -XPUT "http://localhost:9200/hn/" -d '{ 6 | "settings" : { 7 | "index" : { 8 | "number_of_shards" : 2, 9 | "number_of_replicas" : 0 10 | } 11 | }, 12 | "mappings" : { 13 | "story" : { 14 | "_source" : { "enabled" : true }, 15 | "properties" : { 16 | "time" : { "type" : "date" }, 17 | "domain": { "type" : "string", "index" : "not_analyzed" }, 18 | "by": { "type" : "string", "index" : "not_analyzed" } 19 | } 20 | }, 21 | "job" : { 22 | "_source" : { "enabled" : true }, 23 | "properties" : { 24 | "time" : { "type" : "date" }, 25 | "domain": { "type" : "string", "index" : "not_analyzed" }, 26 | "by": { "type" : "string", "index" : "not_analyzed" } 27 | } 28 | }, 29 | "poll" : { 30 | "_source" : { "enabled" : true }, 31 | "properties" : { 32 | "time" : { "type" : "date" }, 33 | "domain": { "type" : "string", "index" : "not_analyzed" }, 34 | "by": { "type" : "string", "index" : "not_analyzed" } 35 | } 36 | } 37 | } 38 | }' 39 | echo "" 40 | 41 | -------------------------------------------------------------------------------- /src/update.py: -------------------------------------------------------------------------------- 1 | from tornado.httpclient import AsyncHTTPClient, HTTPRequest 2 | from tornado.ioloop import IOLoop 3 | import tornado.gen 4 | import tornado.options 5 | import json 6 | import sys 7 | try: 8 | from urllib.parse import urlparse 9 | except: 10 | from urlparse import urlparse 11 | 12 | 13 | STORIES_ONLY = True 14 | 15 | http_client = AsyncHTTPClient() 16 | 17 | 18 | @tornado.gen.coroutine 19 | def download_and_index_item(item_id): 20 | 21 | url = "https://hacker-news.firebaseio.com/v0/item/%s.json?print=pretty" % item_id 22 | h = {'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36"} 23 | response = yield http_client.fetch(url, headers=h) 24 | item = json.loads(response.body.decode('utf_8')) 25 | 26 | # not needed 27 | if 'kids' in item: 28 | item.pop('kids') 29 | 30 | if STORIES_ONLY and item['type'] != 'story': 31 | print("\nskiped item %s" % item['id']) 32 | return 33 | 34 | if not 'url' in item or not item['url']: 35 | item['url'] = "http://news.ycombinator.com/item?id=%s" % item['id'] 36 | item['domain'] = "news.ycombinator.com" 37 | else: 38 | u = urlparse(item['url']) 39 | item['domain'] = u.hostname.replace("www.", "") if u.hostname else "" 40 | 41 | # ES uses milliseconds 42 | item['time'] = int(item['time']) * 1000 43 | 44 | es_url = "http://localhost:9200/hn/%s/%s" % (item['type'], item['id']) 45 | request = HTTPRequest(es_url, method="PUT", body=json.dumps(item), request_timeout=10) 46 | response = yield http_client.fetch(request) 47 | if not response.code in [200, 201]: 48 | print("\nfailed to add item %s" % item['id']) 49 | else: 50 | sys.stdout.write('.') 51 | sys.stdout.flush() 52 | 53 | 54 | @tornado.gen.coroutine 55 | def download_topstories(): 56 | response = yield http_client.fetch('https://hacker-news.firebaseio.com/v0/topstories.json?print=pretty') 57 | top100_ids = json.loads(response.body.decode('utf_8')) 58 | print("Got Top 100") 59 | 60 | for item_id in top100_ids: 61 | yield download_and_index_item(item_id) 62 | 63 | print("Done") 64 | 65 | 66 | if __name__ == '__main__': 67 | print("Starting") 68 | IOLoop.instance().run_sync(download_topstories) --------------------------------------------------------------------------------