├── README.md
└── src
    ├── create_index.sh
    └── update.py


/README.md:
--------------------------------------------------------------------------------
  1 | Elasticsearch For Beginners: Index and Search Hacker News 
  2 | ================
  3 | 
  4 | 
  5 | #### Big picture plz? 
  6 | 
  7 | Hacker News officially released their [API](http://blog.ycombinator.com/hacker-news-api) this October, giving access to a vast amount of news articles, comments, polls, job postings, etc and via JSON, perfect to put it into Elasticsearch.
  8 | 
  9 | [Elasticsearch](http://elasticsearch.org) is currently the most popular Open-Source search engine, used for a wide variety of use cases. It natively works with JSON documents so this sounds like a perfect fit.
 10 | 
 11 | It runs on a [DigitalOcean 512MB droplet](https://m.do.co/c/c9b25dec9715) droplet and hosts the Elasticsearch node and a simple Tornado app for the frontend. Crontab runs the update every 5 minutes.
 12 | 
 13 | 
 14 | #### Prerequisites
 15 | 
 16 | Set up Elasticsearch and make sure it's running at [http://localhost:9200](http://localhost:9200)
 17 | 
 18 | See [here](https://www.elastic.co/guide/en/elasticsearch/guide/current/running-elasticsearch.html) if you need more information on how to install Elasticsearch.
 19 | 
 20 | I use Python and [Tornado](https://github.com/tornadoweb/tornado/) for the scripts to import and query the data.
 21 | 
 22 | 
 23 | 
 24 | #### Aight, so what are we doing? 
 25 | 
 26 | We'll start with loading the Top 100 HN stories IDs, retrieve detailed information about each item and then index them in Elasticsearch.
 27 | 
 28 | 
 29 | Top 100 Stories:
 30 | 
 31 | `curl https://hacker-news.firebaseio.com/v0/topstories.json?print=pretty`
 32 | 
 33 | the result looking something like this:
 34 | 
 35 | ```
 36 | [ 8605204, 8604814, 8602936, 8604489, 8604533, 8604626, 8605207, 8605186, 
 37 | ...
 38 | 8603147, 8602037 ]
 39 | ```
 40 | 
 41 | We can now loop through the IDs and retrieve more detailed information:
 42 | 
 43 | `curl https://hacker-news.firebaseio.com/v0/item/8605204.json?print=pretty`
 44 | 
 45 | yields this:
 46 | 
 47 | ```
 48 | {
 49 |   "by" : "davecheney",
 50 |   "id" : 8605204,
 51 |   "kids" : [ 8605567, 8605461, 8605280, 8605824, 8605404, 8605601, 8605246, 8605323, 8605712, 8605346, 8605743, 8605242, 8605321, 8605268 ],
 52 |   "score" : 260,
 53 |   "text" : "",
 54 |   "time" : 1415926359,
 55 |   "title" : "Go is moving to GitHub",
 56 |   "type" : "story",
 57 |   "url" : "https://groups.google.com/forum/#!topic/golang-dev/sckirqOWepg"
 58 | }
 59 | ```
 60 | 
 61 | And store the JSON document in Elasticsearch:
 62 | 
 63 | `curl -XPUT http://localhost:9200/hn/story/***item['id']*** -d @doc.json`
 64 | 
 65 | where `***item['id']***` is the ID of the document we just retrieved and `@doc.json` is the body of the document we just downloaded.
 66 | 
 67 | 
 68 | #### Got it, show me some real code!
 69 | 
 70 | Check out the full Python code here: [src/update.py](src/update.py)
 71 | 
 72 | This is the loop over the top 100 IDs:
 73 | 
 74 | ```
 75 |     response = yield http_client.fetch('https://hacker-news.firebaseio.com/v0/topstories.json?print=pretty')
 76 |     top100_ids = json.loads(response.body)
 77 |     
 78 |     for item_id in top100_ids:
 79 |         yield download_and_index_item(item_id)
 80 | 
 81 |     print "Done"
 82 | 
 83 | ```
 84 | 
 85 | and this (shortened) piece downloads the individual items:
 86 | 
 87 | ```
 88 | def download_and_index_item(item_id):
 89 |     
 90 |     url = "https://hacker-news.firebaseio.com/v0/item/%s.json?print=pretty" % item_id
 91 |     response = yield http_client.fetch(url)
 92 |     item = json.loads(response.body)
 93 | 
 94 | 	# all sorts of clean-up of "item"
 95 | 
 96 |     es_url = "http://localhost:9200/hn/%s/%s" % (item['type'], item['id'])
 97 |     request = HTTPRequest(es_url, method="PUT", body=json.dumps(item), request_timeout=10)
 98 |     response = yield http_client.fetch(request)
 99 |     if not response.code in [200, 201]:
100 |         print "\nfailed to add item %s" % item['id']
101 |     else:
102 |         sys.stdout.write('.')
103 | ```
104 | 
105 | 
106 | #### Ok, but where's the data?
107 | 
108 | Once we have a batch of HN articles in ES we can run queries
109 | 
110 | `curl "http://localhost:9200/hn/story/_search?pretty"`
111 | 
112 | gives us all the stories (the first 10 really as ES defaults to 10 results by default).
113 | 
114 | All stories for a given user:
115 | 
116 | `curl "http://localhost:9200/hn/story/_search?q=by:davecheney&pretty"`
117 | 
118 | We can also run aggregations and for see who posted the most stories and what the most popular domains are:
119 | 
120 | ```
121 | curl -XGET 'http://localhost:9200/hn/story/_search?search_type=count' -d '
122 | { "aggs" : { "domains" : { "terms" : { "field" : "domain", "size": 11 } }, "by" : {  "terms" : { "field" : "by", "size": 5 } } } }'
123 | ```
124 | 
125 | returning something like this:
126 | 
127 | ```
128 | { "aggregations": {
129 |     "by": {
130 |       "buckets": [
131 |         { "doc_count": 5,
132 |           "key": "luu" "},
133 |         { "doc_count": 3,
134 |           "key": "benbreen" },
135 |         { "doc_count": 3,
136 |           "key": "dnetesn" "},
137 |         ...
138 |       ]
139 |     },
140 |     "domains": {
141 |       "buckets": [
142 |         { "doc_count": 6,
143 |           "key": "github.com" },
144 |         { "doc_count": 4,
145 |           "key": "medium.com" },
146 |         ...
147 |       ]
148 |     }
149 |   }
150 | }
151 | ```
152 | 
153 | 
154 | 
155 | #### What can we do better? 
156 | 
157 | ##### Field Mappings
158 | 
159 | Elasticsearch is doing a pretty good job at figuring out what type a field is but sometimes it can use a little help.
160 | Run this query to see how ES maps each field of the `story` type:
161 | 
162 | `curl -XGET 'http://localhost:9200/hn/_mapping/story'`
163 | 
164 | Looks all pretty straight forward but one mapping sticks out:
165 | 
166 | ```
167 |     "time": {
168 |         "type": "long"
169 |     },
170 | ```
171 | 
172 | The type `long` is ok but what we really want is the type `date` so we can take advantage of the built-in date operators and aggregations. <br>
173 | Let's set up a index mapping for `time`:
174 | 
175 | ```
176 | curl -XPUT "http://localhost:9200/hn/" -d '{
177 |     "mappings" : {
178 |         "story" : {
179 |             "properties" : {
180 |                 "time" :   { "type" : "date" }
181 |             }
182 |         }
183 |     }
184 | }'
185 | ```
186 | That should do the trick so now we can run a query to see how many stories are being posted to the HN Top 100 per week:
187 | 
188 | ```
189 | curl -XGET 'http://localhost:9200/hn/story/_search?search_type=count' -d '
190 | {
191 |     "aggs" : {
192 |         "articles_over_time" : {
193 |             "date_histogram" : {
194 |                 "field" : "time",
195 |                 "interval" : "1w"
196 |             }
197 |         }
198 |     }
199 | }
200 | '
201 | ```
202 | Result:
203 | 
204 | ```
205 | { "aggregations": {
206 |     "articles_over_time": {
207 |       "buckets": [
208 |         { "doc_count": 1609,
209 |           "key": 1413158400000,
210 |           "key_as_string": "2014-10-13T00:00:00.000Z"
211 |         },
212 |         { "doc_count": 1195,
213 |           "key": 1413763200000,
214 |           "key_as_string": "2014-10-20T00:00:00.000Z"
215 |         },
216 |         { "doc_count": 1236,
217 |           "key": 1414368000000,
218 |           "key_as_string": "2014-10-27T00:00:00.000Z"
219 |         },
220 |         { "doc_count": 1304,
221 |           "key": 1414972800000,
222 |           "key_as_string": "2014-11-03T00:00:00.000Z"
223 |         }
224 |   ] } },
225 | }
226 | ```
227 | 
228 |  
229 | 
230 | ##### Other possible future improvements
231 | 
232 | - use bulk API
233 | - more interesting queries
234 | - simple web interface to query ES
235 | 
236 | 
237 | #### feedback
238 | 
239 | Open pull requests, issues or email me at o@21zoo.com
240 | 
241 | 
242 | 
243 | 
244 | 
245 | 
246 | 
247 | 
248 | 


--------------------------------------------------------------------------------
/src/create_index.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | 
 3 | curl -XDELETE "http://localhost:9200/*"
 4 | 
 5 | curl -XPUT "http://localhost:9200/hn/" -d '{
 6 |     "settings" : {
 7 |         "index" : {
 8 |             "number_of_shards" :   2,
 9 |             "number_of_replicas" : 0
10 |         }
11 |     },
12 |     "mappings" : {
13 |         "story" : {
14 |             "_source" : { "enabled" : true },
15 |             "properties" : {
16 |                 "time" :   { "type" : "date" },
17 |                 "domain":  { "type" : "string", "index" : "not_analyzed" },
18 |                 "by":      { "type" : "string", "index" : "not_analyzed" }
19 |             }
20 |         },
21 |         "job" : {
22 |             "_source" : { "enabled" : true },
23 |             "properties" : {
24 |                 "time" :   { "type" : "date" },
25 |                 "domain":  { "type" : "string", "index" : "not_analyzed" },
26 |                 "by":      { "type" : "string", "index" : "not_analyzed" }
27 |             }
28 |         },
29 |         "poll" : {
30 |             "_source" : { "enabled" : true },
31 |             "properties" : {
32 |                 "time" :   { "type" : "date" },
33 |                 "domain":  { "type" : "string", "index" : "not_analyzed" },
34 |                 "by":      { "type" : "string", "index" : "not_analyzed" }
35 |             }
36 |         }
37 |     }
38 | }'
39 | echo ""
40 | 
41 | 


--------------------------------------------------------------------------------
/src/update.py:
--------------------------------------------------------------------------------
 1 | from tornado.httpclient import AsyncHTTPClient, HTTPRequest
 2 | from tornado.ioloop import IOLoop
 3 | import tornado.gen
 4 | import tornado.options
 5 | import json
 6 | import sys
 7 | try:
 8 |     from urllib.parse import urlparse
 9 | except:
10 |     from urlparse import urlparse
11 | 
12 | 
13 | STORIES_ONLY = True
14 | 
15 | http_client = AsyncHTTPClient()
16 | 
17 | 
18 | @tornado.gen.coroutine
19 | def download_and_index_item(item_id):
20 | 
21 |     url = "https://hacker-news.firebaseio.com/v0/item/%s.json?print=pretty" % item_id
22 |     h = {'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36"}
23 |     response = yield http_client.fetch(url, headers=h)
24 |     item = json.loads(response.body.decode('utf_8'))
25 | 
26 |     # not needed
27 |     if 'kids' in item:
28 |         item.pop('kids')
29 | 
30 |     if STORIES_ONLY and item['type'] != 'story':
31 |         print("\nskiped item %s" % item['id'])
32 |         return
33 | 
34 |     if not 'url' in item or not item['url']:
35 |         item['url'] = "http://news.ycombinator.com/item?id=%s" % item['id']
36 |         item['domain'] = "news.ycombinator.com"
37 |     else:
38 |         u = urlparse(item['url'])
39 |         item['domain'] = u.hostname.replace("www.", "") if u.hostname else ""
40 | 
41 |     # ES uses milliseconds
42 |     item['time'] = int(item['time']) * 1000
43 | 
44 |     es_url = "http://localhost:9200/hn/%s/%s" % (item['type'], item['id'])
45 |     request = HTTPRequest(es_url, method="PUT", body=json.dumps(item), request_timeout=10)
46 |     response = yield http_client.fetch(request)
47 |     if not response.code in [200, 201]:
48 |         print("\nfailed to add item %s" % item['id'])
49 |     else:
50 |         sys.stdout.write('.')
51 |         sys.stdout.flush()
52 | 
53 | 
54 | @tornado.gen.coroutine
55 | def download_topstories():
56 |     response = yield http_client.fetch('https://hacker-news.firebaseio.com/v0/topstories.json?print=pretty')
57 |     top100_ids = json.loads(response.body.decode('utf_8'))
58 |     print("Got Top 100")
59 | 
60 |     for item_id in top100_ids:
61 |         yield download_and_index_item(item_id)
62 | 
63 |     print("Done")
64 | 
65 | 
66 | if __name__ == '__main__':
67 |     print("Starting")
68 |     IOLoop.instance().run_sync(download_topstories)


--------------------------------------------------------------------------------