├── .github
    └── workflows
    │   └── tests.yml
├── .gitignore
├── README.md
├── requirements.txt
├── sample.mbox
└── src
    └── index_emails.py


/.github/workflows/tests.yml:
--------------------------------------------------------------------------------
 1 | name: CI
 2 | 
 3 | on:
 4 |   push:
 5 |     branches:
 6 |       - master
 7 |   pull_request:
 8 | 
 9 | jobs:
10 |   tests:
11 |     runs-on: ubuntu-latest
12 |     services:
13 |       es:
14 |         image: docker.elastic.co/elasticsearch/elasticsearch-oss:7.10.2
15 |         ports:
16 |           - 9200:9200
17 |         options: >-
18 |           --env http.port=9200
19 |           --env discovery.type=single-node
20 | 
21 |     steps:
22 |       - name: Checkout code
23 |         uses: actions/checkout@v4
24 | 
25 |       - name: Set up Python
26 |         uses: actions/setup-python@v5
27 |         with:
28 |           python-version: 3.12
29 | 
30 |       - name: Install dependencies
31 |         run: |
32 |           pip install --upgrade pip
33 |           pip install -r requirements.txt
34 | 
35 |       - name: Wait for Elasticsearch
36 |         run: |
37 |           sleep 10
38 |           curl -s http://localhost:9200
39 | 
40 |       - name: Run tests
41 |         run: python3 src/index_emails.py --infile=sample.mbox --es-url=http://localhost:9200
42 | 
43 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | venv
2 | .idea
3 | *.pyc
4 | .vscode
5 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | Elasticsearch For Beginners: Indexing your Gmail Inbox (and more: Supports any mbox and MH mailboxes)
  2 | =======================
  3 | 
  4 | #### What's this all about?
  5 | 
  6 | I recently looked at my Gmail inbox and noticed that I have well over 50k emails, taking up about 12GB of space but there is no good way to tell what emails take up space, who sent them to, who emails me, etc
  7 | 
  8 | Goal of this tutorial is to load an entire Gmail inbox into Elasticsearch using bulk indexing and then start querying the cluster to get a better picture of what's going on.
  9 | 
 10 | 
 11 | #### Prerequisites
 12 | 
 13 | Set up [Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/guide/current/running-elasticsearch.html) and make sure it's running at [http://localhost:9200](http://localhost:9200)
 14 | 
 15 | A quick way to run Elasticsearch is using Docker: (the cors settings aren't really needed but come in handy if you want to use e.g. [dejavu](https://dejavu.appbase.io/) to explore the index)
 16 | ```
 17 | docker run --name es -d -p 9200:9200 -e http.port=9200 -e http.cors.enabled=true -e 'http.cors.allow-origin=*' -e http.cors.allow-headers=X-Requested-With,X-Auth-Token,Content-Type,Content-Length,Authorization -e http.cors.allow-credentials=true -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch-oss:7.10.2
 18 | ```
 19 | 
 20 | I use Python and [Tornado](https://github.com/tornadoweb/tornado/) for the scripts to import and query the data. Also `beautifulsoup4` for the stripping HTML/JS/CSS (if you want to use the body indexing flag).
 21 | 
 22 | Install the dependencies by running:
 23 | 
 24 | `pip3 install -r requirements.txt`
 25 | 
 26 | 
 27 | #### Aight, where do we start?
 28 | 
 29 | First, go [here](https://www.google.com/settings/takeout/custom/gmail) and download your Gmail mailbox, depending on the amount of emails you have accumulated this might take a while.
 30 | There's also a small `sample.mbox` file included in the repo for you to play around with while you're waiting for Google to prepare your download.
 31 | 
 32 | The downloaded archive is in the [mbox format](http://en.wikipedia.org/wiki/Mbox) and Python provides libraries to work with the mbox format so that's easy.
 33 | 
 34 | You can run the code (assuming Elasticsearch is running at localhost:9200) with the sammple mbox file like this:
 35 | ```
 36 | $ python3 src/index_emails.py --infile=sample.mbox
 37 | [I index_emails:173] Starting import from file sample.mbox
 38 | [I index_emails:101] Upload: OK - upload took: 1033ms, total messages uploaded:      3
 39 | [I index_emails:197] Import done - total count 16
 40 | $
 41 | ```
 42 | 
 43 | Note: All examples focus on Gmail inboxes. Substitute any `--infile=` parameters with `--indir=` pointing to an MH directory to make them work with MH mailboxes instead.
 44 | 
 45 | #### The Source Code
 46 | 
 47 | The overall program will look something like this:
 48 | 
 49 | ```python
 50 | mbox = mailbox.mbox('emails.mbox') // or mailbox.MH('inbox/')
 51 | 
 52 | for msg in mbox:
 53 |     item = convert_msg_to_json(msg)
 54 | 	upload_item_to_es(item)
 55 | 
 56 | print "Done!"
 57 | ```
 58 | 
 59 | #### Ok, tell me more about the details
 60 | 
 61 | The full Python code is here: [src/index_emails.py](src/index_emails.py)
 62 | 
 63 | 
 64 | ##### Turn mailbox into JSON
 65 | 
 66 | First, we got to turn the messages into JSON so we can insert it into Elasticsearch. [Here](http://nbviewer.ipython.org/github/furukama/Mining-the-Social-Web-2nd-Edition/blob/master/ipynb/Chapter%206%20-%20Mining%20Mailboxes.ipynb) is some sample code that was very useful when it came to normalizing and cleaning up the data.
 67 | 
 68 | A good first step:
 69 | 
 70 | ```python
 71 | def convert_msg_to_json(msg):
 72 |     result = {'parts': []}
 73 |     for (k, v) in msg.items():
 74 |         result[k.lower()] = v.decode('utf-8', 'ignore')
 75 | 
 76 | ```
 77 | 
 78 | Additionally, you also want to parse and normalize the `From` and `To` email addresses:
 79 | 
 80 | ```python
 81 | for k in ['to', 'cc', 'bcc']:
 82 |     if not result.get(k):
 83 |         continue
 84 |     emails_split = result[k].replace('\n', '').replace('\t', '').replace('\r', '').replace(' ', '').encode('utf8').decode('utf-8', 'ignore').split(',')
 85 |     result[k] = [ normalize_email(e) for e in emails_split]
 86 | 
 87 | if "from" in result:
 88 |     result['from'] = normalize_email(result['from'])
 89 | ```
 90 | 
 91 | Elasticsearch expects timestamps to be in microseconds so let's convert the date accordingly
 92 | 
 93 | ```python
 94 | if "date" in result:
 95 |     tt = email.utils.parsedate_tz(result['date'])
 96 |     result['date_ts'] = int(calendar.timegm(tt) - tt[9]) * 1000
 97 | ```
 98 | 
 99 | We also need to split up and normalize the labels
100 | 
101 | ```python
102 | labels = []
103 | if "x-gmail-labels" in result:
104 |     labels = [l.strip().lower() for l in result["x-gmail-labels"].split(',')]
105 |     del result["x-gmail-labels"]
106 | result['labels'] = labels
107 | ```
108 | 
109 | Email size is also interesting so let's break that out
110 | 
111 | ```python
112 | parts = json_msg.get("parts", [])
113 | json_msg['content_size_total'] = 0
114 | for part in parts:
115 |     json_msg['content_size_total'] += len(part.get('content', ""))
116 | 
117 | ```
118 | 
119 | 
120 | ##### Index the data with Elasticsearch
121 | 
122 | The most simple approach is a PUT request per item:
123 | 
124 | ```python
125 | def upload_item_to_es(item):
126 |     es_url = "http://localhost:9200/gmail/email/%s" % (item['message-id'])
127 |     request = HTTPRequest(es_url, method="PUT", body=json.dumps(item), request_timeout=10)
128 |     response = yield http_client.fetch(request)
129 |     if not response.code in [200, 201]:
130 |         print "\nfailed to add item %s" % item['message-id']
131 | 
132 | ```
133 | 
134 | However, Elasticsearch provides a better method for importing large chunks of data: [bulk indexing](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html)
135 | Instead of making a HTTP request per document and indexing individually, we batch them in chunks of eg. 1000 documents and then index them.<br>
136 | Bulk messages are of the format:
137 | 
138 | ```
139 | cmd\n
140 | doc\n
141 | cmd\n
142 | doc\n
143 | ...
144 | ```
145 | 
146 | where `cmd` is the control message for each `doc` we want to index.
147 | For our example, `cmd` would look like this:
148 | 
149 | ```
150 | cmd = {'index': {'_index': 'gmail', '_type': 'email', '_id': item['message-id']}}`
151 | ```
152 | 
153 | The final code looks something like this:
154 | 
155 | ```python
156 | upload_data = list()
157 | for msg in mbox:
158 |     item = convert_msg_to_json(msg)
159 |     upload_data.append(item)
160 |     if len(upload_data) == 100:
161 |         upload_batch(upload_data)
162 |         upload_data = list()
163 | 
164 | if upload_data:
165 |     upload_batch(upload_data)
166 | 
167 | ```
168 | and
169 | 
170 | ```python
171 | def upload_batch(upload_data):
172 | 
173 |     upload_data_txt = ""
174 |     for item in upload_data:
175 |         cmd = {'index': {'_index': 'gmail', '_type': 'email', '_id': item['message-id']}}
176 |         upload_data_txt += json.dumps(cmd) + "\n"
177 |         upload_data_txt += json.dumps(item) + "\n"
178 | 
179 |     request = HTTPRequest("http://localhost:9200/_bulk", method="POST", body=upload_data_txt, request_timeout=240)
180 |     response = http_client.fetch(request)
181 |     result = json.loads(response.body)
182 | 	if 'errors' in result:
183 | 	    print result['errors']
184 | ```
185 | 
186 | 
187 | 
188 | #### Ok, show me some data!
189 | 
190 | After indexing all your emails, we can start running queries.
191 | 
192 | 
193 | ##### Filters
194 | 
195 | If you want to search for emails from the last 6 months, you can use the range filter and search for `gte` the current time (`now`) minus 6 month:
196 | 
197 | ```
198 | curl -XGET 'http://localhost:9200/gmail/email/_search?pretty' -d '{
199 | "filter": { "range" : { "date_ts" : { "gte": "now-6M" } } } }
200 | '
201 | ```
202 | 
203 | or you can filter for all emails from 2014 by using `gte` and `lt`
204 | 
205 | ```
206 | curl -XGET 'http://localhost:9200/gmail/email/_search?pretty' -d '{
207 | "filter": { "range" : { "date_ts" : { "gte": "2013-01-01T00:00:00.000Z", "lt": "2014-01-01T00:00:00.000Z" } } } }
208 | '
209 | ```
210 | 
211 | You can also quickly query for certain fields via the `q` parameter. This example shows you all your Amazon shipping info emails:
212 | 
213 | ```
214 | curl "localhost:9200/gmail/email/_search?pretty&q=from:ship-confirm@amazon.com"
215 | ```
216 | 
217 | ##### Aggregation queries
218 | 
219 | Aggregation queries let us bucket data by a given key and count the number of messages per bucket.
220 | For example, number of messages grouped by recipient:
221 | 
222 | ```
223 | curl -XGET 'http://localhost:9200/gmail/email/_search?pretty&search_type=count' -d '{
224 | "aggs": { "emails": { "terms" : { "field" : "to",  "size": 10 }
225 | } } }
226 | '
227 | ```
228 | 
229 | Result:
230 | 
231 | ```
232 | "aggregations" : {
233 | "emails" : {
234 |   "buckets" : [ {
235 |        "key" : "noreply@github.com",
236 |        "doc_count" : 1920
237 |   }, { "key" : "oliver@gmail.com",
238 |        "doc_count" : 1326
239 |   }, { "key" : "michael@gmail.com",
240 |        "doc_count" : 263
241 |   }, { "key" : "david@gmail.com",
242 |        "doc_count" : 232
243 |   }
244 |   ...
245 |   ]
246 | }
247 | ```
248 | 
249 | This one gives us the number of emails per label:
250 | 
251 | ```
252 | curl -XGET 'http://localhost:9200/gmail/email/_search?pretty&search_type=count' -d '{
253 | "aggs": { "labels": { "terms" : { "field" : "labels",  "size": 10 }
254 | } } }
255 | '
256 | ```
257 | 
258 | Result:
259 | 
260 | ```
261 | "hits" : {
262 |   "total" : 51794,
263 | },
264 | "aggregations" : {
265 | "labels" : {
266 |   "buckets" : [       {
267 |        "key" : "important",
268 |        "doc_count" : 15430
269 |   }, { "key" : "github",
270 |        "doc_count" : 4928
271 |   }, { "key" : "sent",
272 |        "doc_count" : 4285
273 |   }, { "key" : "unread",
274 |        "doc_count" : 510
275 |   },
276 |   ...
277 |    ]
278 | }
279 | ```
280 | 
281 | Use a `date histogram` you can also count how many emails you sent and received per year:
282 | 
283 | ```
284 | curl -s "localhost:9200/gmail/email/_search?pretty&search_type=count" -d '
285 | { "aggs": {
286 |     "years": {
287 |       "date_histogram": {
288 |         "field": "date_ts", "interval": "year"
289 | }}}}
290 | '
291 | ```
292 | 
293 | Result:
294 | 
295 | ```
296 | "aggregations" : {
297 | "years" : {
298 |   "buckets" : [ {
299 |     "key_as_string" : "2004-01-01T00:00:00.000Z",
300 |     "key" : 1072915200000,
301 |     "doc_count" : 585
302 |   }, {
303 | ...
304 |   }, {
305 |     "key_as_string" : "2013-01-01T00:00:00.000Z",
306 |     "key" : 1356998400000,
307 |     "doc_count" : 12832
308 |   }, {
309 |     "key_as_string" : "2014-01-01T00:00:00.000Z",
310 |     "key" : 1388534400000,
311 |     "doc_count" : 7283
312 |   } ]
313 | }
314 | ```
315 | 
316 | Write aggregation queries to work out how much you spent on Amazon/Steam:
317 | 
318 | ```
319 | GET _search
320 | {
321 |   "query": {
322 |     "match_all": {}
323 |       },
324 |       "size": 0,
325 |       "aggs": {
326 |         "group_by_company": {
327 |           "terms": {
328 |             "field": "order_details.merchant"
329 |             },
330 |             "aggs": {
331 |               "total_spent": {
332 |                 "sum": {
333 |                   "field": "order_details.order_total"
334 |                 }
335 |                 },
336 |                 "postage": {
337 |                   "sum": {
338 |                     "field": "order_details.postage"
339 |                   }
340 |                 }
341 |               }
342 |             }
343 |           }
344 |         }
345 | ```
346 | 
347 | 
348 | #### Todo
349 | 
350 | - more interesting queries
351 | - schema tweaks
352 | - multi-part message parsing
353 | - blurb about performance
354 | - ...
355 | 
356 | 
357 | 
358 | #### Feedback
359 | 
360 | Open a pull requests or an issue!
361 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | beautifulsoup4==4.6.0
2 | chardet==3.0.4
3 | tornado==6.5.0
4 | 


--------------------------------------------------------------------------------
/sample.mbox:
--------------------------------------------------------------------------------
  1 |     
  2 | 	
  3 |     
  4 | From nobody Mon Sep 17 00:00:00 2001
  5 | From: A (zzz)
  6 |       U
  7 |       Thor
  8 |       <a.u.thor@example.com> (Comment)
  9 | Date: Fri, 9 Jun 2006 00:44:16 -0700
 10 | Subject: [PATCH] a commit.
 11 | 
 12 | Here is a patch from A U Thor.
 13 | 
 14 | ---
 15 |  foo |    2 +-
 16 |  1 files changed, 1 insertions(+), 1 deletions(-)
 17 | 
 18 | diff --git a/foo b/foo
 19 | index 9123cdc..918dcf8 100644
 20 | --- a/foo
 21 | +++ b/foo
 22 | @@ -1 +1 @@
 23 | -Fri Jun  9 00:44:04 PDT 2006
 24 | +Fri Jun  9 00:44:13 PDT 2006
 25 | -- 
 26 | 1.4.0.g6f2b
 27 | 
 28 | From nobody Mon Sep 17 00:00:00 2001
 29 | From: A U Thor <a.u.thor@example.com>
 30 | Date: Fri, 9 Jun 2006 00:44:16 -0700
 31 | Subject: [PATCH] another patch
 32 | 
 33 | Here is a patch from A U Thor.  This addresses the issue raised in the
 34 | message:
 35 | 
 36 | From: Nit Picker <nit.picker@example.net>
 37 | Subject: foo is too old
 38 | Message-Id: <nitpicker.12121212@example.net>
 39 | 
 40 | Hopefully this would fix the problem stated there.
 41 | 
 42 | 
 43 | I have included an extra blank line above, but it does not have to be
 44 | stripped away here, along with the               		   
 45 | whitespaces at the end of the above line.  They are expected to be squashed
 46 | when the message is made into a commit log by stripspace,
 47 | Also, there are three blank lines after this paragraph,
 48 | two truly blank and another full of spaces in between.
 49 | 
 50 |             
 51 | 
 52 | Hope this helps.
 53 | 
 54 | ---
 55 |  foo |    2 +-
 56 |  1 files changed, 1 insertions(+), 1 deletions(-)
 57 | 
 58 | diff --git a/foo b/foo
 59 | index 9123cdc..918dcf8 100644
 60 | --- a/foo
 61 | +++ b/foo
 62 | @@ -1 +1 @@
 63 | -Fri Jun  9 00:44:04 PDT 2006
 64 | +Fri Jun  9 00:44:13 PDT 2006
 65 | -- 
 66 | 1.4.0.g6f2b
 67 | 
 68 | From nobody Mon Sep 17 00:00:00 2001
 69 | From: Junio C Hamano <junio@kernel.org>
 70 | Date: Fri, 9 Jun 2006 00:44:16 -0700
 71 | Subject: re: [PATCH] another patch
 72 | 
 73 | From: A U Thor <a.u.thor@example.com>
 74 | Subject: [PATCH] third patch
 75 | 
 76 | Here is a patch from A U Thor.  This addresses the issue raised in the
 77 | message:
 78 | 
 79 | From: Nit Picker <nit.picker@example.net>
 80 | Subject: foo is too old
 81 | Message-Id: <nitpicker.12121212@example.net>
 82 | 
 83 | Hopefully this would fix the problem stated there.
 84 | 
 85 | ---
 86 |  foo |    2 +-
 87 |  1 files changed, 1 insertions(+), 1 deletions(-)
 88 | 
 89 | diff --git a/foo b/foo
 90 | index 9123cdc..918dcf8 100644
 91 | --- a/foo
 92 | +++ b/foo
 93 | @@ -1 +1 @@
 94 | -Fri Jun  9 00:44:04 PDT 2006
 95 | +Fri Jun  9 00:44:13 PDT 2006
 96 | -- 
 97 | 1.4.0.g6f2b
 98 | 
 99 | From nobody Sat Aug 27 23:07:49 2005
100 | Path: news.gmane.org!not-for-mail
101 | Message-ID: <20050721.091036.01119516.yoshfuji@linux-ipv6.org>
102 | From: YOSHIFUJI Hideaki / =?ISO-2022-JP?B?GyRCNUhGIzFRTEAbKEI=?= 
103 | 	<yoshfuji@linux-ipv6.org>
104 | Newsgroups: gmane.comp.version-control.git
105 | Subject: [PATCH 1/2] GIT: Try all addresses for given remote name
106 | Date: Thu, 21 Jul 2005 09:10:36 -0400 (EDT)
107 | Lines: 99
108 | Organization: USAGI/WIDE Project
109 | Approved: news@gmane.org
110 | NNTP-Posting-Host: main.gmane.org
111 | Mime-Version: 1.0
112 | Content-Type: Text/Plain; charset=us-ascii
113 | Content-Transfer-Encoding: 7bit
114 | X-Trace: sea.gmane.org 1121951434 29350 80.91.229.2 (21 Jul 2005 13:10:34 GMT)
115 | X-Complaints-To: usenet@sea.gmane.org
116 | NNTP-Posting-Date: Thu, 21 Jul 2005 13:10:34 +0000 (UTC)
117 | 
118 | Hello.
119 | 
120 | Try all addresses for given remote name until it succeeds.
121 | Also supports IPv6.
122 | 
123 | Signed-of-by: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
124 | 
125 | diff --git a/connect.c b/connect.c
126 | --- a/connect.c
127 | +++ b/connect.c
128 | @@ -96,42 +96,57 @@ static enum protocol get_protocol(const 
129 |  	die("I don't handle protocol '%s'", name);
130 |  }
131 |  
132 | -static void lookup_host(const char *host, struct sockaddr *in)
133 | -{
134 | -	struct addrinfo *res;
135 | -	int ret;
136 | -
137 | -	ret = getaddrinfo(host, NULL, NULL, &res);
138 | -	if (ret)
139 | -		die("Unable to look up %s (%s)", host, gai_strerror(ret));
140 | -	*in = *res->ai_addr;
141 | -	freeaddrinfo(res);
142 | -}
143 | +#define STR_(s)	# s
144 | +#define STR(s)	STR_(s)
145 |  
146 |  static int git_tcp_connect(int fd[2], const char *prog, char *host, char *path)
147 |  {
148 | -	struct sockaddr addr;
149 | -	int port = DEFAULT_GIT_PORT, sockfd;
150 | -	char *colon;
151 | -
152 | -	colon = strchr(host, ':');
153 | -	if (colon) {
154 | -		char *end;
155 | -		unsigned long n = strtoul(colon+1, &end, 0);
156 | -		if (colon[1] && !*end) {
157 | -			*colon = 0;
158 | -			port = n;
159 | +	int sockfd = -1;
160 | +	char *colon, *end;
161 | +	char *port = STR(DEFAULT_GIT_PORT);
162 | +	struct addrinfo hints, *ai0, *ai;
163 | +	int gai;
164 | +
165 | +	if (host[0] == '[') {
166 | +		end = strchr(host + 1, ']');
167 | +		if (end) {
168 | +			*end = 0;
169 | +			end++;
170 | +			host++;
171 | +		} else
172 | +			end = host;
173 | +	} else
174 | +		end = host;
175 | +	colon = strchr(end, ':');
176 | +
177 | +	if (colon)
178 | +		port = colon + 1;
179 | +
180 | +	memset(&hints, 0, sizeof(hints));
181 | +	hints.ai_socktype = SOCK_STREAM;
182 | +	hints.ai_protocol = IPPROTO_TCP;
183 | +
184 | +	gai = getaddrinfo(host, port, &hints, &ai);
185 | +	if (gai)
186 | +		die("Unable to look up %s (%s)", host, gai_strerror(gai));
187 | +
188 | +	for (ai0 = ai; ai; ai = ai->ai_next) {
189 | +		sockfd = socket(ai->ai_family, ai->ai_socktype, ai->ai_protocol);
190 | +		if (sockfd < 0)
191 | +			continue;
192 | +		if (connect(sockfd, ai->ai_addr, ai->ai_addrlen) < 0) {
193 | +			close(sockfd);
194 | +			sockfd = -1;
195 | +			continue;
196 |  		}
197 | +		break;
198 |  	}
199 |  
200 | -	lookup_host(host, &addr);
201 | -	((struct sockaddr_in *)&addr)->sin_port = htons(port);
202 | +	freeaddrinfo(ai0);
203 |  
204 | -	sockfd = socket(PF_INET, SOCK_STREAM, IPPROTO_IP);
205 |  	if (sockfd < 0)
206 |  		die("unable to create socket (%s)", strerror(errno));
207 | -	if (connect(sockfd, (void *)&addr, sizeof(addr)) < 0)
208 | -		die("unable to connect (%s)", strerror(errno));
209 | +
210 |  	fd[0] = sockfd;
211 |  	fd[1] = sockfd;
212 |  	packet_write(sockfd, "%s %s\n", prog, path);
213 | 
214 | -- 
215 | YOSHIFUJI Hideaki @ USAGI Project  <yoshfuji@linux-ipv6.org>
216 | GPG-FP  : 9022 65EB 1ECF 3AD1 0BDF  80D8 4807 F894 E062 0EEA
217 | 
218 | From nobody Sat Aug 27 23:07:49 2005
219 | Path: news.gmane.org!not-for-mail
220 | Message-ID: <u5tacjjdpxq.fsf@lysator.liu.se>
221 | From: =?ISO8859-1?Q?David_K=E5gedal?= <davidk@lysator.liu.se>
222 | Newsgroups: gmane.comp.version-control.git
223 | Subject: [PATCH] Fixed two bugs in git-cvsimport-script.
224 | Date: Mon, 15 Aug 2005 20:18:25 +0200
225 | Lines: 83
226 | Approved: news@gmane.org
227 | NNTP-Posting-Host: main.gmane.org
228 | Mime-Version: 1.0
229 | Content-Type: text/plain; charset=ISO8859-1
230 | Content-Transfer-Encoding: QUOTED-PRINTABLE
231 | X-Trace: sea.gmane.org 1124130247 31839 80.91.229.2 (15 Aug 2005 18:24:07 GMT)
232 | X-Complaints-To: usenet@sea.gmane.org
233 | NNTP-Posting-Date: Mon, 15 Aug 2005 18:24:07 +0000 (UTC)
234 | Cc: "Junio C. Hamano" <junkio@cox.net>
235 | Original-X-From: git-owner@vger.kernel.org Mon Aug 15 20:24:05 2005
236 | 
237 | The git-cvsimport-script had a copule of small bugs that prevented me
238 | from importing a big CVS repository.
239 | 
240 | The first was that it didn't handle removed files with a multi-digit
241 | primary revision number.
242 | 
243 | The second was that it was asking the CVS server for "F" messages,
244 | although they were not handled.
245 | 
246 | I also updated the documentation for that script to correspond to
247 | actual flags.
248 | 
249 | Signed-off-by: David K=E5gedal <davidk@lysator.liu.se>
250 | ---
251 | 
252 |  Documentation/git-cvsimport-script.txt |    9 ++++++++-
253 |  git-cvsimport-script                   |    4 ++--
254 |  2 files changed, 10 insertions(+), 3 deletions(-)
255 | 
256 | 50452f9c0c2df1f04d83a26266ba704b13861632
257 | diff --git a/Documentation/git-cvsimport-script.txt b/Documentation/git=
258 | -cvsimport-script.txt
259 | --- a/Documentation/git-cvsimport-script.txt
260 | +++ b/Documentation/git-cvsimport-script.txt
261 | @@ -29,6 +29,10 @@ OPTIONS
262 |  	currently, only the :local:, :ext: and :pserver: access methods=20
263 |  	are supported.
264 | =20
265 | +-C <target-dir>::
266 | +        The GIT repository to import to.  If the directory doesn't
267 | +        exist, it will be created.  Default is the current directory.
268 | +
269 |  -i::
270 |  	Import-only: don't perform a checkout after importing.  This option
271 |  	ensures the working directory and cache remain untouched and will
272 | @@ -44,7 +48,7 @@ OPTIONS
273 | =20
274 |  -p <options-for-cvsps>::
275 |  	Additional options for cvsps.
276 | -	The options '-x' and '-A' are implicit and should not be used here.
277 | +	The options '-u' and '-A' are implicit and should not be used here.
278 | =20
279 |  	If you need to pass multiple options, separate them with a comma.
280 | =20
281 | @@ -57,6 +61,9 @@ OPTIONS
282 |  -h::
283 |  	Print a short usage message and exit.
284 | =20
285 | +-z <fuzz>::
286 | +        Pass the timestamp fuzz factor to cvsps.
287 | +
288 |  OUTPUT
289 |  ------
290 |  If '-v' is specified, the script reports what it is doing.
291 | diff --git a/git-cvsimport-script b/git-cvsimport-script
292 | --- a/git-cvsimport-script
293 | +++ b/git-cvsimport-script
294 | @@ -190,7 +190,7 @@ sub conn {
295 |  	$self->{'socketo'}->write("Root $repo\n");
296 | =20
297 |  	# Trial and error says that this probably is the minimum set
298 | -	$self->{'socketo'}->write("Valid-responses ok error Valid-requests Mo=
299 | de M Mbinary E F Checked-in Created Updated Merged Removed\n");
300 | +	$self->{'socketo'}->write("Valid-responses ok error Valid-requests Mo=
301 | de M Mbinary E Checked-in Created Updated Merged Removed\n");
302 | =20
303 |  	$self->{'socketo'}->write("valid-requests\n");
304 |  	$self->{'socketo'}->flush();
305 | @@ -691,7 +691,7 @@ while(<CVS>) {
306 |  		unlink($tmpname);
307 |  		my $mode =3D pmode($cvs->{'mode'});
308 |  		push(@new,[$mode, $sha, $fn]); # may be resurrected!
309 | -	} elsif($state =3D=3D 9 and /^\s+(\S+):\d(?:\.\d+)+->(\d(?:\.\d+)+)\(=
310 | DEAD\)\s*$/) {
311 | +	} elsif($state =3D=3D 9 and /^\s+(\S+):\d+(?:\.\d+)+->(\d+(?:\.\d+)+)=
312 | \(DEAD\)\s*$/) {
313 |  		my $fn =3D $1;
314 |  		$fn =3D~ s#^/+##;
315 |  		push(@old,$fn);
316 | 
317 | --=20
318 | David K=E5gedal
319 | -
320 | To unsubscribe from this list: send the line "unsubscribe git" in
321 | the body of a message to majordomo@vger.kernel.org
322 | More majordomo info at  http://vger.kernel.org/majordomo-info.html
323 | 
324 | From nobody Mon Sep 17 00:00:00 2001
325 | From: A U Thor <a.u.thor@example.com>
326 | References: <Pine.LNX.4.640.0001@woody.linux-foundation.org>
327 |  <Pine.LNX.4.640.0002@woody.linux-foundation.org>
328 |  <Pine.LNX.4.640.0003@woody.linux-foundation.org>
329 |  <Pine.LNX.4.640.0004@woody.linux-foundation.org>
330 |  <Pine.LNX.4.640.0005@woody.linux-foundation.org>
331 |  <Pine.LNX.4.640.0006@woody.linux-foundation.org>
332 |  <Pine.LNX.4.640.0007@woody.linux-foundation.org>
333 |  <Pine.LNX.4.640.0008@woody.linux-foundation.org>
334 |  <Pine.LNX.4.640.0009@woody.linux-foundation.org>
335 |  <Pine.LNX.4.640.0010@woody.linux-foundation.org>
336 |  <Pine.LNX.4.640.0011@woody.linux-foundation.org>
337 |  <Pine.LNX.4.640.0012@woody.linux-foundation.org>
338 |  <Pine.LNX.4.640.0013@woody.linux-foundation.org>
339 |  <Pine.LNX.4.640.0014@woody.linux-foundation.org>
340 |  <Pine.LNX.4.640.0015@woody.linux-foundation.org>
341 |  <Pine.LNX.4.640.0016@woody.linux-foundation.org>
342 |  <Pine.LNX.4.640.0017@woody.linux-foundation.org>
343 |  <Pine.LNX.4.640.0018@woody.linux-foundation.org>
344 |  <Pine.LNX.4.640.0019@woody.linux-foundation.org>
345 |  <Pine.LNX.4.640.0020@woody.linux-foundation.org>
346 |  <Pine.LNX.4.640.0021@woody.linux-foundation.org>
347 |  <Pine.LNX.4.640.0022@woody.linux-foundation.org>
348 |  <Pine.LNX.4.640.0023@woody.linux-foundation.org>
349 |  <Pine.LNX.4.640.0024@woody.linux-foundation.org>
350 |  <Pine.LNX.4.640.0025@woody.linux-foundation.org>
351 |  <Pine.LNX.4.640.0026@woody.linux-foundation.org>
352 |  <Pine.LNX.4.640.0027@woody.linux-foundation.org>
353 |  <Pine.LNX.4.640.0028@woody.linux-foundation.org>
354 |  <Pine.LNX.4.640.0029@woody.linux-foundation.org>
355 |  <Pine.LNX.4.640.0030@woody.linux-foundation.org>
356 |  <Pine.LNX.4.640.0031@woody.linux-foundation.org>
357 |  <Pine.LNX.4.640.0032@woody.linux-foundation.org>
358 |  <Pine.LNX.4.640.0033@woody.linux-foundation.org>
359 |  <Pine.LNX.4.640.0034@woody.linux-foundation.org>
360 |  <Pine.LNX.4.640.0035@woody.linux-foundation.org>
361 |  <Pine.LNX.4.640.0036@woody.linux-foundation.org>
362 |  <Pine.LNX.4.640.0037@woody.linux-foundation.org>
363 |  <Pine.LNX.4.640.0038@woody.linux-foundation.org>
364 |  <Pine.LNX.4.640.0039@woody.linux-foundation.org>
365 |  <Pine.LNX.4.640.0040@woody.linux-foundation.org>
366 |  <Pine.LNX.4.640.0041@woody.linux-foundation.org>
367 |  <Pine.LNX.4.640.0042@woody.linux-foundation.org>
368 |  <Pine.LNX.4.640.0043@woody.linux-foundation.org>
369 |  <Pine.LNX.4.640.0044@woody.linux-foundation.org>
370 |  <Pine.LNX.4.640.0045@woody.linux-foundation.org>
371 |  <Pine.LNX.4.640.0046@woody.linux-foundation.org>
372 |  <Pine.LNX.4.640.0047@woody.linux-foundation.org>
373 |  <Pine.LNX.4.640.0048@woody.linux-foundation.org>
374 |  <Pine.LNX.4.640.0049@woody.linux-foundation.org>
375 |  <Pine.LNX.4.640.0050@woody.linux-foundation.org>
376 | Date: Fri, 9 Jun 2006 00:44:16 -0700
377 | Subject: [PATCH] a commit.
378 | 
379 | Here is a patch from A U Thor.
380 | 
381 | ---
382 |  foo |    2 +-
383 |  1 files changed, 1 insertions(+), 1 deletions(-)
384 | 
385 | diff --git a/foo b/foo
386 | index 9123cdc..918dcf8 100644
387 | --- a/foo
388 | +++ b/foo
389 | @@ -1 +1 @@
390 | -Fri Jun  9 00:44:04 PDT 2006
391 | +Fri Jun  9 00:44:13 PDT 2006
392 | -- 
393 | 1.4.0.g6f2b
394 | 
395 | From nobody Mon Sep 17 00:00:00 2001
396 | From: A U Thor <a.u.thor@example.com>
397 | Date: Fri, 9 Jun 2006 00:44:16 -0700
398 | Subject: [PATCH] another patch
399 | 
400 | Here is an empty patch from A U Thor.
401 | 
402 | From nobody Mon Sep 17 00:00:00 2001
403 | From: Junio C Hamano <junio@kernel.org>
404 | Date: Fri, 9 Jun 2006 00:44:16 -0700
405 | Subject: re: [PATCH] another patch
406 | 
407 | From: A U Thor <a.u.thor@example.com>
408 | Subject: [PATCH] another patch
409 | >Here is an empty patch from A U Thor.
410 | 
411 | Hey you forgot the patch!
412 | 
413 | From nobody Mon Sep 17 00:00:00 2001
414 | From: A U Thor <a.u.thor@example.com>
415 | Date: Mon, 17 Sep 2001 00:00:00 +0900
416 | Mime-Version: 1.0
417 | Content-Type: Text/Plain; charset=us-ascii
418 | Content-Transfer-Encoding: Quoted-Printable
419 | 
420 | =0A=0AFrom: F U Bar <f.u.bar@example.com>
421 | Subject: [PATCH] updates=0A=0AThis is to fix diff-format documentation.
422 | 
423 | diff --git a/Documentation/diff-format.txt b/Documentation/diff-format.txt
424 | index b426a14..97756ec 100644
425 | --- a/Documentation/diff-format.txt
426 | +++ b/Documentation/diff-format.txt
427 | @@ -81,7 +81,7 @@ The "diff" formatting options can be customized via the
428 |  environment variable 'GIT_DIFF_OPTS'.  For example, if you
429 |  prefer context diff:
430 | =20
431 | -      GIT_DIFF_OPTS=3D-c git-diff-index -p $(cat .git/HEAD)
432 | +      GIT_DIFF_OPTS=3D-c git-diff-index -p HEAD
433 | =20
434 | =20
435 |  2. When the environment variable 'GIT_EXTERNAL_DIFF' is set, the
436 | From b9704a518e21158433baa2cc2d591fea687967f6 Mon Sep 17 00:00:00 2001
437 | From: =?UTF-8?q?Lukas=20Sandstr=C3=B6m?= <lukass@etek.chalmers.se>
438 | Date: Thu, 10 Jul 2008 23:41:33 +0200
439 | Subject: Re: discussion that lead to this patch
440 | MIME-Version: 1.0
441 | Content-Type: text/plain; charset=UTF-8
442 | Content-Transfer-Encoding: 8bit
443 | 
444 | [PATCH] git-mailinfo: Fix getting the subject from the body
445 | 
446 | "Subject: " isn't in the static array "header", and thus
447 | memcmp("Subject: ", header[i], 7) will never match.
448 | 
449 | Signed-off-by: Lukas Sandström <lukass@etek.chalmers.se>
450 | Signed-off-by: Junio C Hamano <gitster@pobox.com>
451 | ---
452 |  builtin-mailinfo.c |    2 +-
453 |  1 files changed, 1 insertions(+), 1 deletions(-)
454 | 
455 | diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c
456 | index 962aa34..2d1520f 100644
457 | --- a/builtin-mailinfo.c
458 | +++ b/builtin-mailinfo.c
459 | @@ -334,7 +334,7 @@ static int check_header(char *line, unsigned linesize, char **hdr_data, int over
460 |  		return 1;
461 |  	if (!memcmp("[PATCH]", line, 7) && isspace(line[7])) {
462 |  		for (i = 0; header[i]; i++) {
463 | -			if (!memcmp("Subject: ", header[i], 9)) {
464 | +			if (!memcmp("Subject", header[i], 7)) {
465 |  				if (! handle_header(line, hdr_data[i], 0)) {
466 |  					return 1;
467 |  				}
468 | -- 
469 | 1.5.6.2.455.g1efb2
470 | 
471 | From nobody Fri Aug  8 22:24:03 2008
472 | Date: Fri, 8 Aug 2008 13:08:37 +0200 (CEST)
473 | From: A U Thor <a.u.thor@example.com>
474 | Subject: [PATCH 3/3 v2] Xyzzy
475 | MIME-Version: 1.0
476 | Content-Type: multipart/mixed; boundary="=-=-="
477 | 
478 | --=-=-=
479 | Content-Type: text/plain; charset=ISO8859-15
480 | Content-Transfer-Encoding: quoted-printable
481 | 
482 | Here comes a commit log message, and
483 | its second line is here.
484 | ---
485 |  builtin-mailinfo.c  |    4 ++--
486 | 
487 | diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c
488 | index 3e5fe51..aabfe5c 100644
489 | --- a/builtin-mailinfo.c
490 | +++ b/builtin-mailinfo.c
491 | @@ -758,8 +758,8 @@ static void handle_body(void)
492 |  		/* process any boundary lines */
493 |  		if (*content_top && is_multipart_boundary(&line)) {
494 |  			/* flush any leftover */
495 | -			if (line.len)
496 | -				handle_filter(&line);
497 | +			if (prev.len)
498 | +				handle_filter(&prev);
499 | =20
500 |  			if (!handle_boundary())
501 |  				goto handle_body_out;
502 | --=20
503 | 1.6.0.rc2
504 | 
505 | --=-=-=--
506 | 
507 | From bda@mnsspb.ru Wed Nov 12 17:54:41 2008
508 | From: Dmitriy Blinov <bda@mnsspb.ru>
509 | To: navy-patches@dinar.mns.mnsspb.ru
510 | Date: Wed, 12 Nov 2008 17:54:41 +0300
511 | Message-Id: <1226501681-24923-1-git-send-email-bda@mnsspb.ru>
512 | X-Mailer: git-send-email 1.5.6.5
513 | MIME-Version: 1.0
514 | Content-Type: text/plain;
515 |   charset=utf-8
516 | Content-Transfer-Encoding: 8bit
517 | Subject: [Navy-patches] [PATCH]
518 | 	=?utf-8?b?0JjQt9C80LXQvdGR0L0g0YHQv9C40YHQvtC6INC/0LA=?=
519 | 	=?utf-8?b?0LrQtdGC0L7QsiDQvdC10L7QsdGF0L7QtNC40LzRi9GFINC00LvRjyA=?=
520 | 	=?utf-8?b?0YHQsdC+0YDQutC4?=
521 | 
522 | textlive-* исправлены на texlive-*
523 | docutils заменён на python-docutils
524 | 
525 | Действительно, оказалось, что rest2web вытягивает за собой
526 | python-docutils. В то время как сам rest2web не нужен.
527 | 
528 | Signed-off-by: Dmitriy Blinov <bda@mnsspb.ru>
529 | ---
530 |  howto/build_navy.txt |    6 +++---
531 |  1 files changed, 3 insertions(+), 3 deletions(-)
532 | 
533 | diff --git a/howto/build_navy.txt b/howto/build_navy.txt
534 | index 3fd3afb..0ee807e 100644
535 | --- a/howto/build_navy.txt
536 | +++ b/howto/build_navy.txt
537 | @@ -119,8 +119,8 @@
538 |     - libxv-dev
539 |     - libusplash-dev
540 |     - latex-make
541 | -   - textlive-lang-cyrillic
542 | -   - textlive-latex-extra
543 | +   - texlive-lang-cyrillic
544 | +   - texlive-latex-extra
545 |     - dia
546 |     - python-pyrex
547 |     - libtool
548 | @@ -128,7 +128,7 @@
549 |     - sox
550 |     - cython
551 |     - imagemagick
552 | -   - docutils
553 | +   - python-docutils
554 |  
555 |  #. на машине dinar: добавить свой открытый ssh-ключ в authorized_keys2 пользователя ddev
556 |  #. на своей машине: отредактировать /etc/sudoers (команда ``visudo``) примерно следующим образом::
557 | -- 
558 | 1.5.6.5
559 | From nobody Mon Sep 17 00:00:00 2001
560 | From: <a.u.thor@example.com> (A U Thor)
561 | Date: Fri, 9 Jun 2006 00:44:16 -0700
562 | Subject: [PATCH] a patch
563 | 
564 | From nobody Mon Sep 17 00:00:00 2001
565 | From: Junio Hamano <junkio@cox.net>
566 | Date: Thu, 20 Aug 2009 17:18:22 -0700
567 | Subject: Why doesn't git-am does not like >8 scissors mark?
568 | 
569 | Subject: [PATCH] BLAH ONE
570 | 
571 | In real life, we will see a discussion that inspired this patch
572 | discussing related and unrelated things around >8 scissors mark
573 | in this part of the message.
574 | 
575 | Subject: [PATCH] BLAH TWO
576 | 
577 | And then we will see the scissors.
578 | 
579 |  This line is not a scissors mark -- >8 -- but talks about it.
580 |  - - >8 - - please remove everything above this line - - >8 - -
581 | 
582 | Subject: [PATCH] Teach mailinfo to ignore everything before -- >8 -- mark
583 | From: Junio C Hamano <gitster@pobox.com>
584 | 
585 | This teaches mailinfo the scissors -- >8 -- mark; the command ignores
586 | everything before it in the message body.
587 | 
588 | Signed-off-by: Junio C Hamano <gitster@pobox.com>
589 | ---
590 |  builtin-mailinfo.c |   37 ++++++++++++++++++++++++++++++++++++-
591 |  1 files changed, 36 insertions(+), 1 deletions(-)
592 | 
593 | diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c
594 | index b0b5d8f..461c47e 100644
595 | --- a/builtin-mailinfo.c
596 | +++ b/builtin-mailinfo.c
597 | @@ -712,6 +712,34 @@ static inline int patchbreak(const struct strbuf *line)
598 |  	return 0;
599 |  }
600 |  
601 | +static int scissors(const struct strbuf *line)
602 | +{
603 | +	size_t i, len = line->len;
604 | +	int scissors_dashes_seen = 0;
605 | +	const char *buf = line->buf;
606 | +
607 | +	for (i = 0; i < len; i++) {
608 | +		if (isspace(buf[i]))
609 | +			continue;
610 | +		if (buf[i] == '-') {
611 | +			scissors_dashes_seen |= 02;
612 | +			continue;
613 | +		}
614 | +		if (i + 1 < len && !memcmp(buf + i, ">8", 2)) {
615 | +			scissors_dashes_seen |= 01;
616 | +			i++;
617 | +			continue;
618 | +		}
619 | +		if (i + 7 < len && !memcmp(buf + i, "cut here", 8)) {
620 | +			i += 7;
621 | +			continue;
622 | +		}
623 | +		/* everything else --- not scissors */
624 | +		break;
625 | +	}
626 | +	return scissors_dashes_seen == 03;
627 | +}
628 | +
629 |  static int handle_commit_msg(struct strbuf *line)
630 |  {
631 |  	static int still_looking = 1;
632 | @@ -723,10 +751,17 @@ static int handle_commit_msg(struct strbuf *line)
633 |  		strbuf_ltrim(line);
634 |  		if (!line->len)
635 |  			return 0;
636 | -		if ((still_looking = check_header(line, s_hdr_data, 0)) != 0)
637 | +		still_looking = check_header(line, s_hdr_data, 0);
638 | +		if (still_looking)
639 |  			return 0;
640 |  	}
641 |  
642 | +	if (scissors(line)) {
643 | +		fseek(cmitmsg, 0L, SEEK_SET);
644 | +		still_looking = 1;
645 | +		return 0;
646 | +	}
647 | +
648 |  	/* normalize the log message to UTF-8. */
649 |  	if (metainfo_charset)
650 |  		convert_to_utf8(line, charset.buf);
651 | -- 
652 | 1.6.4.1
653 | From nobody Mon Sep 17 00:00:00 2001
654 | From: A U Thor <a.u.thor@example.com>
655 | Subject: check bogus body header (from)
656 | Date: Fri, 9 Jun 2006 00:44:16 -0700
657 | 
658 | From: bogosity
659 |   - a list
660 |   - of stuff
661 | ---
662 | diff --git a/foo b/foo
663 | index e69de29..d95f3ad 100644
664 | --- a/foo
665 | +++ b/foo
666 | @@ -0,0 +1 @@
667 | +content
668 | 
669 | From nobody Mon Sep 17 00:00:00 2001
670 | From: A U Thor <a.u.thor@example.com>
671 | Subject: check bogus body header (date)
672 | Date: Fri, 9 Jun 2006 00:44:16 -0700
673 | 
674 | Date: bogus
675 | 
676 | and some content
677 | 
678 | ---
679 | diff --git a/foo b/foo
680 | index e69de29..d95f3ad 100644
681 | --- a/foo
682 | +++ b/foo
683 | @@ -0,0 +1 @@
684 | +content
685 | 
686 | 


--------------------------------------------------------------------------------
/src/index_emails.py:
--------------------------------------------------------------------------------
  1 | from tornado.httpclient import AsyncHTTPClient, HTTPRequest
  2 | from tornado.ioloop import IOLoop
  3 | import tornado.options
  4 | import json
  5 | import time
  6 | import calendar
  7 | import email.utils
  8 | import mailbox
  9 | import email
 10 | import quopri
 11 | import chardet
 12 | from bs4 import BeautifulSoup
 13 | import logging
 14 | 
 15 | http_client = AsyncHTTPClient()
 16 | 
 17 | DEFAULT_BATCH_SIZE = 500
 18 | DEFAULT_ES_URL = "http://localhost:9200"
 19 | DEFAULT_INDEX_NAME = "gmail"
 20 | 
 21 | 
 22 | def strip_html_css_js(msg):
 23 |     soup = BeautifulSoup(msg, "html.parser")  # create a new bs4 object from the html data loaded
 24 |     for script in soup(["script", "style"]):  # remove all javascript and stylesheet code
 25 |         script.extract()
 26 |     # get text
 27 |     text = soup.get_text()
 28 |     # break into lines and remove leading and trailing space on each
 29 |     lines = (line.strip() for line in text.splitlines())
 30 |     # break multi-headlines into a line each
 31 |     chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
 32 |     # drop blank lines
 33 |     text = '\n'.join(chunk for chunk in chunks if chunk)
 34 |     return text
 35 | 
 36 | 
 37 | async def delete_index():
 38 |     try:
 39 |         url = "%s/%s" % (tornado.options.options.es_url, tornado.options.options.index_name)
 40 |         request = HTTPRequest(url, method="DELETE", request_timeout=240, headers={"Content-Type": "application/json"})
 41 |         response = await http_client.fetch(request)
 42 |         logging.info('Delete index done   %s' % response.body)
 43 |     except:
 44 |         pass
 45 | 
 46 | 
 47 | async def create_index():
 48 | 
 49 |     schema = {
 50 |         "settings": {
 51 |             "number_of_shards": tornado.options.options.num_of_shards,
 52 |             "number_of_replicas": 0
 53 |         },
 54 |         "mappings": {
 55 |             "email": {
 56 |                 "_source": {"enabled": True},
 57 |                 "properties": {
 58 |                     "from": {"type": "string", "index": "not_analyzed"},
 59 |                     "return-path": {"type": "string", "index": "not_analyzed"},
 60 |                     "delivered-to": {"type": "string", "index": "not_analyzed"},
 61 |                     "message-id": {"type": "string", "index": "not_analyzed"},
 62 |                     "to": {"type": "string", "index": "not_analyzed"},
 63 |                     "date_ts": {"type": "date"},
 64 |                 },
 65 |             }
 66 |         },
 67 |         "refresh": True
 68 |     }
 69 | 
 70 |     body = json.dumps(schema)
 71 |     url = "%s/%s" % (tornado.options.options.es_url, tornado.options.options.index_name)
 72 |     try:
 73 |         request = HTTPRequest(url, method="PUT", body=body, request_timeout=240, headers={"Content-Type": "application/json"})
 74 |         response = await http_client.fetch(request)
 75 |         logging.info('Create index done   %s' % response.body)
 76 |     except:
 77 |         pass
 78 | 
 79 | 
 80 | total_uploaded = 0
 81 | 
 82 | 
 83 | async def upload_batch(upload_data):
 84 |     if tornado.options.options.dry_run:
 85 |         logging.info("Dry run, not uploading")
 86 |         return
 87 |     upload_data_txt = ""
 88 |     for item in upload_data:
 89 |         cmd = {'index': {'_index': tornado.options.options.index_name, '_type': 'email', '_id': item['message-id']}}
 90 |         try:
 91 |             json_cmd = json.dumps(cmd) + "\n"
 92 |             json_item = json.dumps(item) + "\n"
 93 |         except:
 94 |             logging.warn('Skipping mail with message id %s because of exception converting to JSON (invalid characters?).' % item['message-id'])
 95 |             continue
 96 |         upload_data_txt += json_cmd
 97 |         upload_data_txt += json_item
 98 | 
 99 |     request = HTTPRequest(tornado.options.options.es_url + "/_bulk", method="POST", body=upload_data_txt, request_timeout=240, headers={"Content-Type": "application/json"})
100 |     response = await http_client.fetch(request)
101 |     result = json.loads(response.body)
102 | 
103 |     global total_uploaded
104 |     total_uploaded += len(upload_data)
105 |     res_txt = "OK" if not result['errors'] else "FAILED"
106 |     logging.info("Upload: %s - upload took: %4dms, total messages uploaded: %6d" % (res_txt, result['took'], total_uploaded))
107 | 
108 | 
109 | def normalize_email(email_in):
110 |     parsed = email.utils.parseaddr(email_in)
111 |     return parsed[1]
112 | 
113 | 
114 | def convert_msg_to_json(msg):
115 | 
116 |     def parse_message_parts(current_msg):
117 |         if current_msg.is_multipart():
118 |             for mpart in current_msg.get_payload():
119 |                 if mpart is not None:
120 |                     content_type = str(mpart.get_content_type())
121 |                     if not tornado.options.options.text_only or (content_type.startswith("text") or content_type.startswith("multipart")):
122 |                         parse_message_parts(mpart)
123 |         else:
124 |             result['body'] += strip_html_css_js(current_msg.get_payload(decode=True))
125 | 
126 |     result = {'parts': []}
127 |     if 'message-id' not in msg:
128 |         return None
129 | 
130 |     for (k, v) in msg.items():
131 |         result[k.lower()] = v
132 | 
133 |     for k in ['to', 'cc', 'bcc']:
134 |         if not result.get(k):
135 |             continue
136 |         emails_split = str(result[k]).replace('\n', '').replace('\t', '').replace('\r', '').replace(' ', '').encode('utf8').decode('utf-8', 'ignore').split(',')
137 |         result[k] = [normalize_email(e) for e in emails_split]
138 | 
139 |     if "from" in result:
140 |         result['from'] = normalize_email(str(result['from']))
141 | 
142 |     if "date" in result:
143 |         try:
144 |             tt = email.utils.parsedate_tz(result['date'])
145 |             tz = tt[9] if len(tt) == 10 and tt[9] else 0
146 |             result['date_ts'] = int(calendar.timegm(tt) - tz) * 1000
147 |         except:
148 |             return None
149 | 
150 |     labels = []
151 |     if "x-gmail-labels" in result:
152 |         labels = [l.strip().lower() for l in result["x-gmail-labels"].split(',')]
153 |         del result["x-gmail-labels"]
154 |     result['labels'] = labels
155 | 
156 |     # Bodies...
157 |     if tornado.options.options.index_bodies:
158 |         result['body'] = ''
159 |         parse_message_parts(msg)
160 |         result['body_size'] = len(result['body'])
161 | 
162 |     parts = result.get("parts", [])
163 |     result['content_size_total'] = 0
164 |     for part in parts:
165 |         result['content_size_total'] += len(part.get('content', ""))
166 | 
167 |     if not tornado.options.options.index_x_headers:
168 |         result = {key: result[key] for key in result if not key.startswith("x-")}
169 | 
170 |     return result
171 | 
172 | 
173 | async def load_from_file():
174 | 
175 |     if tornado.options.options.init:
176 |         await delete_index()
177 |     await create_index()
178 | 
179 |     if tornado.options.options.skip:
180 |         logging.info("Skipping first %d messages" % tornado.options.options.skip)
181 | 
182 |     upload_data = list()
183 | 
184 |     if tornado.options.options.infile:
185 |         logging.info("Starting import from mbox file %s" % tornado.options.options.infile)
186 |         mbox = mailbox.mbox(tornado.options.options.infile)
187 |     else:
188 |         logging.info("Starting import from MH directory %s" % tornado.options.options.indir)
189 |         mbox = mailbox.MH(tornado.options.options.indir, factory=None, create=False)
190 | 
191 |     #Skipping on keys to avoid expensive read operations on skipped messages
192 |     msgkeys = mbox.keys()[tornado.options.options.skip:]
193 | 
194 |     for msgkey in msgkeys:
195 |         msg = mbox[msgkey]
196 |         item = convert_msg_to_json(msg)
197 | 
198 |         if item:
199 |             upload_data.append(item)
200 |             if len(upload_data) == tornado.options.options.batch_size:
201 |                 await upload_batch(upload_data)
202 |                 upload_data = list()
203 | 
204 |     # upload remaining items in `upload_batch`
205 |     if upload_data:
206 |         await upload_batch(upload_data)
207 | 
208 |     logging.info("Import done - total count %d" % len(mbox.keys()))
209 | 
210 | 
211 | if __name__ == '__main__':
212 | 
213 |     tornado.options.define("es_url", type=str, default=DEFAULT_ES_URL,
214 |                            help="URL of your Elasticsearch node")
215 | 
216 |     tornado.options.define("index_name", type=str, default=DEFAULT_INDEX_NAME,
217 |                            help="Name of the index to store your messages")
218 | 
219 |     tornado.options.define("infile", type=str, default=None,
220 |                            help="Input file (supported mailbox format: mbox). Mutually exclusive to --indir")
221 | 
222 |     tornado.options.define("indir", type=str, default=None,
223 |                            help="Input directory (supported mailbox format: mh). Mutually exclusive to --infile")
224 | 
225 |     tornado.options.define("init", type=bool, default=False,
226 |                            help="Force deleting and re-initializing the Elasticsearch index")
227 | 
228 |     tornado.options.define("batch_size", type=int, default=DEFAULT_BATCH_SIZE,
229 |                            help="Elasticsearch bulk index batch size")
230 | 
231 |     tornado.options.define("skip", type=int, default=0,
232 |                            help="Number of messages to skip from mailbox")
233 | 
234 |     tornado.options.define("num_of_shards", type=int, default=2,
235 |                            help="Number of shards for ES index")
236 | 
237 |     tornado.options.define("index_bodies", type=bool, default=False,
238 |                            help="Will index all body content, stripped of HTML/CSS/JS etc. Adds fields: 'body' and \
239 |                                     'body_size'")
240 | 
241 |     tornado.options.define("text_only", type=bool, default=False,
242 |                            help='Only parse message body multiparts declared as text (ignoring images etc.).')
243 | 
244 |     tornado.options.define("index_x_headers", type=bool, default=True,
245 |                            help='Index x-* fields from headers')
246 | 
247 |     tornado.options.define("dry_run", type=bool, default=False,
248 |                            help='Do not upload to Elastic Search, just process messages')
249 | 
250 |     tornado.options.parse_command_line()
251 | 
252 |     # Exactly one of {infile, indir} must be set
253 |     if bool(tornado.options.options.infile) ^ bool(tornado.options.options.indir):
254 |         IOLoop.instance().run_sync(load_from_file)
255 |     else:
256 |         tornado.options.print_help()
257 | 


--------------------------------------------------------------------------------