├── .github └── workflows │ └── tests.yml ├── .gitignore ├── README.md ├── requirements.txt ├── sample.mbox └── src └── index_emails.py /.github/workflows/tests.yml: -------------------------------------------------------------------------------- 1 | name: CI 2 | 3 | on: 4 | push: 5 | branches: 6 | - master 7 | pull_request: 8 | 9 | jobs: 10 | tests: 11 | runs-on: ubuntu-latest 12 | services: 13 | es: 14 | image: docker.elastic.co/elasticsearch/elasticsearch-oss:7.10.2 15 | ports: 16 | - 9200:9200 17 | options: >- 18 | --env http.port=9200 19 | --env discovery.type=single-node 20 | 21 | steps: 22 | - name: Checkout code 23 | uses: actions/checkout@v4 24 | 25 | - name: Set up Python 26 | uses: actions/setup-python@v5 27 | with: 28 | python-version: 3.12 29 | 30 | - name: Install dependencies 31 | run: | 32 | pip install --upgrade pip 33 | pip install -r requirements.txt 34 | 35 | - name: Wait for Elasticsearch 36 | run: | 37 | sleep 10 38 | curl -s http://localhost:9200 39 | 40 | - name: Run tests 41 | run: python3 src/index_emails.py --infile=sample.mbox --es-url=http://localhost:9200 42 | 43 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | venv 2 | .idea 3 | *.pyc 4 | .vscode 5 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Elasticsearch For Beginners: Indexing your Gmail Inbox (and more: Supports any mbox and MH mailboxes) 2 | ======================= 3 | 4 | #### What's this all about? 5 | 6 | I recently looked at my Gmail inbox and noticed that I have well over 50k emails, taking up about 12GB of space but there is no good way to tell what emails take up space, who sent them to, who emails me, etc 7 | 8 | Goal of this tutorial is to load an entire Gmail inbox into Elasticsearch using bulk indexing and then start querying the cluster to get a better picture of what's going on. 9 | 10 | 11 | #### Prerequisites 12 | 13 | Set up [Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/guide/current/running-elasticsearch.html) and make sure it's running at [http://localhost:9200](http://localhost:9200) 14 | 15 | A quick way to run Elasticsearch is using Docker: (the cors settings aren't really needed but come in handy if you want to use e.g. [dejavu](https://dejavu.appbase.io/) to explore the index) 16 | ``` 17 | docker run --name es -d -p 9200:9200 -e http.port=9200 -e http.cors.enabled=true -e 'http.cors.allow-origin=*' -e http.cors.allow-headers=X-Requested-With,X-Auth-Token,Content-Type,Content-Length,Authorization -e http.cors.allow-credentials=true -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch-oss:7.10.2 18 | ``` 19 | 20 | I use Python and [Tornado](https://github.com/tornadoweb/tornado/) for the scripts to import and query the data. Also `beautifulsoup4` for the stripping HTML/JS/CSS (if you want to use the body indexing flag). 21 | 22 | Install the dependencies by running: 23 | 24 | `pip3 install -r requirements.txt` 25 | 26 | 27 | #### Aight, where do we start? 28 | 29 | First, go [here](https://www.google.com/settings/takeout/custom/gmail) and download your Gmail mailbox, depending on the amount of emails you have accumulated this might take a while. 30 | There's also a small `sample.mbox` file included in the repo for you to play around with while you're waiting for Google to prepare your download. 31 | 32 | The downloaded archive is in the [mbox format](http://en.wikipedia.org/wiki/Mbox) and Python provides libraries to work with the mbox format so that's easy. 33 | 34 | You can run the code (assuming Elasticsearch is running at localhost:9200) with the sammple mbox file like this: 35 | ``` 36 | $ python3 src/index_emails.py --infile=sample.mbox 37 | [I index_emails:173] Starting import from file sample.mbox 38 | [I index_emails:101] Upload: OK - upload took: 1033ms, total messages uploaded: 3 39 | [I index_emails:197] Import done - total count 16 40 | $ 41 | ``` 42 | 43 | Note: All examples focus on Gmail inboxes. Substitute any `--infile=` parameters with `--indir=` pointing to an MH directory to make them work with MH mailboxes instead. 44 | 45 | #### The Source Code 46 | 47 | The overall program will look something like this: 48 | 49 | ```python 50 | mbox = mailbox.mbox('emails.mbox') // or mailbox.MH('inbox/') 51 | 52 | for msg in mbox: 53 | item = convert_msg_to_json(msg) 54 | upload_item_to_es(item) 55 | 56 | print "Done!" 57 | ``` 58 | 59 | #### Ok, tell me more about the details 60 | 61 | The full Python code is here: [src/index_emails.py](src/index_emails.py) 62 | 63 | 64 | ##### Turn mailbox into JSON 65 | 66 | First, we got to turn the messages into JSON so we can insert it into Elasticsearch. [Here](http://nbviewer.ipython.org/github/furukama/Mining-the-Social-Web-2nd-Edition/blob/master/ipynb/Chapter%206%20-%20Mining%20Mailboxes.ipynb) is some sample code that was very useful when it came to normalizing and cleaning up the data. 67 | 68 | A good first step: 69 | 70 | ```python 71 | def convert_msg_to_json(msg): 72 | result = {'parts': []} 73 | for (k, v) in msg.items(): 74 | result[k.lower()] = v.decode('utf-8', 'ignore') 75 | 76 | ``` 77 | 78 | Additionally, you also want to parse and normalize the `From` and `To` email addresses: 79 | 80 | ```python 81 | for k in ['to', 'cc', 'bcc']: 82 | if not result.get(k): 83 | continue 84 | emails_split = result[k].replace('\n', '').replace('\t', '').replace('\r', '').replace(' ', '').encode('utf8').decode('utf-8', 'ignore').split(',') 85 | result[k] = [ normalize_email(e) for e in emails_split] 86 | 87 | if "from" in result: 88 | result['from'] = normalize_email(result['from']) 89 | ``` 90 | 91 | Elasticsearch expects timestamps to be in microseconds so let's convert the date accordingly 92 | 93 | ```python 94 | if "date" in result: 95 | tt = email.utils.parsedate_tz(result['date']) 96 | result['date_ts'] = int(calendar.timegm(tt) - tt[9]) * 1000 97 | ``` 98 | 99 | We also need to split up and normalize the labels 100 | 101 | ```python 102 | labels = [] 103 | if "x-gmail-labels" in result: 104 | labels = [l.strip().lower() for l in result["x-gmail-labels"].split(',')] 105 | del result["x-gmail-labels"] 106 | result['labels'] = labels 107 | ``` 108 | 109 | Email size is also interesting so let's break that out 110 | 111 | ```python 112 | parts = json_msg.get("parts", []) 113 | json_msg['content_size_total'] = 0 114 | for part in parts: 115 | json_msg['content_size_total'] += len(part.get('content', "")) 116 | 117 | ``` 118 | 119 | 120 | ##### Index the data with Elasticsearch 121 | 122 | The most simple approach is a PUT request per item: 123 | 124 | ```python 125 | def upload_item_to_es(item): 126 | es_url = "http://localhost:9200/gmail/email/%s" % (item['message-id']) 127 | request = HTTPRequest(es_url, method="PUT", body=json.dumps(item), request_timeout=10) 128 | response = yield http_client.fetch(request) 129 | if not response.code in [200, 201]: 130 | print "\nfailed to add item %s" % item['message-id'] 131 | 132 | ``` 133 | 134 | However, Elasticsearch provides a better method for importing large chunks of data: [bulk indexing](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html) 135 | Instead of making a HTTP request per document and indexing individually, we batch them in chunks of eg. 1000 documents and then index them.
136 | Bulk messages are of the format: 137 | 138 | ``` 139 | cmd\n 140 | doc\n 141 | cmd\n 142 | doc\n 143 | ... 144 | ``` 145 | 146 | where `cmd` is the control message for each `doc` we want to index. 147 | For our example, `cmd` would look like this: 148 | 149 | ``` 150 | cmd = {'index': {'_index': 'gmail', '_type': 'email', '_id': item['message-id']}}` 151 | ``` 152 | 153 | The final code looks something like this: 154 | 155 | ```python 156 | upload_data = list() 157 | for msg in mbox: 158 | item = convert_msg_to_json(msg) 159 | upload_data.append(item) 160 | if len(upload_data) == 100: 161 | upload_batch(upload_data) 162 | upload_data = list() 163 | 164 | if upload_data: 165 | upload_batch(upload_data) 166 | 167 | ``` 168 | and 169 | 170 | ```python 171 | def upload_batch(upload_data): 172 | 173 | upload_data_txt = "" 174 | for item in upload_data: 175 | cmd = {'index': {'_index': 'gmail', '_type': 'email', '_id': item['message-id']}} 176 | upload_data_txt += json.dumps(cmd) + "\n" 177 | upload_data_txt += json.dumps(item) + "\n" 178 | 179 | request = HTTPRequest("http://localhost:9200/_bulk", method="POST", body=upload_data_txt, request_timeout=240) 180 | response = http_client.fetch(request) 181 | result = json.loads(response.body) 182 | if 'errors' in result: 183 | print result['errors'] 184 | ``` 185 | 186 | 187 | 188 | #### Ok, show me some data! 189 | 190 | After indexing all your emails, we can start running queries. 191 | 192 | 193 | ##### Filters 194 | 195 | If you want to search for emails from the last 6 months, you can use the range filter and search for `gte` the current time (`now`) minus 6 month: 196 | 197 | ``` 198 | curl -XGET 'http://localhost:9200/gmail/email/_search?pretty' -d '{ 199 | "filter": { "range" : { "date_ts" : { "gte": "now-6M" } } } } 200 | ' 201 | ``` 202 | 203 | or you can filter for all emails from 2014 by using `gte` and `lt` 204 | 205 | ``` 206 | curl -XGET 'http://localhost:9200/gmail/email/_search?pretty' -d '{ 207 | "filter": { "range" : { "date_ts" : { "gte": "2013-01-01T00:00:00.000Z", "lt": "2014-01-01T00:00:00.000Z" } } } } 208 | ' 209 | ``` 210 | 211 | You can also quickly query for certain fields via the `q` parameter. This example shows you all your Amazon shipping info emails: 212 | 213 | ``` 214 | curl "localhost:9200/gmail/email/_search?pretty&q=from:ship-confirm@amazon.com" 215 | ``` 216 | 217 | ##### Aggregation queries 218 | 219 | Aggregation queries let us bucket data by a given key and count the number of messages per bucket. 220 | For example, number of messages grouped by recipient: 221 | 222 | ``` 223 | curl -XGET 'http://localhost:9200/gmail/email/_search?pretty&search_type=count' -d '{ 224 | "aggs": { "emails": { "terms" : { "field" : "to", "size": 10 } 225 | } } } 226 | ' 227 | ``` 228 | 229 | Result: 230 | 231 | ``` 232 | "aggregations" : { 233 | "emails" : { 234 | "buckets" : [ { 235 | "key" : "noreply@github.com", 236 | "doc_count" : 1920 237 | }, { "key" : "oliver@gmail.com", 238 | "doc_count" : 1326 239 | }, { "key" : "michael@gmail.com", 240 | "doc_count" : 263 241 | }, { "key" : "david@gmail.com", 242 | "doc_count" : 232 243 | } 244 | ... 245 | ] 246 | } 247 | ``` 248 | 249 | This one gives us the number of emails per label: 250 | 251 | ``` 252 | curl -XGET 'http://localhost:9200/gmail/email/_search?pretty&search_type=count' -d '{ 253 | "aggs": { "labels": { "terms" : { "field" : "labels", "size": 10 } 254 | } } } 255 | ' 256 | ``` 257 | 258 | Result: 259 | 260 | ``` 261 | "hits" : { 262 | "total" : 51794, 263 | }, 264 | "aggregations" : { 265 | "labels" : { 266 | "buckets" : [ { 267 | "key" : "important", 268 | "doc_count" : 15430 269 | }, { "key" : "github", 270 | "doc_count" : 4928 271 | }, { "key" : "sent", 272 | "doc_count" : 4285 273 | }, { "key" : "unread", 274 | "doc_count" : 510 275 | }, 276 | ... 277 | ] 278 | } 279 | ``` 280 | 281 | Use a `date histogram` you can also count how many emails you sent and received per year: 282 | 283 | ``` 284 | curl -s "localhost:9200/gmail/email/_search?pretty&search_type=count" -d ' 285 | { "aggs": { 286 | "years": { 287 | "date_histogram": { 288 | "field": "date_ts", "interval": "year" 289 | }}}} 290 | ' 291 | ``` 292 | 293 | Result: 294 | 295 | ``` 296 | "aggregations" : { 297 | "years" : { 298 | "buckets" : [ { 299 | "key_as_string" : "2004-01-01T00:00:00.000Z", 300 | "key" : 1072915200000, 301 | "doc_count" : 585 302 | }, { 303 | ... 304 | }, { 305 | "key_as_string" : "2013-01-01T00:00:00.000Z", 306 | "key" : 1356998400000, 307 | "doc_count" : 12832 308 | }, { 309 | "key_as_string" : "2014-01-01T00:00:00.000Z", 310 | "key" : 1388534400000, 311 | "doc_count" : 7283 312 | } ] 313 | } 314 | ``` 315 | 316 | Write aggregation queries to work out how much you spent on Amazon/Steam: 317 | 318 | ``` 319 | GET _search 320 | { 321 | "query": { 322 | "match_all": {} 323 | }, 324 | "size": 0, 325 | "aggs": { 326 | "group_by_company": { 327 | "terms": { 328 | "field": "order_details.merchant" 329 | }, 330 | "aggs": { 331 | "total_spent": { 332 | "sum": { 333 | "field": "order_details.order_total" 334 | } 335 | }, 336 | "postage": { 337 | "sum": { 338 | "field": "order_details.postage" 339 | } 340 | } 341 | } 342 | } 343 | } 344 | } 345 | ``` 346 | 347 | 348 | #### Todo 349 | 350 | - more interesting queries 351 | - schema tweaks 352 | - multi-part message parsing 353 | - blurb about performance 354 | - ... 355 | 356 | 357 | 358 | #### Feedback 359 | 360 | Open a pull requests or an issue! 361 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | beautifulsoup4==4.6.0 2 | chardet==3.0.4 3 | tornado==6.5.0 4 | -------------------------------------------------------------------------------- /sample.mbox: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | From nobody Mon Sep 17 00:00:00 2001 5 | From: A (zzz) 6 | U 7 | Thor 8 | (Comment) 9 | Date: Fri, 9 Jun 2006 00:44:16 -0700 10 | Subject: [PATCH] a commit. 11 | 12 | Here is a patch from A U Thor. 13 | 14 | --- 15 | foo | 2 +- 16 | 1 files changed, 1 insertions(+), 1 deletions(-) 17 | 18 | diff --git a/foo b/foo 19 | index 9123cdc..918dcf8 100644 20 | --- a/foo 21 | +++ b/foo 22 | @@ -1 +1 @@ 23 | -Fri Jun 9 00:44:04 PDT 2006 24 | +Fri Jun 9 00:44:13 PDT 2006 25 | -- 26 | 1.4.0.g6f2b 27 | 28 | From nobody Mon Sep 17 00:00:00 2001 29 | From: A U Thor 30 | Date: Fri, 9 Jun 2006 00:44:16 -0700 31 | Subject: [PATCH] another patch 32 | 33 | Here is a patch from A U Thor. This addresses the issue raised in the 34 | message: 35 | 36 | From: Nit Picker 37 | Subject: foo is too old 38 | Message-Id: 39 | 40 | Hopefully this would fix the problem stated there. 41 | 42 | 43 | I have included an extra blank line above, but it does not have to be 44 | stripped away here, along with the 45 | whitespaces at the end of the above line. They are expected to be squashed 46 | when the message is made into a commit log by stripspace, 47 | Also, there are three blank lines after this paragraph, 48 | two truly blank and another full of spaces in between. 49 | 50 | 51 | 52 | Hope this helps. 53 | 54 | --- 55 | foo | 2 +- 56 | 1 files changed, 1 insertions(+), 1 deletions(-) 57 | 58 | diff --git a/foo b/foo 59 | index 9123cdc..918dcf8 100644 60 | --- a/foo 61 | +++ b/foo 62 | @@ -1 +1 @@ 63 | -Fri Jun 9 00:44:04 PDT 2006 64 | +Fri Jun 9 00:44:13 PDT 2006 65 | -- 66 | 1.4.0.g6f2b 67 | 68 | From nobody Mon Sep 17 00:00:00 2001 69 | From: Junio C Hamano 70 | Date: Fri, 9 Jun 2006 00:44:16 -0700 71 | Subject: re: [PATCH] another patch 72 | 73 | From: A U Thor 74 | Subject: [PATCH] third patch 75 | 76 | Here is a patch from A U Thor. This addresses the issue raised in the 77 | message: 78 | 79 | From: Nit Picker 80 | Subject: foo is too old 81 | Message-Id: 82 | 83 | Hopefully this would fix the problem stated there. 84 | 85 | --- 86 | foo | 2 +- 87 | 1 files changed, 1 insertions(+), 1 deletions(-) 88 | 89 | diff --git a/foo b/foo 90 | index 9123cdc..918dcf8 100644 91 | --- a/foo 92 | +++ b/foo 93 | @@ -1 +1 @@ 94 | -Fri Jun 9 00:44:04 PDT 2006 95 | +Fri Jun 9 00:44:13 PDT 2006 96 | -- 97 | 1.4.0.g6f2b 98 | 99 | From nobody Sat Aug 27 23:07:49 2005 100 | Path: news.gmane.org!not-for-mail 101 | Message-ID: <20050721.091036.01119516.yoshfuji@linux-ipv6.org> 102 | From: YOSHIFUJI Hideaki / =?ISO-2022-JP?B?GyRCNUhGIzFRTEAbKEI=?= 103 | 104 | Newsgroups: gmane.comp.version-control.git 105 | Subject: [PATCH 1/2] GIT: Try all addresses for given remote name 106 | Date: Thu, 21 Jul 2005 09:10:36 -0400 (EDT) 107 | Lines: 99 108 | Organization: USAGI/WIDE Project 109 | Approved: news@gmane.org 110 | NNTP-Posting-Host: main.gmane.org 111 | Mime-Version: 1.0 112 | Content-Type: Text/Plain; charset=us-ascii 113 | Content-Transfer-Encoding: 7bit 114 | X-Trace: sea.gmane.org 1121951434 29350 80.91.229.2 (21 Jul 2005 13:10:34 GMT) 115 | X-Complaints-To: usenet@sea.gmane.org 116 | NNTP-Posting-Date: Thu, 21 Jul 2005 13:10:34 +0000 (UTC) 117 | 118 | Hello. 119 | 120 | Try all addresses for given remote name until it succeeds. 121 | Also supports IPv6. 122 | 123 | Signed-of-by: Hideaki YOSHIFUJI 124 | 125 | diff --git a/connect.c b/connect.c 126 | --- a/connect.c 127 | +++ b/connect.c 128 | @@ -96,42 +96,57 @@ static enum protocol get_protocol(const 129 | die("I don't handle protocol '%s'", name); 130 | } 131 | 132 | -static void lookup_host(const char *host, struct sockaddr *in) 133 | -{ 134 | - struct addrinfo *res; 135 | - int ret; 136 | - 137 | - ret = getaddrinfo(host, NULL, NULL, &res); 138 | - if (ret) 139 | - die("Unable to look up %s (%s)", host, gai_strerror(ret)); 140 | - *in = *res->ai_addr; 141 | - freeaddrinfo(res); 142 | -} 143 | +#define STR_(s) # s 144 | +#define STR(s) STR_(s) 145 | 146 | static int git_tcp_connect(int fd[2], const char *prog, char *host, char *path) 147 | { 148 | - struct sockaddr addr; 149 | - int port = DEFAULT_GIT_PORT, sockfd; 150 | - char *colon; 151 | - 152 | - colon = strchr(host, ':'); 153 | - if (colon) { 154 | - char *end; 155 | - unsigned long n = strtoul(colon+1, &end, 0); 156 | - if (colon[1] && !*end) { 157 | - *colon = 0; 158 | - port = n; 159 | + int sockfd = -1; 160 | + char *colon, *end; 161 | + char *port = STR(DEFAULT_GIT_PORT); 162 | + struct addrinfo hints, *ai0, *ai; 163 | + int gai; 164 | + 165 | + if (host[0] == '[') { 166 | + end = strchr(host + 1, ']'); 167 | + if (end) { 168 | + *end = 0; 169 | + end++; 170 | + host++; 171 | + } else 172 | + end = host; 173 | + } else 174 | + end = host; 175 | + colon = strchr(end, ':'); 176 | + 177 | + if (colon) 178 | + port = colon + 1; 179 | + 180 | + memset(&hints, 0, sizeof(hints)); 181 | + hints.ai_socktype = SOCK_STREAM; 182 | + hints.ai_protocol = IPPROTO_TCP; 183 | + 184 | + gai = getaddrinfo(host, port, &hints, &ai); 185 | + if (gai) 186 | + die("Unable to look up %s (%s)", host, gai_strerror(gai)); 187 | + 188 | + for (ai0 = ai; ai; ai = ai->ai_next) { 189 | + sockfd = socket(ai->ai_family, ai->ai_socktype, ai->ai_protocol); 190 | + if (sockfd < 0) 191 | + continue; 192 | + if (connect(sockfd, ai->ai_addr, ai->ai_addrlen) < 0) { 193 | + close(sockfd); 194 | + sockfd = -1; 195 | + continue; 196 | } 197 | + break; 198 | } 199 | 200 | - lookup_host(host, &addr); 201 | - ((struct sockaddr_in *)&addr)->sin_port = htons(port); 202 | + freeaddrinfo(ai0); 203 | 204 | - sockfd = socket(PF_INET, SOCK_STREAM, IPPROTO_IP); 205 | if (sockfd < 0) 206 | die("unable to create socket (%s)", strerror(errno)); 207 | - if (connect(sockfd, (void *)&addr, sizeof(addr)) < 0) 208 | - die("unable to connect (%s)", strerror(errno)); 209 | + 210 | fd[0] = sockfd; 211 | fd[1] = sockfd; 212 | packet_write(sockfd, "%s %s\n", prog, path); 213 | 214 | -- 215 | YOSHIFUJI Hideaki @ USAGI Project 216 | GPG-FP : 9022 65EB 1ECF 3AD1 0BDF 80D8 4807 F894 E062 0EEA 217 | 218 | From nobody Sat Aug 27 23:07:49 2005 219 | Path: news.gmane.org!not-for-mail 220 | Message-ID: 221 | From: =?ISO8859-1?Q?David_K=E5gedal?= 222 | Newsgroups: gmane.comp.version-control.git 223 | Subject: [PATCH] Fixed two bugs in git-cvsimport-script. 224 | Date: Mon, 15 Aug 2005 20:18:25 +0200 225 | Lines: 83 226 | Approved: news@gmane.org 227 | NNTP-Posting-Host: main.gmane.org 228 | Mime-Version: 1.0 229 | Content-Type: text/plain; charset=ISO8859-1 230 | Content-Transfer-Encoding: QUOTED-PRINTABLE 231 | X-Trace: sea.gmane.org 1124130247 31839 80.91.229.2 (15 Aug 2005 18:24:07 GMT) 232 | X-Complaints-To: usenet@sea.gmane.org 233 | NNTP-Posting-Date: Mon, 15 Aug 2005 18:24:07 +0000 (UTC) 234 | Cc: "Junio C. Hamano" 235 | Original-X-From: git-owner@vger.kernel.org Mon Aug 15 20:24:05 2005 236 | 237 | The git-cvsimport-script had a copule of small bugs that prevented me 238 | from importing a big CVS repository. 239 | 240 | The first was that it didn't handle removed files with a multi-digit 241 | primary revision number. 242 | 243 | The second was that it was asking the CVS server for "F" messages, 244 | although they were not handled. 245 | 246 | I also updated the documentation for that script to correspond to 247 | actual flags. 248 | 249 | Signed-off-by: David K=E5gedal 250 | --- 251 | 252 | Documentation/git-cvsimport-script.txt | 9 ++++++++- 253 | git-cvsimport-script | 4 ++-- 254 | 2 files changed, 10 insertions(+), 3 deletions(-) 255 | 256 | 50452f9c0c2df1f04d83a26266ba704b13861632 257 | diff --git a/Documentation/git-cvsimport-script.txt b/Documentation/git= 258 | -cvsimport-script.txt 259 | --- a/Documentation/git-cvsimport-script.txt 260 | +++ b/Documentation/git-cvsimport-script.txt 261 | @@ -29,6 +29,10 @@ OPTIONS 262 | currently, only the :local:, :ext: and :pserver: access methods=20 263 | are supported. 264 | =20 265 | +-C :: 266 | + The GIT repository to import to. If the directory doesn't 267 | + exist, it will be created. Default is the current directory. 268 | + 269 | -i:: 270 | Import-only: don't perform a checkout after importing. This option 271 | ensures the working directory and cache remain untouched and will 272 | @@ -44,7 +48,7 @@ OPTIONS 273 | =20 274 | -p :: 275 | Additional options for cvsps. 276 | - The options '-x' and '-A' are implicit and should not be used here. 277 | + The options '-u' and '-A' are implicit and should not be used here. 278 | =20 279 | If you need to pass multiple options, separate them with a comma. 280 | =20 281 | @@ -57,6 +61,9 @@ OPTIONS 282 | -h:: 283 | Print a short usage message and exit. 284 | =20 285 | +-z :: 286 | + Pass the timestamp fuzz factor to cvsps. 287 | + 288 | OUTPUT 289 | ------ 290 | If '-v' is specified, the script reports what it is doing. 291 | diff --git a/git-cvsimport-script b/git-cvsimport-script 292 | --- a/git-cvsimport-script 293 | +++ b/git-cvsimport-script 294 | @@ -190,7 +190,7 @@ sub conn { 295 | $self->{'socketo'}->write("Root $repo\n"); 296 | =20 297 | # Trial and error says that this probably is the minimum set 298 | - $self->{'socketo'}->write("Valid-responses ok error Valid-requests Mo= 299 | de M Mbinary E F Checked-in Created Updated Merged Removed\n"); 300 | + $self->{'socketo'}->write("Valid-responses ok error Valid-requests Mo= 301 | de M Mbinary E Checked-in Created Updated Merged Removed\n"); 302 | =20 303 | $self->{'socketo'}->write("valid-requests\n"); 304 | $self->{'socketo'}->flush(); 305 | @@ -691,7 +691,7 @@ while() { 306 | unlink($tmpname); 307 | my $mode =3D pmode($cvs->{'mode'}); 308 | push(@new,[$mode, $sha, $fn]); # may be resurrected! 309 | - } elsif($state =3D=3D 9 and /^\s+(\S+):\d(?:\.\d+)+->(\d(?:\.\d+)+)\(= 310 | DEAD\)\s*$/) { 311 | + } elsif($state =3D=3D 9 and /^\s+(\S+):\d+(?:\.\d+)+->(\d+(?:\.\d+)+)= 312 | \(DEAD\)\s*$/) { 313 | my $fn =3D $1; 314 | $fn =3D~ s#^/+##; 315 | push(@old,$fn); 316 | 317 | --=20 318 | David K=E5gedal 319 | - 320 | To unsubscribe from this list: send the line "unsubscribe git" in 321 | the body of a message to majordomo@vger.kernel.org 322 | More majordomo info at http://vger.kernel.org/majordomo-info.html 323 | 324 | From nobody Mon Sep 17 00:00:00 2001 325 | From: A U Thor 326 | References: 327 | 328 | 329 | 330 | 331 | 332 | 333 | 334 | 335 | 336 | 337 | 338 | 339 | 340 | 341 | 342 | 343 | 344 | 345 | 346 | 347 | 348 | 349 | 350 | 351 | 352 | 353 | 354 | 355 | 356 | 357 | 358 | 359 | 360 | 361 | 362 | 363 | 364 | 365 | 366 | 367 | 368 | 369 | 370 | 371 | 372 | 373 | 374 | 375 | 376 | Date: Fri, 9 Jun 2006 00:44:16 -0700 377 | Subject: [PATCH] a commit. 378 | 379 | Here is a patch from A U Thor. 380 | 381 | --- 382 | foo | 2 +- 383 | 1 files changed, 1 insertions(+), 1 deletions(-) 384 | 385 | diff --git a/foo b/foo 386 | index 9123cdc..918dcf8 100644 387 | --- a/foo 388 | +++ b/foo 389 | @@ -1 +1 @@ 390 | -Fri Jun 9 00:44:04 PDT 2006 391 | +Fri Jun 9 00:44:13 PDT 2006 392 | -- 393 | 1.4.0.g6f2b 394 | 395 | From nobody Mon Sep 17 00:00:00 2001 396 | From: A U Thor 397 | Date: Fri, 9 Jun 2006 00:44:16 -0700 398 | Subject: [PATCH] another patch 399 | 400 | Here is an empty patch from A U Thor. 401 | 402 | From nobody Mon Sep 17 00:00:00 2001 403 | From: Junio C Hamano 404 | Date: Fri, 9 Jun 2006 00:44:16 -0700 405 | Subject: re: [PATCH] another patch 406 | 407 | From: A U Thor 408 | Subject: [PATCH] another patch 409 | >Here is an empty patch from A U Thor. 410 | 411 | Hey you forgot the patch! 412 | 413 | From nobody Mon Sep 17 00:00:00 2001 414 | From: A U Thor 415 | Date: Mon, 17 Sep 2001 00:00:00 +0900 416 | Mime-Version: 1.0 417 | Content-Type: Text/Plain; charset=us-ascii 418 | Content-Transfer-Encoding: Quoted-Printable 419 | 420 | =0A=0AFrom: F U Bar 421 | Subject: [PATCH] updates=0A=0AThis is to fix diff-format documentation. 422 | 423 | diff --git a/Documentation/diff-format.txt b/Documentation/diff-format.txt 424 | index b426a14..97756ec 100644 425 | --- a/Documentation/diff-format.txt 426 | +++ b/Documentation/diff-format.txt 427 | @@ -81,7 +81,7 @@ The "diff" formatting options can be customized via the 428 | environment variable 'GIT_DIFF_OPTS'. For example, if you 429 | prefer context diff: 430 | =20 431 | - GIT_DIFF_OPTS=3D-c git-diff-index -p $(cat .git/HEAD) 432 | + GIT_DIFF_OPTS=3D-c git-diff-index -p HEAD 433 | =20 434 | =20 435 | 2. When the environment variable 'GIT_EXTERNAL_DIFF' is set, the 436 | From b9704a518e21158433baa2cc2d591fea687967f6 Mon Sep 17 00:00:00 2001 437 | From: =?UTF-8?q?Lukas=20Sandstr=C3=B6m?= 438 | Date: Thu, 10 Jul 2008 23:41:33 +0200 439 | Subject: Re: discussion that lead to this patch 440 | MIME-Version: 1.0 441 | Content-Type: text/plain; charset=UTF-8 442 | Content-Transfer-Encoding: 8bit 443 | 444 | [PATCH] git-mailinfo: Fix getting the subject from the body 445 | 446 | "Subject: " isn't in the static array "header", and thus 447 | memcmp("Subject: ", header[i], 7) will never match. 448 | 449 | Signed-off-by: Lukas Sandström 450 | Signed-off-by: Junio C Hamano 451 | --- 452 | builtin-mailinfo.c | 2 +- 453 | 1 files changed, 1 insertions(+), 1 deletions(-) 454 | 455 | diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c 456 | index 962aa34..2d1520f 100644 457 | --- a/builtin-mailinfo.c 458 | +++ b/builtin-mailinfo.c 459 | @@ -334,7 +334,7 @@ static int check_header(char *line, unsigned linesize, char **hdr_data, int over 460 | return 1; 461 | if (!memcmp("[PATCH]", line, 7) && isspace(line[7])) { 462 | for (i = 0; header[i]; i++) { 463 | - if (!memcmp("Subject: ", header[i], 9)) { 464 | + if (!memcmp("Subject", header[i], 7)) { 465 | if (! handle_header(line, hdr_data[i], 0)) { 466 | return 1; 467 | } 468 | -- 469 | 1.5.6.2.455.g1efb2 470 | 471 | From nobody Fri Aug 8 22:24:03 2008 472 | Date: Fri, 8 Aug 2008 13:08:37 +0200 (CEST) 473 | From: A U Thor 474 | Subject: [PATCH 3/3 v2] Xyzzy 475 | MIME-Version: 1.0 476 | Content-Type: multipart/mixed; boundary="=-=-=" 477 | 478 | --=-=-= 479 | Content-Type: text/plain; charset=ISO8859-15 480 | Content-Transfer-Encoding: quoted-printable 481 | 482 | Here comes a commit log message, and 483 | its second line is here. 484 | --- 485 | builtin-mailinfo.c | 4 ++-- 486 | 487 | diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c 488 | index 3e5fe51..aabfe5c 100644 489 | --- a/builtin-mailinfo.c 490 | +++ b/builtin-mailinfo.c 491 | @@ -758,8 +758,8 @@ static void handle_body(void) 492 | /* process any boundary lines */ 493 | if (*content_top && is_multipart_boundary(&line)) { 494 | /* flush any leftover */ 495 | - if (line.len) 496 | - handle_filter(&line); 497 | + if (prev.len) 498 | + handle_filter(&prev); 499 | =20 500 | if (!handle_boundary()) 501 | goto handle_body_out; 502 | --=20 503 | 1.6.0.rc2 504 | 505 | --=-=-=-- 506 | 507 | From bda@mnsspb.ru Wed Nov 12 17:54:41 2008 508 | From: Dmitriy Blinov 509 | To: navy-patches@dinar.mns.mnsspb.ru 510 | Date: Wed, 12 Nov 2008 17:54:41 +0300 511 | Message-Id: <1226501681-24923-1-git-send-email-bda@mnsspb.ru> 512 | X-Mailer: git-send-email 1.5.6.5 513 | MIME-Version: 1.0 514 | Content-Type: text/plain; 515 | charset=utf-8 516 | Content-Transfer-Encoding: 8bit 517 | Subject: [Navy-patches] [PATCH] 518 | =?utf-8?b?0JjQt9C80LXQvdGR0L0g0YHQv9C40YHQvtC6INC/0LA=?= 519 | =?utf-8?b?0LrQtdGC0L7QsiDQvdC10L7QsdGF0L7QtNC40LzRi9GFINC00LvRjyA=?= 520 | =?utf-8?b?0YHQsdC+0YDQutC4?= 521 | 522 | textlive-* исправлены на texlive-* 523 | docutils заменён на python-docutils 524 | 525 | Действительно, оказалось, что rest2web вытягивает за собой 526 | python-docutils. В то время как сам rest2web не нужен. 527 | 528 | Signed-off-by: Dmitriy Blinov 529 | --- 530 | howto/build_navy.txt | 6 +++--- 531 | 1 files changed, 3 insertions(+), 3 deletions(-) 532 | 533 | diff --git a/howto/build_navy.txt b/howto/build_navy.txt 534 | index 3fd3afb..0ee807e 100644 535 | --- a/howto/build_navy.txt 536 | +++ b/howto/build_navy.txt 537 | @@ -119,8 +119,8 @@ 538 | - libxv-dev 539 | - libusplash-dev 540 | - latex-make 541 | - - textlive-lang-cyrillic 542 | - - textlive-latex-extra 543 | + - texlive-lang-cyrillic 544 | + - texlive-latex-extra 545 | - dia 546 | - python-pyrex 547 | - libtool 548 | @@ -128,7 +128,7 @@ 549 | - sox 550 | - cython 551 | - imagemagick 552 | - - docutils 553 | + - python-docutils 554 | 555 | #. на машине dinar: добавить свой открытый ssh-ключ в authorized_keys2 пользователя ddev 556 | #. на своей машине: отредактировать /etc/sudoers (команда ``visudo``) примерно следующим образом:: 557 | -- 558 | 1.5.6.5 559 | From nobody Mon Sep 17 00:00:00 2001 560 | From: (A U Thor) 561 | Date: Fri, 9 Jun 2006 00:44:16 -0700 562 | Subject: [PATCH] a patch 563 | 564 | From nobody Mon Sep 17 00:00:00 2001 565 | From: Junio Hamano 566 | Date: Thu, 20 Aug 2009 17:18:22 -0700 567 | Subject: Why doesn't git-am does not like >8 scissors mark? 568 | 569 | Subject: [PATCH] BLAH ONE 570 | 571 | In real life, we will see a discussion that inspired this patch 572 | discussing related and unrelated things around >8 scissors mark 573 | in this part of the message. 574 | 575 | Subject: [PATCH] BLAH TWO 576 | 577 | And then we will see the scissors. 578 | 579 | This line is not a scissors mark -- >8 -- but talks about it. 580 | - - >8 - - please remove everything above this line - - >8 - - 581 | 582 | Subject: [PATCH] Teach mailinfo to ignore everything before -- >8 -- mark 583 | From: Junio C Hamano 584 | 585 | This teaches mailinfo the scissors -- >8 -- mark; the command ignores 586 | everything before it in the message body. 587 | 588 | Signed-off-by: Junio C Hamano 589 | --- 590 | builtin-mailinfo.c | 37 ++++++++++++++++++++++++++++++++++++- 591 | 1 files changed, 36 insertions(+), 1 deletions(-) 592 | 593 | diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c 594 | index b0b5d8f..461c47e 100644 595 | --- a/builtin-mailinfo.c 596 | +++ b/builtin-mailinfo.c 597 | @@ -712,6 +712,34 @@ static inline int patchbreak(const struct strbuf *line) 598 | return 0; 599 | } 600 | 601 | +static int scissors(const struct strbuf *line) 602 | +{ 603 | + size_t i, len = line->len; 604 | + int scissors_dashes_seen = 0; 605 | + const char *buf = line->buf; 606 | + 607 | + for (i = 0; i < len; i++) { 608 | + if (isspace(buf[i])) 609 | + continue; 610 | + if (buf[i] == '-') { 611 | + scissors_dashes_seen |= 02; 612 | + continue; 613 | + } 614 | + if (i + 1 < len && !memcmp(buf + i, ">8", 2)) { 615 | + scissors_dashes_seen |= 01; 616 | + i++; 617 | + continue; 618 | + } 619 | + if (i + 7 < len && !memcmp(buf + i, "cut here", 8)) { 620 | + i += 7; 621 | + continue; 622 | + } 623 | + /* everything else --- not scissors */ 624 | + break; 625 | + } 626 | + return scissors_dashes_seen == 03; 627 | +} 628 | + 629 | static int handle_commit_msg(struct strbuf *line) 630 | { 631 | static int still_looking = 1; 632 | @@ -723,10 +751,17 @@ static int handle_commit_msg(struct strbuf *line) 633 | strbuf_ltrim(line); 634 | if (!line->len) 635 | return 0; 636 | - if ((still_looking = check_header(line, s_hdr_data, 0)) != 0) 637 | + still_looking = check_header(line, s_hdr_data, 0); 638 | + if (still_looking) 639 | return 0; 640 | } 641 | 642 | + if (scissors(line)) { 643 | + fseek(cmitmsg, 0L, SEEK_SET); 644 | + still_looking = 1; 645 | + return 0; 646 | + } 647 | + 648 | /* normalize the log message to UTF-8. */ 649 | if (metainfo_charset) 650 | convert_to_utf8(line, charset.buf); 651 | -- 652 | 1.6.4.1 653 | From nobody Mon Sep 17 00:00:00 2001 654 | From: A U Thor 655 | Subject: check bogus body header (from) 656 | Date: Fri, 9 Jun 2006 00:44:16 -0700 657 | 658 | From: bogosity 659 | - a list 660 | - of stuff 661 | --- 662 | diff --git a/foo b/foo 663 | index e69de29..d95f3ad 100644 664 | --- a/foo 665 | +++ b/foo 666 | @@ -0,0 +1 @@ 667 | +content 668 | 669 | From nobody Mon Sep 17 00:00:00 2001 670 | From: A U Thor 671 | Subject: check bogus body header (date) 672 | Date: Fri, 9 Jun 2006 00:44:16 -0700 673 | 674 | Date: bogus 675 | 676 | and some content 677 | 678 | --- 679 | diff --git a/foo b/foo 680 | index e69de29..d95f3ad 100644 681 | --- a/foo 682 | +++ b/foo 683 | @@ -0,0 +1 @@ 684 | +content 685 | 686 | -------------------------------------------------------------------------------- /src/index_emails.py: -------------------------------------------------------------------------------- 1 | from tornado.httpclient import AsyncHTTPClient, HTTPRequest 2 | from tornado.ioloop import IOLoop 3 | import tornado.options 4 | import json 5 | import time 6 | import calendar 7 | import email.utils 8 | import mailbox 9 | import email 10 | import quopri 11 | import chardet 12 | from bs4 import BeautifulSoup 13 | import logging 14 | 15 | http_client = AsyncHTTPClient() 16 | 17 | DEFAULT_BATCH_SIZE = 500 18 | DEFAULT_ES_URL = "http://localhost:9200" 19 | DEFAULT_INDEX_NAME = "gmail" 20 | 21 | 22 | def strip_html_css_js(msg): 23 | soup = BeautifulSoup(msg, "html.parser") # create a new bs4 object from the html data loaded 24 | for script in soup(["script", "style"]): # remove all javascript and stylesheet code 25 | script.extract() 26 | # get text 27 | text = soup.get_text() 28 | # break into lines and remove leading and trailing space on each 29 | lines = (line.strip() for line in text.splitlines()) 30 | # break multi-headlines into a line each 31 | chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) 32 | # drop blank lines 33 | text = '\n'.join(chunk for chunk in chunks if chunk) 34 | return text 35 | 36 | 37 | async def delete_index(): 38 | try: 39 | url = "%s/%s" % (tornado.options.options.es_url, tornado.options.options.index_name) 40 | request = HTTPRequest(url, method="DELETE", request_timeout=240, headers={"Content-Type": "application/json"}) 41 | response = await http_client.fetch(request) 42 | logging.info('Delete index done %s' % response.body) 43 | except: 44 | pass 45 | 46 | 47 | async def create_index(): 48 | 49 | schema = { 50 | "settings": { 51 | "number_of_shards": tornado.options.options.num_of_shards, 52 | "number_of_replicas": 0 53 | }, 54 | "mappings": { 55 | "email": { 56 | "_source": {"enabled": True}, 57 | "properties": { 58 | "from": {"type": "string", "index": "not_analyzed"}, 59 | "return-path": {"type": "string", "index": "not_analyzed"}, 60 | "delivered-to": {"type": "string", "index": "not_analyzed"}, 61 | "message-id": {"type": "string", "index": "not_analyzed"}, 62 | "to": {"type": "string", "index": "not_analyzed"}, 63 | "date_ts": {"type": "date"}, 64 | }, 65 | } 66 | }, 67 | "refresh": True 68 | } 69 | 70 | body = json.dumps(schema) 71 | url = "%s/%s" % (tornado.options.options.es_url, tornado.options.options.index_name) 72 | try: 73 | request = HTTPRequest(url, method="PUT", body=body, request_timeout=240, headers={"Content-Type": "application/json"}) 74 | response = await http_client.fetch(request) 75 | logging.info('Create index done %s' % response.body) 76 | except: 77 | pass 78 | 79 | 80 | total_uploaded = 0 81 | 82 | 83 | async def upload_batch(upload_data): 84 | if tornado.options.options.dry_run: 85 | logging.info("Dry run, not uploading") 86 | return 87 | upload_data_txt = "" 88 | for item in upload_data: 89 | cmd = {'index': {'_index': tornado.options.options.index_name, '_type': 'email', '_id': item['message-id']}} 90 | try: 91 | json_cmd = json.dumps(cmd) + "\n" 92 | json_item = json.dumps(item) + "\n" 93 | except: 94 | logging.warn('Skipping mail with message id %s because of exception converting to JSON (invalid characters?).' % item['message-id']) 95 | continue 96 | upload_data_txt += json_cmd 97 | upload_data_txt += json_item 98 | 99 | request = HTTPRequest(tornado.options.options.es_url + "/_bulk", method="POST", body=upload_data_txt, request_timeout=240, headers={"Content-Type": "application/json"}) 100 | response = await http_client.fetch(request) 101 | result = json.loads(response.body) 102 | 103 | global total_uploaded 104 | total_uploaded += len(upload_data) 105 | res_txt = "OK" if not result['errors'] else "FAILED" 106 | logging.info("Upload: %s - upload took: %4dms, total messages uploaded: %6d" % (res_txt, result['took'], total_uploaded)) 107 | 108 | 109 | def normalize_email(email_in): 110 | parsed = email.utils.parseaddr(email_in) 111 | return parsed[1] 112 | 113 | 114 | def convert_msg_to_json(msg): 115 | 116 | def parse_message_parts(current_msg): 117 | if current_msg.is_multipart(): 118 | for mpart in current_msg.get_payload(): 119 | if mpart is not None: 120 | content_type = str(mpart.get_content_type()) 121 | if not tornado.options.options.text_only or (content_type.startswith("text") or content_type.startswith("multipart")): 122 | parse_message_parts(mpart) 123 | else: 124 | result['body'] += strip_html_css_js(current_msg.get_payload(decode=True)) 125 | 126 | result = {'parts': []} 127 | if 'message-id' not in msg: 128 | return None 129 | 130 | for (k, v) in msg.items(): 131 | result[k.lower()] = v 132 | 133 | for k in ['to', 'cc', 'bcc']: 134 | if not result.get(k): 135 | continue 136 | emails_split = str(result[k]).replace('\n', '').replace('\t', '').replace('\r', '').replace(' ', '').encode('utf8').decode('utf-8', 'ignore').split(',') 137 | result[k] = [normalize_email(e) for e in emails_split] 138 | 139 | if "from" in result: 140 | result['from'] = normalize_email(str(result['from'])) 141 | 142 | if "date" in result: 143 | try: 144 | tt = email.utils.parsedate_tz(result['date']) 145 | tz = tt[9] if len(tt) == 10 and tt[9] else 0 146 | result['date_ts'] = int(calendar.timegm(tt) - tz) * 1000 147 | except: 148 | return None 149 | 150 | labels = [] 151 | if "x-gmail-labels" in result: 152 | labels = [l.strip().lower() for l in result["x-gmail-labels"].split(',')] 153 | del result["x-gmail-labels"] 154 | result['labels'] = labels 155 | 156 | # Bodies... 157 | if tornado.options.options.index_bodies: 158 | result['body'] = '' 159 | parse_message_parts(msg) 160 | result['body_size'] = len(result['body']) 161 | 162 | parts = result.get("parts", []) 163 | result['content_size_total'] = 0 164 | for part in parts: 165 | result['content_size_total'] += len(part.get('content', "")) 166 | 167 | if not tornado.options.options.index_x_headers: 168 | result = {key: result[key] for key in result if not key.startswith("x-")} 169 | 170 | return result 171 | 172 | 173 | async def load_from_file(): 174 | 175 | if tornado.options.options.init: 176 | await delete_index() 177 | await create_index() 178 | 179 | if tornado.options.options.skip: 180 | logging.info("Skipping first %d messages" % tornado.options.options.skip) 181 | 182 | upload_data = list() 183 | 184 | if tornado.options.options.infile: 185 | logging.info("Starting import from mbox file %s" % tornado.options.options.infile) 186 | mbox = mailbox.mbox(tornado.options.options.infile) 187 | else: 188 | logging.info("Starting import from MH directory %s" % tornado.options.options.indir) 189 | mbox = mailbox.MH(tornado.options.options.indir, factory=None, create=False) 190 | 191 | #Skipping on keys to avoid expensive read operations on skipped messages 192 | msgkeys = mbox.keys()[tornado.options.options.skip:] 193 | 194 | for msgkey in msgkeys: 195 | msg = mbox[msgkey] 196 | item = convert_msg_to_json(msg) 197 | 198 | if item: 199 | upload_data.append(item) 200 | if len(upload_data) == tornado.options.options.batch_size: 201 | await upload_batch(upload_data) 202 | upload_data = list() 203 | 204 | # upload remaining items in `upload_batch` 205 | if upload_data: 206 | await upload_batch(upload_data) 207 | 208 | logging.info("Import done - total count %d" % len(mbox.keys())) 209 | 210 | 211 | if __name__ == '__main__': 212 | 213 | tornado.options.define("es_url", type=str, default=DEFAULT_ES_URL, 214 | help="URL of your Elasticsearch node") 215 | 216 | tornado.options.define("index_name", type=str, default=DEFAULT_INDEX_NAME, 217 | help="Name of the index to store your messages") 218 | 219 | tornado.options.define("infile", type=str, default=None, 220 | help="Input file (supported mailbox format: mbox). Mutually exclusive to --indir") 221 | 222 | tornado.options.define("indir", type=str, default=None, 223 | help="Input directory (supported mailbox format: mh). Mutually exclusive to --infile") 224 | 225 | tornado.options.define("init", type=bool, default=False, 226 | help="Force deleting and re-initializing the Elasticsearch index") 227 | 228 | tornado.options.define("batch_size", type=int, default=DEFAULT_BATCH_SIZE, 229 | help="Elasticsearch bulk index batch size") 230 | 231 | tornado.options.define("skip", type=int, default=0, 232 | help="Number of messages to skip from mailbox") 233 | 234 | tornado.options.define("num_of_shards", type=int, default=2, 235 | help="Number of shards for ES index") 236 | 237 | tornado.options.define("index_bodies", type=bool, default=False, 238 | help="Will index all body content, stripped of HTML/CSS/JS etc. Adds fields: 'body' and \ 239 | 'body_size'") 240 | 241 | tornado.options.define("text_only", type=bool, default=False, 242 | help='Only parse message body multiparts declared as text (ignoring images etc.).') 243 | 244 | tornado.options.define("index_x_headers", type=bool, default=True, 245 | help='Index x-* fields from headers') 246 | 247 | tornado.options.define("dry_run", type=bool, default=False, 248 | help='Do not upload to Elastic Search, just process messages') 249 | 250 | tornado.options.parse_command_line() 251 | 252 | # Exactly one of {infile, indir} must be set 253 | if bool(tornado.options.options.infile) ^ bool(tornado.options.options.indir): 254 | IOLoop.instance().run_sync(load_from_file) 255 | else: 256 | tornado.options.print_help() 257 | --------------------------------------------------------------------------------