├── .gitignore ├── Dockerfile ├── Dockerfile-compose ├── LICENSE ├── README.md ├── VERSION ├── docker-compose.yml ├── example-config.cfg ├── imapbox.py ├── logo.png ├── mailboxresource.py ├── message.py └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__/ 2 | *.py[cod] 3 | .tags 4 | config.cfg 5 | .idea 6 | tmp/** 7 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM python:slim-buster 2 | 3 | WORKDIR /opt/bin/ 4 | 5 | # Copy source files 6 | COPY *.py . 7 | COPY requirements.txt . 8 | COPY VERSION . 9 | 10 | # Install dependencies 11 | RUN pip install --no-cache-dir -r requirements.txt 12 | RUN apt-get update && apt-get install -y wkhtmltopdf 13 | 14 | # Make the data and config directory a volume 15 | VOLUME ["/etc/imapbox/"] 16 | VOLUME ["/var/imapbox/"] 17 | 18 | # Set entry point 19 | 20 | ENTRYPOINT ["python", "./imapbox.py"] 21 | -------------------------------------------------------------------------------- /Dockerfile-compose: -------------------------------------------------------------------------------- 1 | FROM python:3.7-alpine 2 | 3 | # Install dependencies 4 | RUN pip install six 5 | RUN pip install chardet 6 | RUN pip install pdfkit 7 | RUN apk add --update wkhtmltopdf 8 | 9 | # Copy source files and set entry point 10 | COPY *.py /opt/bin/ 11 | ENTRYPOINT ["python", "/opt/bin/imapbox.py"] -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | 2 | The MIT License (MIT) 3 | 4 | Permission is hereby granted, free of charge, to any person obtaining a copy 5 | of this software and associated documentation files (the "Software"), to deal 6 | in the Software without restriction, including without limitation the rights 7 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 8 | copies of the Software, and to permit persons to whom the Software is 9 | furnished to do so, subject to the following conditions: 10 | 11 | The above copyright notice and this permission notice shall be included in all 12 | copies or substantial portions of the Software. 13 | 14 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 15 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 16 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 17 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 18 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 19 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 20 | SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ![IMAPBOX](logo.png) 2 | 3 | Dump IMAP inbox to a local folder in a regular backupable format: HTML, PDF, JSON and attachments. 4 | 5 | This program aims to save a mailbox for archive using files in indexable or searchable formats. The produced files should be readable without external software, for example, to find an email in backups using only the terminal. 6 | 7 | For each email in an IMAP mailbox, a folder is created with the following files: 8 | 9 | File | Description 10 | ------------------|------------------ 11 | __message.html__ | If an html part exists for the message body. the `message.html` will always be in UTF-8, the embedded images links are modified to refer to the attachments subfolder. 12 | __message.pdf__ | This file is optionally created from `message.html` when the `wkhtmltopdf` option is set in the config file. 13 | __attachments__ | The attachments folder contains the attached files and the embeded images. 14 | __message.txt__ | This file contain the body text if available in the original email, always converted in UTF-8. 15 | __metadata.json__ | Various informations in JSON format, date, recipients, body text, etc... This file can be used from external applications or a search engine like [Elasticsearch](http://www.elasticsearch.com/). 16 | __raw.eml.gz__ | A gziped version of the email in `.eml` format. 17 | 18 | Imapbox was designed to archive multiple mailboxes in one common directory tree, 19 | copies of the same message spread knew several account will be archived once using the Message-Id property. 20 | 21 | ## Install 22 | 23 | This script requires Python 3.4 for `master` branch or python 2 on the `python2` branch and the following libraries: 24 | * [six](https://pypi.org/project/six) 25 | * [chardet](https://pypi.python.org/pypi/chardet) – required for character encoding detection. 26 | * [pdfkit](https://pypi.python.org/pypi/pdfkit) – optionally required for archiving emails to PDF. 27 | 28 | To install the required dependencies, run: 29 | 30 | ```bash 31 | pip install -r requirements.txt 32 | ``` 33 | 34 | ## Use cases 35 | 36 | * I use the script to merge all my mail accounts in one searchable directory on my NAS server. 37 | * Report on a website the content of an email address, like a mailing list. 38 | * Sharing address of several employees to perform cross-searches on a common database. 39 | * Archiving an IMAP account because of mailbox size restrictions, or to restrain the used disk space on the IMAP server. 40 | * Archiving emails to PDF format. 41 | 42 | ## Config file 43 | 44 | Use `./config.cfg` `~/.config/imapbox/config.cfg` or `/etc/imapbox/config.cfg` 45 | 46 | Example: 47 | ```ini 48 | [imapbox] 49 | local_folder=/var/imapbox 50 | days=6 51 | wkhtmltopdf=/opt/bin/wkhtmltopdf 52 | specific_folders=True 53 | 54 | [account1] 55 | host=mail.autistici.org 56 | username=username@domain 57 | password=secret 58 | ssl=True 59 | 60 | [account2] 61 | host=imap.googlemail.com 62 | username=username@gmail.com 63 | password=secret 64 | remote_folder=INBOX 65 | port=993 66 | ``` 67 | 68 | ### The imapbox section 69 | 70 | Possibles parameters for the imapbox section: 71 | 72 | Parameter | Description 73 | ----------------|---------------------- 74 | local_folder | The full path to the folder where the emails should be stored. If the local_folder is not set, imapbox will download the emails in the current directory. This can be overwritten with the shell argument `-l`. 75 | days | Number of days back to get in the IMAP account, this should be set greater and equals to the cronjob frequency. If this parameter is not set, imapbox will get all the emails from the IMAP account. This can be overwritten with the shell argument `-d`. 76 | wkhtmltopdf | (optional) The location of the `wkhtmltopdf` binary. By default `pdfkit` will attempt to locate this using `which` (on UNIX type systems) or `where` (on Windows). This can be overwritten with the shell argument `-w`. 77 | specific_folders| (optional) Backup into specific account subfolders. By default all accounts will be combined into one account folder. This can be overwritten with the shell argument `-f`. 78 | test_only | (optional) Only a connection and folder retrival test will be performed. This can be overwritten with the shell argument `-t`. 79 | 80 | ### Other sections 81 | 82 | You can have has many configured account as you want, one per section. Sections names may contains the account name. 83 | 84 | Possibles parameters for an account section: 85 | 86 | Parameter | Description 87 | ----------------|---------------------- 88 | host | IMAP server hostname 89 | username | Login id for the IMAP server. 90 | password | (optional) The password will be saved in cleartext, for security reasons, you have to run the imapbox script in userspace and set `chmod 700` on your `~/.config/mailbox/config.cfg` file. The user will prompted for a password if this parameter is missing. 91 | remote_folder | (optional) IMAP folder name (multiple folder name is not supported for the moment). Default value is `INBOX`. You can use `__ALL__` to fetch all folders. 92 | port | (optional) Default value is `993`. 93 | ssl | (optional) Default value is `False`. Set to `True` to enable SSL 94 | dsn | (optional) Use a specific DSN to set account paramaters. All other parameters in the account section will overwrite these. This can be used with the shell argument `-n `. 95 | 96 | ## Metadata file 97 | 98 | Property | Description 99 | ----------------|---------------------- 100 | Subject | Email subject 101 | Body | A text version of the message 102 | From | Name and email of the sender 103 | To | An array of recipients 104 | Cc | An array of recipients 105 | Attachments | An array of files names 106 | Date | Message date with the timezone included, in the `RFC 2822` format 107 | Utc | Message date converted in UTC, in the `ISO 8601` format. This can be used to sort emails or filter emails by date 108 | WithHtml | Boolean, if the `message.html` file exists or not 109 | WithText | Boolean, if the `message.txt` file exists or not 110 | 111 | ## Elasticsearch 112 | 113 | The `metadata.json` file contain the necessary informations for a search engine like [Elasticsearch](http://www.elasticsearch.com/). 114 | Populate an Elasticsearch index with the emails metadata can be done with a simple script. 115 | 116 | Create an index: 117 | 118 | ```bash 119 | curl -XPUT 'localhost:9200/imapbox?pretty' 120 | ``` 121 | 122 | Add all emails to index: 123 | 124 | ```bash 125 | #!/bin/bash 126 | cd emails/ 127 | for ID in */ ; do 128 | curl -XPUT "localhost:9200/imapbox/message/${ID}?pretty" --data-binary "@${ID}/metadata.json" 129 | done 130 | ``` 131 | 132 | A front-end can be used to search in email archives: 133 | 134 | * [Calaca](https://github.com/polo2ro/Calaca) is a beautiful, easy to use, search UI for Elasticsearch. 135 | * [Facetview](https://github.com/okfn/facetview) 136 | 137 | ## Search in emails without indexation process 138 | 139 | [jq](http://stedolan.github.io/jq/) is a lightweight and flexible command-line JSON processor. 140 | 141 | Example command to browse emails: 142 | 143 | ```bash 144 | find . -name "*.json" | xargs cat | jq '[.Date, .Id, .Subject, " ✉ "] + .From | join(" ")' 145 | ``` 146 | 147 | Example with a filter on UTC date: 148 | 149 | ```bash 150 | find . -name "*.json" | xargs cat | jq 'select(.Utc > "20150221T130000Z")' 151 | ``` 152 | 153 | ## Docker compose 154 | 155 | ``` 156 | version: '3' 157 | services: 158 | 159 | imapbox: 160 | image: mauricemueller/imapbox 161 | container_name: imapbox 162 | volumes: 163 | - imapbox_data:/var/imapbox 164 | # change the path to the config 165 | - ./test/config.cfg:/etc/imapbox/config.cfg 166 | 167 | volumes: 168 | imapbox_data: 169 | ``` 170 | 171 | `docker-compose run --rm imapbox` 172 | 173 | ## Build own docker image and push to dockerhub 174 | 1. `docker login` 175 | 1. `docker-compose build` 176 | 1. `docker tag imapbox:latest [USERNAME]/imapbox:latest` 177 | 1. `docker push [USERNAME]/imapbox:latest` 178 | 179 | 180 | ## Similar projects 181 | 182 | [NoPriv](https://github.com/RaymiiOrg/NoPriv) is a python script to backup any IMAP capable email account to a browsable HTML archive and a Maildir folder. 183 | 184 | ## License 185 | 186 | The MIT License (MIT) 187 | -------------------------------------------------------------------------------- /VERSION: -------------------------------------------------------------------------------- 1 | v1.0.0 -------------------------------------------------------------------------------- /docker-compose.yml: -------------------------------------------------------------------------------- 1 | version: '3' 2 | services: 3 | 4 | imapbox: 5 | image: mauricemueller/imapbox 6 | # enable this if you want to build your own docker image 7 | #build: 8 | # context: ./ 9 | # dockerfile: Dockerfile 10 | image: imapbox:latest 11 | container_name: imapbox 12 | volumes: 13 | # change the path './tmp/backup' to your back up folder 14 | - ./tmp/backup/:/var/imapbox/ 15 | # change the path './tmp/config.cfg' to the config 16 | - ./tmp/config.cfg:/etc/imapbox/config.cfg 17 | 18 | -------------------------------------------------------------------------------- /example-config.cfg: -------------------------------------------------------------------------------- 1 | [imapbox] 2 | local_folder= 3 | days=10 4 | wkhtmltopdf=/usr/bin/wkhtmltopdf 5 | 6 | [account1] 7 | host=imap.googlemail.com 8 | username= 9 | password= 10 | remote_folder= 11 | exclude_folders= 12 | port=993 13 | -------------------------------------------------------------------------------- /imapbox.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | #-*- coding:utf-8 -*- 3 | 4 | from mailboxresource import save_emails, get_folder_fist, get_account 5 | import argparse 6 | from six.moves import configparser 7 | import os 8 | import getpass 9 | 10 | 11 | def load_configuration(args): 12 | config = configparser.ConfigParser(allow_no_value=True) 13 | config.read(['./config.cfg', '/etc/imapbox/config.cfg', os.path.expanduser('~/.config/imapbox/config.cfg')]) 14 | 15 | options = { 16 | 'days': None, 17 | 'local_folder': '.', 18 | 'wkhtmltopdf': None, 19 | 'specific_folders': False, 20 | 'test_only': False, 21 | 'accounts': [] 22 | } 23 | 24 | if (config.has_section('imapbox')): 25 | if config.has_option('imapbox', 'days'): 26 | options['days'] = config.getint('imapbox', 'days') 27 | 28 | if config.has_option('imapbox', 'local_folder'): 29 | options['local_folder'] = os.path.expanduser(config.get('imapbox', 'local_folder')) 30 | 31 | if config.has_option('imapbox', 'wkhtmltopdf'): 32 | options['wkhtmltopdf'] = os.path.expanduser(config.get('imapbox', 'wkhtmltopdf')) 33 | 34 | if config.has_option('imapbox', 'specific_folders'): 35 | options['specific_folders'] = config.getboolean('imapbox', 'specific_folders') 36 | 37 | if config.has_option('imapbox', 'test_only'): 38 | options['test_only'] = config.getboolean('imapbox', 'test_only') 39 | 40 | if args.specific_dsn: 41 | account = get_account(args.specific_dsn) 42 | if (None == account['host'] or None == account['username'] or None == account['password']): 43 | print('Invalid DSN: ' + args.specific_dsn) 44 | else: 45 | options['accounts'].append(account) 46 | 47 | else: 48 | for section in config.sections(): 49 | 50 | if ('imapbox' == section): 51 | continue 52 | 53 | if (args.specific_account and (args.specific_account != section)): 54 | continue 55 | 56 | account = { 57 | 'name': section, 58 | 'remote_folder': 'INBOX', 59 | 'username': None, 60 | 'password': None, 61 | 'host': None, 62 | 'port': 993, 63 | 'ssl': False 64 | } 65 | 66 | if config.has_option(section, 'dsn'): 67 | account = get_account(config.get(section, 'dsn'), account['name']) 68 | 69 | if config.has_option(section, 'host'): 70 | account['host'] = config.get(section, 'host') 71 | 72 | if config.has_option(section, 'port'): 73 | account['port'] = config.get(section, 'port') 74 | 75 | if config.has_option(section, 'username'): 76 | account['username'] = config.get(section, 'username') 77 | 78 | if config.has_option(section, 'password'): 79 | account['password'] = config.get(section, 'password') 80 | elif not account['password']: 81 | prompt=('Password for ' + account['username'] + ':' + account['host'] + ': ') 82 | account['password'] = getpass.getpass(prompt=prompt) 83 | 84 | if config.has_option(section, 'ssl'): 85 | if config.get(section, 'ssl').lower() == "true": 86 | account['ssl'] = True 87 | 88 | if config.has_option(section, 'remote_folder'): 89 | account['remote_folder'] = config.get(section, 'remote_folder') 90 | 91 | if config.has_option(section, 'exclude_folders'): 92 | exclude_folders_str = config.get(section, 'exclude_folders') 93 | account['exclude_folders'] = [folder.strip() for folder in exclude_folders_str.split(',')] 94 | else: 95 | account['exclude_folders'] = [] # Leeres Array, falls keine exclude_folders angegeben sind 96 | 97 | if (None == account['host'] or None == account['username'] or None == account['password']): 98 | print('Invalid account: ' + section) 99 | continue 100 | 101 | options['accounts'].append(account) 102 | 103 | if (args.local_folder): 104 | options['local_folder'] = args.local_folder 105 | 106 | if (args.days): 107 | options['days'] = args.days 108 | 109 | if (args.wkhtmltopdf): 110 | options['wkhtmltopdf'] = args.wkhtmltopdf 111 | 112 | if (args.specific_folders): 113 | options['specific_folders'] = True 114 | 115 | if (args.test_only): 116 | options['test_only'] = True 117 | 118 | if (args.show_version): 119 | with open(os.path.join(os.path.dirname(os.path.abspath(__file__)), 'VERSION'), 'r') as version_file: 120 | print(version_file.read()) 121 | exit(0) 122 | 123 | return options 124 | 125 | 126 | 127 | 128 | def main(): 129 | argparser = argparse.ArgumentParser(description="Dump a IMAP folder into .eml files") 130 | argparser.add_argument('-l', dest='local_folder', help="Local folder where to create the email folders") 131 | argparser.add_argument('-d', dest='days', help="Number of days back to get in the IMAP account", type=int) 132 | argparser.add_argument('-w', dest='wkhtmltopdf', help="The location of the wkhtmltopdf binary") 133 | argparser.add_argument('-a', dest='specific_account', help="Select a specific account to backup") 134 | argparser.add_argument('-f', dest='specific_folders', help="Backup into specific account subfolders", action='store_true') 135 | argparser.add_argument('-t', dest='test_only', help="Only a connection and folder retrival test will be performed", action='store_true') 136 | argparser.add_argument('-n', dest='specific_dsn', help="Use a specific DSN as account") 137 | argparser.add_argument('-v', '--version', dest='show_version', help="Show the current version", action="store_true") 138 | args = argparser.parse_args() 139 | options = load_configuration(args) 140 | rootDir = options['local_folder'] 141 | 142 | if not options['accounts']: 143 | argparser.print_help() 144 | 145 | for account in options['accounts']: 146 | 147 | print('{}/{} (on {})'.format(account['name'], account['remote_folder'], account['host'])) 148 | 149 | if options['test_only']: 150 | try: 151 | get_folder_fist(account) 152 | print(" - SUCCESS: Login and folder retrival") 153 | except: 154 | print("\x1b[31;20m" + " - FAILED: Login and folder retrival" + "\x1b[0m") 155 | continue 156 | 157 | if options['specific_folders']: 158 | basedir = os.path.join(rootDir, account['name']) 159 | else: 160 | basedir = rootDir 161 | 162 | if account['remote_folder'] == "__ALL__": 163 | folders = [] 164 | for folder_entry in get_folder_fist(account): 165 | folder_name = folder_entry.decode().replace("/", ".").split(' "." ')[1] 166 | if folder_name not in account['exclude_folders']: 167 | folders.append(folder_name) 168 | # Remove Gmail parent folder from array otherwise the script fails: 169 | if '"[Gmail]"' in folders: folders.remove('"[Gmail]"') 170 | # Remove Gmail "All Mail" folder which just duplicates emails: 171 | if '"[Gmail].All Mail"' in folders: folders.remove('"[Gmail].All Mail"') 172 | else: 173 | folders = str.split(account['remote_folder'], ',') 174 | for folder_entry in folders: 175 | print("Saving folder: " + folder_entry) 176 | account['remote_folder'] = folder_entry 177 | options['local_folder'] = os.path.join(basedir, folder_entry.replace('"', '')) 178 | save_emails(account, options) 179 | 180 | 181 | if __name__ == '__main__': 182 | main() 183 | -------------------------------------------------------------------------------- /logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/polo2ro/imapbox/5ec0193d59b77268b74787bdc1ab84938871a32f/logo.png -------------------------------------------------------------------------------- /mailboxresource.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | #-*- coding:utf-8 -*- 3 | 4 | from __future__ import print_function 5 | 6 | import imaplib, email 7 | import re 8 | import os 9 | import hashlib 10 | from message import Message 11 | import datetime 12 | import urllib 13 | 14 | MAX_RETRIES = 5 15 | 16 | class MailboxClient: 17 | """Operations on a mailbox""" 18 | 19 | def __init__(self, host, port, username, password, remote_folder, ssl): 20 | 21 | self.host = host 22 | self.port = port 23 | self.username = username 24 | self.password = password 25 | self.remote_folder = remote_folder 26 | self.ssl = ssl 27 | 28 | self.connect_to_imap() 29 | 30 | def connect_to_imap(self): 31 | retries = 0 32 | while retries < MAX_RETRIES: 33 | try: 34 | if not self.ssl: # Gespeicherten Wert verwenden 35 | self.mailbox = imaplib.IMAP4(self.host, self.port) # Gespeicherte Werte verwenden 36 | else: 37 | self.mailbox = imaplib.IMAP4_SSL(self.host, self.port) 38 | self.mailbox.login(self.username, self.password) 39 | typ, data = self.mailbox.select(self.remote_folder, readonly=True) 40 | if typ != 'OK': 41 | # Handle case where Exchange/Outlook uses '.' path separator when 42 | # reporting subfolders. Adjust to use '/' on remote. 43 | adjust_remote_folder = re.sub(r'\.', '/', self.remote_folder) 44 | typ, data = self.mailbox.select(adjust_remote_folder, readonly=True) 45 | if typ != 'OK': 46 | print("MailboxClient: Could not select remote folder '%s'" % self.remote_folder) 47 | break # Erfolgreiche Verbindung und Ordnerauswahl 48 | except ConnectionResetError as e: 49 | print(f"Connection error: {e}. Will retry...") 50 | retries += 1 51 | except Exception as e: 52 | print(f"MailboxClient: The following error happened: {e}. Will NOT retry...") 53 | exit(1) 54 | 55 | if retries == MAX_RETRIES: 56 | print("Maximum retries reached. Exiting...") 57 | exit(1) 58 | 59 | def search_emails(self, criterion, batch_size=5000): 60 | all_uids = [] 61 | last_num = 0 62 | 63 | while True: 64 | typ, data = self.mailbox.search(None, criterion, f'{last_num+1}:{last_num + batch_size}') 65 | if typ != 'OK': 66 | raise imaplib.IMAP4.error(f"Error on searching emails: {data}") 67 | 68 | if data and len(data) > 0 and data[0]: 69 | batch_uids = data[0].split() 70 | else: 71 | batch_uids = [] 72 | 73 | if not batch_uids: 74 | break 75 | 76 | all_uids.extend(batch_uids) 77 | last_num = last_num + batch_size 78 | 79 | return all_uids 80 | 81 | def copy_emails(self, days, local_folder, wkhtmltopdf): 82 | 83 | n_saved = 0 84 | n_exists = 0 85 | 86 | self.local_folder = local_folder 87 | self.wkhtmltopdf = wkhtmltopdf 88 | criterion = 'ALL' 89 | 90 | if days: 91 | date = (datetime.date.today() - datetime.timedelta(days)).strftime("%d-%b-%Y") 92 | criterion = '(SENTSINCE {date})'.format(date=date) 93 | 94 | uids = self.search_emails(criterion) 95 | if uids is not None and uids is not []: 96 | print("Copying emails...") 97 | total = len(uids) 98 | for idx, num in enumerate(uids): 99 | fetch_retries = 0 100 | while fetch_retries < MAX_RETRIES: 101 | try: 102 | typ, data = self.mailbox.fetch(num, '(BODY.PEEK[])') 103 | print('\r{0:.2f}%'.format(idx*100/total), end='') 104 | if self.saveEmail(data): 105 | n_saved += 1 106 | else: 107 | n_exists += 1 108 | break 109 | except ConnectionResetError as e: 110 | print(f"Connection error while fetching email: {e}. Retrying...") 111 | self.connect_to_imap() 112 | fetch_retries += 1 113 | except imaplib.IMAP4.abort as e: 114 | print(f"Abort error while fetching email: {e}. Skipping...") 115 | self.connect_to_imap() 116 | break 117 | except Exception as e: 118 | print(f"Error while fetching email: {e}. Skipping...") 119 | break 120 | if fetch_retries == MAX_RETRIES: 121 | print("\nMaximum retries reached. Exiting...") 122 | exit(1) 123 | print("\rDone.") 124 | return (n_saved, n_exists) 125 | 126 | def cleanup(self): 127 | self.mailbox.close() 128 | self.mailbox.logout() 129 | 130 | 131 | def getEmailFolder(self, msg, data): 132 | # 255is the max filename length on all systems 133 | if msg['Message-Id'] and len(msg['Message-Id']) < 255: 134 | foldername = re.sub(r'[^a-zA-Z0-9_\-\.() ]+', '', msg['Message-Id']) 135 | else: 136 | foldername = hashlib.sha224(data).hexdigest() 137 | 138 | year = 'None' 139 | if msg['Date']: 140 | match = re.search(r'\d{1,2}\s\w{3}\s(\d{4})', msg['Date']) 141 | if match: 142 | year = match.group(1) 143 | 144 | 145 | return os.path.join(self.local_folder, year, foldername) 146 | 147 | 148 | 149 | def saveEmail(self, data): 150 | for response_part in data: 151 | if isinstance(response_part, tuple): 152 | msg = "" 153 | # Handle Python version differences: 154 | # Python 2 imaplib returns bytearray, Python 3 imaplib 155 | # returns str. 156 | if isinstance(response_part[1], str): 157 | msg = email.message_from_string(response_part[1]) 158 | else: 159 | try: 160 | msg = email.message_from_string(response_part[1].decode("utf-8")) 161 | except: 162 | # print("couldn't decode message with utf-8 - trying 'ISO-8859-1'") 163 | msg = email.message_from_string(response_part[1].decode("ISO-8859-1")) 164 | 165 | directory = self.getEmailFolder(msg, data[0][1]) 166 | 167 | if os.path.exists(directory): 168 | return False 169 | 170 | os.makedirs(directory) 171 | 172 | try: 173 | message = Message(directory, msg) 174 | message.createRawFile(data[0][1]) 175 | message.createMetaFile() 176 | message.extractAttachments() 177 | 178 | if self.wkhtmltopdf: 179 | message.createPdfFile(self.wkhtmltopdf) 180 | 181 | except Exception as e: 182 | # ex: Unsupported charset on decode 183 | if hasattr(e, 'strerror'): 184 | if e.strerror is not None: 185 | print(directory) 186 | print("MailboxClient.saveEmail() failed:", e.strerror) 187 | else: 188 | print("MailboxClient.saveEmail() failed") 189 | print(e) 190 | 191 | return True 192 | 193 | 194 | def save_emails(account, options): 195 | mailbox = MailboxClient(account['host'], account['port'], account['username'], account['password'], account['remote_folder'], account['ssl']) 196 | stats = mailbox.copy_emails(options['days'], options['local_folder'], options['wkhtmltopdf']) 197 | mailbox.cleanup() 198 | if stats[0] == 0 and stats[1] == 0: 199 | print('Folder {} is empty'.format(account['remote_folder'])) 200 | else: 201 | print('{} emails created, {} emails already exists'.format(stats[0], stats[1])) 202 | 203 | 204 | def get_folder_fist(account): 205 | if not account['ssl']: 206 | mailbox = imaplib.IMAP4(account['host'], account['port']) 207 | else: 208 | mailbox = imaplib.IMAP4_SSL(account['host'], account['port']) 209 | mailbox.login(account['username'], account['password']) 210 | folder_list = mailbox.list()[1] 211 | mailbox.logout() 212 | return folder_list 213 | 214 | # DSN: 215 | # defaults to INBOX, path represents a single folder: 216 | # imap://username:password@imap.gmail.com:993/ 217 | # imap://username:password@imap.gmail.com:993/INBOX 218 | # 219 | # get all folders 220 | # imap://username:password@imap.gmail.com:993/__ALL__ 221 | # 222 | # singe folder with ssl, both are the same: 223 | # imaps://username:password@imap.gmail.com:993/INBOX 224 | # imap://username:password@imap.gmail.com:993/INBOX?ssl=true 225 | # 226 | # folder as provided as path or as query param "remote_folder" with comma separated list 227 | # imap://username:password@imap.gmail.com:993/INBOX.Drafts 228 | # imap://username:password@imap.gmail.com:993/?remote_folder=INBOX.Drafts 229 | # 230 | # combined list of folders with path and ?remote_folder 231 | # imap://username:password@imap.gmail.com:993/INBOX.Drafts?remote_folder=INBOX.Sent 232 | # 233 | # with multiple remote_folder: 234 | # imap://username:password@imap.gmail.com:993/?remote_folder=INBOX.Drafts 235 | # imap://username:password@imap.gmail.com:993/?remote_folder=INBOX.Drafts,INBOX.Sent 236 | # 237 | # setting other parameters 238 | # imap://username:password@imap.gmail.com:993/?name=Account1 239 | def get_account(dsn, name=None): 240 | account = { 241 | 'name': 'account', 242 | 'host': None, 243 | 'port': 993, 244 | 'username': None, 245 | 'password': None, 246 | 'remote_folder': 'INBOX', # String (might contain a comma separated list of folders) 247 | 'ssl': False, 248 | } 249 | 250 | parsed_url = urllib.parse.urlparse(dsn) 251 | 252 | if parsed_url.scheme.lower() not in ['imap', 'imaps']: 253 | raise ValueError('Scheme must be "imap" or "imaps"') 254 | 255 | account['ssl'] = parsed_url.scheme.lower() == 'imaps' 256 | 257 | if parsed_url.hostname: 258 | account['host'] = parsed_url.hostname 259 | 260 | if parsed_url.port: 261 | account['port'] = parsed_url.port 262 | if parsed_url.username: 263 | account['username'] = urllib.parse.unquote(parsed_url.username) 264 | if parsed_url.password: 265 | account['password'] = urllib.parse.unquote(parsed_url.password) 266 | 267 | # prefill account name, if none was provided (by config.cfg) in case of calling it from commandline. can be overwritten by the query param 'name' 268 | if name: 269 | account['name'] = name 270 | 271 | else: 272 | if (account['username']): 273 | account['name'] = account['username'] 274 | 275 | if (account['host']): 276 | account['name'] += '@' + account['host'] 277 | 278 | if parsed_url.path != '': 279 | account['remote_folder'] = parsed_url.path.lstrip('/').rstrip('/') 280 | 281 | if parsed_url.query != '': 282 | query_params = urllib.parse.parse_qs(parsed_url.query) 283 | 284 | # merge query params into account 285 | for key, value in query_params.items(): 286 | 287 | if key == 'remote_folder': 288 | if account['remote_folder'] is not None: 289 | account['remote_folder'] += ',' + value[0] 290 | else: 291 | account['remote_folder'] = value[0] 292 | 293 | elif key == 'ssl': 294 | account['ssl'] = value[0].lower() == 'true' 295 | 296 | # merge all others params, to be able to overwrite username, password, ... and future account options 297 | else: 298 | account[key] = value[0] if len(value) == 1 else value 299 | 300 | return account 301 | -------------------------------------------------------------------------------- /message.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | #-*- coding:utf-8 -*- 3 | 4 | 5 | import email 6 | from email.utils import parseaddr 7 | from email.header import decode_header 8 | import re 9 | import os 10 | import signal 11 | import posixpath 12 | import sys 13 | import json 14 | import io 15 | import mimetypes 16 | import chardet 17 | import gzip 18 | import html 19 | import time 20 | import pkgutil 21 | 22 | from six.moves import html_parser 23 | 24 | # import pdfkit if its loader is available 25 | has_pdfkit = pkgutil.find_loader('pdfkit') is not None 26 | if has_pdfkit: import pdfkit 27 | 28 | TIMEOUT_SECONDS = 120 29 | 30 | # email address REGEX matching the RFC 2822 spec 31 | # from perlfaq9 32 | # my $atom = qr{[a-zA-Z0-9_!#\$\%&'*+/=?\^`{}~|\-]+}; 33 | # my $dot_atom = qr{$atom(?:\.$atom)*}; 34 | # my $quoted = qr{"(?:\\[^\r\n]|[^\\"])*"}; 35 | # my $local = qr{(?:$dot_atom|$quoted)}; 36 | # my $domain_lit = qr{\[(?:\\\S|[\x21-\x5a\x5e-\x7e])*\]}; 37 | # my $domain = qr{(?:$dot_atom|$domain_lit)}; 38 | # my $addr_spec = qr{$local\@$domain}; 39 | # 40 | # Python translation 41 | 42 | atom_rfc2822=r"[a-zA-Z0-9_!#\$\%&'*+/=?\^`{}~|\-]+" 43 | atom_posfix_restricted=r"[a-zA-Z0-9_#\$&'*+/=?\^`{}~|\-]+" # without '!' and '%' 44 | atom=atom_rfc2822 45 | dot_atom=atom + r"(?:\." + atom + ")*" 46 | quoted=r'"(?:\\[^\r\n]|[^\\"])*"' 47 | local="(?:" + dot_atom + "|" + quoted + ")" 48 | domain_lit=r"\[(?:\\\S|[\x21-\x5a\x5e-\x7e])*\]" 49 | domain="(?:" + dot_atom + "|" + domain_lit + ")" 50 | addr_spec=local + "\@" + domain 51 | 52 | email_address_re=re.compile('^'+addr_spec+'$') 53 | 54 | 55 | 56 | 57 | class MLStripper(html_parser.HTMLParser): 58 | def __init__(self): 59 | self.reset() 60 | self.fed = [] 61 | def convert_charrefs(x): 62 | return x 63 | def handle_data(self, d): 64 | self.fed.append(d) 65 | def get_data(self): 66 | return ''.join(self.fed) 67 | 68 | def strip_tags(html): 69 | s = MLStripper() 70 | s.feed(html) 71 | return s.get_data() 72 | 73 | 74 | 75 | class Message: 76 | """Operation on a message""" 77 | 78 | def __init__(self, directory, msg): 79 | self.msg = msg 80 | self.directory = directory 81 | 82 | def getmailheader(self, header_text, default="ascii"): 83 | """Decode header_text if needed""" 84 | try: 85 | headers=decode_header(header_text) 86 | except email.Errors.HeaderParseError: 87 | # This already append in email.base64mime.decode() 88 | # instead return a sanitized ascii string 89 | return header_text.encode('ascii', 'replace').decode('ascii') 90 | else: 91 | for i, (text, charset) in enumerate(headers): 92 | headers[i]=text 93 | if charset: 94 | headers[i]=str(text, charset) 95 | else: 96 | headers[i]=str(text) 97 | return u"".join(headers) 98 | 99 | 100 | def getmailaddresses(self, prop): 101 | """retrieve From:, To: and Cc: addresses""" 102 | addrs=email.utils.getaddresses(self.msg.get_all(prop, [])) 103 | for i, (name, addr) in enumerate(addrs): 104 | if not name and addr: 105 | # only one string! Is it the address or is it the name ? 106 | # use the same for both and see later 107 | name=addr 108 | 109 | try: 110 | # address must be ascii only 111 | addr=addr.encode('ascii') 112 | except UnicodeError: 113 | addr='' 114 | else: 115 | # address must match adress regex 116 | if not email_address_re.match(addr.decode("utf-8")): 117 | addr='' 118 | if not isinstance(addr, str): 119 | # Python 2 imaplib returns a bytearray, 120 | # Python 3 imaplib returns a str. 121 | addrs[i]=(self.getmailheader(name), addr.decode("utf-8")) 122 | return addrs 123 | 124 | def getSubject(self): 125 | if not hasattr(self, 'subject'): 126 | self.subject = self.getmailheader(self.msg.get('Subject', '')) 127 | return self.subject 128 | 129 | def getFrom(self): 130 | if not hasattr(self, 'from_'): 131 | self.from_ = self.getmailaddresses('from') 132 | self.from_ = ('', '') if not self.from_ else self.from_[0] 133 | return self.from_ 134 | 135 | def normalizeDate(self, datestr): 136 | if not datestr: 137 | print("No date for '%s'. Using Unix Epoch instead." % self.directory) 138 | datestr="Thu, 1 Jan 1970 00:00:00 +0000" 139 | t = email.utils.parsedate_tz(datestr) 140 | timeval = time.mktime(t[:-1]) 141 | date = email.utils.formatdate(timeval, True) 142 | utc = time.gmtime(email.utils.mktime_tz(t)) 143 | rfc2822 = '{} {:+03d}00'.format(date[:-6], t[9]//3600) 144 | iso8601 = time.strftime('%Y%m%dT%H%M%SZ', utc) 145 | 146 | return (rfc2822, iso8601) 147 | 148 | def createMetaFile(self): 149 | tos=self.getmailaddresses('to') 150 | ccs=self.getmailaddresses('cc') 151 | 152 | parts = self.getParts() 153 | attachments = [] 154 | for afile in parts['files']: 155 | attachments.append(afile[1]) 156 | 157 | text_content = '' 158 | 159 | if parts['text']: 160 | text_content = self.getTextContent(parts['text']) 161 | else: 162 | if parts['html']: 163 | text_content = strip_tags(self.getHtmlContent(parts['html'])) 164 | 165 | rfc2822, iso8601 = self.normalizeDate(self.msg['Date']) 166 | 167 | with io.open('%s/metadata.json' %(self.directory), 'w', encoding='utf8') as json_file: 168 | data = json.dumps({ 169 | 'Id': self.msg['Message-Id'], 170 | 'Subject' : self.getSubject(), 171 | 'From' : self.getFrom(), 172 | 'To' : tos, 173 | 'Cc' : ccs, 174 | 'Date' : rfc2822, 175 | 'Utc' : iso8601, 176 | 'Attachments': attachments, 177 | 'WithHtml': len(parts['html']) > 0, 178 | 'WithText': len(parts['text']) > 0, 179 | 'Body': text_content 180 | }, indent=4, ensure_ascii=False) 181 | 182 | json_file.write(data) 183 | 184 | json_file.close() 185 | 186 | 187 | 188 | 189 | def createRawFile(self, data): 190 | f = gzip.open('%s/raw.eml.gz' %(self.directory), 'wb') 191 | f.write(data) 192 | f.close() 193 | 194 | 195 | def getPartCharset(self, part): 196 | if part.get_content_charset() is None: 197 | # Python 2 chardet expects a string, 198 | # Python 3 chardet expects a bytearray. 199 | if sys.version_info[0] < 3: 200 | return chardet.detect(part.as_string())['encoding'] 201 | else: 202 | try: 203 | return chardet.detect(part.as_bytes())['encoding'] 204 | except UnicodeEncodeError: 205 | string = part.as_string() 206 | array = bytearray(string, 'utf-8') 207 | return chardet.detect(array)['encoding'] 208 | return part.get_content_charset() 209 | 210 | 211 | def getTextContent(self, parts): 212 | if not hasattr(self, 'text_content'): 213 | self.text_content = '' 214 | for part in parts: 215 | raw_content = part.get_payload(decode=True) 216 | charset = self.getPartCharset(part) 217 | self.text_content += raw_content.decode(charset, "replace") 218 | return self.text_content 219 | 220 | 221 | def createTextFile(self, parts): 222 | utf8_content = self.getTextContent(parts) 223 | with open(os.path.join(self.directory, 'message.txt'), 'wb') as fp: 224 | fp.write(bytearray(utf8_content, 'utf-8')) 225 | 226 | def getHtmlContent(self, parts): 227 | if not hasattr(self, 'html_content'): 228 | self.html_content = '' 229 | 230 | for part in parts: 231 | raw_content = part.get_payload(decode=True) 232 | charset = self.getPartCharset(part) 233 | self.html_content += raw_content.decode(charset, "replace") 234 | 235 | m = re.search(r']*>(.+)<\/body>', self.html_content, re.S | re.I) 236 | if (m != None): 237 | self.html_content = m.group(1) 238 | 239 | return self.html_content 240 | 241 | 242 | def createHtmlFile(self, parts, embed): 243 | utf8_content = self.getHtmlContent(parts) 244 | for img in embed: 245 | pattern = r'src=["\']cid:%s["\']' % (re.escape(img[0])) 246 | path = posixpath.join('attachments', img[1]) 247 | utf8_content = re.sub(pattern, 'src="%s"' % (path), utf8_content, 0, re.S | re.I) 248 | 249 | 250 | subject = self.getSubject() 251 | fromname = self.getFrom()[0] 252 | 253 | utf8_content = """ 254 | 255 | 256 | 257 | 258 | %s 259 | 260 | 261 | %s 262 | 263 | """ % (html.escape(fromname), html.escape(subject), utf8_content) 264 | 265 | with open(os.path.join(self.directory, 'message.html'), 'wb') as fp: 266 | fp.write(bytearray(utf8_content, 'utf-8')) 267 | 268 | 269 | def sanitizeFilename(self, filename): 270 | keepcharacters = (' ','.','_','-') 271 | return "".join(c for c in filename if c.isalnum() or c in keepcharacters).rstrip() 272 | 273 | 274 | def getParts(self): 275 | if not hasattr(self, 'message_parts'): 276 | counter = 1 277 | message_parts = { 278 | 'text': [], 279 | 'html': [], 280 | 'embed_images': [], 281 | 'files': [] 282 | } 283 | 284 | 285 | 286 | for part in self.msg.walk(): 287 | # multipart/* are just containers 288 | if part.get_content_maintype() == 'multipart': 289 | continue 290 | 291 | # Applications should really sanitize the given filename so that an 292 | # email message can't be used to overwrite important files 293 | filename = part.get_filename() 294 | if not filename: 295 | if part.get_content_type() == 'text/plain': 296 | message_parts['text'].append(part) 297 | continue 298 | 299 | if part.get_content_type() == 'text/html': 300 | message_parts['html'].append(part) 301 | continue 302 | 303 | ext = mimetypes.guess_extension(part.get_content_type()) 304 | if not ext: 305 | # Use a generic bag-of-bits extension 306 | ext = '.bin' 307 | filename = 'part-%03d%s' % (counter, ext) 308 | 309 | filename = self.sanitizeFilename(filename) 310 | 311 | content_id =part.get('Content-Id') 312 | if (content_id): 313 | content_id = content_id[1:][:-1] 314 | message_parts['embed_images'].append((content_id, filename)) 315 | 316 | counter += 1 317 | message_parts['files'].append((part, filename)) 318 | self.message_parts = message_parts 319 | return self.message_parts 320 | 321 | 322 | def extractAttachments(self): 323 | message_parts = self.getParts() 324 | 325 | if message_parts['text']: 326 | self.createTextFile(message_parts['text']) 327 | 328 | if message_parts['html']: 329 | self.createHtmlFile(message_parts['html'], message_parts['embed_images']) 330 | 331 | if message_parts['files']: 332 | attdir = os.path.join(self.directory, 'attachments') 333 | if not os.path.exists(attdir): 334 | os.makedirs(attdir) 335 | for afile in message_parts['files']: 336 | with open(os.path.join(attdir, afile[1]), 'wb') as fp: 337 | payload = afile[0].get_payload(decode=True) 338 | if payload: 339 | fp.write(payload) 340 | 341 | 342 | def createPdfFile(self, wkhtmltopdf): 343 | if has_pdfkit: 344 | html_path = os.path.join(self.directory, 'message.html') 345 | pdf_path = os.path.join(self.directory, 'message.pdf') 346 | config = pdfkit.configuration(wkhtmltopdf=wkhtmltopdf) 347 | 348 | def timeout_handler(signum, frame): 349 | raise TimeoutError("PDF creation timed out.") 350 | 351 | signal.signal(signal.SIGALRM, timeout_handler) 352 | signal.alarm(TIMEOUT_SECONDS) 353 | 354 | try: 355 | pdfkit.from_file(html_path, pdf_path, configuration=config) 356 | except TimeoutError: 357 | print("Timeout while creating PDF. wkhtmltopdf was terminated.") 358 | finally: 359 | signal.alarm(0) 360 | else: 361 | print("Couldn't create PDF message, since \"pdfkit\" module isn't installed.") 362 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | chardet 2 | pdfkit 3 | six 4 | --------------------------------------------------------------------------------