├── README.md └── lambda_email_extractor.py /README.md: -------------------------------------------------------------------------------- 1 | AWS Lambda Attachment Extractor 2 | =============================== 3 | 4 | This is a very simple project which can be used for reference when creating Lambda functions to manipulate email 5 | files stored to an Amazon S3 bucket, typically by an inbound SES routing rule. See 6 | ['AWS Documentation'](http://docs.aws.amazon.com/ses/latest/DeveloperGuide/receiving-email.html) for more information. 7 | 8 | 9 | Use cases 10 | --------- 11 | 12 | The original use for this project was to process DMARC reports, getting .ZIP and .GZIP attachments from emails stored in S3 and extracting the archives, resulting in .XML files, which are then uploaded to another S3 bucket for processing by another service. 13 | 14 | The Python file doesn't perform security checks on attachments and will ignore anything which does not 15 | result in an .XML file. The original email is then deleted after processing takes place. 16 | 17 | This type of processing may be very useful for extracting DMARC reports, which are sent by email service providers 18 | to an email address specified in the _rua_ section of your DMARC DNS record, see https://dmarc.org/overview/. You 19 | will then be left with a bucket of XML DMARC reports which you may wish to process manually or upload to a service such 20 | as https://dmarcian.com/dmarc-xml/. 21 | 22 | Getting Started 23 | --------------- 24 | 25 | To get started, simply upload the python file to Lambda, changing the variables if required. This lambda function expects a S3 put event to trigger it, see ['this tutorial'](http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html) for more information. 26 | -------------------------------------------------------------------------------- /lambda_email_extractor.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | 3 | import email 4 | import zipfile 5 | import os 6 | import gzip 7 | import string 8 | import boto3 9 | import urllib 10 | 11 | print('Loading function') 12 | 13 | s3 = boto3.client('s3') 14 | s3r = boto3.resource('s3') 15 | xmlDir = "/tmp/output/" 16 | 17 | outputBucket = "" # Set here for a seperate bucket otherwise it is set to the events bucket 18 | outputPrefix = "xml/" # Should end with / 19 | 20 | 21 | def lambda_handler(event, context): 22 | bucket = event['Records'][0]['s3']['bucket']['name'] 23 | key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf8') 24 | 25 | try: 26 | # Set outputBucket if required 27 | if not outputBucket: 28 | global outputBucket 29 | outputBucket = bucket 30 | 31 | # Use waiter to ensure the file is persisted 32 | waiter = s3.get_waiter('object_exists') 33 | waiter.wait(Bucket=bucket, Key=key) 34 | 35 | response = s3r.Bucket(bucket).Object(key) 36 | 37 | # Read the raw text file into a Email Object 38 | msg = email.message_from_string(response.get()["Body"].read()) 39 | 40 | if len(msg.get_payload()) == 2: 41 | 42 | # Create directory for XML files (makes debugging easier) 43 | if os.path.isdir(xmlDir) == False: 44 | os.mkdir(xmlDir) 45 | 46 | # The first attachment 47 | attachment = msg.get_payload()[1] 48 | 49 | # Extract the attachment into /tmp/output 50 | extract_attachment(attachment) 51 | 52 | # Upload the XML files to S3 53 | upload_resulting_files_to_s3() 54 | 55 | else: 56 | print("Could not see file/attachment.") 57 | 58 | return 0 59 | except Exception as e: 60 | print(e) 61 | print('Error getting object {} from bucket {}. Make sure they exist ' 62 | 'and your bucket is in the same region as this ' 63 | 'function.'.format(key, bucket)) 64 | raise e 65 | delete_file(key, bucket) 66 | 67 | 68 | def extract_attachment(attachment): 69 | # Process filename.zip attachments 70 | if "gzip" in attachment.get_content_type(): 71 | contentdisp = string.split(attachment.get('Content-Disposition'), '=') 72 | fname = contentdisp[1].replace('\"', '') 73 | open('/tmp/' + contentdisp[1], 'wb').write(attachment.get_payload(decode=True)) 74 | # This assumes we have filename.xml.gz, if we get this wrong, we will just 75 | # ignore the report 76 | xmlname = fname[:-3] 77 | open(xmlDir + xmlname, 'wb').write(gzip.open('/tmp/' + contentdisp[1], 'rb').read()) 78 | 79 | # Process filename.xml.gz attachments (Providers not complying to standards) 80 | elif "zip" in attachment.get_content_type(): 81 | open('/tmp/attachment.zip', 'wb').write(attachment.get_payload(decode=True)) 82 | with zipfile.ZipFile('/tmp/attachment.zip', "r") as z: 83 | z.extractall(xmlDir) 84 | 85 | else: 86 | print('Skipping ' + attachment.get_content_type()) 87 | 88 | 89 | def upload_resulting_files_to_s3(): 90 | # Put all XML back into S3 (Covers non-compliant cases if a ZIP contains multiple results) 91 | for fileName in os.listdir(xmlDir): 92 | if fileName.endswith(".xml"): 93 | print("Uploading: " + fileName) # File name to upload 94 | s3r.meta.client.upload_file(xmlDir+'/'+fileName, outputBucket, outputPrefix+fileName) 95 | 96 | 97 | # Delete the file in the current bucket 98 | def delete_file(key, bucket): 99 | s3.delete_object(Bucket=bucket, Key=key) 100 | print("%s deleted fom %s ") % (key, bucket) 101 | 102 | --------------------------------------------------------------------------------