├── README.md
└── lambda_email_extractor.py


/README.md:
--------------------------------------------------------------------------------
 1 | AWS Lambda Attachment Extractor
 2 | ===============================
 3 | 
 4 | This is a very simple project which can be used for reference when creating Lambda functions to manipulate email 
 5 | files stored to an Amazon S3 bucket, typically by an inbound SES routing rule. See 
 6 | ['AWS Documentation'](http://docs.aws.amazon.com/ses/latest/DeveloperGuide/receiving-email.html) for more information.
 7 | 
 8 | 
 9 | Use cases
10 | ---------
11 | 
12 | The original use for this project was to process DMARC reports, getting .ZIP and .GZIP attachments from emails stored in S3 and extracting the archives, resulting in .XML files, which are then uploaded to another S3 bucket for processing by another service.
13 | 
14 | The Python file doesn't perform security checks on attachments and will ignore anything which does not 
15 | result in an .XML file. The original email is then deleted after processing takes place.
16 | 
17 | This type of processing may be very useful for extracting DMARC reports, which are sent by email service providers 
18 | to an email address specified in the _rua_ section of your DMARC DNS record, see https://dmarc.org/overview/. You 
19 | will then be left with a bucket of XML DMARC reports which you may wish to process manually or upload to a service such
20 | as https://dmarcian.com/dmarc-xml/.
21 | 
22 | Getting Started
23 | ---------------
24 | 
25 | To get started, simply upload the python file to Lambda, changing the variables if required. This lambda function expects a S3 put event to trigger it, see ['this tutorial'](http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html) for more information.
26 | 


--------------------------------------------------------------------------------
/lambda_email_extractor.py:
--------------------------------------------------------------------------------
  1 | from __future__ import print_function
  2 | 
  3 | import email
  4 | import zipfile
  5 | import os
  6 | import gzip
  7 | import string
  8 | import boto3
  9 | import urllib
 10 | 
 11 | print('Loading function')
 12 | 
 13 | s3 = boto3.client('s3')
 14 | s3r = boto3.resource('s3')
 15 | xmlDir = "/tmp/output/"
 16 | 
 17 | outputBucket = ""  # Set here for a seperate bucket otherwise it is set to the events bucket
 18 | outputPrefix = "xml/"  # Should end with /
 19 | 
 20 | 
 21 | def lambda_handler(event, context):
 22 |     bucket = event['Records'][0]['s3']['bucket']['name']
 23 |     key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf8')
 24 | 
 25 |     try:
 26 |         # Set outputBucket if required
 27 |         if not outputBucket:
 28 |             global outputBucket
 29 |             outputBucket = bucket
 30 | 
 31 |         # Use waiter to ensure the file is persisted
 32 |         waiter = s3.get_waiter('object_exists')
 33 |         waiter.wait(Bucket=bucket, Key=key)
 34 | 
 35 |         response = s3r.Bucket(bucket).Object(key)
 36 | 
 37 |         # Read the raw text file into a Email Object
 38 |         msg = email.message_from_string(response.get()["Body"].read())
 39 | 
 40 |         if len(msg.get_payload()) == 2:
 41 | 
 42 |             # Create directory for XML files (makes debugging easier)
 43 |             if os.path.isdir(xmlDir) == False:
 44 |                 os.mkdir(xmlDir)
 45 | 
 46 |             # The first attachment
 47 |             attachment = msg.get_payload()[1]
 48 | 
 49 |             # Extract the attachment into /tmp/output
 50 |             extract_attachment(attachment)
 51 | 
 52 |             # Upload the XML files to S3
 53 |             upload_resulting_files_to_s3()
 54 | 
 55 |         else:
 56 |             print("Could not see file/attachment.")
 57 | 
 58 |         return 0
 59 |     except Exception as e:
 60 |         print(e)
 61 |         print('Error getting object {} from bucket {}. Make sure they exist '
 62 |             'and your bucket is in the same region as this '
 63 |             'function.'.format(key, bucket))
 64 |         raise e
 65 |     delete_file(key, bucket)
 66 |     
 67 | 
 68 | def extract_attachment(attachment):
 69 |     # Process filename.zip attachments
 70 |     if "gzip" in attachment.get_content_type():
 71 |         contentdisp = string.split(attachment.get('Content-Disposition'), '=')
 72 |         fname = contentdisp[1].replace('\"', '')
 73 |         open('/tmp/' + contentdisp[1], 'wb').write(attachment.get_payload(decode=True))
 74 |         # This assumes we have filename.xml.gz, if we get this wrong, we will just
 75 |         # ignore the report
 76 |         xmlname = fname[:-3]
 77 |         open(xmlDir + xmlname, 'wb').write(gzip.open('/tmp/' + contentdisp[1], 'rb').read())
 78 | 
 79 |     # Process filename.xml.gz attachments (Providers not complying to standards)
 80 |     elif "zip" in attachment.get_content_type():
 81 |         open('/tmp/attachment.zip', 'wb').write(attachment.get_payload(decode=True))
 82 |         with zipfile.ZipFile('/tmp/attachment.zip', "r") as z:
 83 |             z.extractall(xmlDir)
 84 | 
 85 |     else:
 86 |         print('Skipping ' + attachment.get_content_type())
 87 | 
 88 | 
 89 | def upload_resulting_files_to_s3():
 90 |     # Put all XML back into S3 (Covers non-compliant cases if a ZIP contains multiple results)
 91 |     for fileName in os.listdir(xmlDir):
 92 |         if fileName.endswith(".xml"):
 93 |             print("Uploading: " + fileName)  # File name to upload
 94 |             s3r.meta.client.upload_file(xmlDir+'/'+fileName, outputBucket, outputPrefix+fileName)
 95 | 
 96 |             
 97 | # Delete the file in the current bucket
 98 | def delete_file(key, bucket):
 99 |     s3.delete_object(Bucket=bucket, Key=key)
100 |     print("%s deleted fom %s ") % (key, bucket)
101 | 
102 | 


--------------------------------------------------------------------------------