├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── architecture.png ├── helper └── helper.py ├── lambda_function.py └── template.yaml /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *main* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | this software and associated documentation files (the "Software"), to deal in 5 | the Software without restriction, including without limitation the rights to 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | the Software, and to permit persons to whom the Software is furnished to do so. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Synchronously process your receipt and invoice images using Amazon Textract AnalyzeExpense 2 | 3 | 4 | ## Introduction 5 | Receipts and invoices are special types of documents that are critical to small, medium businesses (SMBs), startups, and enterprises for managing their accounts payable process. These types of documents are difficult to process at scale because they follow no set design rules, yet any individual customer encounters thousands of distinct types of these documents. In this sample, you can use Amazon Textract to extract data from any invoice or receipt (in English) without any required machine learning (ML) experience or templates or configuration. Amazon Textract extracts a set of standard fields and related line-item details from your receipts and this solution provide pretty printing capabilities to output this data in various formats including .csv, txt and others. 6 | 7 | ## Architecture 8 | 9 | ![architecture](architecture.png) 10 | 11 | At a high level, the solution architecture includes the following steps: 12 | 1. Sets up an Amazon S3 source and output buckets storing raw expense documents images in png, jpg(jpeg) formats. 13 | 2. Configures an Event Rule based on event pattern in Amazon EventBridge to match incoming S3 PutObject events in the S3 folder containing the raw expense document images. 14 | 3. Configured EventBridge Event Rule sends the event to an AWS Lambda function for futher analysis and processing. 15 | 4. AWS Lambda function reads the images from Amazon S3, calls Amazon Textract AnalyzeExpense API, uses Amazon Textract Response Parser to de-serialize the JSON response and uses Amazon Textract PrettyPrinter to easily print the parsed response and stores the results back to Amazon S3 in different formats. 16 | 17 | ## Prerequisites 18 | 1. python 19 | 2. AWS CLI 20 | 21 | ## Deployment 22 | 23 | Two ways you could deploy this solution 24 | 25 | ### Option 1 - Custom deployment 26 | 1. Download this git repo on your local machine 27 | 2. Install Amazon Textract Pretty Printer and Response Parser 28 | ``` 29 | cd amazon-textract-analyzeexpense 30 | 31 | python -m pip install --target=./ amazon-textract-response-parser 32 | 33 | python -m pip install --target=./ amazon-textract-prettyprinter 34 | ``` 35 | 3. Create lambda function deployment package (.zip) file 36 | ``` 37 | zip -r archive.zip . 38 | ``` 39 | 4. Next, copy archive.zip file to a s3 bucket of your choice. 40 | ``` 41 | aws s3 cp archive.zip s3://my-source-bucket 42 | ``` 43 | If you need to create a new bucket in us-east-1 44 | >`aws s3api create-bucket --bucket my-source-bucket --region us-east-1` 45 | > 46 | And, for all other regions, add `--create-bucket-configuration` option 47 | >`aws s3api create-bucket --bucket my-source-bucket --region us-west-2 --create-bucket-configuration LocationConstraint=us-west-2` 48 | 49 | 5. Open template.yaml file, replace s3 bucket name under "EventConsumerFunction:" section with your s3 bucket name, and save. 50 | 6. Finally, deploy this solution 51 | ``` 52 | aws cloudformation deploy --template-file template.yaml --stack-name myTextractAnalyzeExpense --capabilities CAPABILITY_NAMED_IAM 53 | ``` 54 | 55 | 56 | ### Option 2 - Auto deploy using cloudformation template 57 | Alternatively, deploy this solution in `us-east-1` using a pre-configured CloudFormation template - 58 | 1. Choose Launch Stack to configure the notebook in the US East (N. Virginia) Region: 59 | 60 | [![cfnlaunchstack](https://s3.amazonaws.com/cloudformation-examples/cloudformation-launch-stack.png)](https://us-east-1.console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/new?stackName=textract-analyzexpense-demo&templateURL=https://aws-blogs-artifacts-public.s3.amazonaws.com/artifacts/ML-3911/Textract-Analyze-Expense-Demo.yaml) 61 | 62 | 2. Don’t make any changes to stack name or parameters. 63 | 3. In the Capabilities section, select I acknowledge that AWS CloudFormation might create IAM resources. 64 | 4. Choose Create stack. 65 | 66 | ## Test 67 | In order to test the solution, upload the receipts/invoice images in the Amazon S3 bucket created by CloudFormation template. This will trigger an event to invoke AWS Lambda function which will call the Amazon Textract AnalyzeExpense API, parse the response JSON, convert the parsed response into csv format and store it back to the same Amazon S3 Bucket. You can extend the provided AWS Lambda function further based on your requirements and also change the output format to other types like tsv, grid, latex and many more by setting the appropriate value of output_type when calling get_string method of textractprettyprinter.t_pretty_print_expense in Amazon Textract PrettyPrinter. 68 | 69 | ## Clean up 70 | 71 | Deleting the CloudFormation Stack will remove the Lambda functions, EventBridge rules and other resources created by this solution. Ensure the S3 buckets are empty before attempting to delete this stack. 72 | 73 | ## License 74 | 75 | This library is licensed under the MIT-0 License. See the [LICENSE](LICENSE) file. 76 | -------------------------------------------------------------------------------- /architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-textract-analyze-expense-processing-pipeline/01c9d125458abc38df8a2953e498d9d512a22f61/architecture.png -------------------------------------------------------------------------------- /helper/helper.py: -------------------------------------------------------------------------------- 1 | """ 2 | Author: Manish Chugh 3 | 4 | Helper file to assist with AWS and S3 reusable functions 5 | """ 6 | import boto3 7 | from botocore.client import Config 8 | 9 | class AWSHelper: 10 | 11 | def getResource(self, name, awsRegion=None): 12 | config = Config( 13 | retries = dict( 14 | max_attempts = 30 15 | ) 16 | ) 17 | 18 | if(awsRegion): 19 | return boto3.resource(name, region_name=awsRegion, config=config) 20 | else: 21 | return boto3.resource(name, config=config) 22 | 23 | class S3Helper: 24 | 25 | @staticmethod 26 | def writeToS3(content, bucketName, s3FileName, awsRegion=None): 27 | s3 = AWSHelper().getResource('s3', awsRegion) 28 | object = s3.Object(bucketName, s3FileName) 29 | object.put(Body=content) 30 | -------------------------------------------------------------------------------- /lambda_function.py: -------------------------------------------------------------------------------- 1 | """ 2 | Author: Manish Chugh 3 | 4 | Lambda function for Amazon Textract AnalyzeExpense to help analyze expenses.This 5 | function listens to s3 event notifications of every image upload to source s3 6 | bucket and generates textract analyze expense output in easily digestable .txt 7 | or any other output of your choice 8 | """ 9 | 10 | import os 11 | import boto3 12 | from helper.helper import S3Helper 13 | from textractprettyprinter.t_pretty_print_expense import get_string 14 | from textractprettyprinter.t_pretty_print_expense import Textract_Expense_Pretty_Print, Pretty_Print_Table_Format 15 | 16 | """ 17 | Main method for image processing and pretty printing the AnalyzeExpense Response 18 | """ 19 | def lambda_handler(event, context): 20 | 21 | """ 22 | Initiate boto3 client for Amazon Textract 23 | """ 24 | textract = boto3.client(service_name='textract') 25 | 26 | """ 27 | Fetch source bucket and object key names from s3 event notifications event 28 | """ 29 | s3_source_bucket_name = event['detail']['requestParameters']['bucketName'] 30 | s3_request_file_name = event['detail']['requestParameters']['key'] 31 | 32 | """ 33 | Fetch output s3 bucket name created cloud formation template, function will store textract output in this bucket 34 | """ 35 | s3_output_bucket_name = os.environ['outputs3bucketname'] 36 | 37 | """ 38 | textract analyze_expense call to synchronously process image file 39 | """ 40 | try: 41 | response = textract.analyze_expense( 42 | Document={ 43 | 'S3Object': { 44 | 'Bucket': s3_source_bucket_name, 45 | 'Name': s3_request_file_name 46 | } 47 | }) 48 | 49 | """ 50 | Pretty print textract analyze expense response. Valid values are - "csv", 51 | "plain","simple","github","grid","fancy_grid" ,"pipe","orgtbl","jira","presto", 52 | "pretty","psql","rst","mediawiki","moinmoin","youtrack","html","unsafehtml", 53 | "latex","latex_raw","latex_booktabs","latex_longtable","textile","tsv" 54 | """ 55 | pretty_printed_string = get_string(textract_json=response, output_type=[Textract_Expense_Pretty_Print.SUMMARY, Textract_Expense_Pretty_Print.LINEITEMGROUPS], table_format=Pretty_Print_Table_Format.fancy_grid) 56 | 57 | """ 58 | Save pretty print output to s3 output bucket. This code could be replaced to 59 | send this output to storage service of your choice too. 60 | """ 61 | S3Helper.writeToS3(pretty_printed_string, s3_output_bucket_name, s3_request_file_name+"-analyzeexpenseresponse.txt") 62 | return "200" 63 | 64 | except Exception as e_raise: 65 | print(e_raise) 66 | raise e_raise 67 | -------------------------------------------------------------------------------- /template.yaml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: "2010-09-09" 2 | Transform: AWS::Serverless-2016-10-31 3 | Description: TextractAnalyzeExpense 4 | 5 | 6 | Resources: 7 | CloudTrailBucketPolicy: 8 | Type: AWS::S3::BucketPolicy 9 | Properties: 10 | Bucket: 11 | Ref: LoggingBucket 12 | PolicyDocument: 13 | Version: "2012-10-17" 14 | Statement: 15 | - 16 | Sid: "AWSCloudTrailAclCheck" 17 | Effect: "Allow" 18 | Principal: 19 | Service: "cloudtrail.amazonaws.com" 20 | Action: "s3:GetBucketAcl" 21 | Resource: 22 | !Sub |- 23 | arn:aws:s3:::${LoggingBucket} 24 | - 25 | Sid: "AWSCloudTrailWrite" 26 | Effect: "Allow" 27 | Principal: 28 | Service: "cloudtrail.amazonaws.com" 29 | Action: "s3:PutObject" 30 | Resource: 31 | !Sub |- 32 | arn:aws:s3:::${LoggingBucket}/AWSLogs/${AWS::AccountId}/* 33 | Condition: 34 | StringEquals: 35 | s3:x-amz-acl: "bucket-owner-full-control" 36 | - Sid: AllowSSLRequestsOnly 37 | Effect: Deny 38 | Principal: "*" 39 | Action: "s3:*" 40 | Resource: 41 | - !GetAtt LoggingBucket.Arn 42 | - !Sub "${LoggingBucket.Arn}/*" 43 | Condition: 44 | Bool: 45 | "aws:SecureTransport": false 46 | 47 | SourceBucket: 48 | Type: AWS::S3::Bucket 49 | Properties: 50 | BucketEncryption: 51 | ServerSideEncryptionConfiguration: 52 | - ServerSideEncryptionByDefault: 53 | SSEAlgorithm: AES256 54 | AccessControl: BucketOwnerFullControl 55 | LifecycleConfiguration: 56 | Rules: 57 | - 58 | AbortIncompleteMultipartUpload: 59 | DaysAfterInitiation: 3 60 | NoncurrentVersionExpirationInDays: 3 61 | Status: Enabled 62 | PublicAccessBlockConfiguration: 63 | BlockPublicAcls: true 64 | BlockPublicPolicy: true 65 | IgnorePublicAcls: true 66 | RestrictPublicBuckets: true 67 | Tags: 68 | - 69 | Key: Description 70 | Value: Textract AnalyzeExpense Demo Bucket 71 | VersioningConfiguration: 72 | Status: Enabled 73 | 74 | SourceBucketPolicy: 75 | Type: AWS::S3::BucketPolicy 76 | Properties: 77 | Bucket: !Ref SourceBucket 78 | PolicyDocument: 79 | Version: "2012-10-17" 80 | Statement: 81 | - Sid: AllowSSLRequestsOnly 82 | Effect: Deny 83 | Principal: "*" 84 | Action: "s3:*" 85 | Resource: 86 | - !GetAtt SourceBucket.Arn 87 | - !Sub "${SourceBucket.Arn}/*" 88 | Condition: 89 | Bool: 90 | "aws:SecureTransport": false 91 | 92 | LoggingBucket: 93 | Type: AWS::S3::Bucket 94 | Properties: 95 | BucketEncryption: 96 | ServerSideEncryptionConfiguration: 97 | - ServerSideEncryptionByDefault: 98 | SSEAlgorithm: AES256 99 | AccessControl: BucketOwnerFullControl 100 | LifecycleConfiguration: 101 | Rules: 102 | - 103 | AbortIncompleteMultipartUpload: 104 | DaysAfterInitiation: 3 105 | NoncurrentVersionExpirationInDays: 3 106 | Status: Enabled 107 | PublicAccessBlockConfiguration: 108 | BlockPublicAcls: true 109 | BlockPublicPolicy: true 110 | IgnorePublicAcls: true 111 | RestrictPublicBuckets: true 112 | Tags: 113 | - 114 | Key: Description 115 | Value: Textract AnalyzeExpense Demo Bucket for CloudTrail and EventBridge Integration with Amazon S3 and AWS Lambda 116 | VersioningConfiguration: 117 | Status: Enabled 118 | 119 | OutputBucket: 120 | Type: AWS::S3::Bucket 121 | Properties: 122 | BucketEncryption: 123 | ServerSideEncryptionConfiguration: 124 | - ServerSideEncryptionByDefault: 125 | SSEAlgorithm: AES256 126 | AccessControl: BucketOwnerFullControl 127 | LifecycleConfiguration: 128 | Rules: 129 | - 130 | AbortIncompleteMultipartUpload: 131 | DaysAfterInitiation: 3 132 | NoncurrentVersionExpirationInDays: 3 133 | Status: Enabled 134 | PublicAccessBlockConfiguration: 135 | BlockPublicAcls: true 136 | BlockPublicPolicy: true 137 | IgnorePublicAcls: true 138 | RestrictPublicBuckets: true 139 | Tags: 140 | - 141 | Key: Description 142 | Value: Textract AnalyzeExpense Demo Bucket for storing the Outputs 143 | VersioningConfiguration: 144 | Status: Enabled 145 | 146 | OutputBucketPolicy: 147 | Type: AWS::S3::BucketPolicy 148 | Properties: 149 | Bucket: !Ref OutputBucket 150 | PolicyDocument: 151 | Version: "2012-10-17" 152 | Statement: 153 | - Sid: AllowSSLRequestsOnly 154 | Effect: Deny 155 | Principal: "*" 156 | Action: "s3:*" 157 | Resource: 158 | - !GetAtt OutputBucket.Arn 159 | - !Sub "${OutputBucket.Arn}/*" 160 | Condition: 161 | Bool: 162 | "aws:SecureTransport": false 163 | 164 | EventCloudTrail: 165 | Type: AWS::CloudTrail::Trail 166 | DependsOn: 167 | - CloudTrailBucketPolicy 168 | Properties: 169 | TrailName: !Ref LoggingBucket 170 | S3BucketName: 171 | Ref: LoggingBucket 172 | IsLogging: true 173 | IsMultiRegionTrail: false 174 | EventSelectors: 175 | - IncludeManagementEvents: false 176 | DataResources: 177 | - Type: AWS::S3::Object 178 | Values: 179 | - !Sub |- 180 | arn:aws:s3:::${SourceBucket}/ 181 | IncludeGlobalServiceEvents: false 182 | 183 | AWSLambdaAnalyzeExpenseRole: 184 | Type: AWS::IAM::Role 185 | Properties: 186 | RoleName: !Join [ '', [ !Ref 'AWS::StackName', 'textractanalyzeexpense-lambda-role' ]] 187 | AssumeRolePolicyDocument: 188 | Version: "2012-10-17" 189 | Statement: 190 | - Effect: Allow 191 | Principal: 192 | Service: 193 | - lambda.amazonaws.com 194 | - textract.amazonaws.com 195 | Action: 196 | - 'sts:AssumeRole' 197 | ManagedPolicyArns: 198 | - arn:aws:iam::aws:policy/AmazonTextractFullAccess 199 | - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole 200 | - arn:aws:iam::aws:policy/AWSLambdaExecute 201 | - !Ref LambdaPolicy 202 | 203 | LambdaPolicy: 204 | Type: 'AWS::IAM::ManagedPolicy' 205 | Properties: 206 | ManagedPolicyName: !Join [ '', [ !Ref 'AWS::StackName', 'textractanalyzeexpense-lambda-policy' ]] 207 | Description: 'Managed policy for lambdas' 208 | PolicyDocument: 209 | Version: '2012-10-17' 210 | Statement: 211 | - Effect: "Allow" 212 | Action: 213 | - "s3:PutObject*" 214 | - "s3:GetBucket*" 215 | - "s3:GetObject*" 216 | Resource: 217 | !Sub |- 218 | arn:aws:s3:::${SourceBucket}:/* 219 | - Effect: "Allow" 220 | Action: 221 | - "s3:PutObject*" 222 | - "s3:GetBucket*" 223 | - "s3:GetObject*" 224 | Resource: 225 | !Sub |- 226 | arn:aws:s3:::${OutputBucket}:/* 227 | 228 | EventConsumerFunction: 229 | Type: AWS::Lambda::Function 230 | Properties: 231 | Code: 232 | S3Bucket: my-source-bucket 233 | S3Key: archive.zip 234 | Handler: lambda_function.lambda_handler 235 | Runtime: python3.12 236 | Timeout: 20 237 | Environment: 238 | Variables: 239 | outputs3bucketname: !Sub ${OutputBucket} 240 | Role: 241 | Fn::GetAtt: 242 | - AWSLambdaAnalyzeExpenseRole 243 | - Arn 244 | 245 | EventRule: 246 | Type: AWS::Events::Rule 247 | Properties: 248 | Description: "EventRule" 249 | State: "ENABLED" 250 | EventPattern: 251 | source: 252 | - "aws.s3" 253 | detail-type: 254 | - "AWS API Call via CloudTrail" 255 | detail: 256 | eventSource: 257 | - "s3.amazonaws.com" 258 | eventName: 259 | - "PutObject" 260 | requestParameters: 261 | bucketName: 262 | - !Ref SourceBucket 263 | Targets: 264 | - 265 | Arn: 266 | Fn::GetAtt: 267 | - "EventConsumerFunction" 268 | - "Arn" 269 | Id: "EventConsumerFunctionTarget" 270 | 271 | PermissionForEventsToInvokeLambda: 272 | Type: AWS::Lambda::Permission 273 | Properties: 274 | FunctionName: 275 | Ref: "EventConsumerFunction" 276 | Action: "lambda:InvokeFunction" 277 | Principal: "events.amazonaws.com" 278 | SourceArn: 279 | Fn::GetAtt: 280 | - "EventRule" 281 | - "Arn" 282 | --------------------------------------------------------------------------------