├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── app.py ├── assets └── images │ └── step-function-flow.png ├── cdk.json ├── lambdas ├── __init__.py ├── analyze_function.py ├── classify_function.py ├── detect_function.py ├── newboto3 │ └── .gitkeep ├── post_analyze.py ├── post_detect.py ├── process_application.py ├── process_bank.py ├── process_document.py ├── process_payslip.py ├── s3_input_handler.py └── shared_constants.py ├── my_project ├── __init__.py └── my_project_stack.py ├── requirements-dev.txt ├── requirements.txt ├── source.bat ├── tests ├── __init__.py └── unit │ ├── __init__.py │ └── test_my_project_stack.py └── training ├── bank-statement.pdf ├── payslip.pdf ├── rental-application.pdf └── trainer.csv /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *main* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | this software and associated documentation files (the "Software"), to deal in 5 | the Software without restriction, including without limitation the rights to 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | the Software, and to permit persons to whom the Software is furnished to do so. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## AWS Sample: Textract Queries driven intelligent document processing 2 | 3 | This repository serves as a sample/example of intelligent document processing using AWS AI services. It covers the following: 4 | 1. Setup the example in your AWS account using Infrastructure as Code (IaC) - Cloud Development Kit (CDK) 5 | 2. The example uses fully managed serverless components - offloading undifferentiated heavy lifting 6 | 3. It uses Textract AI service to extract data from uploaded multi-page documents such as PDFs, images, tables, and forms 7 | 3. Classifies documents using custom classification model in the Comprehend AI service (for example PAYSLIP, BANK statement, Rental APPLICATION) 8 | 4. Based on the classification, ask a set of questions in natural language to extract key information. For example, in the case of PAYSLIP: 9 | ``` 10 | Q. "What is the company's name" 11 | Q. "What is the company's ABN number" 12 | Q. "What is the employee's name" 13 | Q. "What is the annual salary" 14 | Q. "What is the monthly net pay" 15 | ``` 16 | 5. The example is automatically trigger when a file is uploaded to the designated S3 bucket. 17 | 6. It is asynchronous and orchestrated using Step Functions - allowing for documents to be processed in parallel. 18 | 7. Both the IaC CDK script and the Lambda handlers are written in Python, making it easy to follow/change/adapt for your needs. 19 | 20 | ## Getting Started 21 | 1. It is recommended to use a Cloud9 environment as it comes bundled with AWS CLI and CDK pre-installed 22 | 2. Increase the size of the EBS volume attached to Cloud9 EC2 instance to 20GiB 23 | 24 | Once the Cloud9 environment is setup as above, clone the repository: 25 | ``` 26 | 'git clone https://github.com/aws-samples/amazon-textract-queries-example.git' 27 | ``` 28 | Setup a virtual environment so that all python dependencies are stored within the project directory: 29 | ``` 30 | Admin:~/environment/amazon-textract-queries-example (main) $ virtualenv .venv 31 | Admin:~/environment/amazon-textract-queries-example (main) $ source .venv/bin/activate 32 | (.venv) Admin:~/environment/amazon-textract-queries-example (main) $ pip install -r requirements.txt 33 | ``` 34 | To use the newly released Textract Queries feature, we need the bundled Lambda functions to use a new version of Python boto3 library, so install that: 35 | ``` 36 | (.venv) Admin:~/environment/amazon-textract-queries-example (main) $ pip install boto3 --target=lambdas/newboto3 37 | ``` 38 | 39 | ## Train a sample classifier 40 | 41 | Now we train a custom classification model in Comprehend. There are 3 sample files provided (payslip, bank-statement and rental-application): 42 | ``` 43 | (.venv) Admin:~/environment/amazon-textract-queries-example (main) $ ls training 44 | bank-statement.pdf payslip.pdf rental-application.pdf trainer.csv 45 | ``` 46 | The text from these files has been extracted and placed as a flattened single-line string along with its corresponding label in "trainer.csv". Lines have been duplicated in "trainer.csv" to have minimum entries required to train the model. 47 | 48 | To train the classifier: 49 | 1. Visit the [Comprehend console](https://console.aws.amazon.com/comprehend/v2/home#classification) (switch to the region you wish to use) 50 | 2. Click *Train classifier* 51 | 3. Give it a name and check the other details (the defaults are fine to start - use a Multi-class classifier) 52 | 4. Specify the S3 location of the training file (upload it first) 53 | 5. Click *Train classifier* 54 | 55 | Once it's trained (it will take a few minutes), start an endpoint by clicking *Create endpoint* from the classifier's console page. Note the ARN of the endpoint. Export it into your environment using the following variable: 56 | ``` 57 | (.venv) Admin:~/environment/amazon-textract-queries-example (main) $ export COMPREHEND_EP_ARN=arn:aws:comprehend:ap-southeast-2:NNNNNNNNNNNN:document-classifier-endpoint/classify-ep 58 | ``` 59 | 60 | ## Deploy the Sample 61 | 62 | We will be using CDK to deploy the application into your account. This is done using two simple commands: 63 | ``` 64 | 1. (.venv) Admin:~/environment/amazon-textract-queries-example (main) $ cdk bootstrap 65 | 2. (.venv) Admin:~/environment/amazon-textract-queries-example (main) $ cdk deploy 66 | ``` 67 | Note that `cdk bootstrap` is required just once for your AWS account. After subsequent changes to the logic or IaC doing `cdk deploy` is sufficient. 68 | 69 | ## Processing PDF files 70 | 71 | After deployment is complete (typically 4 - 5 minutes) and the status of the Comprehend classification endpoint is "Active" upload one of the provided sample PDF (example `payslip.pdf`) files into the input S3 bucket (`myprojectstack-s3inputbucketNNNNN`) that has been created by CDK. 72 | 73 | The set of questions that will be asked against a given document can be inspected here: 74 | ``` 75 | (.venv) Admin:~/environment/amazon-textract-queries-example (main) $ grep -A 15 Payslip_Queries lambdas/shared_constants.py 76 | Payslip_Queries = [ 77 | { 78 | "Text": "What is the company's name", 79 | "Alias": "PAYSLIP_NAME_COMPANY" 80 | }, 81 | { 82 | "Text": "What is the company's ABN number", 83 | "Alias": "PAYSLIP_COMPANY_ABN" 84 | }, 85 | { 86 | "Text": "What is the employee's name", 87 | "Alias": "PAYSLIP_NAME_EMPLOYEE" 88 | }, 89 | { 90 | "Text": "What is the annual salary", 91 | "Alias": "PAYSLIP_SALARY_ANNUAL" 92 | 93 | (.venv) Admin:~/environment/amazon-textract-queries-example (main) $ grep -A 15 Bank_Queries lambdas/shared_constants.py 94 | Bank_Queries = [ 95 | { 96 | "Text": "What is the name of the banking institution", 97 | "Alias": "BANK_NAME" 98 | }, 99 | { 100 | "Text": "What is the name of the account holder", 101 | "Alias": "BANK_HOLDER_NAME" 102 | }, 103 | { 104 | "Text": "What is the account number", 105 | "Alias": "BANK_ACCOUNT_NUMBER" 106 | }, 107 | { 108 | "Text": "What is the current balance of the bank account", 109 | "Alias": "BANK_ACCOUNT_BALANCE" 110 | ``` 111 | When you upload the PDF a [Step Functions](https://console.aws.amazon.com/states/home) will be triggered. View that to see the flow of the document and the [CloudWatch](https://console.aws.amazon.com/cloudwatch/home) Log groups for the Lambda logs. 112 | 113 | The answers to the questions for a given document will show up in the output S3 bucket (`myprojectstack-s3outputbucketNNNNN`) under the "_answers" folder when the corresponding Step Function is complete. 114 | 115 | The Step Functions flow that has been executed is depicted below: 116 | ![Step Functions flow](/assets/images/step-function-flow.png) 117 | 118 | Now you have successfully run the sample of a scalable, serverless application stack using CDK as IaC to intelligently get answers from a document for your questions expressed in natural language. 119 | 120 | ## Resources and pricing 121 | 122 | Note that for as long as you have the stack deployed, charges may apply to your account. You should delete the resources when you are done with the sample. WARNING all the files in the S3 input (`myprojectstack-s3inputbucketNNNNN`) and output (`myprojectstack-s3outputbucketNNNNN`) buckets will be automatically deleted: 123 | ``` 124 | (.venv) Admin:~/environment/amazon-textract-queries-example (main) $ cdk destroy 125 | ``` 126 | You will also need to manually terminate the Comprehend endpoint and delete the Comprehend classifier. Also the Cloud9 environment if that was used to run through the sample. 127 | 128 | ## Security 129 | 130 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information. 131 | 132 | ## License 133 | 134 | This library is licensed under the MIT-0 License. See the LICENSE file. 135 | 136 | -------------------------------------------------------------------------------- /app.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import os 3 | 4 | import aws_cdk as cdk 5 | 6 | from my_project.my_project_stack import MyProjectStack 7 | 8 | 9 | app = cdk.App() 10 | MyProjectStack(app, "MyProjectStack", 11 | # If you don't specify 'env', this stack will be environment-agnostic. 12 | # Account/Region-dependent features and context lookups will not work, 13 | # but a single synthesized template can be deployed anywhere. 14 | 15 | # Uncomment the next line to specialize this stack for the AWS Account 16 | # and Region that are implied by the current CLI configuration. 17 | 18 | #env=cdk.Environment(account=os.getenv('CDK_DEFAULT_ACCOUNT'), region=os.getenv('CDK_DEFAULT_REGION')), 19 | 20 | # Uncomment the next line if you know exactly what Account and Region you 21 | # want to deploy the stack to. */ 22 | 23 | #env=cdk.Environment(account='123456789012', region='us-east-1'), 24 | 25 | # For more information, see https://docs.aws.amazon.com/cdk/latest/guide/environments.html 26 | ) 27 | 28 | app.synth() 29 | -------------------------------------------------------------------------------- /assets/images/step-function-flow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-textract-queries-example/c7ee158086bbd2cad3d0b92d2a83895d1cb712c3/assets/images/step-function-flow.png -------------------------------------------------------------------------------- /cdk.json: -------------------------------------------------------------------------------- 1 | { 2 | "app": "python3 app.py", 3 | "watch": { 4 | "include": [ 5 | "**" 6 | ], 7 | "exclude": [ 8 | "README.md", 9 | "cdk*.json", 10 | "requirements*.txt", 11 | "source.bat", 12 | "**/__init__.py", 13 | "python/__pycache__", 14 | "tests" 15 | ] 16 | }, 17 | "context": { 18 | "@aws-cdk/aws-apigateway:usagePlanKeyOrderInsensitiveId": true, 19 | "@aws-cdk/core:stackRelativeExports": true, 20 | "@aws-cdk/aws-rds:lowercaseDbIdentifier": true, 21 | "@aws-cdk/aws-lambda:recognizeVersionProps": true, 22 | "@aws-cdk/aws-cloudfront:defaultSecurityPolicyTLSv1.2_2021": true 23 | } 24 | } 25 | -------------------------------------------------------------------------------- /lambdas/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-textract-queries-example/c7ee158086bbd2cad3d0b92d2a83895d1cb712c3/lambdas/__init__.py -------------------------------------------------------------------------------- /lambdas/analyze_function.py: -------------------------------------------------------------------------------- 1 | import os 2 | import boto3 3 | import shared_constants as sc 4 | import logging 5 | 6 | logger = logging.getLogger() 7 | logger.setLevel(logging.INFO) 8 | 9 | def get_task_token_path(key): 10 | prefix = sc.OUT_TASK_TOKEN_PREFIX 11 | return prefix + "/" + key + ".analyzeTaskToken"; 12 | 13 | def lookup_query_config(classification): 14 | if classification == "PAYSLIP": 15 | return sc.Payslip_Queries 16 | elif classification == "BANK": 17 | return sc.Bank_Queries 18 | elif classification == "APPLICATION": 19 | return sc.Application_Queries 20 | else: 21 | return [] 22 | 23 | 24 | # process a file upload 25 | def process_record(rec, event): 26 | token = event['token'] 27 | classification = event['input']['Classify']['classification'] 28 | query_config = lookup_query_config(classification) 29 | logger.info(query_config) 30 | input_bucket = rec['s3']['bucket']['name'] 31 | input_file = rec['s3']['object']['key'] 32 | request_id = rec['responseElements']['x-amz-request-id'] 33 | analyze_arn = os.environ[sc.ANALYZE_TOPIC_ARN] 34 | analyze_role = os.environ[sc.TEXTRACT_PUBLISH_ROLE] 35 | output_bucket = os.environ[sc.OUTPUT_BKT] 36 | output_prefix = sc.OUT_ANALYZE_PREFIX 37 | 38 | client = boto3.client('textract') 39 | response = client.start_document_analysis( 40 | DocumentLocation={ 41 | 'S3Object': { 42 | 'Bucket': input_bucket, 43 | 'Name': input_file 44 | } 45 | }, 46 | FeatureTypes=[ 47 | 'QUERIES' 48 | ], 49 | ClientRequestToken=request_id, 50 | NotificationChannel={ 51 | 'SNSTopicArn': analyze_arn, 52 | 'RoleArn': analyze_role 53 | }, 54 | OutputConfig={ 55 | 'S3Bucket': output_bucket, 56 | 'S3Prefix': output_prefix 57 | }, 58 | QueriesConfig={ 59 | 'Queries': query_config 60 | } 61 | ) 62 | s3_client = boto3.client('s3') 63 | s3_client.put_object(Bucket=output_bucket, Key=get_task_token_path(input_file), Body=token); 64 | 65 | # event handler 66 | def lambda_handler(event, context): 67 | # lets just dump it 68 | logger.info(event) 69 | 70 | for rec in event['input']['Records']: 71 | process_record(rec, event) 72 | 73 | return {"submit_status": "SUCCEEDED"} -------------------------------------------------------------------------------- /lambdas/classify_function.py: -------------------------------------------------------------------------------- 1 | import os 2 | import boto3 3 | import shared_constants as sc 4 | import logging 5 | 6 | logger = logging.getLogger() 7 | logger.setLevel(logging.INFO) 8 | 9 | def get_dump_path(key): 10 | prefix = sc.OUT_DUMP_PREFIX 11 | return prefix + "/" + key + ".dump.txt"; 12 | 13 | def lambda_handler(event, context): 14 | logger.info(event) 15 | output_bkt = os.environ[sc.OUTPUT_BKT] 16 | 17 | s3_obj_name = event['Detection']['DocumentLocation']['S3ObjectName'] 18 | job_id = event['Detection']['JobId'] 19 | textract = boto3.client('textract') 20 | response = textract.get_document_text_detection(JobId=job_id) 21 | 22 | # generate flat dump 23 | dump = '' 24 | for block in response['Blocks']: 25 | if block['BlockType'] == "LINE": 26 | dump = dump + block['Text'] + ' ' 27 | logger.info(dump) 28 | 29 | # dump to S3 30 | s3 = boto3.client('s3') 31 | s3.put_object(Bucket=output_bkt, Key=get_dump_path(s3_obj_name), Body=dump); 32 | 33 | # do sync classify 34 | comprehend = boto3.client('comprehend') 35 | comprehend_arn = os.environ[sc.COMPREHEND_EP_ARN] 36 | response = comprehend.classify_document(Text=dump, EndpointArn=comprehend_arn) 37 | classification = 'UNKNOWN' 38 | if len(response['Classes']) > 0: 39 | name = response['Classes'][0]['Name'] 40 | score = response['Classes'][0]['Score'] 41 | if score > 0.5: 42 | classification = name 43 | 44 | logger.info(classification) 45 | return {"classification" : classification} -------------------------------------------------------------------------------- /lambdas/detect_function.py: -------------------------------------------------------------------------------- 1 | import os 2 | import boto3 3 | import shared_constants as sc 4 | import logging 5 | 6 | logger = logging.getLogger() 7 | logger.setLevel(logging.INFO) 8 | 9 | def get_task_token_path(key): 10 | prefix = sc.OUT_TASK_TOKEN_PREFIX 11 | return prefix + "/" + key + ".taskToken"; 12 | 13 | # process a file upload 14 | def process_record(rec, token): 15 | logger.info(__name__) 16 | logger.info(rec) 17 | input_bucket = rec['s3']['bucket']['name'] 18 | input_file = rec['s3']['object']['key'] 19 | request_id = rec['responseElements']['x-amz-request-id'] 20 | detect_arn = os.environ[sc.DETECT_TOPIC_ARN] 21 | detect_role = os.environ[sc.TEXTRACT_PUBLISH_ROLE] 22 | output_bucket = os.environ[sc.OUTPUT_BKT] 23 | output_prefix = sc.OUT_DETECT_PREFIX 24 | 25 | client = boto3.client('textract') 26 | response = client.start_document_text_detection( 27 | DocumentLocation={ 28 | 'S3Object': { 29 | 'Bucket': input_bucket, 30 | 'Name': input_file 31 | } 32 | }, 33 | ClientRequestToken=request_id, 34 | NotificationChannel={ 35 | 'SNSTopicArn': detect_arn, 36 | 'RoleArn': detect_role 37 | }, 38 | OutputConfig={ 39 | 'S3Bucket': output_bucket, 40 | 'S3Prefix': output_prefix 41 | }) 42 | s3_client = boto3.client('s3') 43 | s3_client.put_object(Bucket=output_bucket, Key=get_task_token_path(input_file), Body=token); 44 | 45 | # event handler 46 | def lambda_handler(event, context): 47 | # lets just dump it 48 | logger.info(event) 49 | 50 | for rec in event['input']['Records']: 51 | process_record(rec, event['token']) 52 | 53 | return { 54 | "submit_status": "SUCCEEDED", 55 | } -------------------------------------------------------------------------------- /lambdas/newboto3/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-textract-queries-example/c7ee158086bbd2cad3d0b92d2a83895d1cb712c3/lambdas/newboto3/.gitkeep -------------------------------------------------------------------------------- /lambdas/post_analyze.py: -------------------------------------------------------------------------------- 1 | import os 2 | import boto3 3 | import json 4 | import shared_constants as sc 5 | import logging 6 | 7 | logger = logging.getLogger() 8 | logger.setLevel(logging.INFO) 9 | 10 | def get_task_token_path(key): 11 | prefix = sc.OUT_TASK_TOKEN_PREFIX 12 | return prefix + "/" + key + ".analyzeTaskToken"; 13 | 14 | def process_record(record): 15 | stepFunction = boto3.client('stepfunctions') 16 | s3_client = boto3.resource('s3') 17 | msg = record['Sns']['Message'] 18 | logger.info(msg) 19 | msg_dict = json.loads(msg) 20 | bucket = msg_dict['DocumentLocation']['S3Bucket'] 21 | key = msg_dict['DocumentLocation']['S3ObjectName'] 22 | logger.info("Completing analyze for " + bucket + "/" + key) 23 | output_bkt = os.environ[sc.OUTPUT_BKT] 24 | output_key = get_task_token_path(key) 25 | s3_object = s3_client.Object(output_bkt,output_key) 26 | token = s3_object.get()['Body'].read().decode("utf-8") 27 | logger.info(token) 28 | status = msg_dict['Status'] 29 | if status == "SUCCEEDED": 30 | logger.info("Success, resume SFSM") 31 | response = stepFunction.send_task_success(taskToken=token, 32 | output=msg) 33 | else: 34 | logger.info("Failed with " + status + ", failing SFSM") 35 | response = stepFunction.send_task_failure(taskToken=token, 36 | error=status, 37 | cause=msg) 38 | 39 | def lambda_handler(event, context): 40 | logger.info(event) 41 | for rec in event['Records']: 42 | process_record(rec) 43 | return event -------------------------------------------------------------------------------- /lambdas/post_detect.py: -------------------------------------------------------------------------------- 1 | import os 2 | import boto3 3 | import json 4 | import shared_constants as sc 5 | import logging 6 | 7 | logger = logging.getLogger() 8 | logger.setLevel(logging.INFO) 9 | 10 | def get_task_token_path(key): 11 | prefix = sc.OUT_TASK_TOKEN_PREFIX 12 | return prefix + "/" + key + ".taskToken"; 13 | 14 | def process_record(record): 15 | stepFunction = boto3.client('stepfunctions') 16 | s3_client = boto3.resource('s3') 17 | msg = record['Sns']['Message'] 18 | logger.info(msg) 19 | msg_dict = json.loads(msg) 20 | bucket = msg_dict['DocumentLocation']['S3Bucket'] 21 | key = msg_dict['DocumentLocation']['S3ObjectName'] 22 | logger.info("Completing detect for " + bucket + "/" + key) 23 | output_bkt = os.environ[sc.OUTPUT_BKT] 24 | output_key = get_task_token_path(key) 25 | s3_object = s3_client.Object(output_bkt,output_key) 26 | token = s3_object.get()['Body'].read().decode("utf-8") 27 | logger.info(token) 28 | status = msg_dict['Status'] 29 | if status == "SUCCEEDED": 30 | logger.info("Success, resume SFSM") 31 | response = stepFunction.send_task_success(taskToken=token, 32 | output=msg) 33 | else: 34 | logger.info("Failed with " + status + ", failing SFSM") 35 | response = stepFunction.send_task_failure(taskToken=token, 36 | error=status, 37 | cause=msg) 38 | 39 | def lambda_handler(event, context): 40 | logger.info(event) 41 | for rec in event['Records']: 42 | process_record(rec) 43 | return event -------------------------------------------------------------------------------- /lambdas/process_application.py: -------------------------------------------------------------------------------- 1 | import os 2 | import boto3 3 | import shared_constants as sc 4 | import logging 5 | 6 | logger = logging.getLogger() 7 | logger.setLevel(logging.INFO) 8 | 9 | def lambda_handler(event, context): 10 | # Do any further processing for rental-application here 11 | logger.info(event) 12 | return { "status" : "OK" } -------------------------------------------------------------------------------- /lambdas/process_bank.py: -------------------------------------------------------------------------------- 1 | import os 2 | import boto3 3 | import shared_constants as sc 4 | import logging 5 | 6 | logger = logging.getLogger() 7 | logger.setLevel(logging.INFO) 8 | 9 | def lambda_handler(event, context): 10 | # Do any further processing for bank-statement here 11 | logger.info(event) 12 | return { "status" : "OK" } -------------------------------------------------------------------------------- /lambdas/process_document.py: -------------------------------------------------------------------------------- 1 | import os 2 | import boto3 3 | import shared_constants as sc 4 | import logging 5 | 6 | logger = logging.getLogger() 7 | logger.setLevel(logging.INFO) 8 | 9 | def dump_to_s3(event, answers): 10 | output_bkt = os.environ[sc.OUTPUT_BKT] 11 | output_prefix = sc.OUT_ANSWERS_PREFIX 12 | key = event['Detection']['DocumentLocation']['S3ObjectName'] 13 | s3_obj_name = output_prefix + "/" + key + ".txt" 14 | s3 = boto3.client('s3') 15 | s3.put_object(Bucket=output_bkt, Key=s3_obj_name, Body=answers); 16 | 17 | def lambda_handler(event, context): 18 | logger.info(event) 19 | job_id = event['Analyze']['JobId'] 20 | textract = boto3.client('textract') 21 | response = textract.get_document_analysis(JobId=job_id) 22 | logger.info(response) 23 | 24 | # FIXME: This is a big hack 25 | # perhaps use a hash-table on IDs to store query-result 26 | answers = '' 27 | for block in response['Blocks']: 28 | if block["BlockType"] == "QUERY": 29 | if 'Relationships' not in block.keys(): 30 | continue 31 | logger.info(block["Query"]["Text"]) 32 | answers = answers + block["Query"]["Text"] + "|" + block["Query"]["Alias"] + "|" 33 | for result in response['Blocks']: 34 | if result["BlockType"] == "QUERY_RESULT": 35 | if block['Relationships'][0]['Ids'][0] == result['Id']: 36 | logger.info(result['Text'] + "\n") 37 | answers = answers + result['Text'] + "\n" 38 | break; 39 | # dump to S3 40 | dump_to_s3(event, answers) 41 | return { "status" : "OK" } -------------------------------------------------------------------------------- /lambdas/process_payslip.py: -------------------------------------------------------------------------------- 1 | import os 2 | import boto3 3 | import shared_constants as sc 4 | import logging 5 | 6 | logger = logging.getLogger() 7 | logger.setLevel(logging.INFO) 8 | 9 | def lambda_handler(event, context): 10 | # Do any further processing for payslip here 11 | logger.info(event) 12 | return { "status" : "OK" } -------------------------------------------------------------------------------- /lambdas/s3_input_handler.py: -------------------------------------------------------------------------------- 1 | import os 2 | import boto3 3 | import json 4 | import shared_constants as sc 5 | import logging 6 | 7 | logger = logging.getLogger() 8 | logger.setLevel(logging.INFO) 9 | 10 | def main(event, context): 11 | logger.info(event) 12 | logger.info(context) 13 | sm_arn = os.environ[sc.SF_SM_ARN] 14 | logger.info(sm_arn) 15 | stepFunction = boto3.client('stepfunctions') 16 | 17 | # Bail out for any non-supported formats 18 | for rec in event['Records']: 19 | input_file = rec['s3']['object']['key'] 20 | filename, file_extension = os.path.splitext(input_file) 21 | logger.info ("filename: " + filename + ". extension: " + file_extension) 22 | if file_extension not in sc.SUPPORTED_FILES: 23 | logger.warn("Unsupported file type: " + input_file) 24 | return { 'statusCode': 200, 'body': event } 25 | 26 | response = stepFunction.start_execution( 27 | stateMachineArn=sm_arn, 28 | input = json.dumps(event, indent=4), 29 | name=context.aws_request_id 30 | ) 31 | return { 32 | 'statusCode': 200, 33 | 'body': event 34 | } -------------------------------------------------------------------------------- /lambdas/shared_constants.py: -------------------------------------------------------------------------------- 1 | SF_SM_ARN = 'SF_SM_ARN' 2 | INPUT_BKT = 'INPUT_BKT' 3 | OUTPUT_BKT = 'OUTPUT_BKT' 4 | DETECT_TOPIC_ARN = 'DETECT_TOPIC_ARN' 5 | TEXTRACT_PUBLISH_ROLE = 'TEXTRACT_PUBLISH_ROLE' 6 | ANALYZE_TOPIC_ARN = 'ANALYZE_TOPIC_ARN' 7 | OUT_DETECT_PREFIX = '_detectText' 8 | OUT_ANALYZE_PREFIX = '_analyzeText' 9 | OUT_TASK_TOKEN_PREFIX = '_tasks' 10 | OUT_DUMP_PREFIX ='_dump' 11 | OUT_ANSWERS_PREFIX='_answers' 12 | COMPREHEND_EP_ARN = 'COMPREHEND_EP_ARN' 13 | 14 | SUPPORTED_FILES = [".pdf", ".png", ".jpg", ".jpeg", ".tiff"] 15 | 16 | Payslip_Queries = [ 17 | { 18 | "Text": "What is the company's name", 19 | "Alias": "PAYSLIP_NAME_COMPANY" 20 | }, 21 | { 22 | "Text": "What is the company's ABN number", 23 | "Alias": "PAYSLIP_COMPANY_ABN" 24 | }, 25 | { 26 | "Text": "What is the employee's name", 27 | "Alias": "PAYSLIP_NAME_EMPLOYEE" 28 | }, 29 | { 30 | "Text": "What is the annual salary", 31 | "Alias": "PAYSLIP_SALARY_ANNUAL" 32 | }, 33 | { 34 | "Text": "What is the monthly net pay", 35 | "Alias": "PAYSLIP_SALARY_MONTHLY" 36 | } 37 | ] 38 | 39 | Bank_Queries = [ 40 | { 41 | "Text": "What is the name of the banking institution", 42 | "Alias": "BANK_NAME" 43 | }, 44 | { 45 | "Text": "What is the name of the account holder", 46 | "Alias": "BANK_HOLDER_NAME" 47 | }, 48 | { 49 | "Text": "What is the account number", 50 | "Alias": "BANK_ACCOUNT_NUMBER" 51 | }, 52 | { 53 | "Text": "What is the current balance of the bank account", 54 | "Alias": "BANK_ACCOUNT_BALANCE" 55 | }, 56 | { 57 | "Text": "What is the period for the bank statement", 58 | "Alias": "BANK_STATEMENT_PERIOD" 59 | }, 60 | { 61 | "Text": "What is the largest credit transaction in the statement", 62 | "Alias": "BANK_LARGEST_CREDIT" 63 | }, 64 | { 65 | "Text": "What is the largest debit transaction in the statement", 66 | "Alias": "BANK_LARGEST_DEBIT" 67 | }, 68 | ] 69 | 70 | Application_Queries = [ 71 | { 72 | "Text": "What is the name of the realestate company", 73 | "Alias": "APP_COMPANY_NAME" 74 | }, 75 | { 76 | "Text": "What is the name of the applicant or the prospective tenant", 77 | "Alias": "APP_APPLICANT_NAME" 78 | }, 79 | { 80 | "Text": "What is the monthly rental amount", 81 | "Alias": "APP_RENTAL_AMOUNT_MONTHLY" 82 | }, 83 | { 84 | "Text": "What is the weekly rental amount", 85 | "Alias": "APP_RENTAL_AMOUNT_WEEKLY" 86 | }, 87 | { 88 | "Text": "What is the address of the property", 89 | "Alias": "APP_PROPERTY_ADDRESS" 90 | }, 91 | { 92 | "Text": "What is the appicants monthly net pay", 93 | "Alias": "PAYSLIP_SALARY_MONTHLY" 94 | }, 95 | ] -------------------------------------------------------------------------------- /my_project/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-textract-queries-example/c7ee158086bbd2cad3d0b92d2a83895d1cb712c3/my_project/__init__.py -------------------------------------------------------------------------------- /my_project/my_project_stack.py: -------------------------------------------------------------------------------- 1 | from aws_cdk.aws_lambda_python_alpha import PythonFunction, PythonLayerVersion 2 | import sys 3 | sys.path.append('../') 4 | from lambdas import shared_constants as sc 5 | import os 6 | from aws_cdk import ( 7 | aws_lambda as _lambda, 8 | RemovalPolicy, 9 | aws_s3 as _s3, 10 | aws_s3_notifications, 11 | aws_iam as _iam, 12 | aws_stepfunctions as _sfn, 13 | aws_stepfunctions_tasks as _sfn_tasks, 14 | aws_sns as _sns, 15 | aws_sns_subscriptions as _sns_subscriptions, 16 | Duration, Stack 17 | ) 18 | from constructs import Construct 19 | 20 | 21 | class MyProjectStack(Stack): 22 | 23 | def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None: 24 | super().__init__(scope, construct_id, **kwargs) 25 | 26 | 27 | # Here define a Lambda Layer 28 | boto3_lambda_layer = PythonLayerVersion( 29 | self, 'Boto3LambdaLayer', 30 | entry='lambdas/newboto3', 31 | compatible_runtimes=[_lambda.Runtime.PYTHON_3_9], 32 | description='Boto3 Library' 33 | ) 34 | 35 | lambdaEnv = {} 36 | comprehend_ep_arn = os.environ[sc.COMPREHEND_EP_ARN] 37 | lambdaEnv[sc.COMPREHEND_EP_ARN] = comprehend_ep_arn 38 | 39 | # create s3 input bucket 40 | s3InputBucket = _s3.Bucket( 41 | self, "s3InputBucket", auto_delete_objects=True, 42 | removal_policy=RemovalPolicy.DESTROY) 43 | 44 | # create s3 output bucket 45 | s3OutputBucket = _s3.Bucket( 46 | self, "s3OutputBucket", auto_delete_objects=True, 47 | removal_policy=RemovalPolicy.DESTROY) 48 | 49 | lambdaEnv[sc.INPUT_BKT] = s3InputBucket.bucket_name 50 | lambdaEnv[sc.OUTPUT_BKT] = s3OutputBucket.bucket_name 51 | 52 | # SNS topic to write to when text-detect is complete: 53 | detect_topic = _sns.Topic(self, "textDetectTopic") 54 | analyze_topic = _sns.Topic(self, "textAnalyzeTopic") 55 | lambdaEnv[sc.DETECT_TOPIC_ARN] = detect_topic.topic_arn 56 | lambdaEnv[sc.ANALYZE_TOPIC_ARN] = analyze_topic.topic_arn 57 | 58 | # permissions for Textract to publish on SNS topic 59 | textract_role = _iam.Role(self, "TextractSNSPublishRole", 60 | assumed_by=_iam.ServicePrincipal("textract.amazonaws.com")) 61 | textract_role.add_to_policy( 62 | _iam.PolicyStatement(actions=["sts:AssumeRole"], 63 | resources=["*"])) 64 | textract_role.add_to_policy( 65 | _iam.PolicyStatement(actions=["sns:Publish"], 66 | resources=[detect_topic.topic_arn])) 67 | textract_role.add_to_policy( 68 | _iam.PolicyStatement(actions=["sns:Publish"], 69 | resources=[analyze_topic.topic_arn])) 70 | 71 | lambdaEnv[sc.TEXTRACT_PUBLISH_ROLE] = textract_role.role_arn 72 | 73 | # lambda to trigger the text detection 74 | detect_lambda = _lambda.Function(self, 'detectLambda', 75 | handler='detect_function.lambda_handler', 76 | runtime=_lambda.Runtime.PYTHON_3_9, 77 | environment=lambdaEnv, 78 | timeout=Duration.minutes(1), 79 | code=_lambda.Code.from_asset('./lambdas')) 80 | detect_lambda.role.add_managed_policy(_iam.ManagedPolicy.from_aws_managed_policy_name("AmazonTextractFullAccess")) 81 | s3InputBucket.grant_read(detect_lambda) 82 | s3OutputBucket.grant_read_write(detect_lambda) 83 | 84 | # watch detect-topic for post detection processing 85 | post_detect_lambda = _lambda.Function(self, 'postDetectLambda', 86 | handler='post_detect.lambda_handler', 87 | runtime=_lambda.Runtime.PYTHON_3_9, 88 | environment=lambdaEnv, 89 | timeout=Duration.minutes(1), 90 | code=_lambda.Code.from_asset('./lambdas')) 91 | detect_topic.add_subscription(_sns_subscriptions.LambdaSubscription(post_detect_lambda)); 92 | s3OutputBucket.grant_read_write(post_detect_lambda); 93 | 94 | # classify then analyze the document 95 | classify_lambda = _lambda.Function(self, 'classifyLambda', 96 | handler='classify_function.lambda_handler', 97 | runtime=_lambda.Runtime.PYTHON_3_9, 98 | environment=lambdaEnv, 99 | timeout=Duration.minutes(1), 100 | code=_lambda.Code.from_asset('./lambdas')) 101 | classify_lambda.role.add_managed_policy(_iam.ManagedPolicy.from_aws_managed_policy_name("AmazonTextractFullAccess")) 102 | classify_lambda.role.add_managed_policy(_iam.ManagedPolicy.from_aws_managed_policy_name("ComprehendReadOnly")) 103 | s3OutputBucket.grant_read_write(classify_lambda) 104 | 105 | # analyze the document 106 | analyze_lambda = _lambda.Function(self, 'analyzeLambda', 107 | handler='analyze_function.lambda_handler', 108 | runtime=_lambda.Runtime.PYTHON_3_9, 109 | layers=[boto3_lambda_layer], 110 | environment=lambdaEnv, 111 | timeout=Duration.minutes(1), 112 | code=_lambda.Code.from_asset('./lambdas')) 113 | analyze_lambda.role.add_managed_policy(_iam.ManagedPolicy.from_aws_managed_policy_name("AmazonTextractFullAccess")) 114 | s3InputBucket.grant_read(analyze_lambda) 115 | s3OutputBucket.grant_read_write(analyze_lambda) 116 | 117 | # watch analyze-topic for post analyze processing 118 | post_analyze_lambda = _lambda.Function(self, 'postAnalyzeLambda', 119 | handler='post_analyze.lambda_handler', 120 | runtime=_lambda.Runtime.PYTHON_3_9, 121 | environment=lambdaEnv, 122 | timeout=Duration.minutes(1), 123 | code=_lambda.Code.from_asset('./lambdas')) 124 | analyze_topic.add_subscription(_sns_subscriptions.LambdaSubscription(post_analyze_lambda)); 125 | s3OutputBucket.grant_read_write(post_analyze_lambda); 126 | 127 | 128 | process_document_l = _lambda.Function(self, 'processDocument', 129 | handler='process_document.lambda_handler', 130 | runtime=_lambda.Runtime.PYTHON_3_9, 131 | environment=lambdaEnv, 132 | layers=[boto3_lambda_layer], 133 | timeout=Duration.minutes(1), 134 | code=_lambda.Code.from_asset('./lambdas')) 135 | process_document_l.role.add_managed_policy(_iam.ManagedPolicy.from_aws_managed_policy_name("AmazonTextractFullAccess")) 136 | s3OutputBucket.grant_read_write(process_document_l) 137 | 138 | process_payslip_l = _lambda.Function(self, 'processPayslip', 139 | handler='process_payslip.lambda_handler', 140 | runtime=_lambda.Runtime.PYTHON_3_9, 141 | environment=lambdaEnv, 142 | timeout=Duration.minutes(1), 143 | code=_lambda.Code.from_asset('./lambdas')) 144 | process_bank_l = _lambda.Function(self, 'processBank', 145 | handler='process_bank.lambda_handler', 146 | runtime=_lambda.Runtime.PYTHON_3_9, 147 | environment=lambdaEnv, 148 | timeout=Duration.minutes(1), 149 | code=_lambda.Code.from_asset('./lambdas')) 150 | process_application_l = _lambda.Function(self, 'processApplication', 151 | handler='process_application.lambda_handler', 152 | runtime=_lambda.Runtime.PYTHON_3_9, 153 | environment=lambdaEnv, 154 | timeout=Duration.minutes(1), 155 | code=_lambda.Code.from_asset('./lambdas')) 156 | 157 | 158 | # Step functions Definition 159 | detect_job = _sfn_tasks.LambdaInvoke( 160 | self, "Detect Job", 161 | lambda_function=detect_lambda, 162 | result_path="$.Detection", 163 | integration_pattern=_sfn.IntegrationPattern.WAIT_FOR_TASK_TOKEN, 164 | payload=_sfn.TaskInput.from_object({ 165 | "token": _sfn.JsonPath.task_token, 166 | "input": _sfn.JsonPath.entire_payload})) 167 | 168 | classify_job = _sfn_tasks.LambdaInvoke( 169 | self, "Classify Job", 170 | lambda_function=classify_lambda, 171 | result_selector = { 172 | "classification.$":"$.Payload.classification"}, 173 | result_path="$.Classify") 174 | 175 | analyze_job = _sfn_tasks.LambdaInvoke( 176 | self, "Analyze Job", 177 | lambda_function=analyze_lambda, 178 | integration_pattern=_sfn.IntegrationPattern.WAIT_FOR_TASK_TOKEN, 179 | payload=_sfn.TaskInput.from_object({ 180 | "token": _sfn.JsonPath.task_token, 181 | "input": _sfn.JsonPath.entire_payload}), 182 | result_path="$.Analyze") 183 | 184 | fail_job = _sfn.Fail( 185 | self, "Fail", 186 | cause='AWS Batch Job Failed', 187 | error='DescribeJob returned FAILED') 188 | 189 | succeed_job = _sfn.Succeed( 190 | self, "Succeeded", 191 | comment='AWS Batch Job succeeded') 192 | 193 | 194 | process_document_task = _sfn_tasks.LambdaInvoke( 195 | self, "Process Document", 196 | lambda_function=process_document_l, 197 | result_selector = { 198 | "tq_status.$":"$.Payload.status" 199 | }, 200 | result_path="$.Process") 201 | 202 | process_payslip_task = _sfn_tasks.LambdaInvoke( 203 | self, "Process Payslip", 204 | lambda_function=process_payslip_l, 205 | result_selector = { 206 | "status.$":"$.Payload.status" 207 | }, 208 | result_path="$.Process").next(succeed_job) 209 | 210 | process_bank_task = _sfn_tasks.LambdaInvoke( 211 | self, "Process Bank", 212 | lambda_function=process_bank_l, 213 | result_selector = { 214 | "status.$":"$.Payload.status" 215 | }, 216 | result_path="$.Process").next(succeed_job) 217 | 218 | process_application_task = _sfn_tasks.LambdaInvoke( 219 | self, "Process Application", 220 | lambda_function=process_application_l, 221 | result_selector = { 222 | "status.$":"$.Payload.status" 223 | }, 224 | result_path="$.Process").next(succeed_job) 225 | 226 | # Create Chain 227 | definition = detect_job.next(classify_job).next( 228 | _sfn.Choice(self, 'Known Document Type?') 229 | .when(_sfn.Condition.string_equals('$.Classify.classification', 'UNKNOWN'), fail_job) 230 | .otherwise(analyze_job)) 231 | 232 | analyze_job.next(process_document_task).next( 233 | _sfn.Choice(self, 'Analyze Complete?') 234 | .when(_sfn.Condition.string_equals('$.Classify.classification', 'PAYSLIP'), process_payslip_task) 235 | .when(_sfn.Condition.string_equals('$.Classify.classification', 'BANK'), process_bank_task) 236 | .when(_sfn.Condition.string_equals('$.Classify.classification', 'APPLICATION'), process_application_task) 237 | .otherwise(fail_job)) 238 | 239 | # Create the state machine 240 | sm = _sfn.StateMachine( 241 | self, "StateMachine", 242 | definition=definition, 243 | timeout=Duration.minutes(60), 244 | ) 245 | lambdaEnv[sc.SF_SM_ARN] = sm.state_machine_arn 246 | 247 | # watch s3 input-bkt, function triggered at file upload 248 | s3_function = _lambda.Function( 249 | self, "s3_input_function", 250 | runtime=_lambda.Runtime.PYTHON_3_9, 251 | handler="s3_input_handler.main", 252 | environment=lambdaEnv, 253 | code=_lambda.Code.from_asset("./lambdas")) 254 | 255 | # create s3 notification that triggers lambda function 256 | notification = aws_s3_notifications.LambdaDestination(s3_function) 257 | 258 | # assign notification for the s3 event type (ex: OBJECT_CREATED) 259 | # the corresponding lambda function will kick start the step function 260 | s3InputBucket.add_event_notification(_s3.EventType.OBJECT_CREATED, notification) 261 | 262 | sm.grant_start_execution(s3_function) 263 | sm.grant_task_response(post_detect_lambda) 264 | sm.grant_task_response(post_analyze_lambda) -------------------------------------------------------------------------------- /requirements-dev.txt: -------------------------------------------------------------------------------- 1 | pytest==6.2.5 2 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | aws-cdk-lib==2.82.0 2 | constructs>=10.0.0,<11.0.0 3 | aws_cdk.aws_lambda_python_alpha==2.24.1a0 4 | -------------------------------------------------------------------------------- /source.bat: -------------------------------------------------------------------------------- 1 | @echo off 2 | 3 | rem The sole purpose of this script is to make the command 4 | rem 5 | rem source .venv/bin/activate 6 | rem 7 | rem (which activates a Python virtualenv on Linux or Mac OS X) work on Windows. 8 | rem On Windows, this command just runs this batch file (the argument is ignored). 9 | rem 10 | rem Now we don't need to document a Windows command for activating a virtualenv. 11 | 12 | echo Executing .venv\Scripts\activate.bat for you 13 | .venv\Scripts\activate.bat 14 | -------------------------------------------------------------------------------- /tests/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-textract-queries-example/c7ee158086bbd2cad3d0b92d2a83895d1cb712c3/tests/__init__.py -------------------------------------------------------------------------------- /tests/unit/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-textract-queries-example/c7ee158086bbd2cad3d0b92d2a83895d1cb712c3/tests/unit/__init__.py -------------------------------------------------------------------------------- /tests/unit/test_my_project_stack.py: -------------------------------------------------------------------------------- 1 | import aws_cdk as core 2 | import aws_cdk.assertions as assertions 3 | 4 | from my_project.my_project_stack import MyProjectStack 5 | 6 | # example tests. To run these tests, uncomment this file along with the example 7 | # resource in my_project/my_project_stack.py 8 | def test_sqs_queue_created(): 9 | app = core.App() 10 | stack = MyProjectStack(app, "my-project") 11 | template = assertions.Template.from_stack(stack) 12 | 13 | # template.has_resource_properties("AWS::SQS::Queue", { 14 | # "VisibilityTimeout": 300 15 | # }) 16 | -------------------------------------------------------------------------------- /training/bank-statement.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-textract-queries-example/c7ee158086bbd2cad3d0b92d2a83895d1cb712c3/training/bank-statement.pdf -------------------------------------------------------------------------------- /training/payslip.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-textract-queries-example/c7ee158086bbd2cad3d0b92d2a83895d1cb712c3/training/payslip.pdf -------------------------------------------------------------------------------- /training/rental-application.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-textract-queries-example/c7ee158086bbd2cad3d0b92d2a83895d1cb712c3/training/rental-application.pdf -------------------------------------------------------------------------------- /training/trainer.csv: -------------------------------------------------------------------------------- 1 | PAYSLIP,"AnyCompany Heavy Industries PAYSLIP AnyCompany Heavy Industries, AU A.B.N 8347289294759 Employee Nikki Wolf Pay Date Jan 28 2021 Location Syd HQ Period Jan 1 2021 - Jan 31 2021 Occupation Snr Lab Tech Paycycle Monthly Hourly Rate $50.6073 Base Annual Salary $100,000.00 Description Rate Unit Amount Salary $50.6073 152 $ 7,692.31 Gross Package $ 7,692.31 Gross Pay $ 7,692.31 Tax $ (2,208.08) Net Pay $ 5,484.22 Payment Method Amount Bank BSB Account Number Ref Details EFT $ 5,484.22 032-701 618720910 YTD Details Leave Entitlements Gross Package $ 53,846.15 Accrued AL (Hours) 38.5 Gross Pay $ 53,846.15 Accrued PL (Hours) 19 Tax $ (15,456.58) Deductions After Tax Net Pay $ 38,389.57 Employer Super $ 3,500.00 Recent Superannuation Contribution Made by Aussiepay Date Paid For the month of Amount Example Corp. Super Fund Dec 22 2020 Nov 2020 $3,500.00 " 2 | BANK,"AnyCompany Fiduciary Every Day Savings Account Statement Account Holder Ms Nikki Wolf Customer ID BSB Account 7463278 032-701 618720910 Statement Summary Statement Period Dec 15 2020 - Jan 14 2021 Opening Balance $ 32,310.00 Total Credits $ 5,484.22 Total Debits $ (4,500.00) Closing Balance $ 33,294.22 Transactions Date Transaction Details Amount ($) Balance carried forward $ 32,310.00 28/12/2020 Deposit from AnyCompany Heavy Industries Corp AU $ 5,484.22 28/12/2020 TFR to Every Day Credit Card $ (2,000.00) 29/12/2020 The AnyCompany Rental Agency $ (2,500.00) " 3 | APPLICATION,"AnyCompany Realty Agent Details Property Manager Richard Roe Contact number 0425 555 296 Email rroe@example.anycompany-realty.com Fax Property Details Address 23 Operator St Willoughby NSW Postcode 2076 Property Rental Amount ($) $3,000 [ ] week [ X] monthly Tenancy details Property bond amount ($) $3,000 Tenancy start date 12th February, 2021 Tenancy term [X ] fixed [ ] periodic Fixed term in months 18 months Applicant Details (to be completed by applicant) Please complete 1 application per applicant [ ] Mr [ X ] Ms [ ] Mrs [ ] Dr [ ] Other ____________ Last Name Wolf Given Names Nikki Have you ever been known by any other name? [ ] Yes [ X ] No If Yes, what other names have you been known by? Driver's Licence Number 1237618 State NSW Passport Number 48762134 Country Australia Do you have any Dependents? [ X ] Yes [ ] No If yes, please provide details Name Age Meteo Wolf 9 Mary Wolf 7 Jorge Wolf 3 Do you have any Pets? [ X ] Yes [ ] No If yes, please provide details Type and Breed Registered? Dog - Labradoodle Yes Contact Details email pattismith@theinternet.com phone number m 04108756235 h NA w 04108576235 date of birth (dd/mm/yyyy) 18/09/1973 Current address 14 Edgar Cres Mona Vale postcode 2876 how long at this address 2 years 6 months Current landlord/agent details landlord/agent name Maria Garcia agency name (if applicable) The AnyCompany Rental Agency phone number 98341234 email address maria@example.anycompanyrentalagency.com.au Monthly rent $ $2,500 reason for leaving current address Need larger house for family Previous address 3/73 Broadside Ave Cremorne postcode 2761 how long at this address 8 years months landlord/agent details landlord/agent name Richard Roe agency name (if applicable) The AnyCompany Rental Agency phone number 98341234 email address jeromep@example.anycompanyrentalagency.com.au Monthly rent $ $1,500 reason for leaving this address Move from apartment to house required Employment History Are you employed? [ X ] Yes [ ] No Occupation Snr Research Lab technican Employment type [ X ] Full Time [ ] Part Time [ ] Casual Net weekly income $ $1,432.00 Name of Employer AnyCompany Heavy Industries Length of Employment 14 years Contact at Employer John Stiles, Head of R&D Phone number john@example.anycompanyheavyindustries.com.au Address of Employer 100 Old Big Road Willoughby NSW If Self Employed Business Name ABN How long have you been self-employed? Address of Business Accountant's Details Name phone number email address Students Are you a student? [ ] Yes [X ] No School/University/TAFE name Student ID Number Personal References Please do not list relatives or partners Name Martha Rivera Relationship Friend Contact details m 0410 55987 631 h w Name Shirley Rodriguez Relationship Friend Contact details m 0413 55498 419 h w Other History Have you ever been evicted by any agent/lessor? [ ] Yes [ X ] No Was your rental bond at your last address refunded in full? [ X ] Yes [ ] No If No, what deductions were made? Are you in debt to another agent/lessor? [ ] Yes [ X ] No If Yes, why are you in debt to another agent/lessor? Terms and Conditions I wish to undertake a tenancy for a period of 18 months to start on 12 Feb 2021 at a rental price of $3,000 I understand that I am to pay a rental bond of $3,000 to take possession of the presmises and sign a tenancy agreement The customer acknowledges that one application form has to be completed per person [ X ] Yes [ ] No The customer acknowledges that they have received the Privacy Policy of the agent [X ] Yes [ ] No Name of Applicant Signature Date " 4 | PAYSLIP,"AnyCompany Heavy Industries PAYSLIP AnyCompany Heavy Industries, AU A.B.N 8347289294759 Employee Nikki Wolf Pay Date Jan 28 2021 Location Syd HQ Period Jan 1 2021 - Jan 31 2021 Occupation Snr Lab Tech Paycycle Monthly Hourly Rate $50.6073 Base Annual Salary $100,000.00 Description Rate Unit Amount Salary $50.6073 152 $ 7,692.31 Gross Package $ 7,692.31 Gross Pay $ 7,692.31 Tax $ (2,208.08) Net Pay $ 5,484.22 Payment Method Amount Bank BSB Account Number Ref Details EFT $ 5,484.22 032-701 618720910 YTD Details Leave Entitlements Gross Package $ 53,846.15 Accrued AL (Hours) 38.5 Gross Pay $ 53,846.15 Accrued PL (Hours) 19 Tax $ (15,456.58) Deductions After Tax Net Pay $ 38,389.57 Employer Super $ 3,500.00 Recent Superannuation Contribution Made by Aussiepay Date Paid For the month of Amount Example Corp. Super Fund Dec 22 2020 Nov 2020 $3,500.00 " 5 | BANK,"AnyCompany Fiduciary Every Day Savings Account Statement Account Holder Ms Nikki Wolf Customer ID BSB Account 7463278 032-701 618720910 Statement Summary Statement Period Dec 15 2020 - Jan 14 2021 Opening Balance $ 32,310.00 Total Credits $ 5,484.22 Total Debits $ (4,500.00) Closing Balance $ 33,294.22 Transactions Date Transaction Details Amount ($) Balance carried forward $ 32,310.00 28/12/2020 Deposit from AnyCompany Heavy Industries Corp AU $ 5,484.22 28/12/2020 TFR to Every Day Credit Card $ (2,000.00) 29/12/2020 The AnyCompany Rental Agency $ (2,500.00) " 6 | APPLICATION,"AnyCompany Realty Agent Details Property Manager Richard Roe Contact number 0425 555 296 Email rroe@example.anycompany-realty.com Fax Property Details Address 23 Operator St Willoughby NSW Postcode 2076 Property Rental Amount ($) $3,000 [ ] week [ X] monthly Tenancy details Property bond amount ($) $3,000 Tenancy start date 12th February, 2021 Tenancy term [X ] fixed [ ] periodic Fixed term in months 18 months Applicant Details (to be completed by applicant) Please complete 1 application per applicant [ ] Mr [ X ] Ms [ ] Mrs [ ] Dr [ ] Other ____________ Last Name Wolf Given Names Nikki Have you ever been known by any other name? [ ] Yes [ X ] No If Yes, what other names have you been known by? Driver's Licence Number 1237618 State NSW Passport Number 48762134 Country Australia Do you have any Dependents? [ X ] Yes [ ] No If yes, please provide details Name Age Meteo Wolf 9 Mary Wolf 7 Jorge Wolf 3 Do you have any Pets? [ X ] Yes [ ] No If yes, please provide details Type and Breed Registered? Dog - Labradoodle Yes Contact Details email pattismith@theinternet.com phone number m 04108756235 h NA w 04108576235 date of birth (dd/mm/yyyy) 18/09/1973 Current address 14 Edgar Cres Mona Vale postcode 2876 how long at this address 2 years 6 months Current landlord/agent details landlord/agent name Maria Garcia agency name (if applicable) The AnyCompany Rental Agency phone number 98341234 email address maria@example.anycompanyrentalagency.com.au Monthly rent $ $2,500 reason for leaving current address Need larger house for family Previous address 3/73 Broadside Ave Cremorne postcode 2761 how long at this address 8 years months landlord/agent details landlord/agent name Richard Roe agency name (if applicable) The AnyCompany Rental Agency phone number 98341234 email address jeromep@example.anycompanyrentalagency.com.au Monthly rent $ $1,500 reason for leaving this address Move from apartment to house required Employment History Are you employed? [ X ] Yes [ ] No Occupation Snr Research Lab technican Employment type [ X ] Full Time [ ] Part Time [ ] Casual Net weekly income $ $1,432.00 Name of Employer AnyCompany Heavy Industries Length of Employment 14 years Contact at Employer John Stiles, Head of R&D Phone number john@example.anycompanyheavyindustries.com.au Address of Employer 100 Old Big Road Willoughby NSW If Self Employed Business Name ABN How long have you been self-employed? Address of Business Accountant's Details Name phone number email address Students Are you a student? [ ] Yes [X ] No School/University/TAFE name Student ID Number Personal References Please do not list relatives or partners Name Martha Rivera Relationship Friend Contact details m 0410 55987 631 h w Name Shirley Rodriguez Relationship Friend Contact details m 0413 55498 419 h w Other History Have you ever been evicted by any agent/lessor? [ ] Yes [ X ] No Was your rental bond at your last address refunded in full? [ X ] Yes [ ] No If No, what deductions were made? Are you in debt to another agent/lessor? [ ] Yes [ X ] No If Yes, why are you in debt to another agent/lessor? Terms and Conditions I wish to undertake a tenancy for a period of 18 months to start on 12 Feb 2021 at a rental price of $3,000 I understand that I am to pay a rental bond of $3,000 to take possession of the presmises and sign a tenancy agreement The customer acknowledges that one application form has to be completed per person [ X ] Yes [ ] No The customer acknowledges that they have received the Privacy Policy of the agent [X ] Yes [ ] No Name of Applicant Signature Date " 7 | --------------------------------------------------------------------------------