├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── app.py
├── assets
    └── images
    │   └── step-function-flow.png
├── cdk.json
├── lambdas
    ├── __init__.py
    ├── analyze_function.py
    ├── classify_function.py
    ├── detect_function.py
    ├── newboto3
    │   └── .gitkeep
    ├── post_analyze.py
    ├── post_detect.py
    ├── process_application.py
    ├── process_bank.py
    ├── process_document.py
    ├── process_payslip.py
    ├── s3_input_handler.py
    └── shared_constants.py
├── my_project
    ├── __init__.py
    └── my_project_stack.py
├── requirements-dev.txt
├── requirements.txt
├── source.bat
├── tests
    ├── __init__.py
    └── unit
    │   ├── __init__.py
    │   └── test_my_project_stack.py
└── training
    ├── bank-statement.pdf
    ├── payslip.pdf
    ├── rental-application.pdf
    └── trainer.csv


/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
 4 | documentation, we greatly value feedback and contributions from our community.
 5 | 
 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
 7 | information to effectively respond to your bug report or contribution.
 8 | 
 9 | 
10 | ## Reporting Bugs/Feature Requests
11 | 
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 | 
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 | 
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 | 
22 | 
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 | 
26 | 1. You are working against the latest source on the *main* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 | 
30 | To send us a pull request, please:
31 | 
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 | 
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 | 
42 | 
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 | 
46 | 
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 | 
52 | 
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 | 
56 | 
57 | ## Licensing
58 | 
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | 
 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 4 | this software and associated documentation files (the "Software"), to deal in
 5 | the Software without restriction, including without limitation the rights to
 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 7 | the Software, and to permit persons to whom the Software is furnished to do so.
 8 | 
 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 | 
16 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | ## AWS Sample: Textract Queries driven intelligent document processing
  2 | 
  3 | This repository serves as a sample/example of intelligent document processing using AWS AI services. It covers the following:
  4 | 1. Setup the example in your AWS account using Infrastructure as Code (IaC) - Cloud Development Kit (CDK)
  5 | 2. The example uses fully managed serverless components - offloading undifferentiated heavy lifting
  6 | 3. It uses Textract AI service to extract data from uploaded multi-page documents such as PDFs, images, tables, and forms
  7 | 3. Classifies documents using custom classification model in the Comprehend AI service (for example PAYSLIP, BANK statement, Rental APPLICATION)
  8 | 4. Based on the classification, ask a set of questions in natural language to extract key information. For example, in the case of PAYSLIP:
  9 | ```
 10 |        Q. "What is the company's name"
 11 |        Q. "What is the company's ABN number"
 12 |        Q. "What is the employee's name"
 13 |        Q. "What is the annual salary"
 14 |        Q. "What is the monthly net pay"
 15 | ```
 16 | 5. The example is automatically trigger when a file is uploaded to the designated S3 bucket.
 17 | 6. It is asynchronous and orchestrated using Step Functions - allowing for documents to be processed in parallel.
 18 | 7. Both the IaC CDK script and the Lambda handlers are written in Python, making it easy to follow/change/adapt for your needs.
 19 | 
 20 | ## Getting Started
 21 | 1. It is recommended to use a Cloud9 environment as it comes bundled with AWS CLI and CDK pre-installed
 22 | 2. Increase the size of the EBS volume attached to Cloud9 EC2 instance to 20GiB
 23 | 
 24 | Once the Cloud9 environment is setup as above, clone the repository:
 25 | ```
 26 | 'git clone https://github.com/aws-samples/amazon-textract-queries-example.git'
 27 | ```
 28 | Setup a virtual environment so that all python dependencies are stored within the project directory:
 29 | ```
 30 | Admin:~/environment/amazon-textract-queries-example (main) $ virtualenv .venv
 31 | Admin:~/environment/amazon-textract-queries-example (main) $ source .venv/bin/activate
 32 | (.venv) Admin:~/environment/amazon-textract-queries-example (main) $ pip install -r requirements.txt
 33 | ```
 34 | To use the newly released Textract Queries feature, we need the bundled Lambda functions to use a new version of Python boto3 library, so install that:
 35 | ```
 36 | (.venv) Admin:~/environment/amazon-textract-queries-example (main) $ pip install boto3 --target=lambdas/newboto3
 37 | ```
 38 | 
 39 | ## Train a sample classifier
 40 | 
 41 | Now we train a custom classification model in Comprehend. There are 3 sample files provided (payslip, bank-statement and rental-application):
 42 | ```
 43 | (.venv) Admin:~/environment/amazon-textract-queries-example (main) $ ls training
 44 | bank-statement.pdf  payslip.pdf  rental-application.pdf  trainer.csv
 45 | ```
 46 | The text from these files has been extracted and placed as a flattened single-line string along with its corresponding label in "trainer.csv". Lines have been duplicated in "trainer.csv" to have minimum entries required to train the model.
 47 | 
 48 | To train the classifier:
 49 | 1. Visit the [Comprehend console](https://console.aws.amazon.com/comprehend/v2/home#classification) (switch to the region you wish to use)
 50 | 2. Click *Train classifier*
 51 | 3. Give it a name and check the other details (the defaults are fine to start - use a Multi-class classifier)
 52 | 4. Specify the S3 location of the training file (upload it first)
 53 | 5. Click *Train classifier*
 54 | 
 55 | Once it's trained (it will take a few minutes), start an endpoint by clicking *Create endpoint* from the classifier's console page. Note the ARN of the endpoint. Export it into your environment using the following variable:
 56 | ```
 57 | (.venv) Admin:~/environment/amazon-textract-queries-example (main) $ export COMPREHEND_EP_ARN=arn:aws:comprehend:ap-southeast-2:NNNNNNNNNNNN:document-classifier-endpoint/classify-ep
 58 | ```
 59 | 
 60 | ## Deploy the Sample
 61 | 
 62 | We will be using CDK to deploy the application into your account. This is done using two simple commands:
 63 | ```
 64 | 1. (.venv) Admin:~/environment/amazon-textract-queries-example (main) $ cdk bootstrap
 65 | 2. (.venv) Admin:~/environment/amazon-textract-queries-example (main) $ cdk deploy
 66 | ```
 67 | Note that `cdk bootstrap` is required just once for your AWS account. After subsequent changes to the  logic or IaC doing `cdk deploy` is sufficient.
 68 | 
 69 | ## Processing PDF files
 70 | 
 71 | After deployment is complete (typically 4 - 5 minutes) and the status of the Comprehend classification endpoint is "Active" upload one of the provided sample PDF (example `payslip.pdf`) files into the input S3 bucket (`myprojectstack-s3inputbucketNNNNN`) that has been created by CDK.
 72 | 
 73 | The set of questions that will be asked against a given document can be inspected here:
 74 | ```
 75 | (.venv) Admin:~/environment/amazon-textract-queries-example (main) $ grep -A 15 Payslip_Queries lambdas/shared_constants.py
 76 | Payslip_Queries = [
 77 |     {
 78 |         "Text": "What is the company's name",
 79 |         "Alias": "PAYSLIP_NAME_COMPANY"
 80 |     },
 81 |     {
 82 |         "Text": "What is the company's ABN number",
 83 |         "Alias": "PAYSLIP_COMPANY_ABN"
 84 |     },
 85 |     {
 86 |         "Text": "What is the employee's name",
 87 |         "Alias": "PAYSLIP_NAME_EMPLOYEE"
 88 |     },
 89 |     {
 90 |         "Text": "What is the annual salary",
 91 |         "Alias": "PAYSLIP_SALARY_ANNUAL"
 92 | 
 93 | (.venv) Admin:~/environment/amazon-textract-queries-example (main) $ grep -A 15 Bank_Queries lambdas/shared_constants.py
 94 | Bank_Queries = [
 95 |     {
 96 |         "Text": "What is the name of the banking institution",
 97 |         "Alias": "BANK_NAME"
 98 |     },
 99 |     {
100 |         "Text": "What is the name of the account holder",
101 |         "Alias": "BANK_HOLDER_NAME"
102 |     },
103 |     {
104 |         "Text": "What is the account number",
105 |         "Alias": "BANK_ACCOUNT_NUMBER"
106 |     },
107 |     {
108 |         "Text": "What is the current balance of the bank account",
109 |         "Alias": "BANK_ACCOUNT_BALANCE"
110 | ```
111 | When you upload the PDF a [Step Functions](https://console.aws.amazon.com/states/home) will be triggered. View that to see the flow of the document and the [CloudWatch](https://console.aws.amazon.com/cloudwatch/home) Log groups for the Lambda logs.
112 | 
113 | The answers to the questions for a given document will show up in the output S3 bucket (`myprojectstack-s3outputbucketNNNNN`) under the "_answers" folder when the corresponding Step Function is complete.
114 | 
115 | The Step Functions flow that has been executed is depicted below:
116 | ![Step Functions flow](/assets/images/step-function-flow.png)
117 | 
118 | Now you have successfully run the sample of a scalable, serverless application stack using CDK as IaC to intelligently get answers from a document for your questions expressed in natural language.
119 | 
120 | ## Resources and pricing
121 | 
122 | Note that for as long as you have the stack deployed, charges may apply to your account. You should delete the resources when you are done with the sample. WARNING all the files in the S3 input (`myprojectstack-s3inputbucketNNNNN`) and output (`myprojectstack-s3outputbucketNNNNN`) buckets will be automatically deleted:
123 | ```
124 | (.venv) Admin:~/environment/amazon-textract-queries-example (main) $ cdk destroy
125 | ```
126 | You will also need to manually terminate the Comprehend endpoint and delete the Comprehend classifier. Also the Cloud9 environment if that was used to run through the sample.
127 | 
128 | ## Security
129 | 
130 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
131 | 
132 | ## License
133 | 
134 | This library is licensed under the MIT-0 License. See the LICENSE file.
135 | 
136 | 


--------------------------------------------------------------------------------
/app.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | import os
 3 | 
 4 | import aws_cdk as cdk
 5 | 
 6 | from my_project.my_project_stack import MyProjectStack
 7 | 
 8 | 
 9 | app = cdk.App()
10 | MyProjectStack(app, "MyProjectStack",
11 |     # If you don't specify 'env', this stack will be environment-agnostic.
12 |     # Account/Region-dependent features and context lookups will not work,
13 |     # but a single synthesized template can be deployed anywhere.
14 | 
15 |     # Uncomment the next line to specialize this stack for the AWS Account
16 |     # and Region that are implied by the current CLI configuration.
17 | 
18 |     #env=cdk.Environment(account=os.getenv('CDK_DEFAULT_ACCOUNT'), region=os.getenv('CDK_DEFAULT_REGION')),
19 | 
20 |     # Uncomment the next line if you know exactly what Account and Region you
21 |     # want to deploy the stack to. */
22 | 
23 |     #env=cdk.Environment(account='123456789012', region='us-east-1'),
24 | 
25 |     # For more information, see https://docs.aws.amazon.com/cdk/latest/guide/environments.html
26 |     )
27 | 
28 | app.synth()
29 | 


--------------------------------------------------------------------------------
/assets/images/step-function-flow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-textract-queries-example/c7ee158086bbd2cad3d0b92d2a83895d1cb712c3/assets/images/step-function-flow.png


--------------------------------------------------------------------------------
/cdk.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "app": "python3 app.py",
 3 |   "watch": {
 4 |     "include": [
 5 |       "**"
 6 |     ],
 7 |     "exclude": [
 8 |       "README.md",
 9 |       "cdk*.json",
10 |       "requirements*.txt",
11 |       "source.bat",
12 |       "**/__init__.py",
13 |       "python/__pycache__",
14 |       "tests"
15 |     ]
16 |   },
17 |   "context": {
18 |     "@aws-cdk/aws-apigateway:usagePlanKeyOrderInsensitiveId": true,
19 |     "@aws-cdk/core:stackRelativeExports": true,
20 |     "@aws-cdk/aws-rds:lowercaseDbIdentifier": true,
21 |     "@aws-cdk/aws-lambda:recognizeVersionProps": true,
22 |     "@aws-cdk/aws-cloudfront:defaultSecurityPolicyTLSv1.2_2021": true
23 |   }
24 | }
25 | 


--------------------------------------------------------------------------------
/lambdas/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-textract-queries-example/c7ee158086bbd2cad3d0b92d2a83895d1cb712c3/lambdas/__init__.py


--------------------------------------------------------------------------------
/lambdas/analyze_function.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import boto3
 3 | import shared_constants as sc
 4 | import logging
 5 | 
 6 | logger = logging.getLogger()
 7 | logger.setLevel(logging.INFO)
 8 | 
 9 | def get_task_token_path(key):
10 |     prefix = sc.OUT_TASK_TOKEN_PREFIX
11 |     return prefix + "/" + key + ".analyzeTaskToken";
12 | 
13 | def lookup_query_config(classification):
14 |     if classification == "PAYSLIP":
15 |         return sc.Payslip_Queries
16 |     elif classification == "BANK":
17 |         return sc.Bank_Queries
18 |     elif classification == "APPLICATION":
19 |         return sc.Application_Queries
20 |     else:
21 |         return []
22 |             
23 | 
24 | # process a file upload
25 | def process_record(rec, event):
26 |     token = event['token']
27 |     classification = event['input']['Classify']['classification']
28 |     query_config = lookup_query_config(classification)
29 |     logger.info(query_config)
30 |     input_bucket = rec['s3']['bucket']['name']
31 |     input_file = rec['s3']['object']['key']
32 |     request_id = rec['responseElements']['x-amz-request-id']
33 |     analyze_arn = os.environ[sc.ANALYZE_TOPIC_ARN]
34 |     analyze_role = os.environ[sc.TEXTRACT_PUBLISH_ROLE]
35 |     output_bucket = os.environ[sc.OUTPUT_BKT]
36 |     output_prefix = sc.OUT_ANALYZE_PREFIX
37 |     
38 |     client = boto3.client('textract')
39 |     response = client.start_document_analysis(
40 |         DocumentLocation={
41 |             'S3Object': {
42 |                 'Bucket': input_bucket,
43 |                 'Name': input_file
44 |             }
45 |         },
46 |         FeatureTypes=[
47 |             'QUERIES'
48 |         ],
49 |         ClientRequestToken=request_id,
50 |         NotificationChannel={
51 |             'SNSTopicArn': analyze_arn,
52 |             'RoleArn': analyze_role
53 |         },
54 |         OutputConfig={
55 |             'S3Bucket': output_bucket,
56 |             'S3Prefix': output_prefix
57 |         },
58 |         QueriesConfig={
59 |             'Queries': query_config
60 |         }
61 |     )
62 |     s3_client = boto3.client('s3')
63 |     s3_client.put_object(Bucket=output_bucket, Key=get_task_token_path(input_file), Body=token);
64 | 
65 | # event handler
66 | def lambda_handler(event, context):
67 |     # lets just dump it
68 |     logger.info(event)
69 |     
70 |     for rec in event['input']['Records']:
71 |         process_record(rec, event)
72 |         
73 |     return {"submit_status": "SUCCEEDED"}


--------------------------------------------------------------------------------
/lambdas/classify_function.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import boto3
 3 | import shared_constants as sc
 4 | import logging
 5 | 
 6 | logger = logging.getLogger()
 7 | logger.setLevel(logging.INFO)
 8 | 
 9 | def get_dump_path(key):
10 |     prefix = sc.OUT_DUMP_PREFIX
11 |     return prefix + "/" + key + ".dump.txt";
12 | 
13 | def lambda_handler(event, context):
14 |     logger.info(event)
15 |     output_bkt = os.environ[sc.OUTPUT_BKT]
16 |     
17 |     s3_obj_name = event['Detection']['DocumentLocation']['S3ObjectName']
18 |     job_id = event['Detection']['JobId']
19 |     textract = boto3.client('textract')
20 |     response = textract.get_document_text_detection(JobId=job_id)
21 |     
22 |     # generate flat dump
23 |     dump = ''
24 |     for block in response['Blocks']:
25 |         if block['BlockType'] == "LINE":
26 |             dump = dump + block['Text'] + ' '
27 |     logger.info(dump)
28 |     
29 |     # dump to S3
30 |     s3 = boto3.client('s3')
31 |     s3.put_object(Bucket=output_bkt, Key=get_dump_path(s3_obj_name), Body=dump);
32 |     
33 |     # do sync classify
34 |     comprehend = boto3.client('comprehend')
35 |     comprehend_arn = os.environ[sc.COMPREHEND_EP_ARN]
36 |     response = comprehend.classify_document(Text=dump, EndpointArn=comprehend_arn)
37 |     classification = 'UNKNOWN'
38 |     if len(response['Classes']) > 0:
39 |         name = response['Classes'][0]['Name']
40 |         score = response['Classes'][0]['Score']
41 |         if score > 0.5:
42 |             classification = name
43 |     
44 |     logger.info(classification)
45 |     return {"classification" : classification}


--------------------------------------------------------------------------------
/lambdas/detect_function.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import boto3
 3 | import shared_constants as sc
 4 | import logging
 5 | 
 6 | logger = logging.getLogger()
 7 | logger.setLevel(logging.INFO)
 8 | 
 9 | def get_task_token_path(key):
10 |     prefix = sc.OUT_TASK_TOKEN_PREFIX
11 |     return prefix + "/" + key + ".taskToken";
12 | 
13 | # process a file upload
14 | def process_record(rec, token):
15 |     logger.info(__name__)
16 |     logger.info(rec)
17 |     input_bucket = rec['s3']['bucket']['name']
18 |     input_file = rec['s3']['object']['key']
19 |     request_id = rec['responseElements']['x-amz-request-id']
20 |     detect_arn = os.environ[sc.DETECT_TOPIC_ARN]
21 |     detect_role = os.environ[sc.TEXTRACT_PUBLISH_ROLE]
22 |     output_bucket = os.environ[sc.OUTPUT_BKT]
23 |     output_prefix = sc.OUT_DETECT_PREFIX
24 |     
25 |     client = boto3.client('textract')
26 |     response = client.start_document_text_detection(
27 |         DocumentLocation={
28 |             'S3Object': {
29 |                 'Bucket': input_bucket,
30 |                 'Name': input_file
31 |             }
32 |         },
33 |         ClientRequestToken=request_id,
34 |         NotificationChannel={
35 |             'SNSTopicArn': detect_arn,
36 |             'RoleArn': detect_role
37 |         },
38 |         OutputConfig={
39 |             'S3Bucket': output_bucket,
40 |             'S3Prefix': output_prefix
41 |         })
42 |     s3_client = boto3.client('s3')
43 |     s3_client.put_object(Bucket=output_bucket, Key=get_task_token_path(input_file), Body=token);
44 | 
45 | # event handler
46 | def lambda_handler(event, context):
47 |     # lets just dump it
48 |     logger.info(event)
49 |     
50 |     for rec in event['input']['Records']:
51 |         process_record(rec, event['token'])
52 |         
53 |     return {
54 |         "submit_status": "SUCCEEDED",
55 |     }


--------------------------------------------------------------------------------
/lambdas/newboto3/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-textract-queries-example/c7ee158086bbd2cad3d0b92d2a83895d1cb712c3/lambdas/newboto3/.gitkeep


--------------------------------------------------------------------------------
/lambdas/post_analyze.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import boto3
 3 | import json
 4 | import shared_constants as sc
 5 | import logging
 6 | 
 7 | logger = logging.getLogger()
 8 | logger.setLevel(logging.INFO)
 9 | 
10 | def get_task_token_path(key):
11 |     prefix = sc.OUT_TASK_TOKEN_PREFIX
12 |     return prefix + "/" + key + ".analyzeTaskToken";
13 | 
14 | def process_record(record):
15 |     stepFunction = boto3.client('stepfunctions')
16 |     s3_client = boto3.resource('s3')
17 |     msg = record['Sns']['Message']
18 |     logger.info(msg)
19 |     msg_dict = json.loads(msg)
20 |     bucket = msg_dict['DocumentLocation']['S3Bucket']
21 |     key = msg_dict['DocumentLocation']['S3ObjectName']
22 |     logger.info("Completing analyze for " + bucket + "/" + key)
23 |     output_bkt = os.environ[sc.OUTPUT_BKT]
24 |     output_key = get_task_token_path(key)
25 |     s3_object = s3_client.Object(output_bkt,output_key)
26 |     token = s3_object.get()['Body'].read().decode("utf-8") 
27 |     logger.info(token)
28 |     status = msg_dict['Status']
29 |     if status == "SUCCEEDED":
30 |         logger.info("Success, resume SFSM")
31 |         response = stepFunction.send_task_success(taskToken=token,
32 |                                                   output=msg)
33 |     else:
34 |         logger.info("Failed with " + status + ", failing SFSM")
35 |         response = stepFunction.send_task_failure(taskToken=token, 
36 |                                                   error=status,
37 |                                                   cause=msg)
38 | 
39 | def lambda_handler(event, context):
40 |     logger.info(event)
41 |     for rec in event['Records']:
42 |         process_record(rec)
43 |     return event


--------------------------------------------------------------------------------
/lambdas/post_detect.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import boto3
 3 | import json
 4 | import shared_constants as sc
 5 | import logging
 6 | 
 7 | logger = logging.getLogger()
 8 | logger.setLevel(logging.INFO)
 9 | 
10 | def get_task_token_path(key):
11 |     prefix = sc.OUT_TASK_TOKEN_PREFIX
12 |     return prefix + "/" + key + ".taskToken";
13 | 
14 | def process_record(record):
15 |     stepFunction = boto3.client('stepfunctions')
16 |     s3_client = boto3.resource('s3')
17 |     msg = record['Sns']['Message']
18 |     logger.info(msg)
19 |     msg_dict = json.loads(msg)
20 |     bucket = msg_dict['DocumentLocation']['S3Bucket']
21 |     key = msg_dict['DocumentLocation']['S3ObjectName']
22 |     logger.info("Completing detect for " + bucket + "/" + key)
23 |     output_bkt = os.environ[sc.OUTPUT_BKT]
24 |     output_key = get_task_token_path(key)
25 |     s3_object = s3_client.Object(output_bkt,output_key)
26 |     token = s3_object.get()['Body'].read().decode("utf-8") 
27 |     logger.info(token)
28 |     status = msg_dict['Status']
29 |     if status == "SUCCEEDED":
30 |         logger.info("Success, resume SFSM")
31 |         response = stepFunction.send_task_success(taskToken=token,
32 |                                                   output=msg)
33 |     else:
34 |         logger.info("Failed with " + status + ", failing SFSM")
35 |         response = stepFunction.send_task_failure(taskToken=token, 
36 |                                                   error=status,
37 |                                                   cause=msg)
38 | 
39 | def lambda_handler(event, context):
40 |     logger.info(event)
41 |     for rec in event['Records']:
42 |         process_record(rec)
43 |     return event


--------------------------------------------------------------------------------
/lambdas/process_application.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import boto3
 3 | import shared_constants as sc
 4 | import logging
 5 | 
 6 | logger = logging.getLogger()
 7 | logger.setLevel(logging.INFO)
 8 | 
 9 | def lambda_handler(event, context):
10 |     # Do any further processing for rental-application here
11 |     logger.info(event)
12 |     return { "status" : "OK" }


--------------------------------------------------------------------------------
/lambdas/process_bank.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import boto3
 3 | import shared_constants as sc
 4 | import logging
 5 | 
 6 | logger = logging.getLogger()
 7 | logger.setLevel(logging.INFO)
 8 | 
 9 | def lambda_handler(event, context):
10 |     # Do any further processing for bank-statement here
11 |     logger.info(event)
12 |     return { "status" : "OK" }


--------------------------------------------------------------------------------
/lambdas/process_document.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import boto3
 3 | import shared_constants as sc
 4 | import logging
 5 | 
 6 | logger = logging.getLogger()
 7 | logger.setLevel(logging.INFO)
 8 | 
 9 | def dump_to_s3(event, answers):
10 |     output_bkt = os.environ[sc.OUTPUT_BKT]
11 |     output_prefix = sc.OUT_ANSWERS_PREFIX
12 |     key = event['Detection']['DocumentLocation']['S3ObjectName']
13 |     s3_obj_name = output_prefix + "/" + key + ".txt"
14 |     s3 = boto3.client('s3')
15 |     s3.put_object(Bucket=output_bkt, Key=s3_obj_name, Body=answers);
16 |     
17 | def lambda_handler(event, context):
18 |     logger.info(event)
19 |     job_id = event['Analyze']['JobId']
20 |     textract = boto3.client('textract')
21 |     response = textract.get_document_analysis(JobId=job_id)
22 |     logger.info(response)
23 |     
24 |     # FIXME: This is a big hack
25 |     # perhaps use a hash-table on IDs to store query-result
26 |     answers = ''
27 |     for block in response['Blocks']:
28 |         if block["BlockType"] == "QUERY":
29 |             if 'Relationships' not in block.keys():
30 |                 continue
31 |             logger.info(block["Query"]["Text"])
32 |             answers = answers + block["Query"]["Text"] + "|" + block["Query"]["Alias"] + "|"
33 |             for result in response['Blocks']:
34 |                 if result["BlockType"] == "QUERY_RESULT":
35 |                     if block['Relationships'][0]['Ids'][0] == result['Id']:
36 |                         logger.info(result['Text'] + "\n")
37 |                         answers = answers + result['Text'] + "\n"
38 |                         break;
39 |     # dump to S3
40 |     dump_to_s3(event, answers)
41 |     return { "status" : "OK" }


--------------------------------------------------------------------------------
/lambdas/process_payslip.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import boto3
 3 | import shared_constants as sc
 4 | import logging
 5 | 
 6 | logger = logging.getLogger()
 7 | logger.setLevel(logging.INFO)
 8 | 
 9 | def lambda_handler(event, context):
10 |     # Do any further processing for payslip here
11 |     logger.info(event)
12 |     return { "status" : "OK" }


--------------------------------------------------------------------------------
/lambdas/s3_input_handler.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import boto3
 3 | import json
 4 | import shared_constants as sc
 5 | import logging
 6 | 
 7 | logger = logging.getLogger()
 8 | logger.setLevel(logging.INFO)
 9 | 
10 | def main(event, context):
11 |     logger.info(event)
12 |     logger.info(context)
13 |     sm_arn = os.environ[sc.SF_SM_ARN]
14 |     logger.info(sm_arn)
15 |     stepFunction = boto3.client('stepfunctions')
16 |     
17 |     # Bail out for any non-supported formats
18 |     for rec in event['Records']:
19 |         input_file = rec['s3']['object']['key']
20 |         filename, file_extension = os.path.splitext(input_file)
21 |         logger.info ("filename: " + filename + ". extension: " + file_extension)
22 |         if file_extension not in sc.SUPPORTED_FILES:
23 |             logger.warn("Unsupported file type: " + input_file)
24 |             return { 'statusCode': 200, 'body': event }
25 |             
26 |     response = stepFunction.start_execution(
27 |         stateMachineArn=sm_arn,
28 |         input = json.dumps(event, indent=4),
29 |         name=context.aws_request_id
30 |     )
31 |     return {
32 |         'statusCode': 200,
33 |         'body': event
34 |     }


--------------------------------------------------------------------------------
/lambdas/shared_constants.py:
--------------------------------------------------------------------------------
 1 | SF_SM_ARN = 'SF_SM_ARN'
 2 | INPUT_BKT = 'INPUT_BKT'
 3 | OUTPUT_BKT = 'OUTPUT_BKT'
 4 | DETECT_TOPIC_ARN = 'DETECT_TOPIC_ARN'
 5 | TEXTRACT_PUBLISH_ROLE = 'TEXTRACT_PUBLISH_ROLE'
 6 | ANALYZE_TOPIC_ARN = 'ANALYZE_TOPIC_ARN'
 7 | OUT_DETECT_PREFIX = '_detectText'
 8 | OUT_ANALYZE_PREFIX = '_analyzeText'
 9 | OUT_TASK_TOKEN_PREFIX = '_tasks'
10 | OUT_DUMP_PREFIX ='_dump'
11 | OUT_ANSWERS_PREFIX='_answers'
12 | COMPREHEND_EP_ARN = 'COMPREHEND_EP_ARN'
13 | 
14 | SUPPORTED_FILES = [".pdf", ".png", ".jpg", ".jpeg", ".tiff"]
15 | 
16 | Payslip_Queries = [
17 |     {
18 |         "Text": "What is the company's name",
19 |         "Alias": "PAYSLIP_NAME_COMPANY"
20 |     },
21 |     {
22 |         "Text": "What is the company's ABN number",
23 |         "Alias": "PAYSLIP_COMPANY_ABN"
24 |     },
25 |     {
26 |         "Text": "What is the employee's name",
27 |         "Alias": "PAYSLIP_NAME_EMPLOYEE"
28 |     },
29 |     {
30 |         "Text": "What is the annual salary",
31 |         "Alias": "PAYSLIP_SALARY_ANNUAL"
32 |     },
33 |     {
34 |         "Text": "What is the monthly net pay",
35 |         "Alias": "PAYSLIP_SALARY_MONTHLY"
36 |     }
37 | ]
38 | 
39 | Bank_Queries = [
40 |     {
41 |         "Text": "What is the name of the banking institution",
42 |         "Alias": "BANK_NAME"
43 |     },
44 |     {
45 |         "Text": "What is the name of the account holder",
46 |         "Alias": "BANK_HOLDER_NAME"
47 |     },
48 |     {
49 |         "Text": "What is the account number",
50 |         "Alias": "BANK_ACCOUNT_NUMBER"
51 |     },
52 |     {
53 |         "Text": "What is the current balance of the bank account",
54 |         "Alias": "BANK_ACCOUNT_BALANCE"
55 |     },
56 |     {
57 |         "Text": "What is the period for the bank statement",
58 |         "Alias": "BANK_STATEMENT_PERIOD"
59 |     },
60 |     {
61 |         "Text": "What is the largest credit transaction in the statement",
62 |         "Alias": "BANK_LARGEST_CREDIT"
63 |     },
64 |     {
65 |         "Text": "What is the largest debit transaction in the statement",
66 |         "Alias": "BANK_LARGEST_DEBIT"
67 |     },
68 | ]
69 | 
70 | Application_Queries = [
71 |     {
72 |         "Text": "What is the name of the realestate company",
73 |         "Alias": "APP_COMPANY_NAME"
74 |     },
75 |     {
76 |         "Text": "What is the name of the applicant or the prospective tenant",
77 |         "Alias": "APP_APPLICANT_NAME"
78 |     },
79 |     {
80 |         "Text": "What is the monthly rental amount",
81 |         "Alias": "APP_RENTAL_AMOUNT_MONTHLY"
82 |     },
83 |     {
84 |         "Text": "What is the weekly rental amount",
85 |         "Alias": "APP_RENTAL_AMOUNT_WEEKLY"
86 |     },
87 |     {
88 |         "Text": "What is the address of the property",
89 |         "Alias": "APP_PROPERTY_ADDRESS"
90 |     },
91 |     {
92 |         "Text": "What is the appicants monthly net pay",
93 |         "Alias": "PAYSLIP_SALARY_MONTHLY"
94 |     },
95 | ]


--------------------------------------------------------------------------------
/my_project/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-textract-queries-example/c7ee158086bbd2cad3d0b92d2a83895d1cb712c3/my_project/__init__.py


--------------------------------------------------------------------------------
/my_project/my_project_stack.py:
--------------------------------------------------------------------------------
  1 | from aws_cdk.aws_lambda_python_alpha import PythonFunction, PythonLayerVersion
  2 | import sys
  3 | sys.path.append('../')
  4 | from lambdas import shared_constants as sc
  5 | import os
  6 | from aws_cdk import (
  7 |     aws_lambda as _lambda,
  8 |     RemovalPolicy,
  9 |     aws_s3 as _s3,
 10 |     aws_s3_notifications,
 11 |     aws_iam as _iam,
 12 |     aws_stepfunctions as _sfn,
 13 |     aws_stepfunctions_tasks as _sfn_tasks,
 14 |     aws_sns as _sns,
 15 |     aws_sns_subscriptions as _sns_subscriptions,
 16 |     Duration, Stack
 17 | )
 18 | from constructs import Construct
 19 | 
 20 | 
 21 | class MyProjectStack(Stack):
 22 | 
 23 |     def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
 24 |         super().__init__(scope, construct_id, **kwargs)
 25 | 
 26 | 
 27 |         # Here define a Lambda Layer 
 28 |         boto3_lambda_layer = PythonLayerVersion(
 29 |            self, 'Boto3LambdaLayer',
 30 |            entry='lambdas/newboto3',
 31 |            compatible_runtimes=[_lambda.Runtime.PYTHON_3_9],
 32 |            description='Boto3 Library'
 33 |         )
 34 | 
 35 |         lambdaEnv = {}
 36 |         comprehend_ep_arn = os.environ[sc.COMPREHEND_EP_ARN]
 37 |         lambdaEnv[sc.COMPREHEND_EP_ARN] = comprehend_ep_arn
 38 |         
 39 |         # create s3 input bucket
 40 |         s3InputBucket = _s3.Bucket(
 41 |             self, "s3InputBucket", auto_delete_objects=True,
 42 |             removal_policy=RemovalPolicy.DESTROY)
 43 |         
 44 |         # create s3 output bucket
 45 |         s3OutputBucket = _s3.Bucket(
 46 |             self, "s3OutputBucket", auto_delete_objects=True,
 47 |             removal_policy=RemovalPolicy.DESTROY)
 48 | 
 49 |         lambdaEnv[sc.INPUT_BKT] = s3InputBucket.bucket_name
 50 |         lambdaEnv[sc.OUTPUT_BKT] = s3OutputBucket.bucket_name
 51 |         
 52 |         # SNS topic to write to when text-detect is complete:
 53 |         detect_topic = _sns.Topic(self, "textDetectTopic")
 54 |         analyze_topic = _sns.Topic(self, "textAnalyzeTopic")
 55 |         lambdaEnv[sc.DETECT_TOPIC_ARN] = detect_topic.topic_arn
 56 |         lambdaEnv[sc.ANALYZE_TOPIC_ARN] = analyze_topic.topic_arn
 57 |         
 58 |         # permissions for Textract to publish on SNS topic 
 59 |         textract_role = _iam.Role(self, "TextractSNSPublishRole", 
 60 |             assumed_by=_iam.ServicePrincipal("textract.amazonaws.com"))
 61 |         textract_role.add_to_policy(
 62 |             _iam.PolicyStatement(actions=["sts:AssumeRole"],
 63 |                 resources=["*"]))
 64 |         textract_role.add_to_policy(
 65 |             _iam.PolicyStatement(actions=["sns:Publish"],
 66 |                 resources=[detect_topic.topic_arn]))
 67 |         textract_role.add_to_policy(
 68 |             _iam.PolicyStatement(actions=["sns:Publish"],
 69 |                 resources=[analyze_topic.topic_arn]))
 70 |         
 71 |         lambdaEnv[sc.TEXTRACT_PUBLISH_ROLE] = textract_role.role_arn
 72 | 
 73 |         # lambda to trigger the text detection
 74 |         detect_lambda = _lambda.Function(self, 'detectLambda',
 75 |                                          handler='detect_function.lambda_handler',
 76 |                                          runtime=_lambda.Runtime.PYTHON_3_9,
 77 |                                          environment=lambdaEnv,
 78 |                                          timeout=Duration.minutes(1),
 79 |                                          code=_lambda.Code.from_asset('./lambdas'))
 80 |         detect_lambda.role.add_managed_policy(_iam.ManagedPolicy.from_aws_managed_policy_name("AmazonTextractFullAccess"))
 81 |         s3InputBucket.grant_read(detect_lambda)
 82 |         s3OutputBucket.grant_read_write(detect_lambda)
 83 |         
 84 |         # watch detect-topic for post detection processing
 85 |         post_detect_lambda = _lambda.Function(self, 'postDetectLambda', 
 86 |                                               handler='post_detect.lambda_handler',
 87 |                                               runtime=_lambda.Runtime.PYTHON_3_9,
 88 |                                               environment=lambdaEnv,
 89 |                                               timeout=Duration.minutes(1),
 90 |                                               code=_lambda.Code.from_asset('./lambdas'))
 91 |         detect_topic.add_subscription(_sns_subscriptions.LambdaSubscription(post_detect_lambda));
 92 |         s3OutputBucket.grant_read_write(post_detect_lambda);
 93 | 
 94 |         # classify then analyze the document
 95 |         classify_lambda = _lambda.Function(self, 'classifyLambda',
 96 |                                          handler='classify_function.lambda_handler',
 97 |                                          runtime=_lambda.Runtime.PYTHON_3_9,
 98 |                                          environment=lambdaEnv,
 99 |                                          timeout=Duration.minutes(1),
100 |                                          code=_lambda.Code.from_asset('./lambdas'))
101 |         classify_lambda.role.add_managed_policy(_iam.ManagedPolicy.from_aws_managed_policy_name("AmazonTextractFullAccess"))
102 |         classify_lambda.role.add_managed_policy(_iam.ManagedPolicy.from_aws_managed_policy_name("ComprehendReadOnly"))
103 |         s3OutputBucket.grant_read_write(classify_lambda)
104 |         
105 |         # analyze the document
106 |         analyze_lambda = _lambda.Function(self, 'analyzeLambda',
107 |                                          handler='analyze_function.lambda_handler',
108 |                                          runtime=_lambda.Runtime.PYTHON_3_9,
109 |                                          layers=[boto3_lambda_layer],
110 |                                          environment=lambdaEnv,
111 |                                          timeout=Duration.minutes(1),
112 |                                          code=_lambda.Code.from_asset('./lambdas'))
113 |         analyze_lambda.role.add_managed_policy(_iam.ManagedPolicy.from_aws_managed_policy_name("AmazonTextractFullAccess"))
114 |         s3InputBucket.grant_read(analyze_lambda)
115 |         s3OutputBucket.grant_read_write(analyze_lambda)
116 |         
117 |         # watch analyze-topic for post analyze processing
118 |         post_analyze_lambda = _lambda.Function(self, 'postAnalyzeLambda', 
119 |                                               handler='post_analyze.lambda_handler',
120 |                                               runtime=_lambda.Runtime.PYTHON_3_9,
121 |                                               environment=lambdaEnv,
122 |                                               timeout=Duration.minutes(1),
123 |                                               code=_lambda.Code.from_asset('./lambdas'))
124 |         analyze_topic.add_subscription(_sns_subscriptions.LambdaSubscription(post_analyze_lambda));
125 |         s3OutputBucket.grant_read_write(post_analyze_lambda);
126 |         
127 |         
128 |         process_document_l = _lambda.Function(self, 'processDocument',
129 |                                          handler='process_document.lambda_handler',
130 |                                          runtime=_lambda.Runtime.PYTHON_3_9,
131 |                                          environment=lambdaEnv,
132 |                                          layers=[boto3_lambda_layer],
133 |                                          timeout=Duration.minutes(1),
134 |                                          code=_lambda.Code.from_asset('./lambdas'))
135 |         process_document_l.role.add_managed_policy(_iam.ManagedPolicy.from_aws_managed_policy_name("AmazonTextractFullAccess"))
136 |         s3OutputBucket.grant_read_write(process_document_l)            
137 |         
138 |         process_payslip_l = _lambda.Function(self, 'processPayslip',
139 |                                          handler='process_payslip.lambda_handler',
140 |                                          runtime=_lambda.Runtime.PYTHON_3_9,
141 |                                          environment=lambdaEnv,
142 |                                          timeout=Duration.minutes(1),
143 |                                          code=_lambda.Code.from_asset('./lambdas'))
144 |         process_bank_l = _lambda.Function(self, 'processBank',
145 |                                          handler='process_bank.lambda_handler',
146 |                                          runtime=_lambda.Runtime.PYTHON_3_9,
147 |                                          environment=lambdaEnv,
148 |                                          timeout=Duration.minutes(1),
149 |                                          code=_lambda.Code.from_asset('./lambdas'))
150 |         process_application_l = _lambda.Function(self, 'processApplication',
151 |                                          handler='process_application.lambda_handler',
152 |                                          runtime=_lambda.Runtime.PYTHON_3_9,
153 |                                          environment=lambdaEnv,
154 |                                          timeout=Duration.minutes(1),
155 |                                          code=_lambda.Code.from_asset('./lambdas'))
156 |         
157 |                                          
158 |         # Step functions Definition
159 |         detect_job = _sfn_tasks.LambdaInvoke(
160 |             self, "Detect Job",
161 |             lambda_function=detect_lambda,
162 |             result_path="$.Detection",
163 |             integration_pattern=_sfn.IntegrationPattern.WAIT_FOR_TASK_TOKEN,
164 |             payload=_sfn.TaskInput.from_object({
165 |                 "token": _sfn.JsonPath.task_token,
166 |                 "input": _sfn.JsonPath.entire_payload}))
167 | 
168 |         classify_job = _sfn_tasks.LambdaInvoke(
169 |             self, "Classify Job",
170 |             lambda_function=classify_lambda,
171 |             result_selector = {
172 |                 "classification.$":"$.Payload.classification"},
173 |             result_path="$.Classify")
174 | 
175 |         analyze_job = _sfn_tasks.LambdaInvoke(
176 |             self, "Analyze Job",
177 |             lambda_function=analyze_lambda,
178 |             integration_pattern=_sfn.IntegrationPattern.WAIT_FOR_TASK_TOKEN,
179 |             payload=_sfn.TaskInput.from_object({
180 |                 "token": _sfn.JsonPath.task_token,
181 |                 "input": _sfn.JsonPath.entire_payload}),
182 |             result_path="$.Analyze")
183 |             
184 |         fail_job = _sfn.Fail(
185 |             self, "Fail",
186 |             cause='AWS Batch Job Failed',
187 |             error='DescribeJob returned FAILED')
188 | 
189 |         succeed_job = _sfn.Succeed(
190 |             self, "Succeeded",
191 |             comment='AWS Batch Job succeeded')
192 | 
193 | 
194 |         process_document_task = _sfn_tasks.LambdaInvoke(
195 |             self, "Process Document",
196 |             lambda_function=process_document_l,
197 |             result_selector = {
198 |                 "tq_status.$":"$.Payload.status"
199 |             },
200 |             result_path="$.Process")
201 | 
202 |         process_payslip_task = _sfn_tasks.LambdaInvoke(
203 |             self, "Process Payslip",
204 |             lambda_function=process_payslip_l,
205 |             result_selector = {
206 |                 "status.$":"$.Payload.status"
207 |             },
208 |             result_path="$.Process").next(succeed_job)
209 |         
210 |         process_bank_task = _sfn_tasks.LambdaInvoke(
211 |             self, "Process Bank",
212 |             lambda_function=process_bank_l,
213 |             result_selector = {
214 |                 "status.$":"$.Payload.status"
215 |             },
216 |             result_path="$.Process").next(succeed_job)
217 |             
218 |         process_application_task = _sfn_tasks.LambdaInvoke(
219 |             self, "Process Application",
220 |             lambda_function=process_application_l,
221 |             result_selector = {
222 |                 "status.$":"$.Payload.status"
223 |             },
224 |             result_path="$.Process").next(succeed_job)
225 | 
226 |         # Create Chain
227 |         definition = detect_job.next(classify_job).next(
228 |             _sfn.Choice(self, 'Known Document Type?')
229 |                 .when(_sfn.Condition.string_equals('$.Classify.classification', 'UNKNOWN'), fail_job)
230 |                 .otherwise(analyze_job))
231 | 
232 |         analyze_job.next(process_document_task).next(
233 |             _sfn.Choice(self, 'Analyze Complete?')
234 |                 .when(_sfn.Condition.string_equals('$.Classify.classification', 'PAYSLIP'), process_payslip_task)
235 |                 .when(_sfn.Condition.string_equals('$.Classify.classification', 'BANK'), process_bank_task)
236 |                 .when(_sfn.Condition.string_equals('$.Classify.classification', 'APPLICATION'), process_application_task)
237 |                 .otherwise(fail_job))
238 | 
239 |         # Create the state machine
240 |         sm = _sfn.StateMachine(
241 |             self, "StateMachine",
242 |             definition=definition,
243 |             timeout=Duration.minutes(60),
244 |         )
245 |         lambdaEnv[sc.SF_SM_ARN] = sm.state_machine_arn
246 |         
247 |         # watch s3 input-bkt, function triggered at file upload
248 |         s3_function = _lambda.Function(
249 |             self, "s3_input_function",
250 |             runtime=_lambda.Runtime.PYTHON_3_9,
251 |             handler="s3_input_handler.main",
252 |             environment=lambdaEnv,
253 |             code=_lambda.Code.from_asset("./lambdas"))
254 |         
255 |         # create s3 notification that triggers lambda function
256 |         notification = aws_s3_notifications.LambdaDestination(s3_function)
257 | 
258 |         # assign notification for the s3 event type (ex: OBJECT_CREATED)
259 |         # the corresponding lambda function will kick start the step function
260 |         s3InputBucket.add_event_notification(_s3.EventType.OBJECT_CREATED, notification)
261 |         
262 |         sm.grant_start_execution(s3_function)
263 |         sm.grant_task_response(post_detect_lambda)
264 |         sm.grant_task_response(post_analyze_lambda)


--------------------------------------------------------------------------------
/requirements-dev.txt:
--------------------------------------------------------------------------------
1 | pytest==6.2.5
2 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | aws-cdk-lib==2.82.0
2 | constructs>=10.0.0,<11.0.0
3 | aws_cdk.aws_lambda_python_alpha==2.24.1a0
4 | 


--------------------------------------------------------------------------------
/source.bat:
--------------------------------------------------------------------------------
 1 | @echo off
 2 | 
 3 | rem The sole purpose of this script is to make the command
 4 | rem
 5 | rem     source .venv/bin/activate
 6 | rem
 7 | rem (which activates a Python virtualenv on Linux or Mac OS X) work on Windows.
 8 | rem On Windows, this command just runs this batch file (the argument is ignored).
 9 | rem
10 | rem Now we don't need to document a Windows command for activating a virtualenv.
11 | 
12 | echo Executing .venv\Scripts\activate.bat for you
13 | .venv\Scripts\activate.bat
14 | 


--------------------------------------------------------------------------------
/tests/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-textract-queries-example/c7ee158086bbd2cad3d0b92d2a83895d1cb712c3/tests/__init__.py


--------------------------------------------------------------------------------
/tests/unit/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-textract-queries-example/c7ee158086bbd2cad3d0b92d2a83895d1cb712c3/tests/unit/__init__.py


--------------------------------------------------------------------------------
/tests/unit/test_my_project_stack.py:
--------------------------------------------------------------------------------
 1 | import aws_cdk as core
 2 | import aws_cdk.assertions as assertions
 3 | 
 4 | from my_project.my_project_stack import MyProjectStack
 5 | 
 6 | # example tests. To run these tests, uncomment this file along with the example
 7 | # resource in my_project/my_project_stack.py
 8 | def test_sqs_queue_created():
 9 |     app = core.App()
10 |     stack = MyProjectStack(app, "my-project")
11 |     template = assertions.Template.from_stack(stack)
12 | 
13 | #     template.has_resource_properties("AWS::SQS::Queue", {
14 | #         "VisibilityTimeout": 300
15 | #     })
16 | 


--------------------------------------------------------------------------------
/training/bank-statement.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-textract-queries-example/c7ee158086bbd2cad3d0b92d2a83895d1cb712c3/training/bank-statement.pdf


--------------------------------------------------------------------------------
/training/payslip.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-textract-queries-example/c7ee158086bbd2cad3d0b92d2a83895d1cb712c3/training/payslip.pdf


--------------------------------------------------------------------------------
/training/rental-application.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-textract-queries-example/c7ee158086bbd2cad3d0b92d2a83895d1cb712c3/training/rental-application.pdf


--------------------------------------------------------------------------------
/training/trainer.csv:
--------------------------------------------------------------------------------
1 | PAYSLIP,"AnyCompany Heavy Industries PAYSLIP  AnyCompany Heavy Industries, AU A.B.N 8347289294759 Employee  Nikki Wolf    Pay Date  Jan 28 2021 Location  Syd HQ    Period  Jan 1 2021  - Jan 31 2021 Occupation  Snr Lab Tech    Paycycle  Monthly Hourly Rate $50.6073    Base Annual Salary  $100,000.00 Description     Rate  Unit    Amount Salary      $50.6073  152    $ 7,692.31  Gross Package             $ 7,692.31  Gross Pay             $ 7,692.31  Tax             $ (2,208.08)  Net Pay             $ 5,484.22 Payment Method    Amount  Bank BSB  Account Number    Ref Details EFT    $ 5,484.22   032-701 618720910 YTD Details       Leave Entitlements Gross Package    $ 53,846.15    Accrued AL (Hours)    38.5 Gross Pay    $ 53,846.15    Accrued PL (Hours)    19 Tax    $ (15,456.58) Deductions After Tax Net Pay    $ 38,389.57 Employer Super     $ 3,500.00 Recent Superannuation Contribution Made by Aussiepay      Date Paid For the month of    Amount Example Corp. Super Fund      Dec 22 2020 Nov 2020    $3,500.00                                               "
2 | BANK,"AnyCompany Fiduciary Every Day Savings Account Statement Account Holder Ms Nikki Wolf Customer ID BSB Account 7463278 032-701 618720910 Statement Summary Statement Period Dec 15 2020 - Jan 14 2021 Opening Balance $ 32,310.00 Total Credits $ 5,484.22 Total Debits $ (4,500.00) Closing Balance $ 33,294.22 Transactions Date Transaction Details Amount ($) Balance carried forward $ 32,310.00 28/12/2020 Deposit from AnyCompany Heavy Industries Corp AU $ 5,484.22 28/12/2020 TFR to Every Day Credit Card $ (2,000.00) 29/12/2020 The AnyCompany Rental Agency $ (2,500.00) "
3 | APPLICATION,"AnyCompany Realty Agent Details Property Manager Richard Roe Contact number 0425 555 296 Email rroe@example.anycompany-realty.com Fax Property Details Address 23 Operator St Willoughby NSW Postcode 2076 Property Rental Amount ($) $3,000 [ ] week [ X] monthly Tenancy details Property bond amount ($) $3,000 Tenancy start date 12th February, 2021 Tenancy term [X ] fixed [ ] periodic Fixed term in months 18 months Applicant Details (to be completed by applicant) Please complete 1 application per applicant [ ] Mr [ X ] Ms [ ] Mrs [ ] Dr [ ] Other ____________ Last Name Wolf Given Names Nikki Have you ever been known by any other name? [ ] Yes [ X ] No If Yes, what other names have you been known by? Driver's Licence Number 1237618 State NSW Passport Number 48762134 Country Australia Do you have any Dependents? [ X ] Yes [ ] No If yes, please provide details Name Age Meteo Wolf 9 Mary Wolf 7 Jorge Wolf 3 Do you have any Pets? [ X ] Yes [ ] No If yes, please provide details Type and Breed Registered? Dog - Labradoodle Yes Contact Details email pattismith@theinternet.com phone number m 04108756235 h NA w 04108576235 date of birth (dd/mm/yyyy) 18/09/1973 Current address 14 Edgar Cres Mona Vale postcode 2876 how long at this address 2 years 6 months Current landlord/agent details landlord/agent name Maria Garcia agency name (if applicable) The AnyCompany Rental Agency phone number 98341234 email address maria@example.anycompanyrentalagency.com.au Monthly rent $ $2,500 reason for leaving current address Need larger house for family Previous address 3/73 Broadside Ave Cremorne postcode 2761 how long at this address 8 years months landlord/agent details landlord/agent name Richard Roe agency name (if applicable) The AnyCompany Rental Agency phone number 98341234 email address jeromep@example.anycompanyrentalagency.com.au Monthly rent $ $1,500 reason for leaving this address Move from apartment to house required Employment History Are you employed? [ X ] Yes [ ] No Occupation Snr Research Lab technican Employment type [ X ] Full Time [ ] Part Time [ ] Casual Net weekly income $ $1,432.00 Name of Employer AnyCompany Heavy Industries Length of Employment 14 years Contact at Employer John Stiles, Head of R&D Phone number john@example.anycompanyheavyindustries.com.au Address of Employer 100 Old Big Road Willoughby NSW If Self Employed Business Name ABN How long have you been self-employed? Address of Business Accountant's Details Name phone number email address Students Are you a student? [ ] Yes [X ] No School/University/TAFE name Student ID Number Personal References Please do not list relatives or partners Name Martha Rivera Relationship Friend Contact details m 0410 55987 631 h w Name Shirley Rodriguez Relationship Friend Contact details m 0413 55498 419 h w Other History Have you ever been evicted by any agent/lessor? [ ] Yes [ X ] No Was your rental bond at your last address refunded in full? [ X ] Yes [ ] No If No, what deductions were made? Are you in debt to another agent/lessor? [ ] Yes [ X ] No If Yes, why are you in debt to another agent/lessor? Terms and Conditions I wish to undertake a tenancy for a period of 18 months to start on 12 Feb 2021 at a rental price of $3,000 I understand that I am to pay a rental bond of $3,000 to take possession of the presmises and sign a tenancy agreement The customer acknowledges that one application form has to be completed per person [ X ] Yes [ ] No The customer acknowledges that they have received the Privacy Policy of the agent [X ] Yes [ ] No Name of Applicant Signature Date "
4 | PAYSLIP,"AnyCompany Heavy Industries PAYSLIP  AnyCompany Heavy Industries, AU A.B.N 8347289294759 Employee  Nikki Wolf    Pay Date  Jan 28 2021 Location  Syd HQ    Period  Jan 1 2021  - Jan 31 2021 Occupation  Snr Lab Tech    Paycycle  Monthly Hourly Rate $50.6073    Base Annual Salary  $100,000.00 Description     Rate  Unit    Amount Salary      $50.6073  152    $ 7,692.31  Gross Package             $ 7,692.31  Gross Pay             $ 7,692.31  Tax             $ (2,208.08)  Net Pay             $ 5,484.22 Payment Method    Amount  Bank BSB  Account Number    Ref Details EFT    $ 5,484.22   032-701 618720910 YTD Details       Leave Entitlements Gross Package    $ 53,846.15    Accrued AL (Hours)    38.5 Gross Pay    $ 53,846.15    Accrued PL (Hours)    19 Tax    $ (15,456.58) Deductions After Tax Net Pay    $ 38,389.57 Employer Super     $ 3,500.00 Recent Superannuation Contribution Made by Aussiepay      Date Paid For the month of    Amount Example Corp. Super Fund      Dec 22 2020 Nov 2020    $3,500.00                                               "
5 | BANK,"AnyCompany Fiduciary Every Day Savings Account Statement Account Holder Ms Nikki Wolf Customer ID BSB Account 7463278 032-701 618720910 Statement Summary Statement Period Dec 15 2020 - Jan 14 2021 Opening Balance $ 32,310.00 Total Credits $ 5,484.22 Total Debits $ (4,500.00) Closing Balance $ 33,294.22 Transactions Date Transaction Details Amount ($) Balance carried forward $ 32,310.00 28/12/2020 Deposit from AnyCompany Heavy Industries Corp AU $ 5,484.22 28/12/2020 TFR to Every Day Credit Card $ (2,000.00) 29/12/2020 The AnyCompany Rental Agency $ (2,500.00) "
6 | APPLICATION,"AnyCompany Realty Agent Details Property Manager Richard Roe Contact number 0425 555 296 Email rroe@example.anycompany-realty.com Fax Property Details Address 23 Operator St Willoughby NSW Postcode 2076 Property Rental Amount ($) $3,000 [ ] week [ X] monthly Tenancy details Property bond amount ($) $3,000 Tenancy start date 12th February, 2021 Tenancy term [X ] fixed [ ] periodic Fixed term in months 18 months Applicant Details (to be completed by applicant) Please complete 1 application per applicant [ ] Mr [ X ] Ms [ ] Mrs [ ] Dr [ ] Other ____________ Last Name Wolf Given Names Nikki Have you ever been known by any other name? [ ] Yes [ X ] No If Yes, what other names have you been known by? Driver's Licence Number 1237618 State NSW Passport Number 48762134 Country Australia Do you have any Dependents? [ X ] Yes [ ] No If yes, please provide details Name Age Meteo Wolf 9 Mary Wolf 7 Jorge Wolf 3 Do you have any Pets? [ X ] Yes [ ] No If yes, please provide details Type and Breed Registered? Dog - Labradoodle Yes Contact Details email pattismith@theinternet.com phone number m 04108756235 h NA w 04108576235 date of birth (dd/mm/yyyy) 18/09/1973 Current address 14 Edgar Cres Mona Vale postcode 2876 how long at this address 2 years 6 months Current landlord/agent details landlord/agent name Maria Garcia agency name (if applicable) The AnyCompany Rental Agency phone number 98341234 email address maria@example.anycompanyrentalagency.com.au Monthly rent $ $2,500 reason for leaving current address Need larger house for family Previous address 3/73 Broadside Ave Cremorne postcode 2761 how long at this address 8 years months landlord/agent details landlord/agent name Richard Roe agency name (if applicable) The AnyCompany Rental Agency phone number 98341234 email address jeromep@example.anycompanyrentalagency.com.au Monthly rent $ $1,500 reason for leaving this address Move from apartment to house required Employment History Are you employed? [ X ] Yes [ ] No Occupation Snr Research Lab technican Employment type [ X ] Full Time [ ] Part Time [ ] Casual Net weekly income $ $1,432.00 Name of Employer AnyCompany Heavy Industries Length of Employment 14 years Contact at Employer John Stiles, Head of R&D Phone number john@example.anycompanyheavyindustries.com.au Address of Employer 100 Old Big Road Willoughby NSW If Self Employed Business Name ABN How long have you been self-employed? Address of Business Accountant's Details Name phone number email address Students Are you a student? [ ] Yes [X ] No School/University/TAFE name Student ID Number Personal References Please do not list relatives or partners Name Martha Rivera Relationship Friend Contact details m 0410 55987 631 h w Name Shirley Rodriguez Relationship Friend Contact details m 0413 55498 419 h w Other History Have you ever been evicted by any agent/lessor? [ ] Yes [ X ] No Was your rental bond at your last address refunded in full? [ X ] Yes [ ] No If No, what deductions were made? Are you in debt to another agent/lessor? [ ] Yes [ X ] No If Yes, why are you in debt to another agent/lessor? Terms and Conditions I wish to undertake a tenancy for a period of 18 months to start on 12 Feb 2021 at a rental price of $3,000 I understand that I am to pay a rental bond of $3,000 to take possession of the presmises and sign a tenancy agreement The customer acknowledges that one application form has to be completed per person [ X ] Yes [ ] No The customer acknowledges that they have received the Privacy Policy of the agent [X ] Yes [ ] No Name of Applicant Signature Date "
7 | 


--------------------------------------------------------------------------------