├── .gitignore
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── requirements.txt
├── workflow1_endpointbuilder
    └── sam-app
    │   ├── functions
    │       ├── function1_createtableandfunction3trigger
    │       │   └── index.py
    │       ├── function2_inserttablerowswithobjectkeys
    │       │   └── index.py
    │       ├── function3_processobjectsandupdaterows
    │       │   ├── Dockerfile
    │       │   ├── index.py
    │       │   └── requirements.txt
    │       ├── function4_checktablestatus
    │       │   └── index.py
    │       ├── function5_createcsv
    │       │   └── index.py
    │       ├── function6_trainclassifier
    │       │   └── index.py
    │       ├── function7_checkclassifierstatus
    │       │   └── index.py
    │       ├── function8_buildendpoint
    │       │   └── index.py
    │       └── function9_checkendpointstatus
    │       │   └── index.py
    │   ├── statemachine
    │       └── endpointbuilder.asl.json
    │   └── template.yaml
├── workflow2_docsplitter
    ├── sam-app
    │   ├── functions
    │   │   ├── docsplitter_function
    │   │   │   ├── Dockerfile
    │   │   │   ├── index.py
    │   │   │   └── requirements.txt
    │   │   └── getalbinfo_function
    │   │   │   └── index.py
    │   └── template.yaml
    └── sample_request_folder
    │   ├── sample_local_request.py
    │   └── sample_s3_request.py
└── workflow3_local
    ├── local_docsplitter.py
    └── local_endpointbuilder.py


/.gitignore:
--------------------------------------------------------------------------------
1 | 
2 | .aws-sam/
3 | .python-version
4 | samconfig.toml
5 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
 4 | documentation, we greatly value feedback and contributions from our community.
 5 | 
 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
 7 | information to effectively respond to your bug report or contribution.
 8 | 
 9 | 
10 | ## Reporting Bugs/Feature Requests
11 | 
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 | 
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 | 
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 | 
22 | 
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 | 
26 | 1. You are working against the latest source on the *main* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 | 
30 | To send us a pull request, please:
31 | 
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 | 
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 | 
42 | 
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 | 
46 | 
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 | 
52 | 
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 | 
56 | 
57 | ## Licensing
58 | 
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | 
 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 4 | this software and associated documentation files (the "Software"), to deal in
 5 | the Software without restriction, including without limitation the rights to
 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 7 | the Software, and to permit persons to whom the Software is furnished to do so.
 8 | 
 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 | 
16 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Intelligently Split Multi-Form Document Packages with Amazon Textract and Amazon Comprehend
  2 | 
  3 | ## Workflow 1: Endpoint Builder
  4 | 
  5 | 
  6 | Blog: https://aws.amazon.com/blogs/machine-learning/intelligently-split-multi-form-document-packages-with-amazon-textract-and-amazon-comprehend/
  7 | 
  8 | Authors: Aditi Rajnish, Raj Pathak
  9 | 
 10 | #### Description
 11 | Workflow 1 will take documents stored on Amazon S3 and send them through a series of steps to extract the data from the documents via Amazon Textract. 
 12 | Then, the extracted data will be used to create an Amazon Comprehend Custom Classification Endpoint.
 13 | 
 14 | #### Input:
 15 | The Amazon S3 URI of the **folder** containing the training dataset files (these can be images, single-page PDFs, and/or multi-page PDFs). 
 16 | The structure of the folder must be:
 17 | 
 18 |     root dataset directory
 19 |     ---- class directory
 20 |     -------- files
 21 | 
 22 | or 
 23 | 
 24 |     root dataset directory
 25 |     ---- class directory
 26 |     -------- nested subdirectories
 27 |     ------------ files
 28 |     
 29 | Note that the name of the class subdirectories (2nd directory level) become the names of the classes used in the Amazon Comprehend custom classification model. For example, see the file structure below; the class for "form123.pdf" would be "tax_forms".
 30 | 
 31 |     training_dataset
 32 |     ---- tax_forms
 33 |     -------- page_1
 34 |     ------------ form123.pdf
 35 | 
 36 | 
 37 | #### Output:
 38 | The ARN (Amazon Resource Number) of the model's endpoint, which can be used to classify documents in Workflow 2.
 39 | 
 40 | #### Usage:
 41 | You must have Docker installed, configured and running. You must also have the AWS CLI and AWS SAM CLI installed and configured with your AWS credentials.
 42 | Navigate to the workflow1_endpointbuilder/sam-app directory. 
 43 | Run 'sam build' and 'sam deploy --guided' in your CLI and provide the Stack Name and AWS Region (ensure that this is the same region as your Amazon S3 dataset folder). You must respond with "y" for the "Allow SAM CLI IAM role creation" and "Create managed ECR repositories for all functions?" prompts. 
 44 | This will deploy CloudFormation stacks that create the infrastructure for the Endpoint Builder state machine in Amazon Step Functions.
 45 | For example, see below the CLI input:
 46 |     
 47 |     (venv) Lambda_Topic_Classifier % cd workflow1_endpointbuilder/sam-app 
 48 |     (venv) sam-app % sam build
 49 |     Build Succeeded
 50 | 
 51 |     (venv) sam-app % sam deploy --guided
 52 |     Configuring SAM deploy
 53 |     =========================================
 54 |     Stack Name [sam-app]: endpointbuilder
 55 |     AWS Region []: us-east-1
 56 |     #Shows you resources changes to be deployed and require a 'Y' to initiate deploy
 57 |     Confirm changes before deploy [y/N]: n
 58 |     #SAM needs permission to be able to create roles to connect to the resources in your template
 59 |     Allow SAM CLI IAM role creation [Y/n]: y
 60 |     Save arguments to configuration file [Y/n]: n
 61 | 
 62 |     Looking for resources needed for deployment:
 63 |     Creating the required resources...
 64 |     Successfully created!
 65 |     Managed S3 bucket: {your_bucket}
 66 |     #Managed repositories will be deleted when their functions are removed from the template and deployed
 67 |     Create managed ECR repositories for all functions? [Y/n]: y
 68 |     
 69 | and CLI output:
 70 | 
 71 |     CloudFormation outputs from deployed stack
 72 |     -------------------------------------------------------------------------------------------------
 73 |     Outputs                                                                                         
 74 |     -------------------------------------------------------------------------------------------------
 75 |     Key                 StateMachineArn                                                             
 76 |     Description         Arn of the EndpointBuilder State Machine                                    
 77 |     Value              {state_machine_arn}                      
 78 |     -------------------------------------------------------------------------------------------------
 79 | 
 80 | Find the corresponding state machine in Amazon Step Functions and start a new execution, providing your Amazon S3 folder URI as a string in the `folder_uri` key-pair argument. 
 81 | 
 82 |     {
 83 |       "folder_uri": "s3://{your_bucket_name}"
 84 |     }
 85 | 
 86 | The state machine returns the Amazon Comprehend classifier's endpoint ARN as a string in the `endpoint_arn` key-pair:
 87 | 
 88 |     {
 89 |       "endpoint_arn": "{your_end_point_ARN}"
 90 |     }
 91 | 
 92 | ## Workflow 2: Document Splitter API
 93 | 
 94 | #### Description
 95 | Workflow 2 will take the endpoint you created in Workflow 1 and split the documents based on the classes with which model has been trained. 
 96 | 
 97 | #### Input:
 98 | The endpoint ARN of the Amazon Comprehend custom classifier, the name of an existing Amazon S3 bucket you have access to, and one of the following:
 99 | - Amazon S3 URI of the multi-page PDF document that needs to be split
100 | - local multi-page PDF file that needs to be split
101 | 
102 | #### Output:
103 | The Amazon S3 URI of a ZIP file containing multi-class PDFs, each titled with their class name, containing the pages from the input PDF that belong to the same class.
104 | 
105 | #### Usage:
106 | You must have Docker installed, configured and running. You must also have the AWS CLI and AWS SAM CLI installed and configured with your AWS credentials.
107 | Navigate to the workflow2_docsplitter/sam-app directory. 
108 | Run 'sam build' and 'sam deploy --guided' in your CLI and provide the Stack Name and AWS Region (ensure that this is the same region as your Amazon S3 bucket). You must respond with "y" for the "Allow SAM CLI IAM role creation" and "Create managed ECR repositories for all functions?" prompts. 
109 | This will deploy CloudFormation stacks that create the infrastructure for the Application Load Balancer its Lambda Function Target where you can send document splitting requests.
110 | For example, see below the CLI input:
111 |     
112 |     (venv) Lambda_Topic_Classifier % cd workflow2_docsplitter/sam-app 
113 |     (venv) sam-app % sam build
114 |     Build Succeeded
115 | 
116 |     (venv) sam-app % sam deploy --guided
117 |     Configuring SAM deploy
118 |     =========================================
119 |     Stack Name [sam-app]: docsplitter
120 |     AWS Region []: us-east-1
121 |     #Shows you resources changes to be deployed and require a 'Y' to initiate deploy
122 |     Confirm changes before deploy [y/N]: n
123 |     #SAM needs permission to be able to create roles to connect to the resources in your template
124 |     Allow SAM CLI IAM role creation [Y/n]: y
125 |     Save arguments to configuration file [Y/n]: n
126 | 
127 |     Looking for resources needed for deployment:
128 |     Managed S3 bucket: {bucket_name}
129 |     #Managed repositories will be deleted when their functions are removed from the template and deployed
130 |     Create managed ECR repositories for all functions? [Y/n]: y
131 |     
132 | and CLI output:
133 | 
134 |     CloudFormation outputs from deployed stack
135 |     -------------------------------------------------------------------------------------------------
136 |     Outputs                                                                                         
137 |     -------------------------------------------------------------------------------------------------
138 |     Key                 LoadBalancerDnsName                                                                                         
139 |     Description         DNS Name of the DocSplitter application load balancer with the Lambda target group.                         
140 |     Value               {ALB_DNS}                     
141 |     -------------------------------------------------------------------------------------------------
142 | 
143 | Send a POST request to the outputted LoadBalancerDnsName, providing your Amazon Comprehend endpoint ARN, Amazon S3 bucket name, and either of the following:
144 | - Amazon S3 URI of the multi-page PDF document that needs to be split
145 | - local multi-page PDF file that needs to be split
146 | 
147 | See below for the POST request input format.
148 | 
149 |     # sample_s3_request.py
150 |     dns_name = "{ALB_DNS}"
151 |     bucket_name = "{Bucket_name}"
152 |     endpoint_arn = "{Comprehend ARN}"
153 |     input_pdf_uri = "{document_s3_uri}"
154 | 
155 |     # sample_local_request.py
156 |     dns_name = "{ALB_DNS}"
157 |     bucket_name = "{Bucket_name}"
158 |     endpoint_arn = "{Comprehend ARN}"
159 |     input_pdf_path = "{local_doc_path}"
160 | 
161 | The Load Balancer's response is the output ZIP file's Amazon S3 URI as a string in the `output_zip_file_s3_uri` key-pair:
162 | 
163 |     {
164 |         "output_zip_file_s3_uri": "s3://{file_of_split_docs}.zip"
165 |     }
166 | 
167 | ## Workflow 3: Local Endpoint Builder and Doc Splitter
168 | 
169 | ### Description
170 | 
171 | Workflow 3 follows a similar purpose to Workflow 1 and Workflow 2 to generate an Amazon Comprehend Endpoint; however, all processing will be done using the user’s local machine to generate an Amazon Comprehend compatible CSV file. 
172 | This workflow was created for customers in highly regulated industries where persisting PDF documents on Amazon S3 may not be possible.
173 | 
174 | ### Local Endpoint Builder
175 | 
176 | #### Input:
177 | The absolute path of the local **folder** containing the training dataset files (these can be images, single-page PDFs, and/or multi-page PDFs). 
178 | The structure of the folder must be the same as that shown in the "Input" section of "Workflow 1: EndpointBuilder".
179 | The name of an existing Amazon S3 bucket you have access to must also be inputted, as this is where the training dataset CSV file is uploaded and accessed by Amazon Comprehend.
180 | 
181 | #### Output:
182 | The ARN (Amazon Resource Number) of the model's endpoint, which can be used to classify documents in Workflow 2.
183 | 
184 | #### Usage:
185 | Navigate to the local_endpointbuilder.py  file in the workflow3_local directory.
186 | Run the file and follow the prompts to input your local dataset folder's absolute path and your Amazon S3 bucket's name.
187 | 
188 | When the program has finished running, the model's endpoint ARN is printed to the console.
189 | See the final lines in local_endpointbuilder.py:
190 | 
191 |     print("Welcome to the local endpoint builder!\n")
192 |     existing_bucket_name = input("Please enter the name of an existing S3 bucket you are able to access: ")
193 |     local_dataset_path = input("Please enter the absolute local file path of the dataset folder: ")
194 |     print("\nComprehend model endpoint ARN: " + main(existing_bucket_name, local_dataset_path))
195 | 
196 | ### Local Doc Splitter
197 | 
198 | #### Input:
199 | The endpoint ARN of the Amazon Comprehend custom classifier and one of the following:
200 | - Amazon S3 URI of the multi-page PDF document that needs to be split
201 | - local multi-page PDF file that needs to be split
202 | 
203 | #### Output:
204 | A local output folder is created. It contains the multi-class PDFs, each titled with their class name, containing the pages from the input PDF that belong to the same class.
205 | 
206 | #### Usage:
207 | Navigate to the local_docsplitter.py file in the workflow3_local directory.
208 | Run the file and follow the prompts to input your endpoint ARN and either your input PDF's S3 URI or absolute local path.
209 | 
210 | When the program has finished running, the console displays the path of the output folder containing the multi-class PDFs.
211 | See the final lines in local_docsplitter.py:
212 | 
213 |     print("Welcome to the local document splitter!")
214 |     endpoint_arn = input("Please enter the ARN of the Comprehend classification endpoint: ")
215 |     
216 |     print("Would you like to split a local file or a file stored in S3?")
217 |     choice = input("Please enter 'local' or 's3': ")
218 |     if choice == 'local':
219 |         input_pdf_info = input("Please enter the absolute local file path of the input PDF: ")
220 |     elif choice == 's3':
221 |         input_pdf_info = input("Please enter the S3 URI of the input PDF: ")
222 |     
223 |     main(endpoint_arn, choice, input_pdf_info)
224 | 
225 | ## Security
226 | 
227 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
228 | 
229 | ## License
230 | 
231 | This library is licensed under the MIT-0 License. See the LICENSE file.
232 | 
233 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | amazon-textract-caller==0.0.13
 2 | amazon-textract-prettyprinter==0.0.9
 3 | amazon-textract-response-parser==0.1.14
 4 | awslambdaric==2.0.0
 5 | boto3==1.18.11
 6 | botocore==1.21.11
 7 | certifi==2021.5.30
 8 | charset-normalizer==2.0.6
 9 | filetype==1.0.8
10 | idna==3.2
11 | jmespath==0.10.0
12 | marshmallow==3.11.1
13 | pdf2image==1.16.0
14 | Pillow==9.0.1
15 | PyPDF2==1.27.5
16 | python-dateutil==2.8.2
17 | requests==2.26.0
18 | s3transfer==0.5.0
19 | simplejson==3.17.2
20 | six==1.16.0
21 | tabulate==0.8.9
22 | urllib3==1.26.6
23 | 


--------------------------------------------------------------------------------
/workflow1_endpointbuilder/sam-app/functions/function1_createtableandfunction3trigger/index.py:
--------------------------------------------------------------------------------
 1 | import boto3
 2 | import datetime
 3 | from urllib.parse import urlparse
 4 | 
 5 | 
 6 | def create_dataset_table(datetime_id, bucket_name):
 7 |     dynamodb_resource = boto3.resource('dynamodb')
 8 |     dynamodb_resource.create_table(
 9 |         TableName=f'DatasetCSVTable_{datetime_id}_{bucket_name}_',
10 |         KeySchema=[{
11 |             'AttributeName': 'objectKey',
12 |             'KeyType': 'HASH'
13 |         }],
14 |         AttributeDefinitions=[{
15 |             'AttributeName': 'objectKey',
16 |             'AttributeType': 'S'
17 |         }],
18 |         BillingMode='PAY_PER_REQUEST',
19 |         StreamSpecification={
20 |             'StreamEnabled': True,
21 |             'StreamViewType': 'NEW_AND_OLD_IMAGES'
22 |         })
23 | 
24 | 
25 | def create_function3_trigger(stream_arn):
26 |     lambda_client = boto3.client('lambda')
27 |     lambda_client.create_event_source_mapping(
28 |         EventSourceArn=stream_arn,
29 |         FunctionName='EndpointBuilder-Function3-ProcessObjects',
30 |         Enabled=True,
31 |         BatchSize=1000,
32 |         ParallelizationFactor=10,
33 |         StartingPosition='TRIM_HORIZON',
34 |         BisectBatchOnFunctionError=True,
35 |         MaximumRetryAttempts=3,
36 |         FunctionResponseTypes=[
37 |             'ReportBatchItemFailures',
38 |         ])
39 | 
40 | 
41 | def lambda_handler(event, context):
42 |     folder_uri = event["folder_uri"]
43 |     bucket_name = urlparse(folder_uri).hostname
44 |     datetime_id = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
45 | 
46 |     create_dataset_table(datetime_id, bucket_name)
47 | 
48 |     dynamodb_client = boto3.client('dynamodb')
49 |     while True:
50 |         response = dynamodb_client.describe_table(
51 |             TableName=f'DatasetCSVTable_{datetime_id}_{bucket_name}_')
52 |         status = response["Table"]["TableStatus"]
53 |         if status == "ACTIVE":
54 |             stream_arn = response['Table']['LatestStreamArn']
55 |             break
56 | 
57 |     create_function3_trigger(stream_arn)
58 | 
59 |     return {'datetime_id': datetime_id, 'folder_uri': folder_uri}
60 | 


--------------------------------------------------------------------------------
/workflow1_endpointbuilder/sam-app/functions/function2_inserttablerowswithobjectkeys/index.py:
--------------------------------------------------------------------------------
 1 | 
 2 | import boto3
 3 | from urllib.parse import urlparse
 4 | 
 5 | 
 6 | def insert_table_rows(folder_uri, datetime_id, bucket_name):
 7 |     s3_resource = boto3.resource('s3')
 8 |     dynamodb = boto3.client("dynamodb")
 9 |     folder_name = folder_uri.split(bucket_name + "/", 1)[1]
10 |     bucket = s3_resource.Bucket(bucket_name)
11 |     s3_client = boto3.client("s3")
12 | 
13 |     for obj in bucket.objects.filter(Prefix=folder_name):
14 |         key = obj.key
15 |         # Skip files that are not PDFs, JPEGs, or PNGs
16 |         object_content_type = s3_client.get_object(
17 |             Bucket=bucket_name,
18 |             Key=key
19 |         )['ContentType']
20 |         valid_content_types = {'application/pdf', 'image/jpeg', 'image/png'}
21 |         if object_content_type in valid_content_types:
22 |             dynamodb.put_item(
23 |                 TableName=f'DatasetCSVTable_{datetime_id}_{bucket_name}_',
24 |                 Item={
25 |                     'objectKey': {
26 |                         'S': key
27 |                     }
28 |                 }
29 |             )
30 | 
31 | 
32 | def lambda_handler(event, context):
33 |     folder_uri = event["folder_uri"]
34 |     datetime_id = event["datetime_id"]
35 |     bucket_name = urlparse(folder_uri).hostname
36 | 
37 |     insert_table_rows(folder_uri, datetime_id, bucket_name)
38 | 
39 |     return {
40 |         'datetime_id': datetime_id,
41 |         'bucket_name': bucket_name,
42 |     }
43 | 


--------------------------------------------------------------------------------
/workflow1_endpointbuilder/sam-app/functions/function3_processobjectsandupdaterows/Dockerfile:
--------------------------------------------------------------------------------
 1 | 
 2 | # Define global args
 3 | ARG FUNCTION_DIR="/"
 4 | ARG RUNTIME_VERSION="3.8"
 5 | ARG DISTRO_VERSION="3.12"
 6 | 
 7 | # Stage 1 - bundle base image + runtime
 8 | # Grab a fresh copy of the image and install GCC
 9 | FROM python:${RUNTIME_VERSION}-alpine${DISTRO_VERSION} AS python-alpine
10 | 
11 | # Install GCC for compiling and linking dependencies
12 | RUN apk add --no-cache \
13 |     libstdc++
14 | 
15 | 
16 | # Stage 2 - build function and dependencies
17 | FROM python-alpine AS build-image
18 | 
19 | # Install Poppler
20 | RUN apk add poppler-utils
21 | 
22 | # Install Pillow dependencies
23 | RUN apk --no-cache add \
24 |     freetype-dev \
25 |     fribidi-dev \
26 |     harfbuzz-dev \
27 |     jpeg-dev \
28 |     lcms2-dev \
29 |     openjpeg-dev \
30 |     tcl-dev \
31 |     tiff-dev \
32 |     tk-dev \
33 |     zlib-dev
34 | 
35 | # Install aws-lambda-cpp build dependencies
36 | RUN apk add --no-cache \
37 |     build-base \
38 |     libtool \
39 |     autoconf \
40 |     automake \
41 |     libexecinfo-dev \
42 |     make \
43 |     cmake \
44 |     libcurl
45 | 
46 | # Include global args in this stage of the build
47 | ARG FUNCTION_DIR
48 | ARG RUNTIME_VERSION
49 | 
50 | # Create function directory
51 | RUN mkdir -p ${FUNCTION_DIR}
52 | 
53 | # Copy handler function
54 | COPY index.py requirements.txt ${FUNCTION_DIR}
55 | 
56 | # Install the function's dependencies
57 | RUN python${RUNTIME_VERSION} -m pip install -r ${FUNCTION_DIR}requirements.txt --target ${FUNCTION_DIR}
58 | 
59 | 
60 | # Stage 3 - final runtime image
61 | 
62 | # Grab a fresh copy of the Python image
63 | FROM python-alpine
64 | 
65 | # Include global arg in this stage of the build
66 | ARG FUNCTION_DIR
67 | 
68 | # Set working directory to function root directory
69 | WORKDIR ${FUNCTION_DIR}
70 | 
71 | # Copy in the built dependencies
72 | COPY --from=build-image ${FUNCTION_DIR} ${FUNCTION_DIR}
73 | 
74 | ENTRYPOINT [ "/usr/local/bin/python", "-m", "awslambdaric" ]
75 | CMD [ "index.lambda_handler" ]
76 | 


--------------------------------------------------------------------------------
/workflow1_endpointbuilder/sam-app/functions/function3_processobjectsandupdaterows/index.py:
--------------------------------------------------------------------------------
 1 | 
 2 | import boto3
 3 | from io import BytesIO
 4 | from pdf2image import convert_from_bytes
 5 | from textractcaller.t_call import call_textract
 6 | from textractprettyprinter.t_pretty_print import get_lines_string
 7 | 
 8 | 
 9 | def process_file(key, datetime_id, bucket_name):
10 |     _class = key.split("/")[1]
11 |     s3 = boto3.client('s3')
12 |     dynamodb = boto3.client('dynamodb')
13 | 
14 |     s3_response_object = s3.get_object(Bucket=bucket_name, Key=key)
15 |     object_content = s3_response_object['Body'].read()
16 | 
17 |     if key.endswith(".pdf"):
18 |         images = convert_from_bytes(object_content)
19 |         all_raw_text = ""
20 |         for i, image in enumerate(images):
21 |             image_text = call_textract_for_pdf(image)
22 |             all_raw_text += image_text
23 |         update_dynamodb_row(key, _class, all_raw_text, datetime_id, bucket_name, dynamodb)
24 | 
25 |     else:
26 |         image_text = call_textract_for_image(object_content)
27 |         update_dynamodb_row(key, _class, image_text, datetime_id, bucket_name, dynamodb)
28 | 
29 | 
30 | def call_textract_for_pdf(image):
31 |     buf = BytesIO()
32 |     image.save(buf, format='JPEG')
33 |     byte_string = buf.getvalue()
34 |     return call_textract_for_image(byte_string)
35 | 
36 | 
37 | def call_textract_for_image(object_content):
38 |     textract_json = call_textract(input_document=object_content)
39 |     return get_lines_string(textract_json=textract_json)
40 | 
41 | 
42 | def update_dynamodb_row(key, _class, raw_text, datetime_id, bucket_name, dynamodb):
43 |     dynamodb.update_item(
44 |         TableName=f'DatasetCSVTable_{datetime_id}_{bucket_name}_',
45 |         Key={
46 |             'objectKey': {
47 |                 'S': key
48 |             }
49 |         },
50 |         ExpressionAttributeNames={
51 |             '#cls': 'class',
52 |             '#txt': 'text',
53 |         },
54 |         ExpressionAttributeValues={
55 |             ':c': {
56 |                 'S': _class,
57 |             },
58 |             ':t': {
59 |                 'S': raw_text,
60 |             },
61 |         },
62 |         UpdateExpression='SET #cls = :c, #txt = :t'
63 |     )
64 | 
65 | 
66 | def parse_info(event_source_arn):
67 |     info = event_source_arn.split("_")
68 |     datetime_id = info[1]
69 |     bucket_name = info[2]
70 |     return datetime_id, bucket_name
71 | 
72 | 
73 | def lambda_handler(event, context):
74 | 
75 |     for record in event["Records"]:
76 |         if record["eventName"] == "INSERT":
77 |             new_image = record["dynamodb"]["NewImage"]
78 |             key = new_image["objectKey"]["S"]
79 |             datetime_id, bucket_name = parse_info(record["eventSourceARN"])
80 | 
81 |             process_file(key, datetime_id, bucket_name)
82 | 


--------------------------------------------------------------------------------
/workflow1_endpointbuilder/sam-app/functions/function3_processobjectsandupdaterows/requirements.txt:
--------------------------------------------------------------------------------
 1 | amazon-textract-caller==0.0.13
 2 | amazon-textract-prettyprinter==0.0.9
 3 | awslambdaric==2.0.0
 4 | certifi==2021.5.30
 5 | charset-normalizer==2.0.6
 6 | idna==3.2
 7 | jmespath==0.10.0
 8 | marshmallow==3.11.1
 9 | pdf2image==1.16.0
10 | Pillow==9.0.1
11 | python-dateutil==2.8.2
12 | requests==2.26.0
13 | s3transfer==0.5.0
14 | six==1.16.0
15 | tabulate==0.8.9
16 | urllib3==1.26.6
17 | 


--------------------------------------------------------------------------------
/workflow1_endpointbuilder/sam-app/functions/function4_checktablestatus/index.py:
--------------------------------------------------------------------------------
 1 | 
 2 | import boto3
 3 | 
 4 | 
 5 | def check_table(datetime_id, bucket_name):
 6 |     dynamodb_resource = boto3.resource('dynamodb')
 7 |     table = dynamodb_resource.Table(f'DatasetCSVTable_{datetime_id}_{bucket_name}_')
 8 | 
 9 |     scan_kwargs = {
10 |         'Select': 'COUNT',
11 |         'FilterExpression': 'attribute_not_exists(#cls)',
12 |         'ExpressionAttributeNames': {'#cls': 'class'}
13 |     }
14 | 
15 |     done = False
16 |     start_key = None
17 |     rows_not_filled = 0
18 |     while not done:
19 |         if start_key:
20 |             scan_kwargs['ExclusiveStartKey'] = start_key
21 |         response = table.scan(**scan_kwargs)
22 |         rows_not_filled += response['Count']
23 |         start_key = response.get('LastEvaluatedKey', None)
24 |         done = start_key is None
25 | 
26 |     if rows_not_filled == 0:
27 |         return True
28 |     else:
29 |         return False
30 | 
31 | 
32 | def lambda_handler(event, context):
33 |     datetime_id = event["datetime_id"]
34 |     bucket_name = event["bucket_name"]
35 | 
36 |     is_complete = check_table(datetime_id, bucket_name)
37 | 
38 |     return {
39 |         'is_complete': is_complete,
40 |         'datetime_id': datetime_id,
41 |         'bucket_name': bucket_name
42 |     }
43 | 


--------------------------------------------------------------------------------
/workflow1_endpointbuilder/sam-app/functions/function5_createcsv/index.py:
--------------------------------------------------------------------------------
 1 | 
 2 | import csv
 3 | import boto3
 4 | from io import StringIO
 5 | 
 6 | 
 7 | def create_csv(datetime_id, bucket_name):
 8 |     dynamodb = boto3.resource('dynamodb')
 9 |     table = dynamodb.Table(f'DatasetCSVTable_{datetime_id}_{bucket_name}_')
10 | 
11 |     scan_kwargs = {
12 |         'ProjectionExpression': "#cls, #txt",
13 |         'ExpressionAttributeNames': {"#cls": "class", "#txt": "text"}
14 |     }
15 | 
16 |     f = StringIO()
17 |     writer = csv.writer(f)
18 |     done = False
19 |     start_key = None
20 |     while not done:
21 |         if start_key:
22 |             scan_kwargs['ExclusiveStartKey'] = start_key
23 |         response = table.scan(**scan_kwargs)
24 |         documents = response.get('Items', [])
25 |         for doc in documents:
26 |             writer.writerow([doc['class'], doc['text']])
27 |         start_key = response.get('LastEvaluatedKey', None)
28 |         done = start_key is None
29 |     return f
30 | 
31 | 
32 | def lambda_handler(event, context):
33 |     datetime_id = event["datetime_id"]
34 |     bucket_name = event["bucket_name"]
35 | 
36 |     f = create_csv(datetime_id, bucket_name)
37 |     s3 = boto3.client("s3")
38 |     s3.put_object(Body=f.getvalue(), Bucket=bucket_name, Key=f"comprehend_dataset_{datetime_id}.csv")
39 | 
40 |     return {
41 |         "datetime_id": datetime_id,
42 |         "bucket_name": bucket_name
43 |     }
44 | 


--------------------------------------------------------------------------------
/workflow1_endpointbuilder/sam-app/functions/function6_trainclassifier/index.py:
--------------------------------------------------------------------------------
 1 | 
 2 | import os
 3 | import boto3
 4 | 
 5 | 
 6 | def train_classifier(datetime_id, bucket_name, role_arn, classifier_name):
 7 |     comprehend = boto3.client("comprehend")
 8 | 
 9 |     # create classifier with the CSV training dataset's S3 URI and the ARN of the IAM role
10 |     create_response = comprehend.create_document_classifier(
11 |         InputDataConfig={
12 |             'S3Uri': f"s3://{bucket_name}/comprehend_dataset_{datetime_id}.csv"
13 |         },
14 |         DataAccessRoleArn=role_arn,
15 |         DocumentClassifierName=classifier_name,
16 |         LanguageCode='en'
17 |     )
18 | 
19 |     classifier_arn = create_response['DocumentClassifierArn']
20 |     return classifier_arn
21 | 
22 | 
23 | def lambda_handler(event, context):
24 |     datetime_id = event["datetime_id"]
25 |     bucket_name = event["bucket_name"]
26 | 
27 |     classifier_name = f"Classifier-{datetime_id}"
28 | 
29 |     role_arn = os.environ.get('FUNCTION6_TRAIN_CLASSIFIER_COMPREHEND_ROLE_ARN')
30 |     classifier_arn = train_classifier(datetime_id, bucket_name, role_arn, classifier_name)
31 | 
32 |     return {
33 |         'datetime_id': datetime_id,
34 |         'classifier_arn': classifier_arn
35 |     }


--------------------------------------------------------------------------------
/workflow1_endpointbuilder/sam-app/functions/function7_checkclassifierstatus/index.py:
--------------------------------------------------------------------------------
 1 | 
 2 | import boto3
 3 | 
 4 | 
 5 | def check_classifier(classifier_arn):
 6 |     comprehend = boto3.client("comprehend")
 7 |     describe_response = comprehend.describe_document_classifier(
 8 |         DocumentClassifierArn=classifier_arn
 9 |     )
10 | 
11 |     is_trained = False
12 |     status = describe_response['DocumentClassifierProperties']['Status']
13 |     if status != "TRAINED":
14 |         if status == "IN_ERROR":
15 |             message = describe_response["DocumentClassifierProperties"]["Message"]
16 |             raise ValueError(f"The classifier is in error:", message)
17 |     else:
18 |         is_trained = True
19 | 
20 |     return is_trained
21 | 
22 | 
23 | def lambda_handler(event, context):
24 |     datetime_id = event["datetime_id"]
25 |     classifier_arn = event["classifier_arn"]
26 | 
27 |     is_trained = check_classifier(classifier_arn)
28 | 
29 |     return {
30 |         'is_trained': is_trained,
31 |         'datetime_id': datetime_id,
32 |         'classifier_arn': classifier_arn
33 |     }
34 | 


--------------------------------------------------------------------------------
/workflow1_endpointbuilder/sam-app/functions/function8_buildendpoint/index.py:
--------------------------------------------------------------------------------
 1 | 
 2 | import boto3
 3 | 
 4 | 
 5 | def build_endpoint(classifier_arn, datetime_id):
 6 |     comprehend = boto3.client("comprehend")
 7 |     endpoint_response = comprehend.create_endpoint(
 8 |         EndpointName=f"Classifier-{datetime_id}",
 9 |         ModelArn=classifier_arn,
10 |         DesiredInferenceUnits=10,
11 |     )
12 |     endpoint_arn = endpoint_response["EndpointArn"]
13 |     return endpoint_arn
14 | 
15 | 
16 | def lambda_handler(event, context):
17 |     datetime_id = event["datetime_id"]
18 |     classifier_arn = event["classifier_arn"]
19 | 
20 |     endpoint_arn = build_endpoint(classifier_arn, datetime_id)
21 | 
22 |     return {
23 |         'endpoint_arn': endpoint_arn
24 |     }
25 | 


--------------------------------------------------------------------------------
/workflow1_endpointbuilder/sam-app/functions/function9_checkendpointstatus/index.py:
--------------------------------------------------------------------------------
 1 | 
 2 | import boto3
 3 | 
 4 | 
 5 | def check_endpoint(endpoint_arn):
 6 |     comprehend = boto3.client("comprehend")
 7 |     describe_response = comprehend.describe_endpoint(
 8 |         EndpointArn=endpoint_arn
 9 |     )
10 | 
11 |     endpoint_is_complete = False
12 |     status = describe_response["EndpointProperties"]["Status"]
13 |     if status != "IN_SERVICE":
14 |         if status == "FAILED":
15 |             message = describe_response["EndpointProperties"]["Message"]
16 |             raise ValueError(f"The endpoint is in error:", message)
17 |     else:
18 |         endpoint_is_complete = True
19 | 
20 |     return endpoint_is_complete
21 | 
22 | 
23 | def lambda_handler(event, context):
24 |     endpoint_arn = event["endpoint_arn"]
25 | 
26 |     endpoint_is_complete = check_endpoint(endpoint_arn)
27 | 
28 |     return {
29 |         'endpoint_is_complete': endpoint_is_complete,
30 |         'endpoint_arn': endpoint_arn
31 |     }
32 | 


--------------------------------------------------------------------------------
/workflow1_endpointbuilder/sam-app/statemachine/endpointbuilder.asl.json:
--------------------------------------------------------------------------------
  1 | {
  2 |   "Comment": "This is your state machine",
  3 |   "StartAt": "Create Table & DynamoDB Stream",
  4 |   "States": {
  5 |     "Create Table & DynamoDB Stream": {
  6 |       "Type": "Task",
  7 |       "Resource": "arn:aws:states:::lambda:invoke",
  8 |       "OutputPath": "$.Payload",
  9 |       "Parameters": {
 10 |         "Payload.$": "$",
 11 |         "FunctionName": "${CreateTableAndFunction3TriggerFunction}"
 12 |       },
 13 |       "Retry": [
 14 |         {
 15 |           "ErrorEquals": [
 16 |             "Lambda.ServiceException",
 17 |             "Lambda.AWSLambdaException",
 18 |             "Lambda.SdkClientException"
 19 |           ],
 20 |           "IntervalSeconds": 2,
 21 |           "MaxAttempts": 6,
 22 |           "BackoffRate": 2
 23 |         }
 24 |       ],
 25 |       "Next": "Insert Table Rows with Object Keys"
 26 |     },
 27 |     "Insert Table Rows with Object Keys": {
 28 |       "Type": "Task",
 29 |       "Resource": "arn:aws:states:::lambda:invoke",
 30 |       "OutputPath": "$.Payload",
 31 |       "Parameters": {
 32 |         "Payload.$": "$",
 33 |         "FunctionName": "${InsertTableRowsWithObjectKeysFunction2}"
 34 |       },
 35 |       "Retry": [
 36 |         {
 37 |           "ErrorEquals": [
 38 |             "Lambda.ServiceException",
 39 |             "Lambda.AWSLambdaException",
 40 |             "Lambda.SdkClientException"
 41 |           ],
 42 |           "IntervalSeconds": 2,
 43 |           "MaxAttempts": 6,
 44 |           "BackoffRate": 2
 45 |         }
 46 |       ],
 47 |       "Next": "Check Table Status"
 48 |     },
 49 |     "Check Table Status": {
 50 |       "Type": "Task",
 51 |       "Resource": "arn:aws:states:::lambda:invoke",
 52 |       "OutputPath": "$.Payload",
 53 |       "Parameters": {
 54 |         "Payload.$": "$",
 55 |         "FunctionName": "${CheckTableStatusFunction4}"
 56 |       },
 57 |       "Retry": [
 58 |         {
 59 |           "ErrorEquals": [
 60 |             "Lambda.ServiceException",
 61 |             "Lambda.AWSLambdaException",
 62 |             "Lambda.SdkClientException"
 63 |           ],
 64 |           "IntervalSeconds": 2,
 65 |           "MaxAttempts": 6,
 66 |           "BackoffRate": 2
 67 |         }
 68 |       ],
 69 |       "Next": "Is Table Complete?"
 70 |     },
 71 |     "Is Table Complete?": {
 72 |       "Type": "Choice",
 73 |       "Choices": [
 74 |         {
 75 |           "Variable": "$.is_complete",
 76 |           "BooleanEquals": true,
 77 |           "Next": "Create CSV"
 78 |         }
 79 |       ],
 80 |       "Default": "Wait for Object Processing"
 81 |     },
 82 |     "Wait for Object Processing": {
 83 |       "Type": "Wait",
 84 |       "Seconds": 300,
 85 |       "Next": "Check Table Status"
 86 |     },
 87 |     "Create CSV": {
 88 |       "Type": "Task",
 89 |       "Resource": "arn:aws:states:::lambda:invoke",
 90 |       "OutputPath": "$.Payload",
 91 |       "Parameters": {
 92 |         "Payload.$": "$",
 93 |         "FunctionName": "${CreateCSVFunction5}"
 94 |       },
 95 |       "Retry": [
 96 |         {
 97 |           "ErrorEquals": [
 98 |             "Lambda.ServiceException",
 99 |             "Lambda.AWSLambdaException",
100 |             "Lambda.SdkClientException"
101 |           ],
102 |           "IntervalSeconds": 2,
103 |           "MaxAttempts": 6,
104 |           "BackoffRate": 2
105 |         }
106 |       ],
107 |       "Next": "Train Classifier"
108 |     },
109 |     "Train Classifier": {
110 |       "Type": "Task",
111 |       "Resource": "arn:aws:states:::lambda:invoke",
112 |       "OutputPath": "$.Payload",
113 |       "Parameters": {
114 |         "Payload.$": "$",
115 |         "FunctionName": "${TrainClassifierFunction6}"
116 |       },
117 |       "Retry": [
118 |         {
119 |           "ErrorEquals": [
120 |             "Lambda.ServiceException",
121 |             "Lambda.AWSLambdaException",
122 |             "Lambda.SdkClientException"
123 |           ],
124 |           "IntervalSeconds": 2,
125 |           "MaxAttempts": 6,
126 |           "BackoffRate": 2
127 |         }
128 |       ],
129 |       "Next": "Check Classifier Status"
130 |     },
131 |     "Check Classifier Status": {
132 |       "Type": "Task",
133 |       "Resource": "arn:aws:states:::lambda:invoke",
134 |       "OutputPath": "$.Payload",
135 |       "Parameters": {
136 |         "Payload.$": "$",
137 |         "FunctionName": "${CheckClassifierStatusFunction7}"
138 |       },
139 |       "Retry": [
140 |         {
141 |           "ErrorEquals": [
142 |             "Lambda.ServiceException",
143 |             "Lambda.AWSLambdaException",
144 |             "Lambda.SdkClientException"
145 |           ],
146 |           "IntervalSeconds": 2,
147 |           "MaxAttempts": 6,
148 |           "BackoffRate": 2
149 |         }
150 |       ],
151 |       "Next": "Is Training Complete?"
152 |     },
153 |     "Is Training Complete?": {
154 |       "Type": "Choice",
155 |       "Choices": [
156 |         {
157 |           "Variable": "$.is_trained",
158 |           "BooleanEquals": true,
159 |           "Next": "Build Endpoint"
160 |         }
161 |       ],
162 |       "Default": "Wait for Training"
163 |     },
164 |     "Build Endpoint": {
165 |       "Type": "Task",
166 |       "Resource": "arn:aws:states:::lambda:invoke",
167 |       "OutputPath": "$.Payload",
168 |       "Parameters": {
169 |         "Payload.$": "$",
170 |         "FunctionName": "${BuildEndpointFunction8}"
171 |       },
172 |       "Retry": [
173 |         {
174 |           "ErrorEquals": [
175 |             "Lambda.ServiceException",
176 |             "Lambda.AWSLambdaException",
177 |             "Lambda.SdkClientException"
178 |           ],
179 |           "IntervalSeconds": 2,
180 |           "MaxAttempts": 6,
181 |           "BackoffRate": 2
182 |         }
183 |       ],
184 |       "Next": "Check Endpoint Status"
185 |     },
186 |     "Check Endpoint Status": {
187 |       "Type": "Task",
188 |       "Resource": "arn:aws:states:::lambda:invoke",
189 |       "OutputPath": "$.Payload",
190 |       "Parameters": {
191 |         "Payload.$": "$",
192 |         "FunctionName": "${CheckEndpointStatusFunction9}"
193 |       },
194 |       "Retry": [
195 |         {
196 |           "ErrorEquals": [
197 |             "Lambda.ServiceException",
198 |             "Lambda.AWSLambdaException",
199 |             "Lambda.SdkClientException"
200 |           ],
201 |           "IntervalSeconds": 2,
202 |           "MaxAttempts": 6,
203 |           "BackoffRate": 2
204 |         }
205 |       ],
206 |       "Next": "Is Endpoint Ready?"
207 |     },
208 |     "Is Endpoint Ready?": {
209 |       "Type": "Choice",
210 |       "Choices": [
211 |         {
212 |           "Variable": "$.endpoint_is_complete",
213 |           "BooleanEquals": true,
214 |           "Next": "Success"
215 |         }
216 |       ],
217 |       "Default": "Wait for Endpoint"
218 |     },
219 |     "Success": {
220 |       "Type": "Succeed"
221 |     },
222 |     "Wait for Endpoint": {
223 |       "Type": "Wait",
224 |       "Seconds": 300,
225 |       "Next": "Check Endpoint Status"
226 |     },
227 |     "Wait for Training": {
228 |       "Type": "Wait",
229 |       "Seconds": 600,
230 |       "Next": "Check Classifier Status"
231 |     }
232 |   }
233 | }


--------------------------------------------------------------------------------
/workflow1_endpointbuilder/sam-app/template.yaml:
--------------------------------------------------------------------------------
  1 | Transform: 'AWS::Serverless-2016-10-31'
  2 | 
  3 | Description: SAM template for endpointbuilder
  4 | 
  5 | Resources:
  6 |   EndpointBuilderStateMachine:
  7 |     Type: AWS::Serverless::StateMachine
  8 |     Properties:
  9 |       DefinitionUri: statemachine/endpointbuilder.asl.json
 10 |       DefinitionSubstitutions:
 11 |         CreateTableAndFunction3TriggerFunction: !GetAtt EndpointBuilderFunction1CreateTableAndFunction3Trigger.Arn
 12 |         InsertTableRowsWithObjectKeysFunction2: !GetAtt EndpointBuilderFunction2InsertTableRowsWithObjectKeys.Arn
 13 |         CheckTableStatusFunction4: !GetAtt EndpointBuilderFunction4CheckTableStatus.Arn
 14 |         CreateCSVFunction5: !GetAtt EndpointBuilderFunction5CreateCSV.Arn
 15 |         TrainClassifierFunction6: !GetAtt EndpointBuilderFunction6TrainClassifier.Arn
 16 |         CheckClassifierStatusFunction7: !GetAtt EndpointBuilderFunction7CheckClassifierStatus.Arn
 17 |         BuildEndpointFunction8: !GetAtt EndpointBuilderFunction8BuildEndpoint.Arn
 18 |         CheckEndpointStatusFunction9: !GetAtt EndpointBuilderFunction9CheckEndpointStatus.Arn
 19 |       Role: !GetAtt EndpointBuilderRole.Arn
 20 | 
 21 |   EndpointBuilderFunction1CreateTableAndFunction3Trigger:
 22 |     Type: AWS::Serverless::Function
 23 |     Properties:
 24 |       CodeUri: functions/function1_createtableandfunction3trigger/
 25 |       Role: !GetAtt EndpointBuilderRole.Arn
 26 |       Description: Creates DynamoDB table and trigger for EndpointBuilderFunction3
 27 |       Timeout: 180
 28 |       MemorySize: 128
 29 |       Runtime: python3.8
 30 |       Handler: index.lambda_handler
 31 | 
 32 |   EndpointBuilderFunction2InsertTableRowsWithObjectKeys:
 33 |     Type: 'AWS::Serverless::Function'
 34 |     Properties:
 35 |       CodeUri: functions/function2_inserttablerowswithobjectkeys/
 36 |       Role: !GetAtt EndpointBuilderRole.Arn
 37 |       Description: Inserts rows into DynamoDB table using the S3 object keys as the primary key
 38 |       Timeout: 900
 39 |       MemorySize: 10240
 40 |       Runtime: python3.8
 41 |       Handler: index.lambda_handler
 42 | 
 43 |   EndpointBuilderFunction3ProcessObjectsAndUpdateRows:
 44 |     Type: 'AWS::Serverless::Function'
 45 |     Properties:
 46 |       FunctionName: EndpointBuilder-Function3-ProcessObjects
 47 |       Description: Processes S3 objects via DynamoDB insert row trigger; uses object key to extract file text and class; updates table row; its FunctionName is used inside Function1 code to create the DynamoDB event source mapping
 48 |       Role: !GetAtt EndpointBuilderRole.Arn
 49 |       Timeout: 900
 50 |       MemorySize: 10240
 51 |       PackageType: Image
 52 |       ImageConfig:
 53 |         Command: [ "index.lambda_handler" ]
 54 |     Metadata:
 55 |       Dockerfile: Dockerfile
 56 |       DockerContext: functions/function3_processobjectsandupdaterows/
 57 |       DockerTag: v1
 58 | 
 59 |   EndpointBuilderFunction4CheckTableStatus:
 60 |     Type: 'AWS::Serverless::Function'
 61 |     Properties:
 62 |       CodeUri: functions/function4_checktablestatus/
 63 |       Role: !GetAtt EndpointBuilderRole.Arn
 64 |       Description: Check DynamoDB table status--all rows must have existing class and text attributes (provided by EndpointBuilderFunction3ProcessObjectsAndUpdateRows)
 65 |       Timeout: 900
 66 |       MemorySize: 10240
 67 |       Runtime: python3.8
 68 |       Handler: index.lambda_handler
 69 | 
 70 |   EndpointBuilderFunction5CreateCSV:
 71 |     Type: 'AWS::Serverless::Function'
 72 |     Properties:
 73 |       CodeUri: functions/function5_createcsv/
 74 |       Role: !GetAtt EndpointBuilderRole.Arn
 75 |       Description: Creates CSV file from DynamoDB table and uploads to S3 bucket
 76 |       Timeout: 900
 77 |       MemorySize: 10240
 78 |       Runtime: python3.8
 79 |       Handler: index.lambda_handler
 80 | 
 81 |   EndpointBuilderFunction6TrainClassifier:
 82 |     Type: 'AWS::Serverless::Function'
 83 |     Properties:
 84 |       CodeUri: functions/function6_trainclassifier/
 85 |       Role: !GetAtt EndpointBuilderRole.Arn
 86 |       Description: Trains Comprehend classifier using the CSV created in EndpointBuilderFunction5CreateCSV
 87 |       Timeout: 300
 88 |       MemorySize: 2048
 89 |       Runtime: python3.8
 90 |       Handler: index.lambda_handler
 91 |       Environment:
 92 |         Variables:
 93 |           FUNCTION6_TRAIN_CLASSIFIER_COMPREHEND_ROLE_ARN: !GetAtt Function6ComprehendRole.Arn
 94 | 
 95 |   EndpointBuilderFunction7CheckClassifierStatus:
 96 |     Type: 'AWS::Serverless::Function'
 97 |     Properties:
 98 |       CodeUri: functions/function7_checkclassifierstatus/
 99 |       Role: !GetAtt EndpointBuilderRole.Arn
100 |       Description: Checks Comprehend classifier status to verify if it has finished training
101 |       Timeout: 60
102 |       MemorySize: 128
103 |       Runtime: python3.8
104 |       Handler: index.lambda_handler
105 | 
106 |   EndpointBuilderFunction8BuildEndpoint:
107 |     Type: 'AWS::Serverless::Function'
108 |     Properties:
109 |       CodeUri: functions/function8_buildendpoint/
110 |       Role: !GetAtt EndpointBuilderRole.Arn
111 |       Description: Creates Comprehend endpoint from trained classifier
112 |       Timeout: 60
113 |       MemorySize: 128
114 |       Runtime: python3.8
115 |       Handler: index.lambda_handler
116 | 
117 |   EndpointBuilderFunction9CheckEndpointStatus:
118 |     Type: 'AWS::Serverless::Function'
119 |     Properties:
120 |       CodeUri: functions/function9_checkendpointstatus/
121 |       Role: !GetAtt EndpointBuilderRole.Arn
122 |       Description: Checks Comprehend endpoint status to verify if it has been created
123 |       Timeout: 60
124 |       MemorySize: 128
125 |       Runtime: python3.8
126 |       Handler: index.lambda_handler
127 | 
128 |   EndpointBuilderRole:
129 |     Type: 'AWS::IAM::Role'
130 |     Properties:
131 |       AssumeRolePolicyDocument:
132 |         Statement:
133 |           - Effect: Allow
134 |             Principal:
135 |               Service:
136 |                 - lambda.amazonaws.com
137 |                 - states.amazonaws.com
138 |             Action:
139 |               - 'sts:AssumeRole'
140 |       Description: Role executes Lambda functions
141 |       MaxSessionDuration: 43200
142 |       Policies:
143 |         - PolicyName: root
144 |           PolicyDocument:
145 |             Version: 2012-10-17
146 |             Statement:
147 |               - Effect: Allow
148 |                 Action:
149 |                   - 's3:PutObject'
150 |                   - 's3:GetObject'
151 |                   - 's3:ListBucket'
152 |                   - 'iam:PassRole'
153 |                   - 'textract:*'
154 |                   - 'dynamodb:*'
155 |                   - 'lambda:*'
156 |                   - 'comprehend:*'
157 |                   - 'states:*'
158 |                   - 'logs:*'
159 |                 Resource: '*'
160 | 
161 |   Function6ComprehendRole:
162 |     Type: 'AWS::IAM::Role'
163 |     Properties:
164 |       AssumeRolePolicyDocument:
165 |         Statement:
166 |           - Effect: Allow
167 |             Principal:
168 |               Service:
169 |                 - comprehend.amazonaws.com
170 |             Action:
171 |               - 'sts:AssumeRole'
172 |       Description: Service role for Comprehend classifier
173 |       MaxSessionDuration: 43200
174 |       Policies:
175 |         - PolicyName: root
176 |           PolicyDocument:
177 |             Version: 2012-10-17
178 |             Statement:
179 |               - Effect: Allow
180 |                 Action:
181 |                   - 's3:PutObject'
182 |                   - 's3:GetObject'
183 |                   - 's3:ListBucket'
184 |                 Resource: '*'
185 | 
186 | Outputs:
187 |   StateMachineArn:
188 |     Description: Arn of the EndpointBuilder State Machine
189 |     Value: !GetAtt EndpointBuilderStateMachine.Arn
190 |     Export:
191 |       Name: State-Machine-Arn
192 | 


--------------------------------------------------------------------------------
/workflow2_docsplitter/sam-app/functions/docsplitter_function/Dockerfile:
--------------------------------------------------------------------------------
 1 | 
 2 | FROM public.ecr.aws/lambda/python:3.8
 3 | 
 4 | RUN yum update -y
 5 | RUN yum install poppler-utils -y
 6 | RUN yum install -y \
 7 |     libtiff-devel libjpeg-devel openjpeg2-devel zlib-devel \
 8 |     freetype-devel lcms2-devel libwebp-devel tcl-devel tk-devel \
 9 |     harfbuzz-devel fribidi-devel libraqm-devel libimagequant-devel libxcb-devel
10 | 
11 | COPY index.py ${LAMBDA_TASK_ROOT}
12 | COPY requirements.txt .
13 | RUN pip3 install -r requirements.txt --target "${LAMBDA_TASK_ROOT}"
14 | 
15 | CMD [ "index.lambda_handler" ]
16 | 


--------------------------------------------------------------------------------
/workflow2_docsplitter/sam-app/functions/docsplitter_function/index.py:
--------------------------------------------------------------------------------
  1 | 
  2 | import json
  3 | import base64
  4 | import boto3
  5 | import zipfile
  6 | from io import BytesIO
  7 | from datetime import datetime
  8 | from pdf2image import convert_from_bytes
  9 | from PyPDF2 import PdfFileReader, PdfFileWriter
 10 | from textractcaller.t_call import call_textract
 11 | from textractprettyprinter.t_pretty_print import get_lines_string
 12 | 
 13 | 
 14 | def call_textract_on_image(textract, image, i):
 15 |     # return JSON response containing the text extracted from the image
 16 |     print(f"Inputting image {i} into Textract")
 17 |     buf = BytesIO()
 18 |     image.save(buf, format='JPEG')
 19 |     byte_string = buf.getvalue()
 20 |     return call_textract(input_document=byte_string, boto3_textract_client=textract)
 21 | 
 22 | 
 23 | def call_comprehend(text, comprehend, endpoint_arn):
 24 |     # send in raw text for a document as well as the Comprehend custom classification model ARN
 25 |     # returns JSON response containing the document's predicted class
 26 |     print("Inputting text into Comprehend model")
 27 |     return comprehend.classify_document(
 28 |         Text=text,
 29 |         EndpointArn=endpoint_arn
 30 |     )
 31 | 
 32 | 
 33 | def add_page_to_class(i, _class, pages_by_class):
 34 |     # appends a page number (integer) to list of page numbers (value in key-value pair)
 35 |     # stores information on how the original multi-page input PDF file is divided by class
 36 |     # stores page numbers in order, so the final outputted multi-class PDF pages will be in order
 37 |     if _class in pages_by_class:
 38 |         pages_by_class[_class].append(i)
 39 |     else:
 40 |         pages_by_class[_class] = [i]
 41 |     print(f"Added page {i} to {_class}\n")
 42 | 
 43 | 
 44 | def create_output_pdfs(input_pdf_content, pages_by_class):
 45 |     # loops through each class in the pages_by_class dictionary to get all of the input PDF page numbers
 46 |     # creates new PDF for each class using the corresponding input PDF's pages
 47 | 
 48 |     input_pdf_buffer = BytesIO(input_pdf_content)
 49 |     input_pdf = PdfFileReader(input_pdf_buffer, strict=False)
 50 |     output_zip_buffer = BytesIO()
 51 | 
 52 |     with zipfile.ZipFile(output_zip_buffer, "w") as zip_archive:
 53 |         for _class in pages_by_class:
 54 |             output = PdfFileWriter()
 55 |             page_numbers = pages_by_class[_class]
 56 | 
 57 |             for page_num in page_numbers:
 58 |                 output.addPage(input_pdf.getPage(page_num))
 59 | 
 60 |             output_buffer = BytesIO()
 61 |             output.write(output_buffer)
 62 | 
 63 |             with zip_archive.open(f"{_class}.pdf", 'w') as output_pdf:
 64 |                 output_pdf.write(output_buffer.getvalue())
 65 | 
 66 |             print(f"Created PDF for {_class}")
 67 | 
 68 |     return output_zip_buffer
 69 | 
 70 | 
 71 | def split_input_pdf_by_class(input_pdf_content, endpoint_arn, _id):
 72 |     # loops through each page of the inputted multi-page PDF
 73 |     # converts single-page PDF into an image and uploads it to the S3 bucket
 74 |     # image in S3 is inputted into the Textract API; text is extracted, JSON is parsed
 75 |     # raw text is inputted into the Comprehend model API using its endpoint ARN
 76 |     # JSON response is parsed to find the predicted class
 77 |     # the input PDF's page number is assigned to the predicted class in the pages_by_class dictionary
 78 |     textract = boto3.client('textract')
 79 |     comprehend = boto3.client('comprehend')
 80 |     pages_by_class = {}
 81 | 
 82 |     # converts PDF into images
 83 |     images = convert_from_bytes(input_pdf_content)
 84 | 
 85 |     # process each image
 86 |     for i, image in enumerate(images):
 87 |         textract_response = call_textract_on_image(textract, image, i)
 88 |         raw_text = get_lines_string(textract_json=textract_response)
 89 |         comprehend_response = call_comprehend(raw_text, comprehend, endpoint_arn)
 90 |         _class = comprehend_response['Classes'][0]['Name']
 91 |         add_page_to_class(i, _class, pages_by_class)
 92 | 
 93 |     print("Input PDF has been split up and classified\n")
 94 |     return pages_by_class
 95 | 
 96 | 
 97 | def lambda_handler(event, context):
 98 |     if event['path'] == '/':
 99 |         return {
100 |             "statusCode": 200,
101 |             "statusDescription": "200 OK",
102 |             "isBase64Encoded": False,
103 |             "headers": {
104 |                 "Content-Type": "text/html"
105 |             },
106 |             "body": "This is the Document Splitter API."
107 |         }
108 | 
109 |     else:
110 |         request_body = event['queryStringParameters']
111 |         endpoint_arn = request_body['endpoint_arn']
112 |         bucket_name = request_body['bucket_name']
113 | 
114 |         _id = datetime.now().strftime("%Y%m%d%H%M%S")
115 |         s3 = boto3.client('s3')
116 | 
117 |         if event['path'] == '/s3_file':
118 |             input_pdf_uri = request_body['input_pdf_uri']
119 |             input_pdf_key = input_pdf_uri.split(bucket_name + "/", 1)[1]
120 |             s3_response_object = s3.get_object(Bucket=bucket_name, Key=input_pdf_key)
121 |             input_pdf_content = s3_response_object['Body'].read()
122 | 
123 |         elif event['path'] == '/local_file':
124 |             encoded_data = event['body']
125 |             decoded_data = base64.standard_b64decode(encoded_data)
126 |             input_pdf_content = b"%PDF" + decoded_data.split(b"\r\n\r\n%PDF", 1)[1]
127 | 
128 |     # pages_by_class is a dictionary
129 |     # key is class name; value is list of page numbers belonging to the key class
130 |     pages_by_class = split_input_pdf_by_class(input_pdf_content, endpoint_arn, _id)
131 | 
132 |     output_zip_buffer = create_output_pdfs(input_pdf_content, pages_by_class)
133 | 
134 |     output_key_name = f"workflow2_output_documents_{_id}.zip"
135 |     s3.put_object(Body=output_zip_buffer.getvalue(), Bucket=bucket_name, Key=output_key_name, ContentType='application/zip')
136 | 
137 |     output_zip_file_s3_uri = f"s3://{bucket_name}/{output_key_name}"
138 |     return {
139 |         "statusCode": 200,
140 |         "statusDescription": "200 OK",
141 |         "isBase64Encoded": False,
142 |         "headers": {
143 |             "Content-Type": "text/html"
144 |         },
145 |         "body": json.dumps(
146 |             {
147 |                 'output_zip_file_s3_uri': output_zip_file_s3_uri
148 |             }
149 |         )
150 |     }
151 | 


--------------------------------------------------------------------------------
/workflow2_docsplitter/sam-app/functions/docsplitter_function/requirements.txt:
--------------------------------------------------------------------------------
 1 | amazon-textract-caller==0.0.13
 2 | amazon-textract-prettyprinter==0.0.9
 3 | awslambdaric==2.0.0
 4 | certifi==2021.5.30
 5 | charset-normalizer==2.0.6
 6 | filetype==1.0.8
 7 | idna==3.2
 8 | jmespath==0.10.0
 9 | marshmallow==3.11.1
10 | pdf2image==1.16.0
11 | Pillow==9.3.0
12 | PyPDF2==1.27.5
13 | python-dateutil==2.8.2
14 | requests==2.26.0
15 | s3transfer==0.5.0
16 | simplejson==3.17.2
17 | six==1.16.0
18 | tabulate==0.8.9
19 | urllib3==1.26.6
20 | 


--------------------------------------------------------------------------------
/workflow2_docsplitter/sam-app/functions/getalbinfo_function/index.py:
--------------------------------------------------------------------------------
 1 | 
 2 | from __future__ import print_function
 3 | import urllib3
 4 | import json
 5 | import boto3
 6 | 
 7 | 
 8 | http = urllib3.PoolManager()
 9 | 
10 | def send(event, context, responseStatus, responseData, physicalResourceId=None, noEcho=False, reason=None):
11 |     responseUrl = event['ResponseURL']
12 | 
13 |     print(responseUrl)
14 | 
15 |     responseBody = {
16 |         'Status': responseStatus,
17 |         'Reason': reason or "See the details in CloudWatch Log Stream: {}".format(context.log_stream_name),
18 |         'PhysicalResourceId': physicalResourceId or context.log_stream_name,
19 |         'StackId': event['StackId'],
20 |         'RequestId': event['RequestId'],
21 |         'LogicalResourceId': event['LogicalResourceId'],
22 |         'NoEcho': noEcho,
23 |         'Data': responseData
24 |     }
25 | 
26 |     json_responseBody = json.dumps(responseBody)
27 | 
28 |     print("Response body:")
29 |     print(json_responseBody)
30 | 
31 |     headers = {
32 |         'content-type': '',
33 |         'content-length': str(len(json_responseBody))
34 |     }
35 | 
36 |     try:
37 |         response = http.request('PUT', responseUrl, headers=headers, body=json_responseBody)
38 |         print("Status code:", response.status)
39 | 
40 | 
41 |     except Exception as e:
42 | 
43 |         print("send(..) failed executing http.request(..):", e)
44 | 
45 | 
46 | def lambda_handler(event, context):
47 |     ec2 = boto3.client('ec2')
48 |     vpc_response = ec2.describe_vpcs(
49 |         Filters=[
50 |             {
51 |                 'Name': 'is-default',
52 |                 'Values': [
53 |                     'true'
54 |                 ]
55 |             }
56 |         ]
57 |     )
58 |     vpc_id = vpc_response['Vpcs'][0]['VpcId']
59 | 
60 |     subnet_response = ec2.describe_subnets(
61 |         Filters=[
62 |             {
63 |                 'Name': 'vpc-id',
64 |                 'Values': [
65 |                     vpc_id
66 |                 ]
67 |             }
68 |         ]
69 |     )
70 |     subnet_ids = [subnet['SubnetId'] for subnet in subnet_response['Subnets']]
71 | 
72 |     response_data = {
73 |         'VpcId': vpc_id,
74 |         'Subnets': subnet_ids
75 |     }
76 | 
77 |     return send(event, context, "SUCCESS", response_data)
78 | 


--------------------------------------------------------------------------------
/workflow2_docsplitter/sam-app/template.yaml:
--------------------------------------------------------------------------------
  1 | Transform: 'AWS::Serverless-2016-10-31'
  2 | 
  3 | Description: SAM template for docsplitter
  4 | 
  5 | Resources:
  6 |   GetALBInfoFunction:
  7 |     Type: 'AWS::Serverless::Function'
  8 |     Properties:
  9 |       CodeUri: functions/getalbinfo_function/
 10 |       Role: !GetAtt DocSplitterRole.Arn
 11 |       Description: Uses AWS EC2 API to obtain VpcId and Subnet Ids required for creating the ALB
 12 |       Runtime: python3.8
 13 |       Handler: index.lambda_handler
 14 |       Timeout: 120
 15 |       MemorySize: 256
 16 | 
 17 |   GetALBInfoFunctionInvoke:
 18 |     Type: AWS::CloudFormation::CustomResource
 19 |     Version: "1.0"
 20 |     Properties:
 21 |       ServiceToken: !GetAtt GetALBInfoFunction.Arn
 22 | 
 23 |   DocSplitterFunction:
 24 |     Type: 'AWS::Serverless::Function'
 25 |     Properties:
 26 |       Description: Splits input PDF based on each page's class; used as ALB target
 27 |       Role: !GetAtt DocSplitterRole.Arn
 28 |       Timeout: 900
 29 |       MemorySize: 10240
 30 |       PackageType: Image
 31 |       ImageConfig:
 32 |         Command: [ "index.lambda_handler" ]
 33 |     Metadata:
 34 |       Dockerfile: Dockerfile
 35 |       DockerContext: functions/docsplitter_function/
 36 |       DockerTag: v1
 37 | 
 38 |   LoadBalancer:
 39 |     Type: AWS::ElasticLoadBalancingV2::LoadBalancer
 40 |     Properties:
 41 |       SecurityGroups:
 42 |         - !GetAtt SecurityGroup.GroupId
 43 |       Subnets: !GetAtt GetALBInfoFunctionInvoke.Subnets
 44 |       Type: application
 45 | 
 46 |   Listener:
 47 |     Type: AWS::ElasticLoadBalancingV2::Listener
 48 |     Properties:
 49 |       DefaultActions:
 50 |         - Type: forward
 51 |           TargetGroupArn: !Ref TargetGroup
 52 |       LoadBalancerArn: !Ref LoadBalancer
 53 |       Port: 80
 54 |       Protocol: HTTP
 55 | 
 56 |   LambdaInvokePermission:
 57 |     Type: AWS::Lambda::Permission
 58 |     Properties:
 59 |       FunctionName: !GetAtt DocSplitterFunction.Arn
 60 |       Action: 'lambda:InvokeFunction'
 61 |       Principal: elasticloadbalancing.amazonaws.com
 62 | 
 63 |   TargetGroup:
 64 |     Type: AWS::ElasticLoadBalancingV2::TargetGroup
 65 |     DependsOn:
 66 |       - LambdaInvokePermission
 67 |     Properties:
 68 |       HealthCheckEnabled: false
 69 |       Name: DocSplitterTargetGroup
 70 |       TargetType: lambda
 71 |       Targets:
 72 |         - Id: !GetAtt DocSplitterFunction.Arn
 73 | 
 74 |   SecurityGroup:
 75 |     Type: AWS::EC2::SecurityGroup
 76 |     Properties:
 77 |       VpcId: !GetAtt GetALBInfoFunctionInvoke.VpcId
 78 |       GroupDescription: security group for DocSplitter application load balancer
 79 |       SecurityGroupIngress:
 80 |         - IpProtocol: tcp
 81 |           FromPort: 80
 82 |           ToPort: 80
 83 |           CidrIp: 0.0.0.0/0
 84 | 
 85 |   DocSplitterRole:
 86 |     Type: 'AWS::IAM::Role'
 87 |     Properties:
 88 |       AssumeRolePolicyDocument:
 89 |         Statement:
 90 |           - Effect: Allow
 91 |             Principal:
 92 |               Service:
 93 |                 - lambda.amazonaws.com
 94 |             Action:
 95 |               - 'sts:AssumeRole'
 96 |       Description: Role executes Lambda functions
 97 |       MaxSessionDuration: 43200
 98 |       Policies:
 99 |         - PolicyName: root
100 |           PolicyDocument:
101 |             Version: 2012-10-17
102 |             Statement:
103 |               - Effect: Allow
104 |                 Action:
105 |                   - 's3:PutObject'
106 |                   - 's3:GetObject'
107 |                   - 's3:ListBucket'
108 |                   - 'textract:*'
109 |                   - 'lambda:*'
110 |                   - 'comprehend:*'
111 |                   - 'logs:*'
112 |                   - 'ec2:DescribeSubnets'
113 |                   - 'ec2:DescribeVpcs'
114 |                 Resource: '*'
115 | 
116 | Outputs:
117 |   LoadBalancerDnsName:
118 |     Description: DNS Name of the DocSplitter application load balancer with the Lambda target group.
119 |     Value: !GetAtt LoadBalancer.DNSName
120 | 


--------------------------------------------------------------------------------
/workflow2_docsplitter/sample_request_folder/sample_local_request.py:
--------------------------------------------------------------------------------
 1 | 
 2 | import requests
 3 | 
 4 | def split_local_pdf(dns_name, input_pdf_path, bucket_name, endpoint_arn):
 5 |     if dns_name.endswith("/"):
 6 |         path = "local_file"
 7 |     else:
 8 |         path = "/local_file"
 9 | 
10 |     with open(input_pdf_path, "rb") as file:
11 |         files = {
12 |             "input_pdf_file": file
13 |         }
14 |         url = dns_name + f"{path}?bucket_name={bucket_name}&endpoint_arn={endpoint_arn}"
15 |         response = requests.post(url, files=files)
16 |         print(response.text)
17 | 
18 | 
19 | if __name__ == "__main__":
20 |     dns_name = "Enter your dns_name"
21 |     bucket_name = "Enter your bucket_name "
22 |     endpoint_arn = "Enter your endpoint arn"
23 | 
24 |     input_pdf_path = "~/sample.pdf"
25 | 
26 |     split_local_pdf(dns_name, input_pdf_path, bucket_name, endpoint_arn)
27 | 


--------------------------------------------------------------------------------
/workflow2_docsplitter/sample_request_folder/sample_s3_request.py:
--------------------------------------------------------------------------------
 1 | 
 2 | import requests
 3 | 
 4 | def split_s3_pdf(dns_name, input_pdf_uri, bucket_name, endpoint_arn):
 5 |     if dns_name.endswith("/"):
 6 |         path = "s3_file"
 7 |     else:
 8 |         path = "/s3_file"
 9 | 
10 |     queries = f"?bucket_name={bucket_name}&endpoint_arn={endpoint_arn}&input_pdf_uri={input_pdf_uri}"
11 |     url = dns_name + path + queries
12 |     response = requests.post(url)
13 |     print(response.text)
14 | 
15 | 
16 | if __name__ == "__main__":
17 |     dns_name = "Enter your dns_name"
18 |     bucket_name = "Enter your bucket_name "
19 |     endpoint_arn = "Enter your endpoint arn"
20 | 
21 |     input_pdf_path = "s3://sample.pdf"
22 | 
23 |     split_s3_pdf(dns_name, input_pdf_path, bucket_name, endpoint_arn)
24 | 


--------------------------------------------------------------------------------
/workflow3_local/local_docsplitter.py:
--------------------------------------------------------------------------------
  1 | 
  2 | import os
  3 | import boto3
  4 | from shutil import rmtree
  5 | from datetime import datetime
  6 | from pdf2image import convert_from_path
  7 | from PyPDF2 import PdfFileReader, PdfFileWriter
  8 | from textractcaller.t_call import call_textract
  9 | from textractprettyprinter.t_pretty_print import get_lines_string
 10 | 
 11 | 
 12 | def call_textract_on_image(textract, image_path):
 13 |     # return JSON response containing the text extracted from the image
 14 |     print(f"Inputting {image_path} into Textract")
 15 |     # send in local image file content (is base64-encoded by API)
 16 |     with open(image_path, "rb") as f:
 17 |         return call_textract(input_document=f.read(), boto3_textract_client=textract)
 18 | 
 19 | 
 20 | def call_comprehend(text, comprehend, endpoint_arn):
 21 |     # send in raw text for a document as well as the Comprehend custom classification model ARN
 22 |     # returns JSON response containing the document's predicted class
 23 |     print("Inputting text into Comprehend model")
 24 |     return comprehend.classify_document(
 25 |         Text=text,
 26 |         EndpointArn=endpoint_arn
 27 |     )
 28 | 
 29 | 
 30 | def add_page_to_class(i, _class, pages_by_class):
 31 |     # appends a page number (integer) to list of page numbers (value in key-value pair)
 32 |     # stores information on how the original multi-page input PDF file is divided by class
 33 |     # stores page numbers in order, so the final outputted multi-class PDF pages will be in order
 34 |     if _class in pages_by_class:
 35 |         pages_by_class[_class].append(i)
 36 |     else:
 37 |         pages_by_class[_class] = [i]
 38 |     print(f"Added page {i + 1} to {_class}\n")
 39 | 
 40 | 
 41 | def create_directories(dirs_list):
 42 |     for dir_name in dirs_list:
 43 |         try:
 44 |             os.mkdir(dir_name)
 45 |         except FileExistsError:
 46 |             pass
 47 |     print("Created directories")
 48 | 
 49 | 
 50 | def create_output_pdfs(input_pdf_path, pages_by_class, output_dir_path, output_dir_name):
 51 |     # loops through each class in the pages_by_class dictionary to get all of the input PDF page numbers
 52 |     # creates new PDF for each class using the corresponding input PDF's pages
 53 |     # outputted multi-class PDFs are located in the output folder
 54 | 
 55 |     with open(input_pdf_path, "rb") as f:
 56 |         for _class in pages_by_class:
 57 |             output = PdfFileWriter()
 58 |             page_numbers = pages_by_class[_class]
 59 |             input_pdf = PdfFileReader(f)
 60 | 
 61 |             for page_num in page_numbers:
 62 |                 output.addPage(input_pdf.getPage(page_num))
 63 |             with open(f"{output_dir_path}/{_class}.pdf", "wb") as output_stream:
 64 |                 output.write(output_stream)
 65 | 
 66 |             print(f"Created PDF for {_class}")
 67 | 
 68 | 
 69 | def split_input_pdf_by_class(input_pdf_path, temp_dir_path, endpoint_arn, _id):
 70 |     # loops through each page of the inputted multi-page PDF
 71 |     # converts single-page PDF into an image and uploads it to the S3 bucket
 72 |     # image in S3 is inputted into the Textract API; text is extracted, JSON is parsed
 73 |     # raw text is inputted into the Comprehend model API using its endpoint ARN
 74 |     # JSON response is parsed to find the predicted class
 75 |     # the input PDF's page number is assigned to the predicted class in the pages_by_class dictionary
 76 |     textract = boto3.client('textract')
 77 |     comprehend = boto3.client('comprehend')
 78 |     pages_by_class = {}
 79 | 
 80 |     # converts PDF into images
 81 |     image_paths = convert_from_path(
 82 |         pdf_path=input_pdf_path,
 83 |         fmt='jpeg',
 84 |         paths_only=True,
 85 |         output_folder=temp_dir_path
 86 |     )
 87 | 
 88 |     # process each image
 89 |     for i, image_path in enumerate(image_paths):
 90 |         textract_response = call_textract_on_image(textract, image_path)
 91 |         raw_text = get_lines_string(textract_json=textract_response)
 92 |         comprehend_response = call_comprehend(raw_text, comprehend, endpoint_arn)
 93 |         _class = comprehend_response['Classes'][0]['Name']
 94 |         add_page_to_class(i, _class, pages_by_class)
 95 | 
 96 |     print("Input PDF has been split up and classified\n")
 97 |     return pages_by_class
 98 | 
 99 | 
100 | def main(endpoint_arn, choice, input_pdf_info):
101 |     root_path = os.path.dirname(os.path.abspath(__file__))
102 |     _id = datetime.now().strftime("%Y%m%d%H%M%S")
103 |     temp_dir_name = f"workflow2_temp_documents-{_id}"
104 |     temp_dir_path = f"{root_path}/{temp_dir_name}"
105 |     output_dir_name = f"workflow2_output_documents-{_id}"
106 |     output_dir_path = f"{root_path}/{output_dir_name}"
107 |     create_directories([temp_dir_path, output_dir_path])
108 |     s3 = boto3.client('s3')
109 | 
110 |     if choice == "s3":
111 |         input_pdf_uri = input_pdf_info
112 |         bucket_name = input_pdf_uri.split("/")[2]
113 |         input_pdf_key = input_pdf_uri.split(bucket_name + "/", 1)[1]
114 |         input_pdf_path = f"{temp_dir_path}/input.pdf"
115 |         with open(input_pdf_path, "wb") as data:
116 |             s3.download_fileobj(bucket_name, input_pdf_key, data)
117 |     elif choice == "local":
118 |         input_pdf_path = input_pdf_info
119 | 
120 |     # pages_by_class is a dictionary
121 |     # key is class name; value is list of page numbers belonging to the key class
122 |     pages_by_class = split_input_pdf_by_class(input_pdf_path, temp_dir_path, endpoint_arn, _id)
123 | 
124 |     create_output_pdfs(input_pdf_path, pages_by_class, output_dir_path, output_dir_name)
125 |     rmtree(temp_dir_path)
126 |     print("Multi-class PDFs have been created in the output folder, " + output_dir_path)
127 | 
128 | 
129 | print("Welcome to the local document splitter!")
130 | endpoint_arn = input("Please enter the ARN of the Comprehend classification endpoint: ")
131 | 
132 | print("Would you like to split a local file or a file stored in S3?")
133 | choice = input("Please enter 'local' or 's3': ")
134 | if choice == 'local':
135 |     input_pdf_info = input("Please enter the absolute local file path of the input PDF: ")
136 | elif choice == 's3':
137 |     input_pdf_info = input("Please enter the S3 URI of the input PDF: ")
138 | 
139 | main(endpoint_arn, choice, input_pdf_info)
140 | 


--------------------------------------------------------------------------------
/workflow3_local/local_endpointbuilder.py:
--------------------------------------------------------------------------------
  1 | 
  2 | import os
  3 | import csv
  4 | import boto3
  5 | import botocore
  6 | import filetype
  7 | from json import dumps
  8 | from time import sleep
  9 | from itertools import repeat
 10 | from datetime import datetime
 11 | from shutil import rmtree, copytree
 12 | from pdf2image import convert_from_path
 13 | from concurrent.futures import ThreadPoolExecutor
 14 | from textractcaller.t_call import call_textract
 15 | from textractprettyprinter.t_pretty_print import get_lines_string
 16 | 
 17 | 
 18 | def create_training_dataset(dataset_path, csv_name, bucket_name, _id):
 19 |     root_path = os.path.dirname(os.path.abspath(__file__))
 20 |     temp_dir_path = root_path + "/workflow1.5_local_temp"
 21 |     csv_file_path = temp_dir_path + "/" + csv_name
 22 | 
 23 |     # create local directories that store the files generated by this script
 24 |     create_temp_directories(temp_dir_path, dataset_path)
 25 | 
 26 |     # create training dataset CSV
 27 |     with open(csv_file_path, "w", newline="") as file:
 28 |         new_image_info = get_processed_images(temp_dir_path)
 29 |         items = enumerate(new_image_info)
 30 |         with ThreadPoolExecutor() as executor:
 31 |             rows = executor.map(add_image_to_csv, items, repeat(temp_dir_path))
 32 | 
 33 |         # write rows to CSV
 34 |         writer = csv.writer(file)
 35 |         writer.writerows(rows)
 36 | 
 37 |     # upload CSV to S3 bucket
 38 |     s3 = boto3.client("s3")
 39 |     s3.upload_file(csv_file_path, bucket_name, csv_name)
 40 |     print("Training dataset CSV file has been uploaded to S3")
 41 | 
 42 |     # delete the local directories containing the CSV, images, and PDFs
 43 |     rmtree(temp_dir_path)
 44 |     print("Deleted local CSV, PDFs, and images")
 45 | 
 46 | 
 47 | def add_image_to_csv(item, temp_dir_path):
 48 |     i = item[0]
 49 |     image_info = item[1]
 50 |     _class = image_info[1]
 51 | 
 52 |     if len(image_info) == 2:
 53 |         # extract text from an image that was originally an image
 54 |         rel_path = image_info[0]
 55 |         image_path = f"{temp_dir_path}/local_dataset_docs/{rel_path}"
 56 |         textract_json = call_textract_on_image(image_path)
 57 |         raw_text = get_lines_string(textract_json=textract_json)
 58 | 
 59 |     elif len(image_info) == 3:
 60 |         # extract text from PDF that has been converted into one or more images
 61 |         file_name = image_info[0]
 62 |         pdf_num_pages = image_info[2]
 63 |         # raw_text will store text extracted from all of the images that compose the PDF
 64 |         raw_text = ""
 65 |         image_path_start = f"{temp_dir_path}/images_processed/{file_name}"
 66 | 
 67 |         for page_num in range(pdf_num_pages):
 68 |             path = f"{image_path_start}-img-{page_num}.jpg"
 69 |             textract_json = call_textract_on_image(path)
 70 |             # parse JSON response to get raw text
 71 |             image_text = get_lines_string(textract_json=textract_json)
 72 |             raw_text += image_text
 73 | 
 74 |     # row is returned; contains document's class and the text within it
 75 |     print(f"Created row {i + 1}")
 76 |     return [_class, raw_text]
 77 | 
 78 | 
 79 | def call_textract_on_image(image_path):
 80 |     # return JSON response containing the text extracted from the image
 81 |     print(f"Inputting {image_path} into Textract")
 82 |     textract = boto3.client("textract")
 83 |     # send in local image file content (is base64-encoded by API)
 84 |     with open(image_path, "rb") as f:
 85 |         return call_textract(input_document=f.read(), boto3_textract_client=textract)
 86 | 
 87 | 
 88 | def process_file(file_info, temp_dir_path):
 89 |     file_path = file_info[0]
 90 |     file_extension = file_info[1]
 91 | 
 92 |     rel_path = os.path.relpath(file_path, temp_dir_path + "/local_dataset_docs").lstrip("./")
 93 |     _class = rel_path.split("/", 1)[0]
 94 | 
 95 |     if file_extension == "pdf":
 96 |         # create images from PDFs
 97 |         # new images, located in the images_processed directory, will be used
 98 |         file_name = file_path.split("/")[-1]
 99 |         pdf_num_pages = pdf_to_images(file_path, temp_dir_path, file_name)
100 |         image_info = (file_name, _class, pdf_num_pages)
101 |     else:
102 |         # the image stored in the local_dataset_docs directory will be used
103 |         image_info = (rel_path, _class)
104 | 
105 |     return image_info
106 | 
107 | 
108 | def get_processed_images(temp_dir_path):
109 |     valid_file_extensions = {"pdf", "jpg", "png"}
110 |     # file_info is a list of tuples with 2 elements: (file_path, file_extension)
111 |     file_info = []
112 | 
113 |     # loop through all files in local_dataset_docs
114 |     for root, dirs, files in os.walk(temp_dir_path + "/local_dataset_docs"):
115 |         for file in files:
116 |             file_path = os.path.join(root, file)
117 |             file_type = filetype.guess(file_path)
118 |             if file_type is None:
119 |                 # filter out folders
120 |                 continue
121 |             file_extension = file_type.extension
122 |             if file_extension in valid_file_extensions:
123 |                 # only accept PDF, JPEG, and PNG files
124 |                 file_info.append((file_path, file_extension))
125 | 
126 |     with ThreadPoolExecutor() as executor:
127 |         # new_image_info is a generator of tuples
128 |         # if the file is an image, the tuple is (rel_path, _class)
129 |         # if the file is a PDF, the tuple is (file_name, _class, pdf_num_pages)
130 |         new_image_info = executor.map(process_file, file_info, repeat(temp_dir_path))
131 |     return list(new_image_info)
132 | 
133 | 
134 | def create_image(image, image_path_start, i):
135 |     new_image_path = f"{image_path_start}-img-{i}.jpg"
136 |     image.save(new_image_path, "JPEG")
137 | 
138 | 
139 | def pdf_to_images(pdf_path, temp_dir_path, file_name):
140 |     # convert PDF into images
141 |     # return number of pages in PDF as an integer
142 |     images = convert_from_path(pdf_path)
143 |     pdf_num_pages = len(images)
144 |     image_path_start = f"{temp_dir_path}/images_processed/{file_name}"
145 |     with ThreadPoolExecutor() as executor:
146 |         executor.map(create_image, images, repeat(image_path_start), range(pdf_num_pages))
147 |     return pdf_num_pages
148 | 
149 | 
150 | def create_temp_directories(temp_path, dataset_path):
151 |     try:
152 |         os.mkdir(temp_path)
153 |         os.mkdir(temp_path + "/images_processed")
154 |         copytree(dataset_path, temp_path + "/local_dataset_docs")
155 |     except FileExistsError:
156 |         pass
157 |     print("Created temporary local directories")
158 | 
159 | 
160 | def create_comprehend_role(bucket_name, role_name, iam_comprehend_policy_name):
161 |     iam = boto3.client("iam")
162 |     try:
163 |         # create IAM role with trust policy
164 |         iam_assume_role_policy = dumps({
165 |             "Version": "2012-10-17",
166 |             "Statement": {
167 |                 "Effect": "Allow",
168 |                 "Principal":
169 |                     {"Service": "comprehend.amazonaws.com"},
170 |                 "Action": "sts:AssumeRole"
171 |             }
172 |         })
173 |         iam_create_response = iam.create_role(
174 |             RoleName=role_name,
175 |             AssumeRolePolicyDocument=iam_assume_role_policy,
176 |             MaxSessionDuration=21600
177 |         )
178 | 
179 |         role_arn = iam_create_response['Role']['Arn']
180 |         print("IAM role created")
181 | 
182 |     except botocore.exceptions.ClientError as error:
183 |         # if role already exists
184 |         if error.response["Error"]["Code"] == "EntityAlreadyExists":
185 |             iam_get_role_response = iam.get_role(
186 |                 RoleName=role_name
187 |             )
188 |             role_arn = iam_get_role_response["Role"]["Arn"]
189 |             print("IAM role already exists")
190 |         else:
191 |             raise error
192 | 
193 |     try:
194 |         # create policy that allows role to access the CSV training dataset in S3
195 |         iam_comprehend_policy_document = dumps({
196 |             "Version": "2012-10-17",
197 |             "Statement": [
198 |                 {
199 |                     "Action": [
200 |                         "s3:GetObject"
201 |                     ],
202 |                     "Resource": [
203 |                         f"arn:aws:s3:::{bucket_name}/*"
204 |                     ],
205 |                     "Effect": "Allow"
206 |                 },
207 |                 {
208 |                     "Action": [
209 |                         "s3:ListBucket"
210 |                     ],
211 |                     "Resource": [
212 |                         f"arn:aws:s3:::{bucket_name}"
213 |                     ],
214 |                     "Effect": "Allow"
215 |                 },
216 |                 {
217 |                     "Action": [
218 |                         "s3:PutObject"
219 |                     ],
220 |                     "Resource": [
221 |                         f"arn:aws:s3:::{bucket_name}/*"
222 |                     ],
223 |                     "Effect": "Allow"
224 |                 }
225 |             ]
226 |         })
227 |         iam_create_policy_response = iam.create_policy(
228 |             PolicyName=iam_comprehend_policy_name,
229 |             PolicyDocument=iam_comprehend_policy_document,
230 |         )
231 | 
232 |         # attach S3 access policy to role
233 |         policy_arn = iam_create_policy_response["Policy"]["Arn"]
234 |         iam.attach_role_policy(
235 |             RoleName=role_name,
236 |             PolicyArn=policy_arn
237 |         )
238 |         print("IAM policy created and attached to role. Waiting to configure")
239 | 
240 |         # wait for a minute before configuring the Comprehend model
241 |         # IAM role configuration needs time to be processed; without it, the model throws an error
242 |         sleep(60)
243 | 
244 |     except botocore.exceptions.ClientError as error:
245 |         # if role already exists
246 |         if error.response["Error"]["Code"] == "EntityAlreadyExists":
247 |             print("IAM policy already exists")
248 |         else:
249 |             raise error
250 | 
251 |     return role_arn
252 | 
253 | 
254 | def get_endpoint_arn(csv_file_name, bucket_name, role_arn, model_name):
255 |     comprehend = boto3.client("comprehend")
256 |     region = boto3.session.Session().region_name
257 |     account_number = role_arn.lstrip("arn:aws:iam::").split(":")[0]
258 | 
259 |     try:
260 |         # create model with the CSV training dataset's S3 URI and the ARN of the IAM role
261 |         create_response = comprehend.create_document_classifier(
262 |             InputDataConfig={
263 |                 "S3Uri": f"s3://{bucket_name}/{csv_file_name}"
264 |             },
265 |             DataAccessRoleArn=role_arn,
266 |             DocumentClassifierName=model_name,
267 |             LanguageCode="en"
268 |         )
269 | 
270 |         model_arn = create_response['DocumentClassifierArn']
271 |         print("Created Comprehend model")
272 | 
273 |     except botocore.exceptions.ClientError as error:
274 |         # if model already exists
275 |         if error.response["Error"]["Code"] == "ResourceInUseException":
276 |             model_arn = f"arn:aws:comprehend:{region}:{account_number}:document-classifier/{model_name}"
277 |             print("Model has already been created")
278 |         else:
279 |             raise error
280 | 
281 |     describe_response = comprehend.describe_document_classifier(
282 |         DocumentClassifierArn=model_arn)
283 |     status = describe_response['DocumentClassifierProperties']['Status']
284 | 
285 |     print("Model training...")
286 |     while status != "TRAINED":
287 |         if status == "IN_ERROR":
288 |             message = describe_response["DocumentClassifierProperties"]["Message"]
289 |             raise ValueError(f"The classifier is in error:", message)
290 |         # update the model's status every 5 minutes if it has not finished training
291 |         sleep(300)
292 |         describe_response = comprehend.describe_document_classifier(
293 |             DocumentClassifierArn=model_arn)
294 |         status = describe_response["DocumentClassifierProperties"]["Status"]
295 | 
296 |     print("Model trained. Creating endpoint")
297 |     try:
298 |         endpoint_response = comprehend.create_endpoint(
299 |             EndpointName=model_name,
300 |             ModelArn=model_arn,
301 |             DesiredInferenceUnits=10,
302 |         )
303 |         endpoint_arn = endpoint_response["EndpointArn"]
304 | 
305 |         describe_response = comprehend.describe_endpoint(
306 |             EndpointArn=endpoint_arn
307 |         )
308 |         status = describe_response["EndpointProperties"]["Status"]
309 |         while status == "CREATING":
310 |             if status == "IN_ERROR":
311 |                 message = describe_response["EndpointProperties"]["Message"]
312 |                 raise ValueError(f"The endpoint is in error:", message)
313 |             # update the endpoint's status every 3 minutes if it has not been created
314 |             sleep(180)
315 |             describe_response = comprehend.describe_endpoint(
316 |                 EndpointArn=endpoint_arn
317 |             )
318 |             status = describe_response["EndpointProperties"]["Status"]
319 | 
320 |     except botocore.exceptions.ClientError as error:
321 |         # if model already exists
322 |         if error.response["Error"]["Code"] == "ResourceInUseException":
323 |             endpoint_arn = f"arn:aws:comprehend:{region}:{account_number}:document-classifier-endpoint/{model_name}"
324 |             print("Endpoint has already been created")
325 |         else:
326 |             raise error
327 |     # the model's endpoint ARN is returned as a string
328 |     return endpoint_arn
329 | 
330 | 
331 | def main(bucket_name, dataset_path):
332 |     _id = datetime.now().strftime("%Y%m%d%H%M%S")
333 |     csv_file_name = f"comprehend_dataset_{_id}.csv"
334 |     model_name = f"DatasetBuilderModel-{_id}"
335 |     role_name = "AmazonComprehendServiceRole-" + model_name
336 |     policy_name = model_name + "AccessS3Policy"
337 | 
338 |     create_training_dataset(dataset_path, csv_file_name, bucket_name, _id)
339 |     role_arn = create_comprehend_role(bucket_name, role_name, policy_name)
340 |     comprehend_endpoint_arn = get_endpoint_arn(csv_file_name, bucket_name, role_arn, model_name)
341 |     return comprehend_endpoint_arn
342 | 
343 | 
344 | print("Welcome to the local endpoint builder!\n")
345 | existing_bucket_name = input("Please enter the name of an existing S3 bucket you are able to access: ")
346 | local_dataset_path = input("Please enter the absolute local file path of the dataset folder: ")
347 | print("\nComprehend model endpoint ARN: " + main(existing_bucket_name, local_dataset_path))
348 | 


--------------------------------------------------------------------------------