├── .gitignore ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── requirements.txt ├── workflow1_endpointbuilder └── sam-app │ ├── functions │ ├── function1_createtableandfunction3trigger │ │ └── index.py │ ├── function2_inserttablerowswithobjectkeys │ │ └── index.py │ ├── function3_processobjectsandupdaterows │ │ ├── Dockerfile │ │ ├── index.py │ │ └── requirements.txt │ ├── function4_checktablestatus │ │ └── index.py │ ├── function5_createcsv │ │ └── index.py │ ├── function6_trainclassifier │ │ └── index.py │ ├── function7_checkclassifierstatus │ │ └── index.py │ ├── function8_buildendpoint │ │ └── index.py │ └── function9_checkendpointstatus │ │ └── index.py │ ├── statemachine │ └── endpointbuilder.asl.json │ └── template.yaml ├── workflow2_docsplitter ├── sam-app │ ├── functions │ │ ├── docsplitter_function │ │ │ ├── Dockerfile │ │ │ ├── index.py │ │ │ └── requirements.txt │ │ └── getalbinfo_function │ │ │ └── index.py │ └── template.yaml └── sample_request_folder │ ├── sample_local_request.py │ └── sample_s3_request.py └── workflow3_local ├── local_docsplitter.py └── local_endpointbuilder.py /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | .aws-sam/ 3 | .python-version 4 | samconfig.toml 5 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *main* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | this software and associated documentation files (the "Software"), to deal in 5 | the Software without restriction, including without limitation the rights to 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | the Software, and to permit persons to whom the Software is furnished to do so. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Intelligently Split Multi-Form Document Packages with Amazon Textract and Amazon Comprehend 2 | 3 | ## Workflow 1: Endpoint Builder 4 | 5 | 6 | Blog: https://aws.amazon.com/blogs/machine-learning/intelligently-split-multi-form-document-packages-with-amazon-textract-and-amazon-comprehend/ 7 | 8 | Authors: Aditi Rajnish, Raj Pathak 9 | 10 | #### Description 11 | Workflow 1 will take documents stored on Amazon S3 and send them through a series of steps to extract the data from the documents via Amazon Textract. 12 | Then, the extracted data will be used to create an Amazon Comprehend Custom Classification Endpoint. 13 | 14 | #### Input: 15 | The Amazon S3 URI of the **folder** containing the training dataset files (these can be images, single-page PDFs, and/or multi-page PDFs). 16 | The structure of the folder must be: 17 | 18 | root dataset directory 19 | ---- class directory 20 | -------- files 21 | 22 | or 23 | 24 | root dataset directory 25 | ---- class directory 26 | -------- nested subdirectories 27 | ------------ files 28 | 29 | Note that the name of the class subdirectories (2nd directory level) become the names of the classes used in the Amazon Comprehend custom classification model. For example, see the file structure below; the class for "form123.pdf" would be "tax_forms". 30 | 31 | training_dataset 32 | ---- tax_forms 33 | -------- page_1 34 | ------------ form123.pdf 35 | 36 | 37 | #### Output: 38 | The ARN (Amazon Resource Number) of the model's endpoint, which can be used to classify documents in Workflow 2. 39 | 40 | #### Usage: 41 | You must have Docker installed, configured and running. You must also have the AWS CLI and AWS SAM CLI installed and configured with your AWS credentials. 42 | Navigate to the workflow1_endpointbuilder/sam-app directory. 43 | Run 'sam build' and 'sam deploy --guided' in your CLI and provide the Stack Name and AWS Region (ensure that this is the same region as your Amazon S3 dataset folder). You must respond with "y" for the "Allow SAM CLI IAM role creation" and "Create managed ECR repositories for all functions?" prompts. 44 | This will deploy CloudFormation stacks that create the infrastructure for the Endpoint Builder state machine in Amazon Step Functions. 45 | For example, see below the CLI input: 46 | 47 | (venv) Lambda_Topic_Classifier % cd workflow1_endpointbuilder/sam-app 48 | (venv) sam-app % sam build 49 | Build Succeeded 50 | 51 | (venv) sam-app % sam deploy --guided 52 | Configuring SAM deploy 53 | ========================================= 54 | Stack Name [sam-app]: endpointbuilder 55 | AWS Region []: us-east-1 56 | #Shows you resources changes to be deployed and require a 'Y' to initiate deploy 57 | Confirm changes before deploy [y/N]: n 58 | #SAM needs permission to be able to create roles to connect to the resources in your template 59 | Allow SAM CLI IAM role creation [Y/n]: y 60 | Save arguments to configuration file [Y/n]: n 61 | 62 | Looking for resources needed for deployment: 63 | Creating the required resources... 64 | Successfully created! 65 | Managed S3 bucket: {your_bucket} 66 | #Managed repositories will be deleted when their functions are removed from the template and deployed 67 | Create managed ECR repositories for all functions? [Y/n]: y 68 | 69 | and CLI output: 70 | 71 | CloudFormation outputs from deployed stack 72 | ------------------------------------------------------------------------------------------------- 73 | Outputs 74 | ------------------------------------------------------------------------------------------------- 75 | Key StateMachineArn 76 | Description Arn of the EndpointBuilder State Machine 77 | Value {state_machine_arn} 78 | ------------------------------------------------------------------------------------------------- 79 | 80 | Find the corresponding state machine in Amazon Step Functions and start a new execution, providing your Amazon S3 folder URI as a string in the `folder_uri` key-pair argument. 81 | 82 | { 83 | "folder_uri": "s3://{your_bucket_name}" 84 | } 85 | 86 | The state machine returns the Amazon Comprehend classifier's endpoint ARN as a string in the `endpoint_arn` key-pair: 87 | 88 | { 89 | "endpoint_arn": "{your_end_point_ARN}" 90 | } 91 | 92 | ## Workflow 2: Document Splitter API 93 | 94 | #### Description 95 | Workflow 2 will take the endpoint you created in Workflow 1 and split the documents based on the classes with which model has been trained. 96 | 97 | #### Input: 98 | The endpoint ARN of the Amazon Comprehend custom classifier, the name of an existing Amazon S3 bucket you have access to, and one of the following: 99 | - Amazon S3 URI of the multi-page PDF document that needs to be split 100 | - local multi-page PDF file that needs to be split 101 | 102 | #### Output: 103 | The Amazon S3 URI of a ZIP file containing multi-class PDFs, each titled with their class name, containing the pages from the input PDF that belong to the same class. 104 | 105 | #### Usage: 106 | You must have Docker installed, configured and running. You must also have the AWS CLI and AWS SAM CLI installed and configured with your AWS credentials. 107 | Navigate to the workflow2_docsplitter/sam-app directory. 108 | Run 'sam build' and 'sam deploy --guided' in your CLI and provide the Stack Name and AWS Region (ensure that this is the same region as your Amazon S3 bucket). You must respond with "y" for the "Allow SAM CLI IAM role creation" and "Create managed ECR repositories for all functions?" prompts. 109 | This will deploy CloudFormation stacks that create the infrastructure for the Application Load Balancer its Lambda Function Target where you can send document splitting requests. 110 | For example, see below the CLI input: 111 | 112 | (venv) Lambda_Topic_Classifier % cd workflow2_docsplitter/sam-app 113 | (venv) sam-app % sam build 114 | Build Succeeded 115 | 116 | (venv) sam-app % sam deploy --guided 117 | Configuring SAM deploy 118 | ========================================= 119 | Stack Name [sam-app]: docsplitter 120 | AWS Region []: us-east-1 121 | #Shows you resources changes to be deployed and require a 'Y' to initiate deploy 122 | Confirm changes before deploy [y/N]: n 123 | #SAM needs permission to be able to create roles to connect to the resources in your template 124 | Allow SAM CLI IAM role creation [Y/n]: y 125 | Save arguments to configuration file [Y/n]: n 126 | 127 | Looking for resources needed for deployment: 128 | Managed S3 bucket: {bucket_name} 129 | #Managed repositories will be deleted when their functions are removed from the template and deployed 130 | Create managed ECR repositories for all functions? [Y/n]: y 131 | 132 | and CLI output: 133 | 134 | CloudFormation outputs from deployed stack 135 | ------------------------------------------------------------------------------------------------- 136 | Outputs 137 | ------------------------------------------------------------------------------------------------- 138 | Key LoadBalancerDnsName 139 | Description DNS Name of the DocSplitter application load balancer with the Lambda target group. 140 | Value {ALB_DNS} 141 | ------------------------------------------------------------------------------------------------- 142 | 143 | Send a POST request to the outputted LoadBalancerDnsName, providing your Amazon Comprehend endpoint ARN, Amazon S3 bucket name, and either of the following: 144 | - Amazon S3 URI of the multi-page PDF document that needs to be split 145 | - local multi-page PDF file that needs to be split 146 | 147 | See below for the POST request input format. 148 | 149 | # sample_s3_request.py 150 | dns_name = "{ALB_DNS}" 151 | bucket_name = "{Bucket_name}" 152 | endpoint_arn = "{Comprehend ARN}" 153 | input_pdf_uri = "{document_s3_uri}" 154 | 155 | # sample_local_request.py 156 | dns_name = "{ALB_DNS}" 157 | bucket_name = "{Bucket_name}" 158 | endpoint_arn = "{Comprehend ARN}" 159 | input_pdf_path = "{local_doc_path}" 160 | 161 | The Load Balancer's response is the output ZIP file's Amazon S3 URI as a string in the `output_zip_file_s3_uri` key-pair: 162 | 163 | { 164 | "output_zip_file_s3_uri": "s3://{file_of_split_docs}.zip" 165 | } 166 | 167 | ## Workflow 3: Local Endpoint Builder and Doc Splitter 168 | 169 | ### Description 170 | 171 | Workflow 3 follows a similar purpose to Workflow 1 and Workflow 2 to generate an Amazon Comprehend Endpoint; however, all processing will be done using the user’s local machine to generate an Amazon Comprehend compatible CSV file. 172 | This workflow was created for customers in highly regulated industries where persisting PDF documents on Amazon S3 may not be possible. 173 | 174 | ### Local Endpoint Builder 175 | 176 | #### Input: 177 | The absolute path of the local **folder** containing the training dataset files (these can be images, single-page PDFs, and/or multi-page PDFs). 178 | The structure of the folder must be the same as that shown in the "Input" section of "Workflow 1: EndpointBuilder". 179 | The name of an existing Amazon S3 bucket you have access to must also be inputted, as this is where the training dataset CSV file is uploaded and accessed by Amazon Comprehend. 180 | 181 | #### Output: 182 | The ARN (Amazon Resource Number) of the model's endpoint, which can be used to classify documents in Workflow 2. 183 | 184 | #### Usage: 185 | Navigate to the local_endpointbuilder.py file in the workflow3_local directory. 186 | Run the file and follow the prompts to input your local dataset folder's absolute path and your Amazon S3 bucket's name. 187 | 188 | When the program has finished running, the model's endpoint ARN is printed to the console. 189 | See the final lines in local_endpointbuilder.py: 190 | 191 | print("Welcome to the local endpoint builder!\n") 192 | existing_bucket_name = input("Please enter the name of an existing S3 bucket you are able to access: ") 193 | local_dataset_path = input("Please enter the absolute local file path of the dataset folder: ") 194 | print("\nComprehend model endpoint ARN: " + main(existing_bucket_name, local_dataset_path)) 195 | 196 | ### Local Doc Splitter 197 | 198 | #### Input: 199 | The endpoint ARN of the Amazon Comprehend custom classifier and one of the following: 200 | - Amazon S3 URI of the multi-page PDF document that needs to be split 201 | - local multi-page PDF file that needs to be split 202 | 203 | #### Output: 204 | A local output folder is created. It contains the multi-class PDFs, each titled with their class name, containing the pages from the input PDF that belong to the same class. 205 | 206 | #### Usage: 207 | Navigate to the local_docsplitter.py file in the workflow3_local directory. 208 | Run the file and follow the prompts to input your endpoint ARN and either your input PDF's S3 URI or absolute local path. 209 | 210 | When the program has finished running, the console displays the path of the output folder containing the multi-class PDFs. 211 | See the final lines in local_docsplitter.py: 212 | 213 | print("Welcome to the local document splitter!") 214 | endpoint_arn = input("Please enter the ARN of the Comprehend classification endpoint: ") 215 | 216 | print("Would you like to split a local file or a file stored in S3?") 217 | choice = input("Please enter 'local' or 's3': ") 218 | if choice == 'local': 219 | input_pdf_info = input("Please enter the absolute local file path of the input PDF: ") 220 | elif choice == 's3': 221 | input_pdf_info = input("Please enter the S3 URI of the input PDF: ") 222 | 223 | main(endpoint_arn, choice, input_pdf_info) 224 | 225 | ## Security 226 | 227 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information. 228 | 229 | ## License 230 | 231 | This library is licensed under the MIT-0 License. See the LICENSE file. 232 | 233 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | amazon-textract-caller==0.0.13 2 | amazon-textract-prettyprinter==0.0.9 3 | amazon-textract-response-parser==0.1.14 4 | awslambdaric==2.0.0 5 | boto3==1.18.11 6 | botocore==1.21.11 7 | certifi==2021.5.30 8 | charset-normalizer==2.0.6 9 | filetype==1.0.8 10 | idna==3.2 11 | jmespath==0.10.0 12 | marshmallow==3.11.1 13 | pdf2image==1.16.0 14 | Pillow==9.0.1 15 | PyPDF2==1.27.5 16 | python-dateutil==2.8.2 17 | requests==2.26.0 18 | s3transfer==0.5.0 19 | simplejson==3.17.2 20 | six==1.16.0 21 | tabulate==0.8.9 22 | urllib3==1.26.6 23 | -------------------------------------------------------------------------------- /workflow1_endpointbuilder/sam-app/functions/function1_createtableandfunction3trigger/index.py: -------------------------------------------------------------------------------- 1 | import boto3 2 | import datetime 3 | from urllib.parse import urlparse 4 | 5 | 6 | def create_dataset_table(datetime_id, bucket_name): 7 | dynamodb_resource = boto3.resource('dynamodb') 8 | dynamodb_resource.create_table( 9 | TableName=f'DatasetCSVTable_{datetime_id}_{bucket_name}_', 10 | KeySchema=[{ 11 | 'AttributeName': 'objectKey', 12 | 'KeyType': 'HASH' 13 | }], 14 | AttributeDefinitions=[{ 15 | 'AttributeName': 'objectKey', 16 | 'AttributeType': 'S' 17 | }], 18 | BillingMode='PAY_PER_REQUEST', 19 | StreamSpecification={ 20 | 'StreamEnabled': True, 21 | 'StreamViewType': 'NEW_AND_OLD_IMAGES' 22 | }) 23 | 24 | 25 | def create_function3_trigger(stream_arn): 26 | lambda_client = boto3.client('lambda') 27 | lambda_client.create_event_source_mapping( 28 | EventSourceArn=stream_arn, 29 | FunctionName='EndpointBuilder-Function3-ProcessObjects', 30 | Enabled=True, 31 | BatchSize=1000, 32 | ParallelizationFactor=10, 33 | StartingPosition='TRIM_HORIZON', 34 | BisectBatchOnFunctionError=True, 35 | MaximumRetryAttempts=3, 36 | FunctionResponseTypes=[ 37 | 'ReportBatchItemFailures', 38 | ]) 39 | 40 | 41 | def lambda_handler(event, context): 42 | folder_uri = event["folder_uri"] 43 | bucket_name = urlparse(folder_uri).hostname 44 | datetime_id = datetime.datetime.now().strftime("%Y%m%d%H%M%S") 45 | 46 | create_dataset_table(datetime_id, bucket_name) 47 | 48 | dynamodb_client = boto3.client('dynamodb') 49 | while True: 50 | response = dynamodb_client.describe_table( 51 | TableName=f'DatasetCSVTable_{datetime_id}_{bucket_name}_') 52 | status = response["Table"]["TableStatus"] 53 | if status == "ACTIVE": 54 | stream_arn = response['Table']['LatestStreamArn'] 55 | break 56 | 57 | create_function3_trigger(stream_arn) 58 | 59 | return {'datetime_id': datetime_id, 'folder_uri': folder_uri} 60 | -------------------------------------------------------------------------------- /workflow1_endpointbuilder/sam-app/functions/function2_inserttablerowswithobjectkeys/index.py: -------------------------------------------------------------------------------- 1 | 2 | import boto3 3 | from urllib.parse import urlparse 4 | 5 | 6 | def insert_table_rows(folder_uri, datetime_id, bucket_name): 7 | s3_resource = boto3.resource('s3') 8 | dynamodb = boto3.client("dynamodb") 9 | folder_name = folder_uri.split(bucket_name + "/", 1)[1] 10 | bucket = s3_resource.Bucket(bucket_name) 11 | s3_client = boto3.client("s3") 12 | 13 | for obj in bucket.objects.filter(Prefix=folder_name): 14 | key = obj.key 15 | # Skip files that are not PDFs, JPEGs, or PNGs 16 | object_content_type = s3_client.get_object( 17 | Bucket=bucket_name, 18 | Key=key 19 | )['ContentType'] 20 | valid_content_types = {'application/pdf', 'image/jpeg', 'image/png'} 21 | if object_content_type in valid_content_types: 22 | dynamodb.put_item( 23 | TableName=f'DatasetCSVTable_{datetime_id}_{bucket_name}_', 24 | Item={ 25 | 'objectKey': { 26 | 'S': key 27 | } 28 | } 29 | ) 30 | 31 | 32 | def lambda_handler(event, context): 33 | folder_uri = event["folder_uri"] 34 | datetime_id = event["datetime_id"] 35 | bucket_name = urlparse(folder_uri).hostname 36 | 37 | insert_table_rows(folder_uri, datetime_id, bucket_name) 38 | 39 | return { 40 | 'datetime_id': datetime_id, 41 | 'bucket_name': bucket_name, 42 | } 43 | -------------------------------------------------------------------------------- /workflow1_endpointbuilder/sam-app/functions/function3_processobjectsandupdaterows/Dockerfile: -------------------------------------------------------------------------------- 1 | 2 | # Define global args 3 | ARG FUNCTION_DIR="/" 4 | ARG RUNTIME_VERSION="3.8" 5 | ARG DISTRO_VERSION="3.12" 6 | 7 | # Stage 1 - bundle base image + runtime 8 | # Grab a fresh copy of the image and install GCC 9 | FROM python:${RUNTIME_VERSION}-alpine${DISTRO_VERSION} AS python-alpine 10 | 11 | # Install GCC for compiling and linking dependencies 12 | RUN apk add --no-cache \ 13 | libstdc++ 14 | 15 | 16 | # Stage 2 - build function and dependencies 17 | FROM python-alpine AS build-image 18 | 19 | # Install Poppler 20 | RUN apk add poppler-utils 21 | 22 | # Install Pillow dependencies 23 | RUN apk --no-cache add \ 24 | freetype-dev \ 25 | fribidi-dev \ 26 | harfbuzz-dev \ 27 | jpeg-dev \ 28 | lcms2-dev \ 29 | openjpeg-dev \ 30 | tcl-dev \ 31 | tiff-dev \ 32 | tk-dev \ 33 | zlib-dev 34 | 35 | # Install aws-lambda-cpp build dependencies 36 | RUN apk add --no-cache \ 37 | build-base \ 38 | libtool \ 39 | autoconf \ 40 | automake \ 41 | libexecinfo-dev \ 42 | make \ 43 | cmake \ 44 | libcurl 45 | 46 | # Include global args in this stage of the build 47 | ARG FUNCTION_DIR 48 | ARG RUNTIME_VERSION 49 | 50 | # Create function directory 51 | RUN mkdir -p ${FUNCTION_DIR} 52 | 53 | # Copy handler function 54 | COPY index.py requirements.txt ${FUNCTION_DIR} 55 | 56 | # Install the function's dependencies 57 | RUN python${RUNTIME_VERSION} -m pip install -r ${FUNCTION_DIR}requirements.txt --target ${FUNCTION_DIR} 58 | 59 | 60 | # Stage 3 - final runtime image 61 | 62 | # Grab a fresh copy of the Python image 63 | FROM python-alpine 64 | 65 | # Include global arg in this stage of the build 66 | ARG FUNCTION_DIR 67 | 68 | # Set working directory to function root directory 69 | WORKDIR ${FUNCTION_DIR} 70 | 71 | # Copy in the built dependencies 72 | COPY --from=build-image ${FUNCTION_DIR} ${FUNCTION_DIR} 73 | 74 | ENTRYPOINT [ "/usr/local/bin/python", "-m", "awslambdaric" ] 75 | CMD [ "index.lambda_handler" ] 76 | -------------------------------------------------------------------------------- /workflow1_endpointbuilder/sam-app/functions/function3_processobjectsandupdaterows/index.py: -------------------------------------------------------------------------------- 1 | 2 | import boto3 3 | from io import BytesIO 4 | from pdf2image import convert_from_bytes 5 | from textractcaller.t_call import call_textract 6 | from textractprettyprinter.t_pretty_print import get_lines_string 7 | 8 | 9 | def process_file(key, datetime_id, bucket_name): 10 | _class = key.split("/")[1] 11 | s3 = boto3.client('s3') 12 | dynamodb = boto3.client('dynamodb') 13 | 14 | s3_response_object = s3.get_object(Bucket=bucket_name, Key=key) 15 | object_content = s3_response_object['Body'].read() 16 | 17 | if key.endswith(".pdf"): 18 | images = convert_from_bytes(object_content) 19 | all_raw_text = "" 20 | for i, image in enumerate(images): 21 | image_text = call_textract_for_pdf(image) 22 | all_raw_text += image_text 23 | update_dynamodb_row(key, _class, all_raw_text, datetime_id, bucket_name, dynamodb) 24 | 25 | else: 26 | image_text = call_textract_for_image(object_content) 27 | update_dynamodb_row(key, _class, image_text, datetime_id, bucket_name, dynamodb) 28 | 29 | 30 | def call_textract_for_pdf(image): 31 | buf = BytesIO() 32 | image.save(buf, format='JPEG') 33 | byte_string = buf.getvalue() 34 | return call_textract_for_image(byte_string) 35 | 36 | 37 | def call_textract_for_image(object_content): 38 | textract_json = call_textract(input_document=object_content) 39 | return get_lines_string(textract_json=textract_json) 40 | 41 | 42 | def update_dynamodb_row(key, _class, raw_text, datetime_id, bucket_name, dynamodb): 43 | dynamodb.update_item( 44 | TableName=f'DatasetCSVTable_{datetime_id}_{bucket_name}_', 45 | Key={ 46 | 'objectKey': { 47 | 'S': key 48 | } 49 | }, 50 | ExpressionAttributeNames={ 51 | '#cls': 'class', 52 | '#txt': 'text', 53 | }, 54 | ExpressionAttributeValues={ 55 | ':c': { 56 | 'S': _class, 57 | }, 58 | ':t': { 59 | 'S': raw_text, 60 | }, 61 | }, 62 | UpdateExpression='SET #cls = :c, #txt = :t' 63 | ) 64 | 65 | 66 | def parse_info(event_source_arn): 67 | info = event_source_arn.split("_") 68 | datetime_id = info[1] 69 | bucket_name = info[2] 70 | return datetime_id, bucket_name 71 | 72 | 73 | def lambda_handler(event, context): 74 | 75 | for record in event["Records"]: 76 | if record["eventName"] == "INSERT": 77 | new_image = record["dynamodb"]["NewImage"] 78 | key = new_image["objectKey"]["S"] 79 | datetime_id, bucket_name = parse_info(record["eventSourceARN"]) 80 | 81 | process_file(key, datetime_id, bucket_name) 82 | -------------------------------------------------------------------------------- /workflow1_endpointbuilder/sam-app/functions/function3_processobjectsandupdaterows/requirements.txt: -------------------------------------------------------------------------------- 1 | amazon-textract-caller==0.0.13 2 | amazon-textract-prettyprinter==0.0.9 3 | awslambdaric==2.0.0 4 | certifi==2021.5.30 5 | charset-normalizer==2.0.6 6 | idna==3.2 7 | jmespath==0.10.0 8 | marshmallow==3.11.1 9 | pdf2image==1.16.0 10 | Pillow==9.0.1 11 | python-dateutil==2.8.2 12 | requests==2.26.0 13 | s3transfer==0.5.0 14 | six==1.16.0 15 | tabulate==0.8.9 16 | urllib3==1.26.6 17 | -------------------------------------------------------------------------------- /workflow1_endpointbuilder/sam-app/functions/function4_checktablestatus/index.py: -------------------------------------------------------------------------------- 1 | 2 | import boto3 3 | 4 | 5 | def check_table(datetime_id, bucket_name): 6 | dynamodb_resource = boto3.resource('dynamodb') 7 | table = dynamodb_resource.Table(f'DatasetCSVTable_{datetime_id}_{bucket_name}_') 8 | 9 | scan_kwargs = { 10 | 'Select': 'COUNT', 11 | 'FilterExpression': 'attribute_not_exists(#cls)', 12 | 'ExpressionAttributeNames': {'#cls': 'class'} 13 | } 14 | 15 | done = False 16 | start_key = None 17 | rows_not_filled = 0 18 | while not done: 19 | if start_key: 20 | scan_kwargs['ExclusiveStartKey'] = start_key 21 | response = table.scan(**scan_kwargs) 22 | rows_not_filled += response['Count'] 23 | start_key = response.get('LastEvaluatedKey', None) 24 | done = start_key is None 25 | 26 | if rows_not_filled == 0: 27 | return True 28 | else: 29 | return False 30 | 31 | 32 | def lambda_handler(event, context): 33 | datetime_id = event["datetime_id"] 34 | bucket_name = event["bucket_name"] 35 | 36 | is_complete = check_table(datetime_id, bucket_name) 37 | 38 | return { 39 | 'is_complete': is_complete, 40 | 'datetime_id': datetime_id, 41 | 'bucket_name': bucket_name 42 | } 43 | -------------------------------------------------------------------------------- /workflow1_endpointbuilder/sam-app/functions/function5_createcsv/index.py: -------------------------------------------------------------------------------- 1 | 2 | import csv 3 | import boto3 4 | from io import StringIO 5 | 6 | 7 | def create_csv(datetime_id, bucket_name): 8 | dynamodb = boto3.resource('dynamodb') 9 | table = dynamodb.Table(f'DatasetCSVTable_{datetime_id}_{bucket_name}_') 10 | 11 | scan_kwargs = { 12 | 'ProjectionExpression': "#cls, #txt", 13 | 'ExpressionAttributeNames': {"#cls": "class", "#txt": "text"} 14 | } 15 | 16 | f = StringIO() 17 | writer = csv.writer(f) 18 | done = False 19 | start_key = None 20 | while not done: 21 | if start_key: 22 | scan_kwargs['ExclusiveStartKey'] = start_key 23 | response = table.scan(**scan_kwargs) 24 | documents = response.get('Items', []) 25 | for doc in documents: 26 | writer.writerow([doc['class'], doc['text']]) 27 | start_key = response.get('LastEvaluatedKey', None) 28 | done = start_key is None 29 | return f 30 | 31 | 32 | def lambda_handler(event, context): 33 | datetime_id = event["datetime_id"] 34 | bucket_name = event["bucket_name"] 35 | 36 | f = create_csv(datetime_id, bucket_name) 37 | s3 = boto3.client("s3") 38 | s3.put_object(Body=f.getvalue(), Bucket=bucket_name, Key=f"comprehend_dataset_{datetime_id}.csv") 39 | 40 | return { 41 | "datetime_id": datetime_id, 42 | "bucket_name": bucket_name 43 | } 44 | -------------------------------------------------------------------------------- /workflow1_endpointbuilder/sam-app/functions/function6_trainclassifier/index.py: -------------------------------------------------------------------------------- 1 | 2 | import os 3 | import boto3 4 | 5 | 6 | def train_classifier(datetime_id, bucket_name, role_arn, classifier_name): 7 | comprehend = boto3.client("comprehend") 8 | 9 | # create classifier with the CSV training dataset's S3 URI and the ARN of the IAM role 10 | create_response = comprehend.create_document_classifier( 11 | InputDataConfig={ 12 | 'S3Uri': f"s3://{bucket_name}/comprehend_dataset_{datetime_id}.csv" 13 | }, 14 | DataAccessRoleArn=role_arn, 15 | DocumentClassifierName=classifier_name, 16 | LanguageCode='en' 17 | ) 18 | 19 | classifier_arn = create_response['DocumentClassifierArn'] 20 | return classifier_arn 21 | 22 | 23 | def lambda_handler(event, context): 24 | datetime_id = event["datetime_id"] 25 | bucket_name = event["bucket_name"] 26 | 27 | classifier_name = f"Classifier-{datetime_id}" 28 | 29 | role_arn = os.environ.get('FUNCTION6_TRAIN_CLASSIFIER_COMPREHEND_ROLE_ARN') 30 | classifier_arn = train_classifier(datetime_id, bucket_name, role_arn, classifier_name) 31 | 32 | return { 33 | 'datetime_id': datetime_id, 34 | 'classifier_arn': classifier_arn 35 | } -------------------------------------------------------------------------------- /workflow1_endpointbuilder/sam-app/functions/function7_checkclassifierstatus/index.py: -------------------------------------------------------------------------------- 1 | 2 | import boto3 3 | 4 | 5 | def check_classifier(classifier_arn): 6 | comprehend = boto3.client("comprehend") 7 | describe_response = comprehend.describe_document_classifier( 8 | DocumentClassifierArn=classifier_arn 9 | ) 10 | 11 | is_trained = False 12 | status = describe_response['DocumentClassifierProperties']['Status'] 13 | if status != "TRAINED": 14 | if status == "IN_ERROR": 15 | message = describe_response["DocumentClassifierProperties"]["Message"] 16 | raise ValueError(f"The classifier is in error:", message) 17 | else: 18 | is_trained = True 19 | 20 | return is_trained 21 | 22 | 23 | def lambda_handler(event, context): 24 | datetime_id = event["datetime_id"] 25 | classifier_arn = event["classifier_arn"] 26 | 27 | is_trained = check_classifier(classifier_arn) 28 | 29 | return { 30 | 'is_trained': is_trained, 31 | 'datetime_id': datetime_id, 32 | 'classifier_arn': classifier_arn 33 | } 34 | -------------------------------------------------------------------------------- /workflow1_endpointbuilder/sam-app/functions/function8_buildendpoint/index.py: -------------------------------------------------------------------------------- 1 | 2 | import boto3 3 | 4 | 5 | def build_endpoint(classifier_arn, datetime_id): 6 | comprehend = boto3.client("comprehend") 7 | endpoint_response = comprehend.create_endpoint( 8 | EndpointName=f"Classifier-{datetime_id}", 9 | ModelArn=classifier_arn, 10 | DesiredInferenceUnits=10, 11 | ) 12 | endpoint_arn = endpoint_response["EndpointArn"] 13 | return endpoint_arn 14 | 15 | 16 | def lambda_handler(event, context): 17 | datetime_id = event["datetime_id"] 18 | classifier_arn = event["classifier_arn"] 19 | 20 | endpoint_arn = build_endpoint(classifier_arn, datetime_id) 21 | 22 | return { 23 | 'endpoint_arn': endpoint_arn 24 | } 25 | -------------------------------------------------------------------------------- /workflow1_endpointbuilder/sam-app/functions/function9_checkendpointstatus/index.py: -------------------------------------------------------------------------------- 1 | 2 | import boto3 3 | 4 | 5 | def check_endpoint(endpoint_arn): 6 | comprehend = boto3.client("comprehend") 7 | describe_response = comprehend.describe_endpoint( 8 | EndpointArn=endpoint_arn 9 | ) 10 | 11 | endpoint_is_complete = False 12 | status = describe_response["EndpointProperties"]["Status"] 13 | if status != "IN_SERVICE": 14 | if status == "FAILED": 15 | message = describe_response["EndpointProperties"]["Message"] 16 | raise ValueError(f"The endpoint is in error:", message) 17 | else: 18 | endpoint_is_complete = True 19 | 20 | return endpoint_is_complete 21 | 22 | 23 | def lambda_handler(event, context): 24 | endpoint_arn = event["endpoint_arn"] 25 | 26 | endpoint_is_complete = check_endpoint(endpoint_arn) 27 | 28 | return { 29 | 'endpoint_is_complete': endpoint_is_complete, 30 | 'endpoint_arn': endpoint_arn 31 | } 32 | -------------------------------------------------------------------------------- /workflow1_endpointbuilder/sam-app/statemachine/endpointbuilder.asl.json: -------------------------------------------------------------------------------- 1 | { 2 | "Comment": "This is your state machine", 3 | "StartAt": "Create Table & DynamoDB Stream", 4 | "States": { 5 | "Create Table & DynamoDB Stream": { 6 | "Type": "Task", 7 | "Resource": "arn:aws:states:::lambda:invoke", 8 | "OutputPath": "$.Payload", 9 | "Parameters": { 10 | "Payload.$": "$", 11 | "FunctionName": "${CreateTableAndFunction3TriggerFunction}" 12 | }, 13 | "Retry": [ 14 | { 15 | "ErrorEquals": [ 16 | "Lambda.ServiceException", 17 | "Lambda.AWSLambdaException", 18 | "Lambda.SdkClientException" 19 | ], 20 | "IntervalSeconds": 2, 21 | "MaxAttempts": 6, 22 | "BackoffRate": 2 23 | } 24 | ], 25 | "Next": "Insert Table Rows with Object Keys" 26 | }, 27 | "Insert Table Rows with Object Keys": { 28 | "Type": "Task", 29 | "Resource": "arn:aws:states:::lambda:invoke", 30 | "OutputPath": "$.Payload", 31 | "Parameters": { 32 | "Payload.$": "$", 33 | "FunctionName": "${InsertTableRowsWithObjectKeysFunction2}" 34 | }, 35 | "Retry": [ 36 | { 37 | "ErrorEquals": [ 38 | "Lambda.ServiceException", 39 | "Lambda.AWSLambdaException", 40 | "Lambda.SdkClientException" 41 | ], 42 | "IntervalSeconds": 2, 43 | "MaxAttempts": 6, 44 | "BackoffRate": 2 45 | } 46 | ], 47 | "Next": "Check Table Status" 48 | }, 49 | "Check Table Status": { 50 | "Type": "Task", 51 | "Resource": "arn:aws:states:::lambda:invoke", 52 | "OutputPath": "$.Payload", 53 | "Parameters": { 54 | "Payload.$": "$", 55 | "FunctionName": "${CheckTableStatusFunction4}" 56 | }, 57 | "Retry": [ 58 | { 59 | "ErrorEquals": [ 60 | "Lambda.ServiceException", 61 | "Lambda.AWSLambdaException", 62 | "Lambda.SdkClientException" 63 | ], 64 | "IntervalSeconds": 2, 65 | "MaxAttempts": 6, 66 | "BackoffRate": 2 67 | } 68 | ], 69 | "Next": "Is Table Complete?" 70 | }, 71 | "Is Table Complete?": { 72 | "Type": "Choice", 73 | "Choices": [ 74 | { 75 | "Variable": "$.is_complete", 76 | "BooleanEquals": true, 77 | "Next": "Create CSV" 78 | } 79 | ], 80 | "Default": "Wait for Object Processing" 81 | }, 82 | "Wait for Object Processing": { 83 | "Type": "Wait", 84 | "Seconds": 300, 85 | "Next": "Check Table Status" 86 | }, 87 | "Create CSV": { 88 | "Type": "Task", 89 | "Resource": "arn:aws:states:::lambda:invoke", 90 | "OutputPath": "$.Payload", 91 | "Parameters": { 92 | "Payload.$": "$", 93 | "FunctionName": "${CreateCSVFunction5}" 94 | }, 95 | "Retry": [ 96 | { 97 | "ErrorEquals": [ 98 | "Lambda.ServiceException", 99 | "Lambda.AWSLambdaException", 100 | "Lambda.SdkClientException" 101 | ], 102 | "IntervalSeconds": 2, 103 | "MaxAttempts": 6, 104 | "BackoffRate": 2 105 | } 106 | ], 107 | "Next": "Train Classifier" 108 | }, 109 | "Train Classifier": { 110 | "Type": "Task", 111 | "Resource": "arn:aws:states:::lambda:invoke", 112 | "OutputPath": "$.Payload", 113 | "Parameters": { 114 | "Payload.$": "$", 115 | "FunctionName": "${TrainClassifierFunction6}" 116 | }, 117 | "Retry": [ 118 | { 119 | "ErrorEquals": [ 120 | "Lambda.ServiceException", 121 | "Lambda.AWSLambdaException", 122 | "Lambda.SdkClientException" 123 | ], 124 | "IntervalSeconds": 2, 125 | "MaxAttempts": 6, 126 | "BackoffRate": 2 127 | } 128 | ], 129 | "Next": "Check Classifier Status" 130 | }, 131 | "Check Classifier Status": { 132 | "Type": "Task", 133 | "Resource": "arn:aws:states:::lambda:invoke", 134 | "OutputPath": "$.Payload", 135 | "Parameters": { 136 | "Payload.$": "$", 137 | "FunctionName": "${CheckClassifierStatusFunction7}" 138 | }, 139 | "Retry": [ 140 | { 141 | "ErrorEquals": [ 142 | "Lambda.ServiceException", 143 | "Lambda.AWSLambdaException", 144 | "Lambda.SdkClientException" 145 | ], 146 | "IntervalSeconds": 2, 147 | "MaxAttempts": 6, 148 | "BackoffRate": 2 149 | } 150 | ], 151 | "Next": "Is Training Complete?" 152 | }, 153 | "Is Training Complete?": { 154 | "Type": "Choice", 155 | "Choices": [ 156 | { 157 | "Variable": "$.is_trained", 158 | "BooleanEquals": true, 159 | "Next": "Build Endpoint" 160 | } 161 | ], 162 | "Default": "Wait for Training" 163 | }, 164 | "Build Endpoint": { 165 | "Type": "Task", 166 | "Resource": "arn:aws:states:::lambda:invoke", 167 | "OutputPath": "$.Payload", 168 | "Parameters": { 169 | "Payload.$": "$", 170 | "FunctionName": "${BuildEndpointFunction8}" 171 | }, 172 | "Retry": [ 173 | { 174 | "ErrorEquals": [ 175 | "Lambda.ServiceException", 176 | "Lambda.AWSLambdaException", 177 | "Lambda.SdkClientException" 178 | ], 179 | "IntervalSeconds": 2, 180 | "MaxAttempts": 6, 181 | "BackoffRate": 2 182 | } 183 | ], 184 | "Next": "Check Endpoint Status" 185 | }, 186 | "Check Endpoint Status": { 187 | "Type": "Task", 188 | "Resource": "arn:aws:states:::lambda:invoke", 189 | "OutputPath": "$.Payload", 190 | "Parameters": { 191 | "Payload.$": "$", 192 | "FunctionName": "${CheckEndpointStatusFunction9}" 193 | }, 194 | "Retry": [ 195 | { 196 | "ErrorEquals": [ 197 | "Lambda.ServiceException", 198 | "Lambda.AWSLambdaException", 199 | "Lambda.SdkClientException" 200 | ], 201 | "IntervalSeconds": 2, 202 | "MaxAttempts": 6, 203 | "BackoffRate": 2 204 | } 205 | ], 206 | "Next": "Is Endpoint Ready?" 207 | }, 208 | "Is Endpoint Ready?": { 209 | "Type": "Choice", 210 | "Choices": [ 211 | { 212 | "Variable": "$.endpoint_is_complete", 213 | "BooleanEquals": true, 214 | "Next": "Success" 215 | } 216 | ], 217 | "Default": "Wait for Endpoint" 218 | }, 219 | "Success": { 220 | "Type": "Succeed" 221 | }, 222 | "Wait for Endpoint": { 223 | "Type": "Wait", 224 | "Seconds": 300, 225 | "Next": "Check Endpoint Status" 226 | }, 227 | "Wait for Training": { 228 | "Type": "Wait", 229 | "Seconds": 600, 230 | "Next": "Check Classifier Status" 231 | } 232 | } 233 | } -------------------------------------------------------------------------------- /workflow1_endpointbuilder/sam-app/template.yaml: -------------------------------------------------------------------------------- 1 | Transform: 'AWS::Serverless-2016-10-31' 2 | 3 | Description: SAM template for endpointbuilder 4 | 5 | Resources: 6 | EndpointBuilderStateMachine: 7 | Type: AWS::Serverless::StateMachine 8 | Properties: 9 | DefinitionUri: statemachine/endpointbuilder.asl.json 10 | DefinitionSubstitutions: 11 | CreateTableAndFunction3TriggerFunction: !GetAtt EndpointBuilderFunction1CreateTableAndFunction3Trigger.Arn 12 | InsertTableRowsWithObjectKeysFunction2: !GetAtt EndpointBuilderFunction2InsertTableRowsWithObjectKeys.Arn 13 | CheckTableStatusFunction4: !GetAtt EndpointBuilderFunction4CheckTableStatus.Arn 14 | CreateCSVFunction5: !GetAtt EndpointBuilderFunction5CreateCSV.Arn 15 | TrainClassifierFunction6: !GetAtt EndpointBuilderFunction6TrainClassifier.Arn 16 | CheckClassifierStatusFunction7: !GetAtt EndpointBuilderFunction7CheckClassifierStatus.Arn 17 | BuildEndpointFunction8: !GetAtt EndpointBuilderFunction8BuildEndpoint.Arn 18 | CheckEndpointStatusFunction9: !GetAtt EndpointBuilderFunction9CheckEndpointStatus.Arn 19 | Role: !GetAtt EndpointBuilderRole.Arn 20 | 21 | EndpointBuilderFunction1CreateTableAndFunction3Trigger: 22 | Type: AWS::Serverless::Function 23 | Properties: 24 | CodeUri: functions/function1_createtableandfunction3trigger/ 25 | Role: !GetAtt EndpointBuilderRole.Arn 26 | Description: Creates DynamoDB table and trigger for EndpointBuilderFunction3 27 | Timeout: 180 28 | MemorySize: 128 29 | Runtime: python3.8 30 | Handler: index.lambda_handler 31 | 32 | EndpointBuilderFunction2InsertTableRowsWithObjectKeys: 33 | Type: 'AWS::Serverless::Function' 34 | Properties: 35 | CodeUri: functions/function2_inserttablerowswithobjectkeys/ 36 | Role: !GetAtt EndpointBuilderRole.Arn 37 | Description: Inserts rows into DynamoDB table using the S3 object keys as the primary key 38 | Timeout: 900 39 | MemorySize: 10240 40 | Runtime: python3.8 41 | Handler: index.lambda_handler 42 | 43 | EndpointBuilderFunction3ProcessObjectsAndUpdateRows: 44 | Type: 'AWS::Serverless::Function' 45 | Properties: 46 | FunctionName: EndpointBuilder-Function3-ProcessObjects 47 | Description: Processes S3 objects via DynamoDB insert row trigger; uses object key to extract file text and class; updates table row; its FunctionName is used inside Function1 code to create the DynamoDB event source mapping 48 | Role: !GetAtt EndpointBuilderRole.Arn 49 | Timeout: 900 50 | MemorySize: 10240 51 | PackageType: Image 52 | ImageConfig: 53 | Command: [ "index.lambda_handler" ] 54 | Metadata: 55 | Dockerfile: Dockerfile 56 | DockerContext: functions/function3_processobjectsandupdaterows/ 57 | DockerTag: v1 58 | 59 | EndpointBuilderFunction4CheckTableStatus: 60 | Type: 'AWS::Serverless::Function' 61 | Properties: 62 | CodeUri: functions/function4_checktablestatus/ 63 | Role: !GetAtt EndpointBuilderRole.Arn 64 | Description: Check DynamoDB table status--all rows must have existing class and text attributes (provided by EndpointBuilderFunction3ProcessObjectsAndUpdateRows) 65 | Timeout: 900 66 | MemorySize: 10240 67 | Runtime: python3.8 68 | Handler: index.lambda_handler 69 | 70 | EndpointBuilderFunction5CreateCSV: 71 | Type: 'AWS::Serverless::Function' 72 | Properties: 73 | CodeUri: functions/function5_createcsv/ 74 | Role: !GetAtt EndpointBuilderRole.Arn 75 | Description: Creates CSV file from DynamoDB table and uploads to S3 bucket 76 | Timeout: 900 77 | MemorySize: 10240 78 | Runtime: python3.8 79 | Handler: index.lambda_handler 80 | 81 | EndpointBuilderFunction6TrainClassifier: 82 | Type: 'AWS::Serverless::Function' 83 | Properties: 84 | CodeUri: functions/function6_trainclassifier/ 85 | Role: !GetAtt EndpointBuilderRole.Arn 86 | Description: Trains Comprehend classifier using the CSV created in EndpointBuilderFunction5CreateCSV 87 | Timeout: 300 88 | MemorySize: 2048 89 | Runtime: python3.8 90 | Handler: index.lambda_handler 91 | Environment: 92 | Variables: 93 | FUNCTION6_TRAIN_CLASSIFIER_COMPREHEND_ROLE_ARN: !GetAtt Function6ComprehendRole.Arn 94 | 95 | EndpointBuilderFunction7CheckClassifierStatus: 96 | Type: 'AWS::Serverless::Function' 97 | Properties: 98 | CodeUri: functions/function7_checkclassifierstatus/ 99 | Role: !GetAtt EndpointBuilderRole.Arn 100 | Description: Checks Comprehend classifier status to verify if it has finished training 101 | Timeout: 60 102 | MemorySize: 128 103 | Runtime: python3.8 104 | Handler: index.lambda_handler 105 | 106 | EndpointBuilderFunction8BuildEndpoint: 107 | Type: 'AWS::Serverless::Function' 108 | Properties: 109 | CodeUri: functions/function8_buildendpoint/ 110 | Role: !GetAtt EndpointBuilderRole.Arn 111 | Description: Creates Comprehend endpoint from trained classifier 112 | Timeout: 60 113 | MemorySize: 128 114 | Runtime: python3.8 115 | Handler: index.lambda_handler 116 | 117 | EndpointBuilderFunction9CheckEndpointStatus: 118 | Type: 'AWS::Serverless::Function' 119 | Properties: 120 | CodeUri: functions/function9_checkendpointstatus/ 121 | Role: !GetAtt EndpointBuilderRole.Arn 122 | Description: Checks Comprehend endpoint status to verify if it has been created 123 | Timeout: 60 124 | MemorySize: 128 125 | Runtime: python3.8 126 | Handler: index.lambda_handler 127 | 128 | EndpointBuilderRole: 129 | Type: 'AWS::IAM::Role' 130 | Properties: 131 | AssumeRolePolicyDocument: 132 | Statement: 133 | - Effect: Allow 134 | Principal: 135 | Service: 136 | - lambda.amazonaws.com 137 | - states.amazonaws.com 138 | Action: 139 | - 'sts:AssumeRole' 140 | Description: Role executes Lambda functions 141 | MaxSessionDuration: 43200 142 | Policies: 143 | - PolicyName: root 144 | PolicyDocument: 145 | Version: 2012-10-17 146 | Statement: 147 | - Effect: Allow 148 | Action: 149 | - 's3:PutObject' 150 | - 's3:GetObject' 151 | - 's3:ListBucket' 152 | - 'iam:PassRole' 153 | - 'textract:*' 154 | - 'dynamodb:*' 155 | - 'lambda:*' 156 | - 'comprehend:*' 157 | - 'states:*' 158 | - 'logs:*' 159 | Resource: '*' 160 | 161 | Function6ComprehendRole: 162 | Type: 'AWS::IAM::Role' 163 | Properties: 164 | AssumeRolePolicyDocument: 165 | Statement: 166 | - Effect: Allow 167 | Principal: 168 | Service: 169 | - comprehend.amazonaws.com 170 | Action: 171 | - 'sts:AssumeRole' 172 | Description: Service role for Comprehend classifier 173 | MaxSessionDuration: 43200 174 | Policies: 175 | - PolicyName: root 176 | PolicyDocument: 177 | Version: 2012-10-17 178 | Statement: 179 | - Effect: Allow 180 | Action: 181 | - 's3:PutObject' 182 | - 's3:GetObject' 183 | - 's3:ListBucket' 184 | Resource: '*' 185 | 186 | Outputs: 187 | StateMachineArn: 188 | Description: Arn of the EndpointBuilder State Machine 189 | Value: !GetAtt EndpointBuilderStateMachine.Arn 190 | Export: 191 | Name: State-Machine-Arn 192 | -------------------------------------------------------------------------------- /workflow2_docsplitter/sam-app/functions/docsplitter_function/Dockerfile: -------------------------------------------------------------------------------- 1 | 2 | FROM public.ecr.aws/lambda/python:3.8 3 | 4 | RUN yum update -y 5 | RUN yum install poppler-utils -y 6 | RUN yum install -y \ 7 | libtiff-devel libjpeg-devel openjpeg2-devel zlib-devel \ 8 | freetype-devel lcms2-devel libwebp-devel tcl-devel tk-devel \ 9 | harfbuzz-devel fribidi-devel libraqm-devel libimagequant-devel libxcb-devel 10 | 11 | COPY index.py ${LAMBDA_TASK_ROOT} 12 | COPY requirements.txt . 13 | RUN pip3 install -r requirements.txt --target "${LAMBDA_TASK_ROOT}" 14 | 15 | CMD [ "index.lambda_handler" ] 16 | -------------------------------------------------------------------------------- /workflow2_docsplitter/sam-app/functions/docsplitter_function/index.py: -------------------------------------------------------------------------------- 1 | 2 | import json 3 | import base64 4 | import boto3 5 | import zipfile 6 | from io import BytesIO 7 | from datetime import datetime 8 | from pdf2image import convert_from_bytes 9 | from PyPDF2 import PdfFileReader, PdfFileWriter 10 | from textractcaller.t_call import call_textract 11 | from textractprettyprinter.t_pretty_print import get_lines_string 12 | 13 | 14 | def call_textract_on_image(textract, image, i): 15 | # return JSON response containing the text extracted from the image 16 | print(f"Inputting image {i} into Textract") 17 | buf = BytesIO() 18 | image.save(buf, format='JPEG') 19 | byte_string = buf.getvalue() 20 | return call_textract(input_document=byte_string, boto3_textract_client=textract) 21 | 22 | 23 | def call_comprehend(text, comprehend, endpoint_arn): 24 | # send in raw text for a document as well as the Comprehend custom classification model ARN 25 | # returns JSON response containing the document's predicted class 26 | print("Inputting text into Comprehend model") 27 | return comprehend.classify_document( 28 | Text=text, 29 | EndpointArn=endpoint_arn 30 | ) 31 | 32 | 33 | def add_page_to_class(i, _class, pages_by_class): 34 | # appends a page number (integer) to list of page numbers (value in key-value pair) 35 | # stores information on how the original multi-page input PDF file is divided by class 36 | # stores page numbers in order, so the final outputted multi-class PDF pages will be in order 37 | if _class in pages_by_class: 38 | pages_by_class[_class].append(i) 39 | else: 40 | pages_by_class[_class] = [i] 41 | print(f"Added page {i} to {_class}\n") 42 | 43 | 44 | def create_output_pdfs(input_pdf_content, pages_by_class): 45 | # loops through each class in the pages_by_class dictionary to get all of the input PDF page numbers 46 | # creates new PDF for each class using the corresponding input PDF's pages 47 | 48 | input_pdf_buffer = BytesIO(input_pdf_content) 49 | input_pdf = PdfFileReader(input_pdf_buffer, strict=False) 50 | output_zip_buffer = BytesIO() 51 | 52 | with zipfile.ZipFile(output_zip_buffer, "w") as zip_archive: 53 | for _class in pages_by_class: 54 | output = PdfFileWriter() 55 | page_numbers = pages_by_class[_class] 56 | 57 | for page_num in page_numbers: 58 | output.addPage(input_pdf.getPage(page_num)) 59 | 60 | output_buffer = BytesIO() 61 | output.write(output_buffer) 62 | 63 | with zip_archive.open(f"{_class}.pdf", 'w') as output_pdf: 64 | output_pdf.write(output_buffer.getvalue()) 65 | 66 | print(f"Created PDF for {_class}") 67 | 68 | return output_zip_buffer 69 | 70 | 71 | def split_input_pdf_by_class(input_pdf_content, endpoint_arn, _id): 72 | # loops through each page of the inputted multi-page PDF 73 | # converts single-page PDF into an image and uploads it to the S3 bucket 74 | # image in S3 is inputted into the Textract API; text is extracted, JSON is parsed 75 | # raw text is inputted into the Comprehend model API using its endpoint ARN 76 | # JSON response is parsed to find the predicted class 77 | # the input PDF's page number is assigned to the predicted class in the pages_by_class dictionary 78 | textract = boto3.client('textract') 79 | comprehend = boto3.client('comprehend') 80 | pages_by_class = {} 81 | 82 | # converts PDF into images 83 | images = convert_from_bytes(input_pdf_content) 84 | 85 | # process each image 86 | for i, image in enumerate(images): 87 | textract_response = call_textract_on_image(textract, image, i) 88 | raw_text = get_lines_string(textract_json=textract_response) 89 | comprehend_response = call_comprehend(raw_text, comprehend, endpoint_arn) 90 | _class = comprehend_response['Classes'][0]['Name'] 91 | add_page_to_class(i, _class, pages_by_class) 92 | 93 | print("Input PDF has been split up and classified\n") 94 | return pages_by_class 95 | 96 | 97 | def lambda_handler(event, context): 98 | if event['path'] == '/': 99 | return { 100 | "statusCode": 200, 101 | "statusDescription": "200 OK", 102 | "isBase64Encoded": False, 103 | "headers": { 104 | "Content-Type": "text/html" 105 | }, 106 | "body": "This is the Document Splitter API." 107 | } 108 | 109 | else: 110 | request_body = event['queryStringParameters'] 111 | endpoint_arn = request_body['endpoint_arn'] 112 | bucket_name = request_body['bucket_name'] 113 | 114 | _id = datetime.now().strftime("%Y%m%d%H%M%S") 115 | s3 = boto3.client('s3') 116 | 117 | if event['path'] == '/s3_file': 118 | input_pdf_uri = request_body['input_pdf_uri'] 119 | input_pdf_key = input_pdf_uri.split(bucket_name + "/", 1)[1] 120 | s3_response_object = s3.get_object(Bucket=bucket_name, Key=input_pdf_key) 121 | input_pdf_content = s3_response_object['Body'].read() 122 | 123 | elif event['path'] == '/local_file': 124 | encoded_data = event['body'] 125 | decoded_data = base64.standard_b64decode(encoded_data) 126 | input_pdf_content = b"%PDF" + decoded_data.split(b"\r\n\r\n%PDF", 1)[1] 127 | 128 | # pages_by_class is a dictionary 129 | # key is class name; value is list of page numbers belonging to the key class 130 | pages_by_class = split_input_pdf_by_class(input_pdf_content, endpoint_arn, _id) 131 | 132 | output_zip_buffer = create_output_pdfs(input_pdf_content, pages_by_class) 133 | 134 | output_key_name = f"workflow2_output_documents_{_id}.zip" 135 | s3.put_object(Body=output_zip_buffer.getvalue(), Bucket=bucket_name, Key=output_key_name, ContentType='application/zip') 136 | 137 | output_zip_file_s3_uri = f"s3://{bucket_name}/{output_key_name}" 138 | return { 139 | "statusCode": 200, 140 | "statusDescription": "200 OK", 141 | "isBase64Encoded": False, 142 | "headers": { 143 | "Content-Type": "text/html" 144 | }, 145 | "body": json.dumps( 146 | { 147 | 'output_zip_file_s3_uri': output_zip_file_s3_uri 148 | } 149 | ) 150 | } 151 | -------------------------------------------------------------------------------- /workflow2_docsplitter/sam-app/functions/docsplitter_function/requirements.txt: -------------------------------------------------------------------------------- 1 | amazon-textract-caller==0.0.13 2 | amazon-textract-prettyprinter==0.0.9 3 | awslambdaric==2.0.0 4 | certifi==2021.5.30 5 | charset-normalizer==2.0.6 6 | filetype==1.0.8 7 | idna==3.2 8 | jmespath==0.10.0 9 | marshmallow==3.11.1 10 | pdf2image==1.16.0 11 | Pillow==9.3.0 12 | PyPDF2==1.27.5 13 | python-dateutil==2.8.2 14 | requests==2.26.0 15 | s3transfer==0.5.0 16 | simplejson==3.17.2 17 | six==1.16.0 18 | tabulate==0.8.9 19 | urllib3==1.26.6 20 | -------------------------------------------------------------------------------- /workflow2_docsplitter/sam-app/functions/getalbinfo_function/index.py: -------------------------------------------------------------------------------- 1 | 2 | from __future__ import print_function 3 | import urllib3 4 | import json 5 | import boto3 6 | 7 | 8 | http = urllib3.PoolManager() 9 | 10 | def send(event, context, responseStatus, responseData, physicalResourceId=None, noEcho=False, reason=None): 11 | responseUrl = event['ResponseURL'] 12 | 13 | print(responseUrl) 14 | 15 | responseBody = { 16 | 'Status': responseStatus, 17 | 'Reason': reason or "See the details in CloudWatch Log Stream: {}".format(context.log_stream_name), 18 | 'PhysicalResourceId': physicalResourceId or context.log_stream_name, 19 | 'StackId': event['StackId'], 20 | 'RequestId': event['RequestId'], 21 | 'LogicalResourceId': event['LogicalResourceId'], 22 | 'NoEcho': noEcho, 23 | 'Data': responseData 24 | } 25 | 26 | json_responseBody = json.dumps(responseBody) 27 | 28 | print("Response body:") 29 | print(json_responseBody) 30 | 31 | headers = { 32 | 'content-type': '', 33 | 'content-length': str(len(json_responseBody)) 34 | } 35 | 36 | try: 37 | response = http.request('PUT', responseUrl, headers=headers, body=json_responseBody) 38 | print("Status code:", response.status) 39 | 40 | 41 | except Exception as e: 42 | 43 | print("send(..) failed executing http.request(..):", e) 44 | 45 | 46 | def lambda_handler(event, context): 47 | ec2 = boto3.client('ec2') 48 | vpc_response = ec2.describe_vpcs( 49 | Filters=[ 50 | { 51 | 'Name': 'is-default', 52 | 'Values': [ 53 | 'true' 54 | ] 55 | } 56 | ] 57 | ) 58 | vpc_id = vpc_response['Vpcs'][0]['VpcId'] 59 | 60 | subnet_response = ec2.describe_subnets( 61 | Filters=[ 62 | { 63 | 'Name': 'vpc-id', 64 | 'Values': [ 65 | vpc_id 66 | ] 67 | } 68 | ] 69 | ) 70 | subnet_ids = [subnet['SubnetId'] for subnet in subnet_response['Subnets']] 71 | 72 | response_data = { 73 | 'VpcId': vpc_id, 74 | 'Subnets': subnet_ids 75 | } 76 | 77 | return send(event, context, "SUCCESS", response_data) 78 | -------------------------------------------------------------------------------- /workflow2_docsplitter/sam-app/template.yaml: -------------------------------------------------------------------------------- 1 | Transform: 'AWS::Serverless-2016-10-31' 2 | 3 | Description: SAM template for docsplitter 4 | 5 | Resources: 6 | GetALBInfoFunction: 7 | Type: 'AWS::Serverless::Function' 8 | Properties: 9 | CodeUri: functions/getalbinfo_function/ 10 | Role: !GetAtt DocSplitterRole.Arn 11 | Description: Uses AWS EC2 API to obtain VpcId and Subnet Ids required for creating the ALB 12 | Runtime: python3.8 13 | Handler: index.lambda_handler 14 | Timeout: 120 15 | MemorySize: 256 16 | 17 | GetALBInfoFunctionInvoke: 18 | Type: AWS::CloudFormation::CustomResource 19 | Version: "1.0" 20 | Properties: 21 | ServiceToken: !GetAtt GetALBInfoFunction.Arn 22 | 23 | DocSplitterFunction: 24 | Type: 'AWS::Serverless::Function' 25 | Properties: 26 | Description: Splits input PDF based on each page's class; used as ALB target 27 | Role: !GetAtt DocSplitterRole.Arn 28 | Timeout: 900 29 | MemorySize: 10240 30 | PackageType: Image 31 | ImageConfig: 32 | Command: [ "index.lambda_handler" ] 33 | Metadata: 34 | Dockerfile: Dockerfile 35 | DockerContext: functions/docsplitter_function/ 36 | DockerTag: v1 37 | 38 | LoadBalancer: 39 | Type: AWS::ElasticLoadBalancingV2::LoadBalancer 40 | Properties: 41 | SecurityGroups: 42 | - !GetAtt SecurityGroup.GroupId 43 | Subnets: !GetAtt GetALBInfoFunctionInvoke.Subnets 44 | Type: application 45 | 46 | Listener: 47 | Type: AWS::ElasticLoadBalancingV2::Listener 48 | Properties: 49 | DefaultActions: 50 | - Type: forward 51 | TargetGroupArn: !Ref TargetGroup 52 | LoadBalancerArn: !Ref LoadBalancer 53 | Port: 80 54 | Protocol: HTTP 55 | 56 | LambdaInvokePermission: 57 | Type: AWS::Lambda::Permission 58 | Properties: 59 | FunctionName: !GetAtt DocSplitterFunction.Arn 60 | Action: 'lambda:InvokeFunction' 61 | Principal: elasticloadbalancing.amazonaws.com 62 | 63 | TargetGroup: 64 | Type: AWS::ElasticLoadBalancingV2::TargetGroup 65 | DependsOn: 66 | - LambdaInvokePermission 67 | Properties: 68 | HealthCheckEnabled: false 69 | Name: DocSplitterTargetGroup 70 | TargetType: lambda 71 | Targets: 72 | - Id: !GetAtt DocSplitterFunction.Arn 73 | 74 | SecurityGroup: 75 | Type: AWS::EC2::SecurityGroup 76 | Properties: 77 | VpcId: !GetAtt GetALBInfoFunctionInvoke.VpcId 78 | GroupDescription: security group for DocSplitter application load balancer 79 | SecurityGroupIngress: 80 | - IpProtocol: tcp 81 | FromPort: 80 82 | ToPort: 80 83 | CidrIp: 0.0.0.0/0 84 | 85 | DocSplitterRole: 86 | Type: 'AWS::IAM::Role' 87 | Properties: 88 | AssumeRolePolicyDocument: 89 | Statement: 90 | - Effect: Allow 91 | Principal: 92 | Service: 93 | - lambda.amazonaws.com 94 | Action: 95 | - 'sts:AssumeRole' 96 | Description: Role executes Lambda functions 97 | MaxSessionDuration: 43200 98 | Policies: 99 | - PolicyName: root 100 | PolicyDocument: 101 | Version: 2012-10-17 102 | Statement: 103 | - Effect: Allow 104 | Action: 105 | - 's3:PutObject' 106 | - 's3:GetObject' 107 | - 's3:ListBucket' 108 | - 'textract:*' 109 | - 'lambda:*' 110 | - 'comprehend:*' 111 | - 'logs:*' 112 | - 'ec2:DescribeSubnets' 113 | - 'ec2:DescribeVpcs' 114 | Resource: '*' 115 | 116 | Outputs: 117 | LoadBalancerDnsName: 118 | Description: DNS Name of the DocSplitter application load balancer with the Lambda target group. 119 | Value: !GetAtt LoadBalancer.DNSName 120 | -------------------------------------------------------------------------------- /workflow2_docsplitter/sample_request_folder/sample_local_request.py: -------------------------------------------------------------------------------- 1 | 2 | import requests 3 | 4 | def split_local_pdf(dns_name, input_pdf_path, bucket_name, endpoint_arn): 5 | if dns_name.endswith("/"): 6 | path = "local_file" 7 | else: 8 | path = "/local_file" 9 | 10 | with open(input_pdf_path, "rb") as file: 11 | files = { 12 | "input_pdf_file": file 13 | } 14 | url = dns_name + f"{path}?bucket_name={bucket_name}&endpoint_arn={endpoint_arn}" 15 | response = requests.post(url, files=files) 16 | print(response.text) 17 | 18 | 19 | if __name__ == "__main__": 20 | dns_name = "Enter your dns_name" 21 | bucket_name = "Enter your bucket_name " 22 | endpoint_arn = "Enter your endpoint arn" 23 | 24 | input_pdf_path = "~/sample.pdf" 25 | 26 | split_local_pdf(dns_name, input_pdf_path, bucket_name, endpoint_arn) 27 | -------------------------------------------------------------------------------- /workflow2_docsplitter/sample_request_folder/sample_s3_request.py: -------------------------------------------------------------------------------- 1 | 2 | import requests 3 | 4 | def split_s3_pdf(dns_name, input_pdf_uri, bucket_name, endpoint_arn): 5 | if dns_name.endswith("/"): 6 | path = "s3_file" 7 | else: 8 | path = "/s3_file" 9 | 10 | queries = f"?bucket_name={bucket_name}&endpoint_arn={endpoint_arn}&input_pdf_uri={input_pdf_uri}" 11 | url = dns_name + path + queries 12 | response = requests.post(url) 13 | print(response.text) 14 | 15 | 16 | if __name__ == "__main__": 17 | dns_name = "Enter your dns_name" 18 | bucket_name = "Enter your bucket_name " 19 | endpoint_arn = "Enter your endpoint arn" 20 | 21 | input_pdf_path = "s3://sample.pdf" 22 | 23 | split_s3_pdf(dns_name, input_pdf_path, bucket_name, endpoint_arn) 24 | -------------------------------------------------------------------------------- /workflow3_local/local_docsplitter.py: -------------------------------------------------------------------------------- 1 | 2 | import os 3 | import boto3 4 | from shutil import rmtree 5 | from datetime import datetime 6 | from pdf2image import convert_from_path 7 | from PyPDF2 import PdfFileReader, PdfFileWriter 8 | from textractcaller.t_call import call_textract 9 | from textractprettyprinter.t_pretty_print import get_lines_string 10 | 11 | 12 | def call_textract_on_image(textract, image_path): 13 | # return JSON response containing the text extracted from the image 14 | print(f"Inputting {image_path} into Textract") 15 | # send in local image file content (is base64-encoded by API) 16 | with open(image_path, "rb") as f: 17 | return call_textract(input_document=f.read(), boto3_textract_client=textract) 18 | 19 | 20 | def call_comprehend(text, comprehend, endpoint_arn): 21 | # send in raw text for a document as well as the Comprehend custom classification model ARN 22 | # returns JSON response containing the document's predicted class 23 | print("Inputting text into Comprehend model") 24 | return comprehend.classify_document( 25 | Text=text, 26 | EndpointArn=endpoint_arn 27 | ) 28 | 29 | 30 | def add_page_to_class(i, _class, pages_by_class): 31 | # appends a page number (integer) to list of page numbers (value in key-value pair) 32 | # stores information on how the original multi-page input PDF file is divided by class 33 | # stores page numbers in order, so the final outputted multi-class PDF pages will be in order 34 | if _class in pages_by_class: 35 | pages_by_class[_class].append(i) 36 | else: 37 | pages_by_class[_class] = [i] 38 | print(f"Added page {i + 1} to {_class}\n") 39 | 40 | 41 | def create_directories(dirs_list): 42 | for dir_name in dirs_list: 43 | try: 44 | os.mkdir(dir_name) 45 | except FileExistsError: 46 | pass 47 | print("Created directories") 48 | 49 | 50 | def create_output_pdfs(input_pdf_path, pages_by_class, output_dir_path, output_dir_name): 51 | # loops through each class in the pages_by_class dictionary to get all of the input PDF page numbers 52 | # creates new PDF for each class using the corresponding input PDF's pages 53 | # outputted multi-class PDFs are located in the output folder 54 | 55 | with open(input_pdf_path, "rb") as f: 56 | for _class in pages_by_class: 57 | output = PdfFileWriter() 58 | page_numbers = pages_by_class[_class] 59 | input_pdf = PdfFileReader(f) 60 | 61 | for page_num in page_numbers: 62 | output.addPage(input_pdf.getPage(page_num)) 63 | with open(f"{output_dir_path}/{_class}.pdf", "wb") as output_stream: 64 | output.write(output_stream) 65 | 66 | print(f"Created PDF for {_class}") 67 | 68 | 69 | def split_input_pdf_by_class(input_pdf_path, temp_dir_path, endpoint_arn, _id): 70 | # loops through each page of the inputted multi-page PDF 71 | # converts single-page PDF into an image and uploads it to the S3 bucket 72 | # image in S3 is inputted into the Textract API; text is extracted, JSON is parsed 73 | # raw text is inputted into the Comprehend model API using its endpoint ARN 74 | # JSON response is parsed to find the predicted class 75 | # the input PDF's page number is assigned to the predicted class in the pages_by_class dictionary 76 | textract = boto3.client('textract') 77 | comprehend = boto3.client('comprehend') 78 | pages_by_class = {} 79 | 80 | # converts PDF into images 81 | image_paths = convert_from_path( 82 | pdf_path=input_pdf_path, 83 | fmt='jpeg', 84 | paths_only=True, 85 | output_folder=temp_dir_path 86 | ) 87 | 88 | # process each image 89 | for i, image_path in enumerate(image_paths): 90 | textract_response = call_textract_on_image(textract, image_path) 91 | raw_text = get_lines_string(textract_json=textract_response) 92 | comprehend_response = call_comprehend(raw_text, comprehend, endpoint_arn) 93 | _class = comprehend_response['Classes'][0]['Name'] 94 | add_page_to_class(i, _class, pages_by_class) 95 | 96 | print("Input PDF has been split up and classified\n") 97 | return pages_by_class 98 | 99 | 100 | def main(endpoint_arn, choice, input_pdf_info): 101 | root_path = os.path.dirname(os.path.abspath(__file__)) 102 | _id = datetime.now().strftime("%Y%m%d%H%M%S") 103 | temp_dir_name = f"workflow2_temp_documents-{_id}" 104 | temp_dir_path = f"{root_path}/{temp_dir_name}" 105 | output_dir_name = f"workflow2_output_documents-{_id}" 106 | output_dir_path = f"{root_path}/{output_dir_name}" 107 | create_directories([temp_dir_path, output_dir_path]) 108 | s3 = boto3.client('s3') 109 | 110 | if choice == "s3": 111 | input_pdf_uri = input_pdf_info 112 | bucket_name = input_pdf_uri.split("/")[2] 113 | input_pdf_key = input_pdf_uri.split(bucket_name + "/", 1)[1] 114 | input_pdf_path = f"{temp_dir_path}/input.pdf" 115 | with open(input_pdf_path, "wb") as data: 116 | s3.download_fileobj(bucket_name, input_pdf_key, data) 117 | elif choice == "local": 118 | input_pdf_path = input_pdf_info 119 | 120 | # pages_by_class is a dictionary 121 | # key is class name; value is list of page numbers belonging to the key class 122 | pages_by_class = split_input_pdf_by_class(input_pdf_path, temp_dir_path, endpoint_arn, _id) 123 | 124 | create_output_pdfs(input_pdf_path, pages_by_class, output_dir_path, output_dir_name) 125 | rmtree(temp_dir_path) 126 | print("Multi-class PDFs have been created in the output folder, " + output_dir_path) 127 | 128 | 129 | print("Welcome to the local document splitter!") 130 | endpoint_arn = input("Please enter the ARN of the Comprehend classification endpoint: ") 131 | 132 | print("Would you like to split a local file or a file stored in S3?") 133 | choice = input("Please enter 'local' or 's3': ") 134 | if choice == 'local': 135 | input_pdf_info = input("Please enter the absolute local file path of the input PDF: ") 136 | elif choice == 's3': 137 | input_pdf_info = input("Please enter the S3 URI of the input PDF: ") 138 | 139 | main(endpoint_arn, choice, input_pdf_info) 140 | -------------------------------------------------------------------------------- /workflow3_local/local_endpointbuilder.py: -------------------------------------------------------------------------------- 1 | 2 | import os 3 | import csv 4 | import boto3 5 | import botocore 6 | import filetype 7 | from json import dumps 8 | from time import sleep 9 | from itertools import repeat 10 | from datetime import datetime 11 | from shutil import rmtree, copytree 12 | from pdf2image import convert_from_path 13 | from concurrent.futures import ThreadPoolExecutor 14 | from textractcaller.t_call import call_textract 15 | from textractprettyprinter.t_pretty_print import get_lines_string 16 | 17 | 18 | def create_training_dataset(dataset_path, csv_name, bucket_name, _id): 19 | root_path = os.path.dirname(os.path.abspath(__file__)) 20 | temp_dir_path = root_path + "/workflow1.5_local_temp" 21 | csv_file_path = temp_dir_path + "/" + csv_name 22 | 23 | # create local directories that store the files generated by this script 24 | create_temp_directories(temp_dir_path, dataset_path) 25 | 26 | # create training dataset CSV 27 | with open(csv_file_path, "w", newline="") as file: 28 | new_image_info = get_processed_images(temp_dir_path) 29 | items = enumerate(new_image_info) 30 | with ThreadPoolExecutor() as executor: 31 | rows = executor.map(add_image_to_csv, items, repeat(temp_dir_path)) 32 | 33 | # write rows to CSV 34 | writer = csv.writer(file) 35 | writer.writerows(rows) 36 | 37 | # upload CSV to S3 bucket 38 | s3 = boto3.client("s3") 39 | s3.upload_file(csv_file_path, bucket_name, csv_name) 40 | print("Training dataset CSV file has been uploaded to S3") 41 | 42 | # delete the local directories containing the CSV, images, and PDFs 43 | rmtree(temp_dir_path) 44 | print("Deleted local CSV, PDFs, and images") 45 | 46 | 47 | def add_image_to_csv(item, temp_dir_path): 48 | i = item[0] 49 | image_info = item[1] 50 | _class = image_info[1] 51 | 52 | if len(image_info) == 2: 53 | # extract text from an image that was originally an image 54 | rel_path = image_info[0] 55 | image_path = f"{temp_dir_path}/local_dataset_docs/{rel_path}" 56 | textract_json = call_textract_on_image(image_path) 57 | raw_text = get_lines_string(textract_json=textract_json) 58 | 59 | elif len(image_info) == 3: 60 | # extract text from PDF that has been converted into one or more images 61 | file_name = image_info[0] 62 | pdf_num_pages = image_info[2] 63 | # raw_text will store text extracted from all of the images that compose the PDF 64 | raw_text = "" 65 | image_path_start = f"{temp_dir_path}/images_processed/{file_name}" 66 | 67 | for page_num in range(pdf_num_pages): 68 | path = f"{image_path_start}-img-{page_num}.jpg" 69 | textract_json = call_textract_on_image(path) 70 | # parse JSON response to get raw text 71 | image_text = get_lines_string(textract_json=textract_json) 72 | raw_text += image_text 73 | 74 | # row is returned; contains document's class and the text within it 75 | print(f"Created row {i + 1}") 76 | return [_class, raw_text] 77 | 78 | 79 | def call_textract_on_image(image_path): 80 | # return JSON response containing the text extracted from the image 81 | print(f"Inputting {image_path} into Textract") 82 | textract = boto3.client("textract") 83 | # send in local image file content (is base64-encoded by API) 84 | with open(image_path, "rb") as f: 85 | return call_textract(input_document=f.read(), boto3_textract_client=textract) 86 | 87 | 88 | def process_file(file_info, temp_dir_path): 89 | file_path = file_info[0] 90 | file_extension = file_info[1] 91 | 92 | rel_path = os.path.relpath(file_path, temp_dir_path + "/local_dataset_docs").lstrip("./") 93 | _class = rel_path.split("/", 1)[0] 94 | 95 | if file_extension == "pdf": 96 | # create images from PDFs 97 | # new images, located in the images_processed directory, will be used 98 | file_name = file_path.split("/")[-1] 99 | pdf_num_pages = pdf_to_images(file_path, temp_dir_path, file_name) 100 | image_info = (file_name, _class, pdf_num_pages) 101 | else: 102 | # the image stored in the local_dataset_docs directory will be used 103 | image_info = (rel_path, _class) 104 | 105 | return image_info 106 | 107 | 108 | def get_processed_images(temp_dir_path): 109 | valid_file_extensions = {"pdf", "jpg", "png"} 110 | # file_info is a list of tuples with 2 elements: (file_path, file_extension) 111 | file_info = [] 112 | 113 | # loop through all files in local_dataset_docs 114 | for root, dirs, files in os.walk(temp_dir_path + "/local_dataset_docs"): 115 | for file in files: 116 | file_path = os.path.join(root, file) 117 | file_type = filetype.guess(file_path) 118 | if file_type is None: 119 | # filter out folders 120 | continue 121 | file_extension = file_type.extension 122 | if file_extension in valid_file_extensions: 123 | # only accept PDF, JPEG, and PNG files 124 | file_info.append((file_path, file_extension)) 125 | 126 | with ThreadPoolExecutor() as executor: 127 | # new_image_info is a generator of tuples 128 | # if the file is an image, the tuple is (rel_path, _class) 129 | # if the file is a PDF, the tuple is (file_name, _class, pdf_num_pages) 130 | new_image_info = executor.map(process_file, file_info, repeat(temp_dir_path)) 131 | return list(new_image_info) 132 | 133 | 134 | def create_image(image, image_path_start, i): 135 | new_image_path = f"{image_path_start}-img-{i}.jpg" 136 | image.save(new_image_path, "JPEG") 137 | 138 | 139 | def pdf_to_images(pdf_path, temp_dir_path, file_name): 140 | # convert PDF into images 141 | # return number of pages in PDF as an integer 142 | images = convert_from_path(pdf_path) 143 | pdf_num_pages = len(images) 144 | image_path_start = f"{temp_dir_path}/images_processed/{file_name}" 145 | with ThreadPoolExecutor() as executor: 146 | executor.map(create_image, images, repeat(image_path_start), range(pdf_num_pages)) 147 | return pdf_num_pages 148 | 149 | 150 | def create_temp_directories(temp_path, dataset_path): 151 | try: 152 | os.mkdir(temp_path) 153 | os.mkdir(temp_path + "/images_processed") 154 | copytree(dataset_path, temp_path + "/local_dataset_docs") 155 | except FileExistsError: 156 | pass 157 | print("Created temporary local directories") 158 | 159 | 160 | def create_comprehend_role(bucket_name, role_name, iam_comprehend_policy_name): 161 | iam = boto3.client("iam") 162 | try: 163 | # create IAM role with trust policy 164 | iam_assume_role_policy = dumps({ 165 | "Version": "2012-10-17", 166 | "Statement": { 167 | "Effect": "Allow", 168 | "Principal": 169 | {"Service": "comprehend.amazonaws.com"}, 170 | "Action": "sts:AssumeRole" 171 | } 172 | }) 173 | iam_create_response = iam.create_role( 174 | RoleName=role_name, 175 | AssumeRolePolicyDocument=iam_assume_role_policy, 176 | MaxSessionDuration=21600 177 | ) 178 | 179 | role_arn = iam_create_response['Role']['Arn'] 180 | print("IAM role created") 181 | 182 | except botocore.exceptions.ClientError as error: 183 | # if role already exists 184 | if error.response["Error"]["Code"] == "EntityAlreadyExists": 185 | iam_get_role_response = iam.get_role( 186 | RoleName=role_name 187 | ) 188 | role_arn = iam_get_role_response["Role"]["Arn"] 189 | print("IAM role already exists") 190 | else: 191 | raise error 192 | 193 | try: 194 | # create policy that allows role to access the CSV training dataset in S3 195 | iam_comprehend_policy_document = dumps({ 196 | "Version": "2012-10-17", 197 | "Statement": [ 198 | { 199 | "Action": [ 200 | "s3:GetObject" 201 | ], 202 | "Resource": [ 203 | f"arn:aws:s3:::{bucket_name}/*" 204 | ], 205 | "Effect": "Allow" 206 | }, 207 | { 208 | "Action": [ 209 | "s3:ListBucket" 210 | ], 211 | "Resource": [ 212 | f"arn:aws:s3:::{bucket_name}" 213 | ], 214 | "Effect": "Allow" 215 | }, 216 | { 217 | "Action": [ 218 | "s3:PutObject" 219 | ], 220 | "Resource": [ 221 | f"arn:aws:s3:::{bucket_name}/*" 222 | ], 223 | "Effect": "Allow" 224 | } 225 | ] 226 | }) 227 | iam_create_policy_response = iam.create_policy( 228 | PolicyName=iam_comprehend_policy_name, 229 | PolicyDocument=iam_comprehend_policy_document, 230 | ) 231 | 232 | # attach S3 access policy to role 233 | policy_arn = iam_create_policy_response["Policy"]["Arn"] 234 | iam.attach_role_policy( 235 | RoleName=role_name, 236 | PolicyArn=policy_arn 237 | ) 238 | print("IAM policy created and attached to role. Waiting to configure") 239 | 240 | # wait for a minute before configuring the Comprehend model 241 | # IAM role configuration needs time to be processed; without it, the model throws an error 242 | sleep(60) 243 | 244 | except botocore.exceptions.ClientError as error: 245 | # if role already exists 246 | if error.response["Error"]["Code"] == "EntityAlreadyExists": 247 | print("IAM policy already exists") 248 | else: 249 | raise error 250 | 251 | return role_arn 252 | 253 | 254 | def get_endpoint_arn(csv_file_name, bucket_name, role_arn, model_name): 255 | comprehend = boto3.client("comprehend") 256 | region = boto3.session.Session().region_name 257 | account_number = role_arn.lstrip("arn:aws:iam::").split(":")[0] 258 | 259 | try: 260 | # create model with the CSV training dataset's S3 URI and the ARN of the IAM role 261 | create_response = comprehend.create_document_classifier( 262 | InputDataConfig={ 263 | "S3Uri": f"s3://{bucket_name}/{csv_file_name}" 264 | }, 265 | DataAccessRoleArn=role_arn, 266 | DocumentClassifierName=model_name, 267 | LanguageCode="en" 268 | ) 269 | 270 | model_arn = create_response['DocumentClassifierArn'] 271 | print("Created Comprehend model") 272 | 273 | except botocore.exceptions.ClientError as error: 274 | # if model already exists 275 | if error.response["Error"]["Code"] == "ResourceInUseException": 276 | model_arn = f"arn:aws:comprehend:{region}:{account_number}:document-classifier/{model_name}" 277 | print("Model has already been created") 278 | else: 279 | raise error 280 | 281 | describe_response = comprehend.describe_document_classifier( 282 | DocumentClassifierArn=model_arn) 283 | status = describe_response['DocumentClassifierProperties']['Status'] 284 | 285 | print("Model training...") 286 | while status != "TRAINED": 287 | if status == "IN_ERROR": 288 | message = describe_response["DocumentClassifierProperties"]["Message"] 289 | raise ValueError(f"The classifier is in error:", message) 290 | # update the model's status every 5 minutes if it has not finished training 291 | sleep(300) 292 | describe_response = comprehend.describe_document_classifier( 293 | DocumentClassifierArn=model_arn) 294 | status = describe_response["DocumentClassifierProperties"]["Status"] 295 | 296 | print("Model trained. Creating endpoint") 297 | try: 298 | endpoint_response = comprehend.create_endpoint( 299 | EndpointName=model_name, 300 | ModelArn=model_arn, 301 | DesiredInferenceUnits=10, 302 | ) 303 | endpoint_arn = endpoint_response["EndpointArn"] 304 | 305 | describe_response = comprehend.describe_endpoint( 306 | EndpointArn=endpoint_arn 307 | ) 308 | status = describe_response["EndpointProperties"]["Status"] 309 | while status == "CREATING": 310 | if status == "IN_ERROR": 311 | message = describe_response["EndpointProperties"]["Message"] 312 | raise ValueError(f"The endpoint is in error:", message) 313 | # update the endpoint's status every 3 minutes if it has not been created 314 | sleep(180) 315 | describe_response = comprehend.describe_endpoint( 316 | EndpointArn=endpoint_arn 317 | ) 318 | status = describe_response["EndpointProperties"]["Status"] 319 | 320 | except botocore.exceptions.ClientError as error: 321 | # if model already exists 322 | if error.response["Error"]["Code"] == "ResourceInUseException": 323 | endpoint_arn = f"arn:aws:comprehend:{region}:{account_number}:document-classifier-endpoint/{model_name}" 324 | print("Endpoint has already been created") 325 | else: 326 | raise error 327 | # the model's endpoint ARN is returned as a string 328 | return endpoint_arn 329 | 330 | 331 | def main(bucket_name, dataset_path): 332 | _id = datetime.now().strftime("%Y%m%d%H%M%S") 333 | csv_file_name = f"comprehend_dataset_{_id}.csv" 334 | model_name = f"DatasetBuilderModel-{_id}" 335 | role_name = "AmazonComprehendServiceRole-" + model_name 336 | policy_name = model_name + "AccessS3Policy" 337 | 338 | create_training_dataset(dataset_path, csv_file_name, bucket_name, _id) 339 | role_arn = create_comprehend_role(bucket_name, role_name, policy_name) 340 | comprehend_endpoint_arn = get_endpoint_arn(csv_file_name, bucket_name, role_arn, model_name) 341 | return comprehend_endpoint_arn 342 | 343 | 344 | print("Welcome to the local endpoint builder!\n") 345 | existing_bucket_name = input("Please enter the name of an existing S3 bucket you are able to access: ") 346 | local_dataset_path = input("Please enter the absolute local file path of the dataset folder: ") 347 | print("\nComprehend model endpoint ARN: " + main(existing_bucket_name, local_dataset_path)) 348 | --------------------------------------------------------------------------------