├── .DS_Store ├── Chapter 02 ├── Amazon Textract API Sample.ipynb ├── emp_app_printed.png ├── employment_history.png ├── form-derogatoire.jpg ├── job-application-form.pdf ├── receipt-image.png ├── sample-invoice.png └── two-column-image.jpeg ├── Chapter 03 └── Chapter 3 Introduction to Amazon Comprehend.ipynb ├── Chapter 04 ├── Chapter 4 Compliance and control.ipynb ├── bankstatement.JPG └── piiredact.png ├── Chapter 05 ├── Ch05-Kendra Search.ipynb ├── lambda │ └── index.py ├── resume_Sample.pdf ├── resume_sample.PNG └── template-export-textract.yml ├── Chapter 06 ├── chapter6-nlp-in-customer-service-github.ipynb └── topic-modeling │ └── initial │ └── complaints_data_initial.csv ├── Chapter 07 ├── chapter07 social media text analytics.ipynb ├── lib │ ├── test.py │ └── workshop.py └── project_path.py ├── Chapter 08 ├── contextual-ad-marking-for-content-monetization-with-nlp-github.ipynb └── media-content │ ├── ad-index.csv │ ├── adserver.csv │ └── bank-demo-prem-ranga.mp4 ├── Chapter 09 ├── chapter 09 metadata extraction.ipynb ├── compact_nx.html ├── events_graph.py ├── nx.html ├── requirements.txt └── sample_financial_news_doc.pdf ├── Chapter 10 ├── Reducing-localization-costs-with-machine-translation-github.ipynb ├── input │ └── aboutLRH.html └── output │ └── aboutLRH_DE.html ├── Chapter 11 ├── 2019-NAR-HBS.pdf ├── 2020-generational-trends-report-03-05-2020.pdf ├── Zillow-home-buyers-report.pdf └── faqs.csv ├── Chapter 12 ├── ch 12 automating claims processing.ipynb ├── invalidmedicalform.png └── validmedicalform.png ├── Chapter 13 ├── chapter13 Improving accuracy of document processing .ipynb ├── samplecheck.PNG └── text.py ├── Chapter 14 ├── chapter14-auditing-workflows-named-entity-detection-forGitHub.ipynb ├── input │ └── sample-loan-application.png └── train │ ├── entitylist.csv │ └── raw_txt.csv ├── Chapter 15 ├── cha15train.png ├── chapter15 classify documents with human in the loop.ipynb ├── chapter15retrain.png ├── documents │ └── train │ │ ├── Bank Statements │ │ ├── 3ba082d3b307398adaf9d55301831684.png │ │ ├── 445ea8e393ca62878d5e3a68f054a8e4.jpg │ │ ├── Howard Bank Sample Personal Bank Statement.jpg │ │ ├── bank statement template 07 (1)0.PNG │ │ ├── bank statement template 08 (1)0.PNG │ │ ├── bank statement template 09 (1)0.PNG │ │ ├── bank statement template 11 (1)0.PNG │ │ ├── bank statement template 11 (1)1.PNG │ │ ├── bank statement template 12 (1)0.PNG │ │ ├── bank statement template 12 (1)1.PNG │ │ ├── bank statement template 15 (1)1.PNG │ │ ├── bank statement template 15 (1)10.PNG │ │ ├── bank statement template 15 (1)11.PNG │ │ ├── bank statement template 15 (1)13.PNG │ │ ├── bank statement template 15 (1)14.PNG │ │ ├── bank statement template 15 (1)15.PNG │ │ ├── bank statement template 15 (1)17.PNG │ │ ├── bank statement template 15 (1)18.PNG │ │ ├── bank statement template 15 (1)2.PNG │ │ ├── bank statement template 15 (1)20.PNG │ │ ├── bank statement template 15 (1)21.PNG │ │ ├── bank statement template 15 (1)22.PNG │ │ ├── bank statement template 15 (1)23.PNG │ │ ├── bank statement template 15 (1)24.PNG │ │ ├── bank statement template 15 (1)25.PNG │ │ ├── bank statement template 15 (1)26.PNG │ │ ├── bank statement template 15 (1)27.PNG │ │ ├── bank statement template 15 (1)4.PNG │ │ ├── bank statement template 15 (1)5.PNG │ │ ├── bank statement template 15 (1)6.PNG │ │ ├── bank statement template 15 (1)8.PNG │ │ ├── bank statement template 15 (1)9.PNG │ │ ├── bank statement template 16 (1)0.PNG │ │ ├── bank statement template 16 (1)1.PNG │ │ ├── bank statement template 18 (1)0.PNG │ │ ├── bank statement template 20 (1)0.PNG │ │ ├── bank statement template 20 (1)1.PNG │ │ └── bank statement template 20 (1)2.PNG │ │ └── Pay Stubs │ │ ├── 29.PNG │ │ ├── 59f20ff3d6612.jpg │ │ ├── 59f212b5f02df.jpg │ │ ├── 59f213a817dd8.jpg │ │ ├── 59f214708f294.jpg │ │ ├── 59f214eabb325.jpg │ │ ├── 5acf71c1d4b14.jpg │ │ ├── 5acf72145500c.jpg │ │ ├── 5d945f376f89f101477294.jpg │ │ ├── 5d945f854fad0046913915.jpg │ │ ├── adp-sample-768x946.png │ │ ├── free-horizontal-paystub-template.png │ │ ├── free-long-creek-paystub-template.png │ │ ├── free-magenta-paystub-template.png │ │ ├── free-midnight-paystub-template.png │ │ ├── free-shamrock-paystub-template.png │ │ ├── free-sycamore-paystub-template.png │ │ ├── free-veritical blue-paystub-template.png │ │ ├── free-violet-paystub-template.png │ │ ├── free-white-paystub-template.png │ │ ├── pay-stub-with-logo-768x384.png │ │ ├── sample-pay-stub-2020-768x384.png │ │ ├── sample-pay-stub.png │ │ └── taxes-sample-pay-stub-768x614.png └── paystubsample.png ├── Chapter 16 ├── .DS_Store ├── Improve-accuracy-of-pdf-processing-with-Amazon-Textract-and-Amazon-A2I-forGitHub.ipynb ├── form-s20-LRHL-registration.pdf ├── form-s20-LRHL-registration_image.png ├── form-s20-SUBS1-registration.pdf ├── form-s20-SUBS1-registration_image.png ├── form-s20-SUBS2-registration.pdf ├── form-s20-SUBS2-registration_image.png └── tabular-sec.liquid.html ├── Chapter 17 ├── chapter17-deriving-insights-from-handwritten-content-forGitHub.ipynb ├── hw-receipt1.jpg ├── hw-receipt2.jpg └── qsmani-raw.json ├── LICENSE └── README.md /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/.DS_Store -------------------------------------------------------------------------------- /Chapter 02/Amazon Textract API Sample.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "!pip install amazon-textract-response-parser" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "\n", 19 | "import boto3\n", 20 | "from IPython.display import Image, display\n", 21 | "from trp import Document\n", 22 | "from PIL import Image as PImage, ImageDraw\n", 23 | "import time\n", 24 | "from IPython.display import IFrame" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "# In this section, we will deep dive into Amazon Textract APIs and its feature. \n", 32 | "Amazon Textract includes simple, easy-to-use APIs that can analyze image files and PDF files.\n", 33 | "Amazon Textract APIs can be classified into synchronous APIs for real time processing and asynchronous APIs for batch processing.\n", 34 | "We will deep dive into each:\n", 35 | "•\tSynchronous APIs(Real time processing use case)\n", 36 | "•\tAsynchronous APIs(Batch processing use cases)\n", 37 | "Synchronous APIs (Real time processing use case): There are two APIs which can help with real time analysis:\n", 38 | " Analyze Text \n", 39 | " Analyze Document API\n" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "# Curent AWS Region. Use this to choose corresponding S3 bucket with sample content\n", 49 | "\n", 50 | "mySession = boto3.session.Session()\n", 51 | "awsRegion = mySession.region_name" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "# S3 bucket that contains sample documents. Download the sample documents and craete an Amazon s3 Bucket \n", 61 | "\n", 62 | "s3BucketName = \"\"" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": null, 68 | "metadata": {}, 69 | "outputs": [], 70 | "source": [ 71 | "# Amazon S3 client\n", 72 | "s3 = boto3.client('s3')\n", 73 | "\n", 74 | "# Amazon Textract client\n", 75 | "textract = boto3.client('textract')" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": null, 81 | "metadata": {}, 82 | "outputs": [], 83 | "source": [ 84 | "# 1. Detect text from image with" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "https://docs.aws.amazon.com/textract/latest/dg/API_DetectDocumentText.html" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "metadata": {}, 98 | "outputs": [], 99 | "source": [ 100 | "# Document\n", 101 | "documentName = \"sample-invoice.png\"" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": null, 107 | "metadata": {}, 108 | "outputs": [], 109 | "source": [ 110 | "display(Image(filename=documentName))" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": null, 116 | "metadata": {}, 117 | "outputs": [], 118 | "source": [ 119 | "# Read document content\n", 120 | "with open(documentName, 'rb') as document:\n", 121 | " imageBytes = bytearray(document.read())\n", 122 | "\n", 123 | "# Call Amazon Textract\n", 124 | "response = textract.detect_document_text(Document={'Bytes': imageBytes})\n" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": null, 130 | "metadata": {}, 131 | "outputs": [], 132 | "source": [ 133 | "import json\n", 134 | "\n", 135 | "print (json.dumps(response, indent=4, sort_keys=True))\n" 136 | ] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "metadata": {}, 141 | "source": [ 142 | "# 2. Detect text from S3 object" 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "metadata": {}, 148 | "source": [ 149 | "https://docs.aws.amazon.com/textract/latest/dg/API_DetectDocumentText.html" 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "metadata": {}, 155 | "source": [ 156 | "## Lines and Words of Text - JSON Structure" 157 | ] 158 | }, 159 | { 160 | "cell_type": "markdown", 161 | "metadata": {}, 162 | "source": [ 163 | "https://docs.aws.amazon.com/textract/latest/dg/API_BoundingBox.html\n", 164 | "\n", 165 | "https://docs.aws.amazon.com/textract/latest/dg/text-location.html\n", 166 | "\n", 167 | "https://docs.aws.amazon.com/textract/latest/dg/how-it-works-lines-words.html" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": null, 173 | "metadata": {}, 174 | "outputs": [], 175 | "source": [ 176 | "# Reading order" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": null, 182 | "metadata": {}, 183 | "outputs": [], 184 | "source": [ 185 | "# Document\n", 186 | "documentName = \"two-column-image.jpeg\"" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": null, 192 | "metadata": {}, 193 | "outputs": [], 194 | "source": [ 195 | "display(Image(filename=documentName))" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": null, 201 | "metadata": {}, 202 | "outputs": [], 203 | "source": [ 204 | "import boto3\n", 205 | "\n", 206 | "s3 = boto3.resource('s3')\n", 207 | "s3.Bucket(s3BucketName).upload_file(documentName,documentName)" 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": null, 213 | "metadata": {}, 214 | "outputs": [], 215 | "source": [ 216 | "# Call Amazon Textract\n", 217 | "response = textract.detect_document_text(\n", 218 | " Document={\n", 219 | " 'S3Object': {\n", 220 | " 'Bucket': s3BucketName,\n", 221 | " 'Name': documentName\n", 222 | " }\n", 223 | " })\n", 224 | "\n", 225 | "print(response)" 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": null, 231 | "metadata": {}, 232 | "outputs": [], 233 | "source": [ 234 | "#using trp.py to parse the json into reading order\n", 235 | "doc = Document(response)\n", 236 | "for page in doc.pages:\n", 237 | " for line in page.getLinesInReadingOrder():\n", 238 | " print(line[1])" 239 | ] 240 | }, 241 | { 242 | "cell_type": "markdown", 243 | "metadata": {}, 244 | "source": [ 245 | "# Analyze Document API for tables and Forms: Key/Values" 246 | ] 247 | }, 248 | { 249 | "cell_type": "markdown", 250 | "metadata": {}, 251 | "source": [ 252 | "https://docs.aws.amazon.com/textract/latest/dg/API_AnalyzeDocument.html" 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": null, 258 | "metadata": {}, 259 | "outputs": [], 260 | "source": [ 261 | "# Document\n", 262 | "documentName = \"sample-invoice.png\"" 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": null, 268 | "metadata": {}, 269 | "outputs": [], 270 | "source": [ 271 | "display(Image(filename=documentName))" 272 | ] 273 | }, 274 | { 275 | "cell_type": "code", 276 | "execution_count": null, 277 | "metadata": {}, 278 | "outputs": [], 279 | "source": [ 280 | "\n", 281 | "s3.Bucket(s3BucketName).upload_file(documentName,documentName)" 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": null, 287 | "metadata": {}, 288 | "outputs": [], 289 | "source": [ 290 | "# Call Amazon Textract\n", 291 | "response = textract.analyze_document(\n", 292 | " Document={\n", 293 | " 'S3Object': {\n", 294 | " 'Bucket': s3BucketName,\n", 295 | " 'Name': documentName\n", 296 | " }\n", 297 | " },\n", 298 | " FeatureTypes=[\"FORMS\",\"TABLES\"])" 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "execution_count": null, 304 | "metadata": {}, 305 | "outputs": [], 306 | "source": [ 307 | "\n", 308 | "\n", 309 | "#print(response)\n", 310 | "\n", 311 | "doc = Document(response)\n", 312 | "\n", 313 | "for page in doc.pages:\n", 314 | " # Print fields\n", 315 | " print(\"Fields:\")\n", 316 | " for field in page.form.fields:\n", 317 | " print(\"Key: {}, Value: {}\".format(field.key, field.value))\n", 318 | "\n", 319 | " # Get field by key\n", 320 | " print(\"\\nGet Field by Key:\")\n", 321 | " key = \"Phone Number:\"\n", 322 | " field = page.form.getFieldByKey(key)\n", 323 | " if(field):\n", 324 | " print(\"Key: {}, Value: {}\".format(field.key, field.value))\n", 325 | "\n", 326 | " # Search fields by key\n", 327 | " print(\"\\nSearch Fields:\")\n", 328 | " key = \"address\"\n", 329 | " fields = page.form.searchFieldsByKey(key)\n", 330 | " for field in fields:\n", 331 | " print(\"Key: {}, Value: {}\".format(field.key, field.value))" 332 | ] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "execution_count": null, 337 | "metadata": {}, 338 | "outputs": [], 339 | "source": [ 340 | "doc = Document(response)\n", 341 | "\n", 342 | "for page in doc.pages:\n", 343 | " # Print tables\n", 344 | " for table in page.tables:\n", 345 | " for r, row in enumerate(table.rows):\n", 346 | " for c, cell in enumerate(row.cells):\n", 347 | " print(\"Table[{}][{}] = {}\".format(r, c, cell.text))" 348 | ] 349 | }, 350 | { 351 | "cell_type": "markdown", 352 | "metadata": {}, 353 | "source": [ 354 | "# 12. PDF Processing" 355 | ] 356 | }, 357 | { 358 | "cell_type": "markdown", 359 | "metadata": {}, 360 | "source": [ 361 | "https://docs.aws.amazon.com/textract/latest/dg/API_StartDocumentTextDetection.html\n", 362 | "https://docs.aws.amazon.com/textract/latest/dg/API_GetDocumentTextDetection.html\n", 363 | "https://docs.aws.amazon.com/textract/latest/dg/API_StartDocumentAnalysis.html\n", 364 | "https://docs.aws.amazon.com/textract/latest/dg/API_GetDocumentAnalysis.html" 365 | ] 366 | }, 367 | { 368 | "cell_type": "code", 369 | "execution_count": null, 370 | "metadata": {}, 371 | "outputs": [], 372 | "source": [ 373 | "def startJob(s3BucketName, objectName):\n", 374 | " response = None\n", 375 | " response = textract.start_document_text_detection(\n", 376 | " DocumentLocation={\n", 377 | " 'S3Object': {\n", 378 | " 'Bucket': s3BucketName,\n", 379 | " 'Name': objectName\n", 380 | " }\n", 381 | " })\n", 382 | "\n", 383 | " return response[\"JobId\"]\n", 384 | "\n", 385 | "def isJobComplete(jobId):\n", 386 | " response = textract.get_document_text_detection(JobId=jobId)\n", 387 | " status = response[\"JobStatus\"]\n", 388 | " print(\"Job status: {}\".format(status))\n", 389 | "\n", 390 | " while(status == \"IN_PROGRESS\"):\n", 391 | " time.sleep(5)\n", 392 | " response = textract.get_document_text_detection(JobId=jobId)\n", 393 | " status = response[\"JobStatus\"]\n", 394 | " print(\"Job status: {}\".format(status))\n", 395 | "\n", 396 | " return status\n", 397 | "\n", 398 | "def getJobResults(jobId):\n", 399 | "\n", 400 | " pages = []\n", 401 | " response = textract.get_document_text_detection(JobId=jobId)\n", 402 | " \n", 403 | " pages.append(response)\n", 404 | " print(\"Resultset page recieved: {}\".format(len(pages)))\n", 405 | " nextToken = None\n", 406 | " if('NextToken' in response):\n", 407 | " nextToken = response['NextToken']\n", 408 | "\n", 409 | " while(nextToken):\n", 410 | " response = textract.get_document_text_detection(JobId=jobId, NextToken=nextToken)\n", 411 | "\n", 412 | " pages.append(response)\n", 413 | " print(\"Resultset page recieved: {}\".format(len(pages)))\n", 414 | " nextToken = None\n", 415 | " if('NextToken' in response):\n", 416 | " nextToken = response['NextToken']\n", 417 | "\n", 418 | " return pages" 419 | ] 420 | }, 421 | { 422 | "cell_type": "code", 423 | "execution_count": null, 424 | "metadata": {}, 425 | "outputs": [], 426 | "source": [ 427 | "# Document\n", 428 | "documentName = \"job-application-form.pdf\"" 429 | ] 430 | }, 431 | { 432 | "cell_type": "code", 433 | "execution_count": null, 434 | "metadata": {}, 435 | "outputs": [], 436 | "source": [ 437 | "\n", 438 | "s3.Bucket(s3BucketName).upload_file(documentName,documentName)" 439 | ] 440 | }, 441 | { 442 | "cell_type": "code", 443 | "execution_count": null, 444 | "metadata": {}, 445 | "outputs": [], 446 | "source": [ 447 | "jobId = startJob(s3BucketName, documentName)\n", 448 | "print(\"Started job with id: {}\".format(jobId))\n", 449 | "if(isJobComplete(jobId)):\n", 450 | " response = getJobResults(jobId)\n", 451 | "\n", 452 | "#print(response)\n", 453 | "doc = Document(response)\n" 454 | ] 455 | }, 456 | { 457 | "cell_type": "code", 458 | "execution_count": null, 459 | "metadata": {}, 460 | "outputs": [], 461 | "source": [ 462 | "\n", 463 | "#Print detected text\n", 464 | "for page in doc.pages:\n", 465 | " for line in page.getLinesInReadingOrder():\n", 466 | " print(line[1])" 467 | ] 468 | }, 469 | { 470 | "cell_type": "markdown", 471 | "metadata": {}, 472 | "source": [ 473 | "# For Analyze expense API demo refer to Chapter 17 Visualizing Insights from handwritten content" 474 | ] 475 | }, 476 | { 477 | "cell_type": "markdown", 478 | "metadata": {}, 479 | "source": [ 480 | "# Clean UP" 481 | ] 482 | }, 483 | { 484 | "cell_type": "markdown", 485 | "metadata": {}, 486 | "source": [ 487 | "Delete the S3 bucket and sample documents from S3 https://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-objects.html" 488 | ] 489 | } 490 | ], 491 | "metadata": { 492 | "kernelspec": { 493 | "display_name": "conda_python3", 494 | "language": "python", 495 | "name": "conda_python3" 496 | }, 497 | "language_info": { 498 | "codemirror_mode": { 499 | "name": "ipython", 500 | "version": 3 501 | }, 502 | "file_extension": ".py", 503 | "mimetype": "text/x-python", 504 | "name": "python", 505 | "nbconvert_exporter": "python", 506 | "pygments_lexer": "ipython3", 507 | "version": "3.6.13" 508 | } 509 | }, 510 | "nbformat": 4, 511 | "nbformat_minor": 2 512 | } 513 | -------------------------------------------------------------------------------- /Chapter 02/emp_app_printed.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 02/emp_app_printed.png -------------------------------------------------------------------------------- /Chapter 02/employment_history.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 02/employment_history.png -------------------------------------------------------------------------------- /Chapter 02/form-derogatoire.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 02/form-derogatoire.jpg -------------------------------------------------------------------------------- /Chapter 02/job-application-form.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 02/job-application-form.pdf -------------------------------------------------------------------------------- /Chapter 02/receipt-image.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 02/receipt-image.png -------------------------------------------------------------------------------- /Chapter 02/sample-invoice.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 02/sample-invoice.png -------------------------------------------------------------------------------- /Chapter 02/two-column-image.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 02/two-column-image.jpeg -------------------------------------------------------------------------------- /Chapter 03/Chapter 3 Introduction to Amazon Comprehend.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "id": "331ec774", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "\n", 11 | "\n", 12 | "import boto3\n", 13 | "\n", 14 | "\n", 15 | "# Amazon Comprehend client\n", 16 | "comprehend = boto3.client('comprehend')\n" 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "id": "a06ecc2d", 22 | "metadata": {}, 23 | "source": [ 24 | "# Entity Extraction Text Analysis Real time API" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": null, 30 | "id": "b037763b", 31 | "metadata": {}, 32 | "outputs": [], 33 | "source": [ 34 | "SampleText=\"Packt is a publishing company founded in 2003 headquartered in Birmingham, UK, with offices in Mumbai, India. Packt primarily publishes print and electronic books and videos relating to information technology, including programming, web design, data analysis and hardware.\"" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": null, 40 | "id": "aa0adc0d", 41 | "metadata": {}, 42 | "outputs": [], 43 | "source": [ 44 | "response = comprehend.detect_entities(\n", 45 | " Text=SampleText,\n", 46 | " LanguageCode='en')" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "id": "b3ebd476", 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [ 56 | "import json\n", 57 | "\n", 58 | "print (json.dumps(response, indent=4, sort_keys=True))" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "id": "c33fe734", 64 | "metadata": {}, 65 | "source": [ 66 | "# Keyphrases Extraction Real Time APIs in French language" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "id": "f26663f7", 73 | "metadata": {}, 74 | "outputs": [], 75 | "source": [ 76 | "SampleText=\"Packt est une société d'édition fondée en 2003 dont le siège est à Birmingham, au Royaume-Uni, avec des bureaux à Mumbai, en Inde. Packt publie principalement des livres et des vidéos imprimés et électroniques relatifs aux technologies de l'information, y compris la programmation, la conception Web, l'analyse de données et le matériel\"" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "id": "f32cee7e", 83 | "metadata": {}, 84 | "outputs": [], 85 | "source": [ 86 | "response = comprehend.detect_key_phrases(\n", 87 | " Text= SampleText,\n", 88 | " LanguageCode='fr'\n", 89 | ")" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": null, 95 | "id": "922c0674", 96 | "metadata": {}, 97 | "outputs": [], 98 | "source": [ 99 | "import json\n", 100 | "\n", 101 | "print (json.dumps(response, indent=4, sort_keys=True))" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "id": "6cb7dbca", 107 | "metadata": {}, 108 | "source": [ 109 | "# Batch Real time API for Amazon Detect Sentiment Demo \n", 110 | "Multiple Document Synchronous Processing mode where you can call Amazon Comprehend with a collection of up to 25 documents and receive a synchronous response. " 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "id": "75340b29", 116 | "metadata": {}, 117 | "source": [ 118 | "\n", 119 | "Packt Publication Book Reviews Setiment Analysis for the book \n", 120 | "\n", 121 | "40 Algorithms Every Programmer Should Know\n", 122 | "https://www.packtpub.com/product/40-algorithms-every-programmer-should-know/9781789801217" 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "id": "881a0214", 128 | "metadata": {}, 129 | "source": [ 130 | "We are going to analyze some of the reviews for this book using batch snetiment analyis API." 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": null, 136 | "id": "b1702d83", 137 | "metadata": {}, 138 | "outputs": [], 139 | "source": [ 140 | "response = comprehend.batch_detect_sentiment(\n", 141 | " TextList=[\n", 142 | " 'Well this is an area of my interest and this book is packed with essential knowledge','kinda all in one With good examples and rather easy to follow', 'There are good examples and samples in the book.', '40 Algorithms every Programmer should know is a good start to a vast topic about algorithms'\n", 143 | " ],\n", 144 | " LanguageCode='en'\n", 145 | ")" 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": null, 151 | "id": "eb435781", 152 | "metadata": {}, 153 | "outputs": [], 154 | "source": [ 155 | "print (json.dumps(response, indent=4, sort_keys=True))" 156 | ] 157 | }, 158 | { 159 | "cell_type": "markdown", 160 | "id": "7e611cd7", 161 | "metadata": {}, 162 | "source": [ 163 | "# Since the book reviews were in different languages. lets Identify the various languages in this book review" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": null, 169 | "id": "280e8e6f", 170 | "metadata": {}, 171 | "outputs": [], 172 | "source": [ 173 | "response = comprehend.batch_detect_dominant_language(\n", 174 | " TextList=[\n", 175 | " 'It include recenet algorithm trend. it is very helpful.','Je ne lai pas encore lu entièrement mais le livre semble expliquer de façon suffisamment claire lensemble de ces algorithmes.'\n", 176 | " ]\n", 177 | ")" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": null, 183 | "id": "a4d8a274", 184 | "metadata": {}, 185 | "outputs": [], 186 | "source": [ 187 | "print (json.dumps(response, indent=4, sort_keys=True))" 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "execution_count": null, 193 | "id": "9787222a", 194 | "metadata": {}, 195 | "outputs": [], 196 | "source": [] 197 | } 198 | ], 199 | "metadata": { 200 | "kernelspec": { 201 | "display_name": "conda_python3", 202 | "language": "python", 203 | "name": "conda_python3" 204 | }, 205 | "language_info": { 206 | "codemirror_mode": { 207 | "name": "ipython", 208 | "version": 3 209 | }, 210 | "file_extension": ".py", 211 | "mimetype": "text/x-python", 212 | "name": "python", 213 | "nbconvert_exporter": "python", 214 | "pygments_lexer": "ipython3", 215 | "version": "3.6.13" 216 | } 217 | }, 218 | "nbformat": 4, 219 | "nbformat_minor": 5 220 | } 221 | -------------------------------------------------------------------------------- /Chapter 04/Chapter 4 Compliance and control.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "6a204c35", 6 | "metadata": {}, 7 | "source": [ 8 | "# PII Detection and Redaction for setting compliance and control\n", 9 | "\n", 10 | "In this , we will be performing extracting the text from the documents using AWS Textract and then use Comprehend to perform pii detection. Then we will be using python function to redact that portion of the image. \n", 11 | "Here is conceptual architectural flow:\n", 12 | "\n", 13 | "![alt-text](piiredact.png)\n", 14 | "\n", 15 | "You can automate the entire end to end flow using step function and lambda for orchestration.\n", 16 | "\n", 17 | "We will walk you through following steps:\n", 18 | "\n", 19 | "## Step 1: Setup and install libraries \n", 20 | "## Step 2: Extract text from sample document\n", 21 | "## Step 3: Save the extracted text into text/csv file and uplaod to Amazon S3 bucket\n", 22 | "## Step 4: Check for PII using Amazon Comprehend Detect PII Sync API.\n", 23 | "## Step 5: Mask PII using Amazon Comprehend PII Analysis Job\n", 24 | "## Step 6: View the redacted/masked output in Amazon S3 Bucket\n" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "id": "bf848fe3", 30 | "metadata": {}, 31 | "source": [ 32 | "# Lets start with Step 1: Setup and install libraries" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "id": "34bcf133", 38 | "metadata": {}, 39 | "source": [ 40 | "import json\n", 41 | "import boto3\n", 42 | "import re\n", 43 | "import csv\n", 44 | "import sagemaker\n", 45 | "from sagemaker import get_execution_role\n", 46 | "from sagemaker.s3 import S3Uploader, S3Downloader\n", 47 | "import uuid\n", 48 | "import time\n", 49 | "import io\n", 50 | "from io import BytesIO\n", 51 | "import sys\n", 52 | "from pprint import pprint\n", 53 | "\n", 54 | "from IPython.display import Image, display\n", 55 | "from PIL import Image as PImage, ImageDraw" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": null, 61 | "id": "2ebd3fed", 62 | "metadata": {}, 63 | "outputs": [], 64 | "source": [ 65 | "!pip install amazon-textract-response-parser" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": null, 71 | "id": "4a8e44b8", 72 | "metadata": {}, 73 | "outputs": [], 74 | "source": [ 75 | "import pandas as pd\n", 76 | "import webbrowser, os\n", 77 | "import json\n", 78 | "import boto3\n", 79 | "import re\n", 80 | "import sagemaker\n", 81 | "from sagemaker import get_execution_role\n", 82 | "from sagemaker.s3 import S3Uploader, S3Downloader\n", 83 | "import uuid\n", 84 | "import time\n", 85 | "import io\n", 86 | "from io import BytesIO\n", 87 | "import sys\n", 88 | "from pprint import pprint\n", 89 | "\n", 90 | "from IPython.display import Image, display\n", 91 | "from PIL import Image as PImage, ImageDraw" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "id": "52f62fbd", 98 | "metadata": {}, 99 | "outputs": [], 100 | "source": [ 101 | "\n", 102 | "region = boto3.Session().region_name\n", 103 | "\n", 104 | "role = get_execution_role()\n", 105 | "print(role)\n", 106 | "\n", 107 | "bucket = sagemaker.Session().default_bucket()\n", 108 | "\n", 109 | "prefix = \"pii-detection-redaction\"\n", 110 | "bucket_path = \"https://s3-{}.amazonaws.com/{}\".format(region, bucket)\n" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "id": "167dde46", 116 | "metadata": {}, 117 | "source": [ 118 | "# Step 2: Extract text from sample document¶" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": null, 124 | "id": "475ed18a", 125 | "metadata": {}, 126 | "outputs": [], 127 | "source": [ 128 | "# Document\n", 129 | "documentName = \"bankstatement.JPG\"\n", 130 | "\n", 131 | "display(Image(filename=documentName))" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": null, 137 | "id": "772e4c06", 138 | "metadata": {}, 139 | "outputs": [], 140 | "source": [ 141 | "client = boto3.client(service_name='textract',\n", 142 | " region_name= 'us-east-1',\n", 143 | " endpoint_url='https://textract.us-east-1.amazonaws.com')\n", 144 | "\n", 145 | "with open(documentName, 'rb') as file:\n", 146 | " img_test = file.read()\n", 147 | " bytes_test = bytearray(img_test)\n", 148 | " print('Image loaded', documentName)\n", 149 | "\n", 150 | " # process using image bytes\n", 151 | "response = client.detect_document_text(Document={'Bytes': bytes_test})\n" 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": null, 157 | "id": "f994f746", 158 | "metadata": {}, 159 | "outputs": [], 160 | "source": [ 161 | "#Extract key values\n", 162 | "# Iterate over elements in the document\n", 163 | "from trp import Document\n", 164 | "\n", 165 | "\n", 166 | "doc = Document(response)\n", 167 | "page_string = ''\n", 168 | "for page in doc.pages:\n", 169 | " # Print lines and words\n", 170 | " \n", 171 | " for line in page.lines:\n", 172 | " #print((line.text))\n", 173 | " page_string += str(line.text)\n", 174 | "print(page_string)" 175 | ] 176 | }, 177 | { 178 | "cell_type": "markdown", 179 | "id": "e4385858", 180 | "metadata": {}, 181 | "source": [ 182 | "# Step 3: Save the extracted text into text/csv file and uplaod to Amazon S3 bucket¶" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": null, 188 | "id": "b4367f93", 189 | "metadata": {}, 190 | "outputs": [], 191 | "source": [ 192 | "# Lets get the data into a text file\n", 193 | "text_filename = 'pii_data.txt'\n", 194 | "doc = Document(response)\n", 195 | "with open(text_filename, 'w', encoding='utf-8') as f:\n", 196 | " for page in doc.pages:\n", 197 | " # Print lines and words\n", 198 | " page_string = ''\n", 199 | " for line in page.lines:\n", 200 | " #print((line.text))\n", 201 | " page_string += str(line.text)\n", 202 | " #print(page_string)\n", 203 | " f.writelines(page_string + \"\\n\")" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": null, 209 | "id": "b62b48b3", 210 | "metadata": {}, 211 | "outputs": [], 212 | "source": [ 213 | "# Load the documents locally for later analysis\n", 214 | "with open(text_filename, \"r\") as fi:\n", 215 | " raw_texts = [line.strip() for line in fi.readlines()]" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": null, 221 | "id": "c40f2d22", 222 | "metadata": {}, 223 | "outputs": [], 224 | "source": [ 225 | "import boto3\n", 226 | "\n", 227 | "s3 = boto3.resource('s3')\n", 228 | "s3.Bucket(bucket).upload_file(\"pii_data.txt\", \"pii-detection-redaction/pii_data.txt\")" 229 | ] 230 | }, 231 | { 232 | "cell_type": "markdown", 233 | "id": "69a5cb4c", 234 | "metadata": {}, 235 | "source": [ 236 | "# Step 4: Check for PII using Amazon Comprehend Detect PII Sync API" 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": null, 242 | "id": "641fed7c", 243 | "metadata": {}, 244 | "outputs": [], 245 | "source": [ 246 | "comprehend = boto3.client(service_name='comprehend')" 247 | ] 248 | }, 249 | { 250 | "cell_type": "code", 251 | "execution_count": null, 252 | "id": "12ce5e4f", 253 | "metadata": {}, 254 | "outputs": [], 255 | "source": [ 256 | "# Call Amazon Comprehend and pass it the aggregated text from our image.\n", 257 | "\n", 258 | "piilist=comprehend.detect_pii_entities(Text = page_string, LanguageCode='en')\n", 259 | "redacted_box_color='red'\n", 260 | "dpi = 72\n", 261 | "pii_detection_threshold = 0.00\n", 262 | "print ('Finding PII text...')\n", 263 | "not_redacted=0\n", 264 | "redacted=0\n", 265 | "for pii in piilist['Entities']:\n", 266 | " print(pii['Type'])\n", 267 | " if pii['Score'] > pii_detection_threshold:\n", 268 | " print (\"detected as type '\"+pii['Type']+\"' and will be redacted.\")\n", 269 | " redacted+=1\n", 270 | " \n", 271 | " else:\n", 272 | " print (\" was detected as type '\"+pii['Type']+\"', but did not meet the confidence score threshold and will not be redacted.\")\n", 273 | " not_redacted+=1\n", 274 | "\n", 275 | "\n", 276 | "print (\"Found\", redacted, \"text boxes to redact.\")\n", 277 | "print (not_redacted, \"additional text boxes were detected, but did not meet the confidence score threshold.\")" 278 | ] 279 | }, 280 | { 281 | "cell_type": "markdown", 282 | "id": "69c52a92", 283 | "metadata": {}, 284 | "source": [ 285 | "# Step 5: Mask PII using Amazon Comprehend PII Analysis Job" 286 | ] 287 | }, 288 | { 289 | "cell_type": "markdown", 290 | "id": "275acf23", 291 | "metadata": {}, 292 | "source": [ 293 | "We will use StartPiiEntitiesDetectionJob API\n", 294 | "\n", 295 | "StartPiiEntitiesDetectionJob API starts an asynchronous PII entity detection job for a collection of documents.\n", 296 | "\n", 297 | "We would be using this API to perform pii detection and redaction for pii_data.txt which we had inspected above.\n" 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": null, 303 | "id": "ee42299c", 304 | "metadata": {}, 305 | "outputs": [], 306 | "source": [ 307 | "import uuid\n", 308 | "InputS3URI= \"s3://\"+bucket+ \"/pii-detection-redaction/pii_data.txt\"\n", 309 | "print(InputS3URI)\n", 310 | "OutputS3URI=\"s3://\"+bucket+\"/pii-detection-redaction\"\n", 311 | "print(OutputS3URI)\n", 312 | "job_uuid = uuid.uuid1()\n", 313 | "job_name = f\"pii-job-{job_uuid}\"" 314 | ] 315 | }, 316 | { 317 | "cell_type": "markdown", 318 | "id": "56212a64", 319 | "metadata": {}, 320 | "source": [ 321 | "# Adding Amazon Comprehend as an additional trusted entity to this role\n", 322 | "\n", 323 | "This step is needed if you want to pass the execution role of this Notebook while calling Comprehend APIs as well without creating an additional Role. \n", 324 | "\n", 325 | "\n", 326 | "\n", 327 | "On the IAM dashboard, please click on Roles on the left sidenav and search for this Role. Once the Role appears, click on the Role to go to its Summary page. Click on the Trust relationships tab on the Summary page to add Amazon Comprehend as an additional trusted entity.\n", 328 | "\n", 329 | "Click on **Edit trust relationship** and replace the JSON with this JSON.\n", 330 | "```\n", 331 | "{\n", 332 | " \"Version\": \"2012-10-17\",\n", 333 | " \"Statement\": [\n", 334 | " {\n", 335 | " \"Effect\": \"Allow\",\n", 336 | " \"Principal\": {\n", 337 | " \"Service\": \"comprehend.amazonaws.com\"\n", 338 | " },\n", 339 | " \"Action\": \"sts:AssumeRole\"\n", 340 | " }\n", 341 | " ]\n", 342 | "}\n", 343 | "```\n", 344 | "\n", 345 | "Once this is complete, click on Update Trust Policy and you are done." 346 | ] 347 | }, 348 | { 349 | "cell_type": "code", 350 | "execution_count": null, 351 | "id": "1ac2943e", 352 | "metadata": {}, 353 | "outputs": [], 354 | "source": [ 355 | "role_name = role[role.rfind('/') + 1:]\n", 356 | "print(\"https://console.aws.amazon.com/iam/home?region={0}#/roles/{1}\".format(region, role_name))\n" 357 | ] 358 | }, 359 | { 360 | "cell_type": "code", 361 | "execution_count": null, 362 | "id": "dc816112", 363 | "metadata": {}, 364 | "outputs": [], 365 | "source": [ 366 | "\n", 367 | "response = comprehend.start_pii_entities_detection_job(\n", 368 | " InputDataConfig={\n", 369 | " 'S3Uri': InputS3URI,\n", 370 | " 'InputFormat': 'ONE_DOC_PER_FILE'\n", 371 | " },\n", 372 | " OutputDataConfig={\n", 373 | " 'S3Uri': OutputS3URI\n", 374 | " \n", 375 | " },\n", 376 | " Mode='ONLY_REDACTION',\n", 377 | " RedactionConfig={\n", 378 | " 'PiiEntityTypes': [\n", 379 | " 'ALL',\n", 380 | " ],\n", 381 | " 'MaskMode': 'MASK',\n", 382 | " 'MaskCharacter': '*'\n", 383 | " },\n", 384 | " DataAccessRoleArn = role,\n", 385 | " JobName=job_name,\n", 386 | " LanguageCode='en',\n", 387 | " \n", 388 | ")" 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": null, 394 | "id": "038aca17", 395 | "metadata": {}, 396 | "outputs": [], 397 | "source": [ 398 | "# Get the job ID\n", 399 | "events_job_id = response['JobId']\n", 400 | "job = comprehend.describe_pii_entities_detection_job(JobId=events_job_id)\n", 401 | "print(job)" 402 | ] 403 | }, 404 | { 405 | "cell_type": "markdown", 406 | "id": "d5c84f74", 407 | "metadata": {}, 408 | "source": [ 409 | "\n", 410 | "The job will take roughly 6-7 minutes. \n", 411 | "The below code is to check the status of the job. \n", 412 | "The cell execution would be completed after the job is completed.\n", 413 | "In case the job fails you can check the logs and status in AWS Console https://console.aws.amazon.com/comprehend/v2/home?region=us-east-1#analysis\n", 414 | "and try re running the job if you get this failure reason:\n", 415 | " NO_WRITE_ACCESS_TO_OUTPUT: The provided data access role does not have write access to the output S3 URI." 416 | ] 417 | }, 418 | { 419 | "cell_type": "code", 420 | "execution_count": null, 421 | "id": "aeb94ab2", 422 | "metadata": {}, 423 | "outputs": [], 424 | "source": [ 425 | "from time import sleep\n", 426 | "# Get current job status\n", 427 | "job = comprehend.describe_pii_entities_detection_job(JobId=events_job_id)\n", 428 | "print(job)\n", 429 | "# Loop until job is completed\n", 430 | "waited = 0\n", 431 | "timeout_minutes = 10\n", 432 | "while job['PiiEntitiesDetectionJobProperties']['JobStatus'] != 'COMPLETED':\n", 433 | " sleep(60)\n", 434 | " waited += 60\n", 435 | " assert waited//60 < timeout_minutes, \"Job timed out after %d seconds.\" % waited\n", 436 | " job = comprehend.describe_pii_entities_detection_job(JobId=events_job_id)" 437 | ] 438 | }, 439 | { 440 | "cell_type": "code", 441 | "execution_count": null, 442 | "id": "6fe60743", 443 | "metadata": {}, 444 | "outputs": [], 445 | "source": [ 446 | "print(response)" 447 | ] 448 | }, 449 | { 450 | "cell_type": "markdown", 451 | "id": "8a44d579", 452 | "metadata": {}, 453 | "source": [ 454 | "# Step 6: View the redacted/masked output in Amazon S3 Bucket¶" 455 | ] 456 | }, 457 | { 458 | "cell_type": "code", 459 | "execution_count": null, 460 | "id": "4bb82ee2", 461 | "metadata": {}, 462 | "outputs": [], 463 | "source": [ 464 | "filename=\"pii_data.txt\"\n", 465 | "output_data_s3_file = job['PiiEntitiesDetectionJobProperties']['OutputDataConfig']['S3Uri'] + filename + '.out'\n", 466 | "print(output_data_s3_file)" 467 | ] 468 | }, 469 | { 470 | "cell_type": "code", 471 | "execution_count": null, 472 | "id": "21be6b75", 473 | "metadata": {}, 474 | "outputs": [], 475 | "source": [ 476 | "\n", 477 | "# The output filename is the input filename + \".out\"\n", 478 | "s3_client = boto3.client(service_name='s3')\n", 479 | "filename=\"pii_data.txt\"\n", 480 | "output_data_s3_file = job['PiiEntitiesDetectionJobProperties']['OutputDataConfig']['S3Uri'] + filename + '.out'\n", 481 | "print(output_data_s3_file)\n", 482 | "output_data_s3_filepath=output_data_s3_file.split(\"//\")[1].split(\"/\")[1]+\"/\"+output_data_s3_file.split(\"//\")[1].split(\"/\")[2]+\"/\"+output_data_s3_file.split(\"//\")[1].split(\"/\")[3]+\"/\"+output_data_s3_file.split(\"//\")[1].split(\"/\")[4]\n", 483 | "print(output_data_s3_filepath)\n", 484 | "\n", 485 | "f = BytesIO()\n", 486 | "s3_client.download_fileobj(bucket, output_data_s3_filepath, f)\n", 487 | "f.seek(0)\n", 488 | "print(f.getvalue())" 489 | ] 490 | }, 491 | { 492 | "cell_type": "markdown", 493 | "id": "686fc93b", 494 | "metadata": {}, 495 | "source": [ 496 | "Clean Up!\n", 497 | "\n", 498 | "Delete Amazon S3 Bucket https://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-bucket.html" 499 | ] 500 | } 501 | ], 502 | "metadata": { 503 | "kernelspec": { 504 | "display_name": "conda_python3", 505 | "language": "python", 506 | "name": "conda_python3" 507 | }, 508 | "language_info": { 509 | "codemirror_mode": { 510 | "name": "ipython", 511 | "version": 3 512 | }, 513 | "file_extension": ".py", 514 | "mimetype": "text/x-python", 515 | "name": "python", 516 | "nbconvert_exporter": "python", 517 | "pygments_lexer": "ipython3", 518 | "version": "3.6.13" 519 | } 520 | }, 521 | "nbformat": 4, 522 | "nbformat_minor": 5 523 | } 524 | -------------------------------------------------------------------------------- /Chapter 04/bankstatement.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 04/bankstatement.JPG -------------------------------------------------------------------------------- /Chapter 04/piiredact.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 04/piiredact.png -------------------------------------------------------------------------------- /Chapter 05/Ch05-Kendra Search.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "id": "4f8f6eb9", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "\n", 11 | "import pandas as pd\n", 12 | "import webbrowser, os\n", 13 | "import json\n", 14 | "import boto3\n", 15 | "import re\n", 16 | "import sagemaker\n", 17 | "from sagemaker import get_execution_role\n", 18 | "from sagemaker.s3 import S3Uploader, S3Downloader\n", 19 | "import uuid\n", 20 | "import time\n", 21 | "import io\n", 22 | "from io import BytesIO\n", 23 | "import sys\n", 24 | "import csv\n", 25 | "from pprint import pprint\n", 26 | "from IPython.display import Image, display\n", 27 | "from PIL import Image as PImage, ImageDraw\n", 28 | "\n", 29 | "# Define IAM role\n", 30 | "role = get_execution_role()\n", 31 | "print(\"RoleArn: {}\".format(role))\n", 32 | "sess = sagemaker.Session()\n", 33 | "s3BucketName = \"\"\n", 34 | "prefix = 'chapter5'\n", 35 | "\n", 36 | "s3 = boto3.client('s3')" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": null, 42 | "id": "dd320b14", 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "\n", 47 | "# initialize the boto3 handle for comprehend\n", 48 | "comprehend = boto3.client('comprehend')\n", 49 | "textract= boto3.client('textract')\n", 50 | "kendra= boto3.client('kendra')" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "id": "49ded142", 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "# Document\n", 61 | "documentName = \"resume_Sample.pdf\"" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": null, 67 | "id": "aa7e908a", 68 | "metadata": {}, 69 | "outputs": [], 70 | "source": [ 71 | "\n", 72 | "s3 = boto3.resource('s3')\n", 73 | "s3.Bucket(s3BucketName).upload_file(documentName,documentName)" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": null, 79 | "id": "d6097210", 80 | "metadata": {}, 81 | "outputs": [], 82 | "source": [ 83 | "def startJob(s3BucketName, objectName):\n", 84 | " response = None\n", 85 | " response = textract.start_document_text_detection(\n", 86 | " DocumentLocation={\n", 87 | " 'S3Object': {\n", 88 | " 'Bucket': s3BucketName,\n", 89 | " 'Name': objectName\n", 90 | " }\n", 91 | " })\n", 92 | "\n", 93 | " return response[\"JobId\"]\n", 94 | "\n", 95 | "def isJobComplete(jobId):\n", 96 | " response = textract.get_document_text_detection(JobId=jobId)\n", 97 | " status = response[\"JobStatus\"]\n", 98 | " print(\"Job status: {}\".format(status))\n", 99 | "\n", 100 | " while(status == \"IN_PROGRESS\"):\n", 101 | " time.sleep(5)\n", 102 | " response = textract.get_document_text_detection(JobId=jobId)\n", 103 | " status = response[\"JobStatus\"]\n", 104 | " print(\"Job status: {}\".format(status))\n", 105 | "\n", 106 | " return status\n", 107 | "\n", 108 | "def getJobResults(jobId):\n", 109 | "\n", 110 | " pages = []\n", 111 | " response = textract.get_document_text_detection(JobId=jobId)\n", 112 | " \n", 113 | " pages.append(response)\n", 114 | " print(\"Resultset page recieved: {}\".format(len(pages)))\n", 115 | " nextToken = None\n", 116 | " if('NextToken' in response):\n", 117 | " nextToken = response['NextToken']\n", 118 | "\n", 119 | " while(nextToken):\n", 120 | " response = textract.get_document_text_detection(JobId=jobId, NextToken=nextToken)\n", 121 | "\n", 122 | " pages.append(response)\n", 123 | " print(\"Resultset page recieved: {}\".format(len(pages)))\n", 124 | " nextToken = None\n", 125 | " if('NextToken' in response):\n", 126 | " nextToken = response['NextToken']\n", 127 | "\n", 128 | " return pages" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "id": "e4988775", 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [ 138 | "\n", 139 | "\n", 140 | "jobId = startJob(s3BucketName, documentName)\n", 141 | "print(\"Started job with id: {}\".format(jobId))\n", 142 | "if(isJobComplete(jobId)):\n", 143 | " response = getJobResults(jobId)\n", 144 | "\n", 145 | "#print(response)\n", 146 | "\n", 147 | "\n" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": null, 153 | "id": "ed4a963e", 154 | "metadata": {}, 155 | "outputs": [], 156 | "source": [ 157 | "# Print detected text\n", 158 | "text=\"\"\n", 159 | "for resultPage in response:\n", 160 | " for item in resultPage[\"Blocks\"]:\n", 161 | " if item[\"BlockType\"] == \"LINE\":\n", 162 | " #print ('\\033[94m' + item[\"Text\"] + '\\033[0m')\n", 163 | " text += item['Text']+\"\\n\"\n", 164 | "print(text)" 165 | ] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "id": "58dba8da", 170 | "metadata": {}, 171 | "source": [ 172 | "# Call Amazon Comprehend" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": null, 178 | "id": "c65788e6", 179 | "metadata": {}, 180 | "outputs": [], 181 | "source": [ 182 | "entities= comprehend.detect_entities(Text=text, LanguageCode='en')\n" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": null, 188 | "id": "e096d750", 189 | "metadata": {}, 190 | "outputs": [], 191 | "source": [ 192 | "print(json.dumps(entities, sort_keys=True, indent=4))" 193 | ] 194 | }, 195 | { 196 | "cell_type": "markdown", 197 | "id": "f7ffc4c2", 198 | "metadata": {}, 199 | "source": [ 200 | "# Create Kendra Index \n", 201 | "go to Kendra console https://console.aws.amazon.com/kendra/home?region=us-east-1#indexes/create\n", 202 | "to create an index by following book instructions and skip creating using API.\n", 203 | " \n", 204 | "Alternatively, Please craete an IAM role and provide in Role ARN, \n", 205 | "\n", 206 | "https://docs.aws.amazon.com/kendra/latest/dg/deploying.html" 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": null, 212 | "id": "3b2abed6", 213 | "metadata": {}, 214 | "outputs": [], 215 | "source": [ 216 | "# run this code only once as it will craete multiple indexes\n", 217 | "#response = kendra.create_index(\n", 218 | "# Name='Search',\n", 219 | "# Edition='DEVELOPER_EDITION',\n", 220 | "# RoleArn='')\n", 221 | "#print(response)\n" 222 | ] 223 | }, 224 | { 225 | "cell_type": "markdown", 226 | "id": "e5b4f468", 227 | "metadata": {}, 228 | "source": [ 229 | "Get IndexId from Console and paste it in ID or run above code to create Index which will give 36 digit Index ID." 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": null, 235 | "id": "c525a4d3", 236 | "metadata": {}, 237 | "outputs": [], 238 | "source": [ 239 | "response = kendra.update_index(\n", 240 | " Id=\"\",\n", 241 | " DocumentMetadataConfigurationUpdates=[\n", 242 | " {\n", 243 | " 'Name':'ORGANIZATION',\n", 244 | " 'Type':'STRING_LIST_VALUE',\n", 245 | " 'Search': {\n", 246 | " 'Facetable': True,\n", 247 | " 'Searchable': True,\n", 248 | " 'Displayable': True\n", 249 | " }\n", 250 | " },\n", 251 | " {\n", 252 | " 'Name':'PERSON',\n", 253 | " 'Type':'STRING_LIST_VALUE',\n", 254 | " 'Search': {\n", 255 | " 'Facetable': False,\n", 256 | " 'Searchable': True,\n", 257 | " 'Displayable': True\n", 258 | " }\n", 259 | " },\n", 260 | " {\n", 261 | " 'Name':'DATE',\n", 262 | " 'Type':'STRING_LIST_VALUE',\n", 263 | " 'Search': {\n", 264 | " 'Facetable': False,\n", 265 | " 'Searchable': True,\n", 266 | " 'Displayable': True\n", 267 | " }\n", 268 | " },\n", 269 | " {\n", 270 | " 'Name':'COMMERCIAL_ITEM',\n", 271 | " 'Type':'STRING_LIST_VALUE',\n", 272 | " 'Search': {\n", 273 | " 'Facetable': True,\n", 274 | " 'Searchable': False,\n", 275 | " 'Displayable': True\n", 276 | " }\n", 277 | " },\n", 278 | " {\n", 279 | " 'Name':'OTHER',\n", 280 | " 'Type':'STRING_LIST_VALUE',\n", 281 | " 'Search': {\n", 282 | " 'Facetable': True,\n", 283 | " 'Searchable': True,\n", 284 | " 'Displayable': True\n", 285 | " }\n", 286 | " }\n", 287 | " ,\n", 288 | " {\n", 289 | " 'Name':'QUANTITY',\n", 290 | " 'Type':'STRING_LIST_VALUE',\n", 291 | " 'Search': {\n", 292 | " 'Facetable': True,\n", 293 | " 'Searchable': True,\n", 294 | " 'Displayable': True\n", 295 | " }\n", 296 | " }\n", 297 | " ,\n", 298 | " {\n", 299 | " 'Name':'TITLE',\n", 300 | " 'Type':'STRING_LIST_VALUE',\n", 301 | " 'Search': {\n", 302 | " 'Facetable': False,\n", 303 | " 'Searchable': True,\n", 304 | " 'Displayable': True\n", 305 | " }\n", 306 | " }\n", 307 | " ])\n", 308 | " \n", 309 | "print(response)" 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "execution_count": null, 315 | "id": "1a32c481", 316 | "metadata": {}, 317 | "outputs": [], 318 | "source": [ 319 | "#List of categories recognized by Comprehend \n", 320 | "categories = [\"ORGANIZATION\", \"PERSON\", \"DATE\", \"COMMERCIAL_ITEM\", \"OTHER\", \"TITLE\", \"QUANTITY\"]" 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": null, 326 | "id": "f56eccf1", 327 | "metadata": {}, 328 | "outputs": [], 329 | "source": [ 330 | "#List of JSON objects to store entities\n", 331 | "entity_data = dict()\n", 332 | "#List of observed text strings recognized as categories\n", 333 | "category_text = dict()\n", 334 | "#Frequency of each text string\n", 335 | "text_frequency = dict()\n", 336 | "#The Kendra attributes JSON object with metadata list to be populated\n", 337 | "attributes = dict()\n", 338 | "metadata = dict()" 339 | ] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "execution_count": null, 344 | "id": "80a6b07e", 345 | "metadata": {}, 346 | "outputs": [], 347 | "source": [ 348 | "for et in categories:\n", 349 | " entity_data[et] = set()\n", 350 | " #print(entity_data[et])\n", 351 | " category_text[et] = []\n", 352 | " text_frequency[et] = dict()" 353 | ] 354 | }, 355 | { 356 | "cell_type": "code", 357 | "execution_count": null, 358 | "id": "5fffe276", 359 | "metadata": {}, 360 | "outputs": [], 361 | "source": [ 362 | "for e in entities[\"Entities\"]:\n", 363 | " if (e[\"Text\"].isprintable()) and (not \"\\\"\" in e[\"Text\"]) and (not e[\"Text\"].upper() in category_text[e[\"Type\"]]):\n", 364 | " #Append the text to entity data to be used for a Kendra custom attribute\n", 365 | " entity_data[e[\"Type\"]].add(e[\"Text\"])\n", 366 | " #Keep track of text in upper case so that we don't treat the same text written in different cases differently\n", 367 | " category_text[e[\"Type\"]].append(e[\"Text\"].upper())\n", 368 | " #Keep track of the frequency of the text so that we can take the text with highest frequency of occurrance\n", 369 | " text_frequency[e[\"Type\"]][e[\"Text\"].upper()] = 1\n", 370 | " elif (e[\"Text\"].upper() in category_text[e[\"Type\"]]):\n", 371 | " #Keep track of the frequency of the text so that we can take the text with highest frequency of occurrance\n", 372 | " text_frequency[e[\"Type\"]][e[\"Text\"].upper()] += 1\n", 373 | "\n", 374 | "print(entity_data)" 375 | ] 376 | }, 377 | { 378 | "cell_type": "code", 379 | "execution_count": null, 380 | "id": "6fa3c949", 381 | "metadata": {}, 382 | "outputs": [], 383 | "source": [ 384 | "#Populate the metadata list\n", 385 | "elimit = 10\n", 386 | "for et in categories:\n", 387 | " #Take at most elimit number of recognized text strings having the highest frequency of occurrance\n", 388 | " el = [pair[0] for pair in sorted(text_frequency[et].items(), key=lambda item: item[1], reverse=True)][0:elimit]\n", 389 | " metadata[et] = [d for d in entity_data[et] if d.upper() in el]\n", 390 | "metadata[\"_source_uri\"] = documentName\n", 391 | "attributes[\"Attributes\"] = metadata\n", 392 | "print(json.dumps(attributes, sort_keys=True, indent=4))" 393 | ] 394 | }, 395 | { 396 | "cell_type": "code", 397 | "execution_count": null, 398 | "id": "7d3fe867", 399 | "metadata": {}, 400 | "outputs": [], 401 | "source": [ 402 | "with open(\"metadata.json\", \"w\") as f:\n", 403 | " json.dump(attributes, f)" 404 | ] 405 | }, 406 | { 407 | "cell_type": "code", 408 | "execution_count": null, 409 | "id": "d7abba45", 410 | "metadata": {}, 411 | "outputs": [], 412 | "source": [ 413 | "s3 = boto3.client('s3')\n", 414 | "prefix= 'meta/'\n", 415 | "with open(\"metadata.json\", \"rb\") as f:\n", 416 | " #s3.upload_fileobj(f,s3BucketName, prefix+\"resume_Sample.pdf.metadata.json\")\n", 417 | " s3.upload_file( \"metadata.json\", s3BucketName,'%s/%s' % (\"meta\",\"resume_Sample.pdf.metadata.json\"))\n", 418 | "print(\"Uploaded to Amazon S3 meta folder\")" 419 | ] 420 | }, 421 | { 422 | "cell_type": "markdown", 423 | "id": "36d2cec1", 424 | "metadata": {}, 425 | "source": [ 426 | "# Run Kendra Sync in AWS Console" 427 | ] 428 | }, 429 | { 430 | "cell_type": "markdown", 431 | "id": "b64a0f5c", 432 | "metadata": {}, 433 | "source": [ 434 | "# Clean UP" 435 | ] 436 | }, 437 | { 438 | "cell_type": "markdown", 439 | "id": "161959b7", 440 | "metadata": {}, 441 | "source": [ 442 | "# Delete the Amazon S3 Data source and the Kendra Index \n", 443 | "https://docs.aws.amazon.com/kendra/latest/dg/delete-data-source.html" 444 | ] 445 | } 446 | ], 447 | "metadata": { 448 | "kernelspec": { 449 | "display_name": "conda_python3", 450 | "language": "python", 451 | "name": "conda_python3" 452 | }, 453 | "language_info": { 454 | "codemirror_mode": { 455 | "name": "ipython", 456 | "version": 3 457 | }, 458 | "file_extension": ".py", 459 | "mimetype": "text/x-python", 460 | "name": "python", 461 | "nbconvert_exporter": "python", 462 | "pygments_lexer": "ipython3", 463 | "version": "3.6.13" 464 | } 465 | }, 466 | "nbformat": 4, 467 | "nbformat_minor": 5 468 | } 469 | -------------------------------------------------------------------------------- /Chapter 05/lambda/index.py: -------------------------------------------------------------------------------- 1 | from elasticsearch import Elasticsearch, RequestsHttpConnection 2 | import requests 3 | from aws_requests_auth.aws_auth import AWSRequestsAuth 4 | from requests_aws4auth import AWS4Auth 5 | import base64 6 | from s3transfer.manager import TransferManager 7 | import os 8 | import os.path 9 | import sys 10 | import boto3 11 | import json 12 | import io 13 | from io import BytesIO 14 | import sys 15 | from trp import Document 16 | 17 | try: 18 | from urllib.parse import unquote_plus 19 | except ImportError: 20 | from urllib import unquote_plus 21 | 22 | 23 | print('setting up boto3') 24 | 25 | root = os.environ["LAMBDA_TASK_ROOT"] 26 | sys.path.insert(0, root) 27 | print(boto3.__version__) 28 | print('core path setup') 29 | s3 = boto3.resource('s3') 30 | s3client = boto3.client('s3') 31 | 32 | host= os.environ['esDomain'] 33 | print("ES DOMAIN IS..........") 34 | region=os.environ['AWS_REGION'] 35 | 36 | service = 'es' 37 | credentials = boto3.Session().get_credentials() 38 | 39 | def connectES(): 40 | print ('Connecting to the ES Endpoint {0}') 41 | awsauth = AWS4Auth(credentials.access_key, 42 | credentials.secret_key, 43 | region, service, 44 | session_token=credentials.token) 45 | try: 46 | es = Elasticsearch( 47 | hosts=[{'host': host, 'port': 443}], 48 | http_auth = awsauth, 49 | use_ssl=True, 50 | verify_certs=True, 51 | connection_class=RequestsHttpConnection) 52 | return es 53 | except Exception as E: 54 | print("Unable to connect to {0}") 55 | print(E) 56 | exit(3) 57 | print("sucess seting up es") 58 | 59 | print("setting up Textract") 60 | # get the results 61 | textract = boto3.client( 62 | service_name='textract', 63 | region_name=region) 64 | 65 | print('initializing comprehend') 66 | comprehend = boto3.client(service_name='comprehend', region_name=region) 67 | print('done') 68 | 69 | def outputForm(page): 70 | csvData = [] 71 | for field in page.form.fields: 72 | csvItem = [] 73 | if(field.key): 74 | csvItem.append(field.key.text) 75 | else: 76 | csvItem.append("") 77 | if(field.value): 78 | csvItem.append(field.value.text) 79 | else: 80 | csvItem.append("") 81 | csvData.append(csvItem) 82 | return csvData 83 | 84 | def outputTable(page): 85 | csvData = [] 86 | print("//////////////////") 87 | #print(page) 88 | for table in page.tables: 89 | csvRow = [] 90 | csvRow.append("Table") 91 | csvData.append(csvRow) 92 | for row in table.rows: 93 | csvRow = [] 94 | for cell in row.cells: 95 | csvRow.append(cell.text) 96 | csvData.append(csvRow) 97 | csvData.append([]) 98 | csvData.append([]) 99 | return csvData 100 | # --------------- Main Lambda Handler ------------------ 101 | 102 | 103 | def handler(event, context): 104 | print("Received event: " + json.dumps(event, indent=2)) 105 | 106 | # Get the object from the event and show its content type 107 | bucket = event['Records'][0]['s3']['bucket']['name'] 108 | key = unquote_plus(event['Records'][0]['s3']['object']['key']) 109 | print("key is"+key) 110 | print("bucket is"+bucket) 111 | text="" 112 | textvalues=[] 113 | textvalues_entity={} 114 | try: 115 | s3.Bucket(bucket).download_file(Key=key,Filename='/tmp/{}') 116 | # Read document content 117 | with open('/tmp/{}', 'rb') as document: 118 | imageBytes = bytearray(document.read()) 119 | print("Object downloaded") 120 | response = textract.analyze_document(Document={'Bytes': imageBytes},FeatureTypes=["TABLES", "FORMS"]) 121 | document = Document(response) 122 | table=[] 123 | forms=[] 124 | #print(document) 125 | for page in document.pages: 126 | table = outputTable(page) 127 | forms = outputForm(page) 128 | print(table) 129 | blocks=response['Blocks'] 130 | for block in blocks: 131 | if block['BlockType'] == 'LINE': 132 | text += block['Text']+"\n" 133 | print(text) 134 | # Extracting Key Phrases 135 | keyphrase_response = comprehend.detect_key_phrases(Text=text, LanguageCode='en') 136 | KeyPhraseList=keyphrase_response.get("KeyPhrases") 137 | for s in KeyPhraseList: 138 | textvalues.append(s.get("Text")) 139 | 140 | detect_entity= comprehend.detect_entities(Text=text, LanguageCode='en') 141 | EntityList=detect_entity.get("Entities") 142 | for s in EntityList: 143 | textvalues_entity.update([(s.get("Type").strip('\t\n\r'),s.get("Text").strip('\t\n\r'))]) 144 | 145 | s3url= 'https://s3.console.aws.amazon.com/s3/object/'+bucket+'/'+key+'?region='+region 146 | 147 | searchdata={'s3link':s3url,'KeyPhrases':textvalues,'Entity':textvalues_entity,'text':text, 'table':table, 'forms':forms} 148 | print(searchdata) 149 | print("connecting to ES") 150 | es=connectES() 151 | #es.index(index="resume-search", doc_type="_doc", body=searchdata) 152 | es.index(index="document", doc_type="_doc", body=searchdata) 153 | print("data uploaded to Elasticsearch") 154 | return 'keyphrases Successfully Uploaded' 155 | except Exception as e: 156 | print(e) 157 | print('Error: ') 158 | raise e 159 | -------------------------------------------------------------------------------- /Chapter 05/resume_Sample.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 05/resume_Sample.pdf -------------------------------------------------------------------------------- /Chapter 05/resume_sample.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 05/resume_sample.PNG -------------------------------------------------------------------------------- /Chapter 05/template-export-textract.yml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: 2010-09-09 2 | Transform: 3 | - AWS::Serverless-2016-10-31 4 | Parameters: 5 | DOMAINNAME: 6 | Description: Name for the Amazon ES domain that this template will create. Domain 7 | names must start with a lowercase letter and must be between 3 and 28 characters. 8 | Valid characters are a-z (lowercase only), 0-9. 9 | Type: String 10 | Default: documentsearchapp 11 | CognitoAdminEmail: 12 | Type: String 13 | Default: abc@amazon.com 14 | AllowedPattern: ^\w+([\.-]?\w+)*@\w+([\.-]?\w+)*(\.\w{2,3})+$ 15 | Description: E-mail address of the Cognito admin name 16 | Mappings: 17 | SourceCode: 18 | General: 19 | S3Bucket: solutions 20 | KeyPrefix: centralized-logging/v2.2.0 21 | Resources: 22 | ComprehendKeyPhraseAnalysis: 23 | Properties: 24 | Description: Triggered by S3 review upload to the repo bucket and start the 25 | key phrase analysis via Amazon Comprehend 26 | Handler: comprehend.handler 27 | MemorySize: 128 28 | Policies: 29 | Statement: 30 | - Sid: comprehend 31 | Effect: Allow 32 | Action: 33 | - comprehend:* 34 | Resource: '*' 35 | - Sid: textract 36 | Effect: Allow 37 | Action: 38 | - textract:* 39 | Resource: '*' 40 | - Sid: s3 41 | Effect: Allow 42 | Action: 43 | - s3:*Object 44 | Resource: 45 | Fn::Sub: arn:aws:s3:::${S3}/* 46 | - Sid: es 47 | Effect: Allow 48 | Action: 49 | - es:* 50 | Resource: '*' 51 | Environment: 52 | Variables: 53 | bucket: 54 | Ref: S3 55 | esDomain: 56 | Fn::GetAtt: 57 | - ElasticsearchDomain 58 | - DomainEndpoint 59 | Runtime: python3.6 60 | Timeout: 300 61 | CodeUri: s3://forindexing/3c0a3b1c981cda97ffabeb704fd0abd2 62 | Type: AWS::Serverless::Function 63 | S3: 64 | Type: AWS::S3::Bucket 65 | TestS3BucketEventPermission: 66 | Type: AWS::Lambda::Permission 67 | Properties: 68 | Action: lambda:invokeFunction 69 | SourceAccount: 70 | Ref: AWS::AccountId 71 | FunctionName: 72 | Ref: ComprehendKeyPhraseAnalysis 73 | SourceArn: 74 | Fn::GetAtt: 75 | - S3 76 | - Arn 77 | Principal: s3.amazonaws.com 78 | ApplyNotificationFunctionRole: 79 | Type: AWS::IAM::Role 80 | Properties: 81 | AssumeRolePolicyDocument: 82 | Version: '2012-10-17' 83 | Statement: 84 | - Effect: Allow 85 | Principal: 86 | Service: lambda.amazonaws.com 87 | Action: sts:AssumeRole 88 | ManagedPolicyArns: 89 | - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole 90 | Path: / 91 | Policies: 92 | - PolicyName: S3BucketNotificationPolicy 93 | PolicyDocument: 94 | Version: '2012-10-17' 95 | Statement: 96 | - Sid: AllowBucketNotification 97 | Effect: Allow 98 | Action: s3:PutBucketNotification 99 | Resource: 100 | - Fn::Sub: arn:aws:s3:::${S3} 101 | - Fn::Sub: arn:aws:s3:::${S3}/* 102 | ApplyBucketNotificationFunction: 103 | Type: AWS::Lambda::Function 104 | Properties: 105 | Description: Dummy function, just logs the received event 106 | Handler: index.handler 107 | Runtime: python3.9 108 | Role: 109 | Fn::GetAtt: 110 | - ApplyNotificationFunctionRole 111 | - Arn 112 | Timeout: 240 113 | Code: 114 | ZipFile: "import boto3\nimport logging\nimport json\nimport cfnresponse\n\n\ 115 | s3Client = boto3.client('s3')\nlogger = logging.getLogger()\nlogger.setLevel(logging.DEBUG)\n\ 116 | \ndef addBucketNotification(bucketName, notificationId, functionArn):\n\ 117 | \ notificationResponse = s3Client.put_bucket_notification_configuration(\n\ 118 | \ Bucket=bucketName,\n NotificationConfiguration={\n 'LambdaFunctionConfigurations':\ 119 | \ [\n {\n 'Id': notificationId,\n 'LambdaFunctionArn':\ 120 | \ functionArn,\n 'Events': [\n 's3:ObjectCreated:*'\n\ 121 | \ ]\n },\n ]\n }\n )\n return notificationResponse\n\ 122 | \ndef create(properties, physical_id):\n bucketName = properties['S3Bucket']\n\ 123 | \ notificationId = properties['NotificationId']\n functionArn = properties['FunctionARN']\n\ 124 | \ response = addBucketNotification(bucketName, notificationId, functionArn)\n\ 125 | \ logger.info('AddBucketNotification response: %s' % json.dumps(response))\n\ 126 | \ return cfnresponse.SUCCESS, physical_id\n\ndef update(properties, physical_id):\n\ 127 | \ return cfnresponse.SUCCESS, None\n\ndef delete(properties, physical_id):\n\ 128 | \ return cfnresponse.SUCCESS, None\n\ndef handler(event, context):\n logger.info('Received\ 129 | \ event: %s' % json.dumps(event))\n\n status = cfnresponse.FAILED\n new_physical_id\ 130 | \ = None\n\n try:\n properties = event.get('ResourceProperties')\n \ 131 | \ physical_id = event.get('PhysicalResourceId')\n\n status, new_physical_id\ 132 | \ = {\n 'Create': create,\n 'Update': update,\n 'Delete':\ 133 | \ delete\n }.get(event['RequestType'], lambda x, y: (cfnresponse.FAILED,\ 134 | \ None))(properties, physical_id)\n except Exception as e:\n logger.error('Exception:\ 135 | \ %s' % e)\n status = cfnresponse.FAILED\n finally:\n cfnresponse.send(event,\ 136 | \ context, status, {}, new_physical_id)\n" 137 | UserPool: 138 | Type: AWS::Cognito::UserPool 139 | Properties: 140 | UserPoolName: 141 | Fn::Sub: ${DOMAINNAME}_kibana_access 142 | AutoVerifiedAttributes: 143 | - email 144 | MfaConfiguration: 'OFF' 145 | EmailVerificationSubject: 146 | Ref: AWS::StackName 147 | Schema: 148 | - Name: name 149 | AttributeDataType: String 150 | Mutable: true 151 | Required: true 152 | - Name: email 153 | AttributeDataType: String 154 | Mutable: false 155 | Required: true 156 | UserPoolGroup: 157 | Type: AWS::Cognito::UserPoolGroup 158 | Properties: 159 | Description: User pool group for Kibana access 160 | GroupName: 161 | Fn::Sub: ${DOMAINNAME}_kibana_access_group 162 | Precedence: 0 163 | UserPoolId: 164 | Ref: UserPool 165 | UserPoolClient: 166 | Type: AWS::Cognito::UserPoolClient 167 | Properties: 168 | ClientName: 169 | Fn::Sub: ${DOMAINNAME}-client 170 | GenerateSecret: false 171 | UserPoolId: 172 | Ref: UserPool 173 | IdentityPool: 174 | Type: AWS::Cognito::IdentityPool 175 | Properties: 176 | IdentityPoolName: 177 | Fn::Sub: ${DOMAINNAME}Identity 178 | AllowUnauthenticatedIdentities: true 179 | CognitoIdentityProviders: 180 | - ClientId: 181 | Ref: UserPoolClient 182 | ProviderName: 183 | Fn::GetAtt: 184 | - UserPool 185 | - ProviderName 186 | CognitoUnAuthorizedRole: 187 | Type: AWS::IAM::Role 188 | Properties: 189 | AssumeRolePolicyDocument: 190 | Version: '2012-10-17' 191 | Statement: 192 | - Effect: Allow 193 | Principal: 194 | Federated: cognito-identity.amazonaws.com 195 | Action: 196 | - sts:AssumeRoleWithWebIdentity 197 | Condition: 198 | StringEquals: 199 | cognito-identity.amazonaws.com:aud: 200 | Ref: IdentityPool 201 | ForAnyValue:StringLike: 202 | cognito-identity.amazonaws.com:amr: unauthenticated 203 | Policies: 204 | - PolicyName: CognitoUnauthorizedPolicy 205 | PolicyDocument: 206 | Version: '2012-10-17' 207 | Statement: 208 | - Effect: Allow 209 | Action: 210 | - mobileanalytics:PutEvents 211 | - cognito-sync:BulkPublish 212 | - cognito-sync:DescribeIdentityPoolUsage 213 | - cognito-sync:GetBulkPublishDetails 214 | - cognito-sync:GetCognitoEvents 215 | - cognito-sync:GetIdentityPoolConfiguration 216 | - cognito-sync:ListIdentityPoolUsage 217 | - cognito-sync:SetCognitoEvents 218 | - congito-sync:SetIdentityPoolConfiguration 219 | Resource: 220 | Fn::Sub: arn:aws:cognito-identity:${AWS::Region}:${AWS::AccountId}:identitypool/${IdentityPool} 221 | CognitoAuthorizedRole: 222 | Type: AWS::IAM::Role 223 | Properties: 224 | AssumeRolePolicyDocument: 225 | Version: '2012-10-17' 226 | Statement: 227 | - Effect: Allow 228 | Principal: 229 | Federated: cognito-identity.amazonaws.com 230 | Action: 231 | - sts:AssumeRoleWithWebIdentity 232 | Condition: 233 | StringEquals: 234 | cognito-identity.amazonaws.com:aud: 235 | Ref: IdentityPool 236 | ForAnyValue:StringLike: 237 | cognito-identity.amazonaws.com:amr: authenticated 238 | Policies: 239 | - PolicyName: CognitoAuthorizedPolicy 240 | PolicyDocument: 241 | Version: '2012-10-17' 242 | Statement: 243 | - Effect: Allow 244 | Action: 245 | - mobileanalytics:PutEvents 246 | - cognito-sync:BulkPublish 247 | - cognito-sync:DescribeIdentityPoolUsage 248 | - cognito-sync:GetBulkPublishDetails 249 | - cognito-sync:GetCognitoEvents 250 | - cognito-sync:GetIdentityPoolConfiguration 251 | - cognito-sync:ListIdentityPoolUsage 252 | - cognito-sync:SetCognitoEvents 253 | - congito-sync:SetIdentityPoolConfiguration 254 | - cognito-identity:DeleteIdentityPool 255 | - cognito-identity:DescribeIdentityPool 256 | - cognito-identity:GetIdentityPoolRoles 257 | - cognito-identity:GetOpenIdTokenForDeveloperIdentity 258 | - cognito-identity:ListIdentities 259 | - cognito-identity:LookupDeveloperIdentity 260 | - cognito-identity:MergeDeveloperIdentities 261 | - cognito-identity:UnlikeDeveloperIdentity 262 | - cognito-identity:UpdateIdentityPool 263 | Resource: 264 | Fn::Sub: arn:aws:cognito-identity:${AWS::Region}:${AWS::AccountId}:identitypool/${IdentityPool} 265 | CognitoESAccessRole: 266 | Type: AWS::IAM::Role 267 | Properties: 268 | ManagedPolicyArns: 269 | - arn:aws:iam::aws:policy/AmazonESCognitoAccess 270 | AssumeRolePolicyDocument: 271 | Version: '2012-10-17' 272 | Statement: 273 | - Effect: Allow 274 | Principal: 275 | Service: es.amazonaws.com 276 | Action: 277 | - sts:AssumeRole 278 | IdentityPoolRoleMapping: 279 | Type: AWS::Cognito::IdentityPoolRoleAttachment 280 | Properties: 281 | IdentityPoolId: 282 | Ref: IdentityPool 283 | Roles: 284 | authenticated: 285 | Fn::GetAtt: 286 | - CognitoAuthorizedRole 287 | - Arn 288 | unauthenticated: 289 | Fn::GetAtt: 290 | - CognitoUnAuthorizedRole 291 | - Arn 292 | AdminUser: 293 | Type: AWS::Cognito::UserPoolUser 294 | Properties: 295 | DesiredDeliveryMediums: 296 | - EMAIL 297 | UserAttributes: 298 | - Name: email 299 | Value: 300 | Ref: CognitoAdminEmail 301 | Username: 302 | Ref: CognitoAdminEmail 303 | UserPoolId: 304 | Ref: UserPool 305 | SetupESCognito: 306 | Type: Custom::SetupESCognito 307 | Version: 1.0 308 | Properties: 309 | ServiceToken: 310 | Fn::GetAtt: 311 | - LambdaESCognito 312 | - Arn 313 | Domain: 314 | Ref: DOMAINNAME 315 | CognitoDomain: 316 | Fn::Sub: ${DOMAINNAME}-${AWS::AccountId} 317 | UserPoolId: 318 | Ref: UserPool 319 | IdentityPoolId: 320 | Ref: IdentityPool 321 | RoleArn: 322 | Fn::GetAtt: 323 | - CognitoESAccessRole 324 | - Arn 325 | LambdaESCognito: 326 | Type: AWS::Lambda::Function 327 | Properties: 328 | Description: Centralized Logging - Lambda function to enable cognito authentication 329 | for kibana 330 | Environment: 331 | Variables: 332 | LOG_LEVEL: INFO 333 | Handler: index.handler 334 | Runtime: nodejs12.x 335 | Timeout: 600 336 | Role: 337 | Fn::GetAtt: 338 | - LambdaESCognitoRole 339 | - Arn 340 | Code: 341 | S3Bucket: 342 | Fn::Join: 343 | - '-' 344 | - - Fn::FindInMap: 345 | - SourceCode 346 | - General 347 | - S3Bucket 348 | - Ref: AWS::Region 349 | S3Key: 350 | Fn::Join: 351 | - / 352 | - - Fn::FindInMap: 353 | - SourceCode 354 | - General 355 | - KeyPrefix 356 | - clog-auth.zip 357 | LambdaESCognitoRole: 358 | Type: AWS::IAM::Role 359 | DependsOn: ElasticsearchDomain 360 | Properties: 361 | AssumeRolePolicyDocument: 362 | Version: '2012-10-17' 363 | Statement: 364 | - Effect: Allow 365 | Principal: 366 | Service: 367 | - lambda.amazonaws.com 368 | Action: 369 | - sts:AssumeRole 370 | Path: / 371 | Policies: 372 | - PolicyName: root 373 | PolicyDocument: 374 | Version: '2012-10-17' 375 | Statement: 376 | - Effect: Allow 377 | Action: 378 | - logs:CreateLogGroup 379 | - logs:CreateLogStream 380 | - logs:PutLogEvents 381 | Resource: arn:aws:logs:*:*:* 382 | - Effect: Allow 383 | Action: 384 | - es:UpdateElasticsearchDomainConfig 385 | Resource: 386 | Fn::Sub: arn:aws:es:${AWS::Region}:${AWS::AccountId}:domain/${DOMAINNAME} 387 | - Effect: Allow 388 | Action: 389 | - cognito-idp:CreateUserPoolDomain 390 | - cognito-idp:DeleteUserPoolDomain 391 | Resource: 392 | Fn::GetAtt: 393 | - UserPool 394 | - Arn 395 | - Effect: Allow 396 | Action: 397 | - iam:PassRole 398 | Resource: 399 | Fn::GetAtt: 400 | - CognitoESAccessRole 401 | - Arn 402 | ElasticsearchDomain: 403 | Type: AWS::Elasticsearch::Domain 404 | Properties: 405 | DomainName: 406 | Ref: DOMAINNAME 407 | ElasticsearchVersion: '6.3' 408 | ElasticsearchClusterConfig: 409 | InstanceCount: '1' 410 | InstanceType: t2.small.elasticsearch 411 | EBSOptions: 412 | EBSEnabled: true 413 | Iops: 0 414 | VolumeSize: 10 415 | VolumeType: gp2 416 | SnapshotOptions: 417 | AutomatedSnapshotStartHour: '0' 418 | AccessPolicies: 419 | Version: '2012-10-17' 420 | Statement: 421 | - Action: es:* 422 | Principal: 423 | AWS: 424 | Fn::Sub: 425 | - arn:aws:sts::${AWS::AccountId}:assumed-role/${AuthRole}/CognitoIdentityCredentials 426 | - AuthRole: 427 | Ref: CognitoAuthorizedRole 428 | Effect: Allow 429 | Resource: 430 | Fn::Sub: arn:aws:es:${AWS::Region}:${AWS::AccountId}:domain/${DOMAINNAME}/* 431 | ApplyNotification: 432 | Type: Custom::ApplyNotification 433 | Properties: 434 | ServiceToken: 435 | Fn::GetAtt: 436 | - ApplyBucketNotificationFunction 437 | - Arn 438 | S3Bucket: 439 | Ref: S3 440 | FunctionARN: 441 | Fn::GetAtt: 442 | - ComprehendKeyPhraseAnalysis 443 | - Arn 444 | NotificationId: S3ObjectCreatedEvent 445 | Outputs: 446 | S3KeyPhraseBucket: 447 | Value: 448 | Fn::Sub: https://console.aws.amazon.com/s3/buckets/${S3}/?region=us-east-1 449 | KibanaLoginURL: 450 | Description: Kibana login URL 451 | Value: 452 | Fn::Sub: https://${ElasticsearchDomain.DomainEndpoint}/_plugin/kibana/ 453 | -------------------------------------------------------------------------------- /Chapter 07/lib/test.py: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Chapter 07/project_path.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import os 3 | 4 | module_path = os.path.abspath(os.path.join(os.pardir)) 5 | if module_path not in sys.path: 6 | sys.path.append(module_path) -------------------------------------------------------------------------------- /Chapter 08/contextual-ad-marking-for-content-monetization-with-nlp-github.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "f1da84ae", 6 | "metadata": {}, 7 | "source": [ 8 | "# Using NLP for content monetization\n", 9 | "This is an accompanying notebook to Chapter 8 of the book - Natural Language Processing with AWS AI Services. Please do not use this notebook directly as there are prerequisites and dependent steps required to be performed as documented in the book. Briefly in this chapter, we look at a use case of how to use AWS services specifically NLP to enable monetization of your video content. The following high level steps (along with where the instructions are) walk through the solution:\n", 10 | "1. Upload a video file to an Amazon S3 bucket - Refer to the book\n", 11 | "2. Use AWS Elemental MediaConvert to create brodcast streams - Refer to the book\n", 12 | "3. Run a transcription of the video file using Amazon Transcribe - Refer to this notebook\n", 13 | "4. Run an Amazon Comprehend Topic Modeling job to extract topics - Refer to this notebook\n", 14 | "5. Select the ad markers based on topics extracted - Refer to this notebook\n", 15 | "6. Stitch into an Ad decision server URL - Refer to this notebook\n", 16 | "7. Create an AWS Elemental MediaTailor configuration - Refer to the book\n", 17 | "8. Play the ad embedded video to test - Refer to the book" 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "id": "66010cce", 23 | "metadata": {}, 24 | "source": [ 25 | "## Transcribe section" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": null, 31 | "id": "6ebea48c", 32 | "metadata": {}, 33 | "outputs": [], 34 | "source": [ 35 | "import pandas as pd\n", 36 | "import json\n", 37 | "import boto3\n", 38 | "import re\n", 39 | "import uuid\n", 40 | "import time\n", 41 | "import io\n", 42 | "import os\n", 43 | "from io import BytesIO\n", 44 | "import sys\n", 45 | "import csv\n", 46 | "from IPython.display import Image, display\n", 47 | "from PIL import Image as PImage, ImageDraw" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": null, 53 | "id": "db26300e", 54 | "metadata": {}, 55 | "outputs": [], 56 | "source": [ 57 | "# create topic-modeling/raw folder we need down the line\n", 58 | "directory = \"topic-modeling\"\n", 59 | "parent_dir = os.getcwd()\n", 60 | " \n", 61 | "# Path\n", 62 | "path = os.path.join(parent_dir, directory)\n", 63 | "os.makedirs(path, exist_ok = True)\n", 64 | "print(\"Directory '%s' created successfully\" %directory)\n", 65 | "\n", 66 | "directory = \"raw\"\n", 67 | "parent_dir = os.getcwd()+'/topic-modeling'\n", 68 | " \n", 69 | "# Path\n", 70 | "path = os.path.join(parent_dir, directory)\n", 71 | "os.makedirs(path, exist_ok = True)\n", 72 | "print(\"Directory '%s' created successfully\" %directory)" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": null, 78 | "id": "21c72aea", 79 | "metadata": {}, 80 | "outputs": [], 81 | "source": [ 82 | "bucket=''\n", 83 | "prefix='chapter8'\n", 84 | "s3=boto3.client('s3')" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": null, 90 | "id": "6e304bdf", 91 | "metadata": {}, 92 | "outputs": [], 93 | "source": [ 94 | "import time\n", 95 | "import boto3\n", 96 | "\n", 97 | "def transcribe_file(job_name, file_uri, transcribe_client):\n", 98 | " transcribe_client.start_transcription_job(\n", 99 | " TranscriptionJobName=job_name,\n", 100 | " Media={'MediaFileUri': file_uri},\n", 101 | " MediaFormat='mp4',\n", 102 | " LanguageCode='en-US'\n", 103 | " )" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": null, 109 | "id": "3a28030a", 110 | "metadata": {}, 111 | "outputs": [], 112 | "source": [ 113 | "job_name = 'media-monetization-transcribe-3'" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": null, 119 | "id": "fbf4f49f", 120 | "metadata": {}, 121 | "outputs": [], 122 | "source": [ 123 | "transcribe_client = boto3.client('transcribe')\n", 124 | "file_uri = 's3://'+bucket+'/'+prefix+'/'+'rawvideo/bank-demo-prem-ranga.mp4'\n", 125 | "transcribe_file(job_name, file_uri, transcribe_client)" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": null, 131 | "id": "c6d2d81b", 132 | "metadata": {}, 133 | "outputs": [], 134 | "source": [ 135 | "job = transcribe_client.get_transcription_job(TranscriptionJobName=job_name)\n", 136 | "job_status = job['TranscriptionJob']['TranscriptionJobStatus']\n", 137 | "if job_status in ['COMPLETED', 'FAILED']:\n", 138 | " print(f\"Job {job_name} is {job_status}.\")\n", 139 | " if job_status == 'COMPLETED':\n", 140 | " print(f\"Download the transcript from\\n\"\n", 141 | " f\"\\t{job['TranscriptionJob']['Transcript']['TranscriptFileUri']}\")" 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "id": "19c1d2c6", 147 | "metadata": {}, 148 | "source": [ 149 | "## Comprehend Topic Modeling Section" 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "id": "0f76a306", 155 | "metadata": {}, 156 | "source": [ 157 | "### First get the transcript" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": null, 163 | "id": "35d14975", 164 | "metadata": {}, 165 | "outputs": [], 166 | "source": [ 167 | "# Load the csv file into a Pandas DataFrame for easy manipulation\n", 168 | "raw_df = pd.read_json(job['TranscriptionJob']['Transcript']['TranscriptFileUri'])\n", 169 | "raw_df.shape" 170 | ] 171 | }, 172 | { 173 | "cell_type": "code", 174 | "execution_count": null, 175 | "id": "ac7e18e1", 176 | "metadata": {}, 177 | "outputs": [], 178 | "source": [ 179 | "raw_df.head()" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": null, 185 | "id": "2fe83f93", 186 | "metadata": {}, 187 | "outputs": [], 188 | "source": [ 189 | "# Let's drop the rest of the columns, we only need the transcript for our solution\n", 190 | "raw_df = pd.DataFrame(raw_df.at['transcripts','results'].copy())" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": null, 196 | "id": "454f1765", 197 | "metadata": {}, 198 | "outputs": [], 199 | "source": [ 200 | "#Convert this back to the CSV file\n", 201 | "raw_df.to_csv('topic-modeling/raw/transcript.csv', header=False, index=False)" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": null, 207 | "id": "b30f7c6a", 208 | "metadata": {}, 209 | "outputs": [], 210 | "source": [ 211 | "directory = \"job-input\"\n", 212 | "parent_dir = os.getcwd()+'/topic-modeling'\n", 213 | "# Path\n", 214 | "path = os.path.join(parent_dir, directory)\n", 215 | "os.makedirs(path, exist_ok = True)\n", 216 | "print(\"Directory '%s' created successfully\" %directory)" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": null, 222 | "id": "f7249ec9", 223 | "metadata": {}, 224 | "outputs": [], 225 | "source": [ 226 | "import csv\n", 227 | "# Run Regex expression to create a list of sentences\n", 228 | "folderpath = r\"topic-modeling/raw\" # make sure to put the 'r' in front and provide the folder where your files are\n", 229 | "filepaths = [os.path.join(folderpath, name) for name in os.listdir(folderpath) if not name.startswith('.')] # do not select hidden directories\n", 230 | "fnfull = \"topic-modeling/job-input/transcript_formatted.csv\"\n", 231 | "for path in filepaths:\n", 232 | " print(path)\n", 233 | " with open(path, 'r') as f:\n", 234 | " content = f.read() # Read the whole file\n", 235 | " lines = content.split('.') # a list of all sentences\n", 236 | " with open(fnfull, \"w\", encoding='utf-8') as ff:\n", 237 | " csv_writer = csv.writer(ff, delimiter=',', quotechar = '\"')\n", 238 | " for num,line in enumerate(lines): # for each sentence\n", 239 | " csv_writer.writerow([line])\n", 240 | "f.close()" 241 | ] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": null, 246 | "id": "0cc5f379", 247 | "metadata": {}, 248 | "outputs": [], 249 | "source": [ 250 | "# Upload the CSV file to the input prefix in S3 to be used in the topic modeling job\n", 251 | "s3.upload_file('topic-modeling/job-input/transcript_formatted.csv', bucket, prefix+'/topic-modeling/job-input/tm-input.csv')\n", 252 | "print('s3://'+bucket+'/'+prefix+'/topic-modeling/job-input/tm-input.csv')" 253 | ] 254 | }, 255 | { 256 | "cell_type": "markdown", 257 | "id": "42f1014b", 258 | "metadata": {}, 259 | "source": [ 260 | "### Now follow the instructions in the book to run the topic modeling job from the Amazon Comprehend console" 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "id": "1b4862fa", 266 | "metadata": {}, 267 | "source": [ 268 | "### Process Topic Modeling Results" 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": null, 274 | "id": "47baa716", 275 | "metadata": {}, 276 | "outputs": [], 277 | "source": [ 278 | "# Let's first download the results of the topic modeling job. \n", 279 | "# Please copy the output data location from your topic modeling job for this step and use it below\n", 280 | "\n", 281 | "# create topic-modeling/results folder\n", 282 | "directory = \"results\"\n", 283 | "parent_dir = os.getcwd()+'/topic-modeling'\n", 284 | " \n", 285 | "# Path\n", 286 | "path = os.path.join(parent_dir, directory)\n", 287 | "os.makedirs(path, exist_ok = True)\n", 288 | "print(\"Directory '%s' created successfully\" %directory)\n", 289 | "\n", 290 | "#tpprefix = prefix+'/'+''\n", 291 | "tpprefix = prefix+'/'+'topic-modeling//output/output.tar.gz'\n", 292 | "s3.download_file(bucket, tpprefix, 'topic-modeling/results/output.tar.gz')\n", 293 | "!tar -xzvf topic-modeling/results/output.tar.gz" 294 | ] 295 | }, 296 | { 297 | "cell_type": "code", 298 | "execution_count": null, 299 | "id": "f29067b8", 300 | "metadata": {}, 301 | "outputs": [], 302 | "source": [ 303 | "# Now load each of the resulting CSV files to their own DataFrames\n", 304 | "tt_df = pd.read_csv('topic-terms.csv')\n", 305 | "dt_df = pd.read_csv('doc-topics.csv')" 306 | ] 307 | }, 308 | { 309 | "cell_type": "code", 310 | "execution_count": null, 311 | "id": "915342d0", 312 | "metadata": {}, 313 | "outputs": [], 314 | "source": [ 315 | "# the topic terms DataFrame contains the topic number, what term corresponds to the topic, and \n", 316 | "# the weightage of this term contributing to the topic\n", 317 | "for i,x in tt_df.iterrows():\n", 318 | " print(str(x['topic'])+\":\"+x['term']+\":\"+str(x['weight']))" 319 | ] 320 | }, 321 | { 322 | "cell_type": "code", 323 | "execution_count": null, 324 | "id": "525911ff", 325 | "metadata": {}, 326 | "outputs": [], 327 | "source": [ 328 | "# We may have multiple topics in the same line, but for this example we are not interested in these duplicates, so we will drop it\n", 329 | "dt_df = dt_df.drop_duplicates(subset=['docname'])" 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": null, 335 | "id": "fa01262f", 336 | "metadata": {}, 337 | "outputs": [], 338 | "source": [ 339 | "# Filter the rows in the mean range of weightage for a topic\n", 340 | "ttdf_max = tt_df.groupby(['topic'], sort=False)['weight'].max()" 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": null, 346 | "id": "01f1fc1c", 347 | "metadata": {}, 348 | "outputs": [], 349 | "source": [ 350 | "ttdf_max.head()" 351 | ] 352 | }, 353 | { 354 | "cell_type": "code", 355 | "execution_count": null, 356 | "id": "b8dc6243", 357 | "metadata": {}, 358 | "outputs": [], 359 | "source": [ 360 | "# Load these into its own DataFrame and remove terms that are masked\n", 361 | "newtt_df = pd.DataFrame()\n", 362 | "for x in ttdf_max:\n", 363 | " newtt_df = newtt_df.append(tt_df.query('weight == @x'))\n", 364 | "newtt_df = newtt_df.reset_index(drop=True) \n", 365 | "adtopic = newtt_df.at[0,'term']" 366 | ] 367 | }, 368 | { 369 | "cell_type": "markdown", 370 | "id": "2e83f342", 371 | "metadata": {}, 372 | "source": [ 373 | "## Ad marking for Media Tailor\n", 374 | "I have provided a sample csv containing content metadata for looking up ads. For this example, we'll use the topics we discovered from our topic modeling job as the key to fetch the cmsid & vid. We will then substitute these in the VAST ad marker URL before creating the AWS Elemental Media Tailor configuration." 375 | ] 376 | }, 377 | { 378 | "cell_type": "code", 379 | "execution_count": null, 380 | "id": "939a893b", 381 | "metadata": {}, 382 | "outputs": [], 383 | "source": [ 384 | "#Get the ad content for marking our input video\n", 385 | "adindex_df = pd.read_csv('media-content/ad-index.csv', header=None, index_col=0)\n", 386 | "adindex_df" 387 | ] 388 | }, 389 | { 390 | "cell_type": "markdown", 391 | "id": "43ac583c", 392 | "metadata": {}, 393 | "source": [ 394 | "#### Lookup a topic from our ad index file" 395 | ] 396 | }, 397 | { 398 | "cell_type": "code", 399 | "execution_count": null, 400 | "id": "097ed2a7", 401 | "metadata": {}, 402 | "outputs": [], 403 | "source": [ 404 | "#Lookup the cmsid and vid for content as the topic\n", 405 | "advalue = adindex_df.loc[adtopic]\n", 406 | "print(advalue[1] + \" and \" + advalue[2])" 407 | ] 408 | }, 409 | { 410 | "cell_type": "code", 411 | "execution_count": null, 412 | "id": "71095949", 413 | "metadata": {}, 414 | "outputs": [], 415 | "source": [ 416 | "#Now we will create the AdMarker URL to use with AWS Elemental MediaTailor. \n", 417 | "#Lets first copy the placeholder URL available in our github repo which has a pre-roll, mid-roll and post-roll segments filled in\n", 418 | "ad_rawurl = pd.read_csv('media-content/adserver.csv', header=None).at[0,0].split('&')\n", 419 | "ad_rawurl" 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": null, 425 | "id": "82870766", 426 | "metadata": {}, 427 | "outputs": [], 428 | "source": [ 429 | "ad_formattedurl = ''\n", 430 | "for x in ad_rawurl:\n", 431 | " if 'cmsid' in x:\n", 432 | " x = advalue[1]\n", 433 | " if 'vid' in x:\n", 434 | " x = advalue[2]\n", 435 | " ad_formattedurl += x + '&'\n", 436 | " \n", 437 | "ad_formattedurl = ad_formattedurl.rstrip('&')\n", 438 | "ad_formattedurl" 439 | ] 440 | }, 441 | { 442 | "cell_type": "markdown", 443 | "id": "84197e06", 444 | "metadata": {}, 445 | "source": [ 446 | "## Resume from Creating AWS Elemental MediaTailor Configuration section in Chapter 8 of the book" 447 | ] 448 | } 449 | ], 450 | "metadata": { 451 | "kernelspec": { 452 | "display_name": "conda_python3", 453 | "language": "python", 454 | "name": "conda_python3" 455 | }, 456 | "language_info": { 457 | "codemirror_mode": { 458 | "name": "ipython", 459 | "version": 3 460 | }, 461 | "file_extension": ".py", 462 | "mimetype": "text/x-python", 463 | "name": "python", 464 | "nbconvert_exporter": "python", 465 | "pygments_lexer": "ipython3", 466 | "version": "3.6.13" 467 | } 468 | }, 469 | "nbformat": 4, 470 | "nbformat_minor": 5 471 | } 472 | -------------------------------------------------------------------------------- /Chapter 08/media-content/ad-index.csv: -------------------------------------------------------------------------------- 1 | insight,cmsid=496,vid=short_onecue 2 | automate,cmsid=176,vid=short_tencue 3 | text,cmsid=496,vid=short_onecue 4 | content,cmsid=176,vid=short_tencue 5 | result,cmsid=496,vid=short_onecue 6 | infrastructure,cmsid=176,vid=short_tencue 7 | compute,cmsid=496,vid=short_onecue 8 | document,cmsid=176,vid=short_tencue -------------------------------------------------------------------------------- /Chapter 08/media-content/adserver.csv: -------------------------------------------------------------------------------- 1 | https://pubads.g.doubleclick.net/gampad/ads?sz=640x480&iu=/124319096/external/ad_rule_samples&ciu_szs=300x250&ad_rule=1&impl=s&gdfp_req=1&env=vp&output=vmap&unviewed_position_start=1&cust_params=deployment%3Ddevsite%26sample_ar%3Dpremidpost&cmsid=&vid=&correlator=[avail.random] -------------------------------------------------------------------------------- /Chapter 08/media-content/bank-demo-prem-ranga.mp4: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 08/media-content/bank-demo-prem-ranga.mp4 -------------------------------------------------------------------------------- /Chapter 09/compact_nx.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 |
6 |

7 |
8 | 9 | 11 | 12 | 29 | 30 | 31 | 32 | 33 |
34 | 35 | 36 | 107 | 108 | -------------------------------------------------------------------------------- /Chapter 09/events_graph.py: -------------------------------------------------------------------------------- 1 | """ 2 | Helper functions and constants for Comprehend Events semantic network graphing. 3 | """ 4 | 5 | from collections import Counter 6 | from matplotlib import cm, colors 7 | import networkx as nx 8 | from pyvis.network import Network 9 | 10 | 11 | ENTITY_TYPES = ['DATE', 'FACILITY', 'LOCATION', 'MONETARY_VALUE', 'ORGANIZATION', 12 | 'PERSON', 'PERSON_TITLE', 'QUANTITY', 'STOCK_CODE'] 13 | 14 | TRIGGER_TYPES = ['BANKRUPTCY', 'EMPLOYMENT', 'CORPORATE_ACQUISITION', 15 | 'INVESTMENT_GENERAL', 'CORPORATE_MERGER', 'IPO', 'RIGHTS_ISSUE', 16 | 'SECONDARY_OFFERING', 'SHELF_OFFERING', 'TENDER_OFFERING', 'STOCK_SPLIT'] 17 | 18 | PROPERTY_MAP = { 19 | "event": {"size": 10, "shape": "box", "color": "#dbe3e5"}, 20 | "entity_group": {"size": 6, "shape": "dot", "color": "#776d8a"}, 21 | "entity": {"size": 4, "shape": "square", "color": "#f3e6e3"}, 22 | "trigger": {"size": 4, "shape": "diamond", "color": "#f3e6e3"} 23 | } 24 | 25 | def get_color_map(tags): 26 | spectral = cm.get_cmap("Spectral", len(tags)) 27 | tag_colors = [colors.rgb2hex(spectral(i)) for i in range(len(tags))] 28 | color_map = dict(zip(*(tags, tag_colors))) 29 | color_map.update({'ROLE': 'grey'}) 30 | return color_map 31 | 32 | COLOR_MAP = get_color_map(ENTITY_TYPES + TRIGGER_TYPES) 33 | COLOR_MAP['ROLE'] = "grey" 34 | 35 | IFRAME_DIMS = ("600", "800") 36 | 37 | 38 | def get_canonical_mention(mentions, method="longest"): 39 | extents = enumerate([m['Text'] for m in mentions]) 40 | if method == "longest": 41 | name = sorted(extents, key=lambda x: len(x[1])) 42 | elif method == "most_common": 43 | name = [Counter(extents).most_common()[0][0]] 44 | else: 45 | name = [list(extents)[0]] 46 | return [mentions[name[-1][0]]] 47 | 48 | 49 | def get_nodes_and_edges( 50 | result, node_types=['event', 'trigger', 'entity_group', 'entity'], thr=0.0 51 | ): 52 | """Convert results to (nodelist, edgelist) depending on specified entity types.""" 53 | nodes = [] 54 | edges = [] 55 | event_nodes = [] 56 | entity_nodes = [] 57 | entity_group_nodes = [] 58 | trigger_nodes = [] 59 | 60 | # Nodes are (id, type, tag, score, mention_type) tuples. 61 | if 'event' in node_types: 62 | event_nodes = [ 63 | ( 64 | "ev%d" % i, 65 | t['Type'], 66 | t['Type'], 67 | t['Score'], 68 | "event" 69 | ) 70 | for i, e in enumerate(result['Events']) 71 | for t in e['Triggers'][:1] 72 | if t['GroupScore'] > thr 73 | ] 74 | nodes.extend(event_nodes) 75 | 76 | if 'trigger' in node_types: 77 | trigger_nodes = [ 78 | ( 79 | "ev%d-tr%d" % (i, j), 80 | t['Type'], 81 | t['Text'], 82 | t['Score'], 83 | "trigger" 84 | ) 85 | for i, e in enumerate(result['Events']) 86 | for j, t in enumerate(e['Triggers']) 87 | if t['Score'] > thr 88 | ] 89 | trigger_nodes = list({t[1:3]: t for t in trigger_nodes}.values()) 90 | nodes.extend(trigger_nodes) 91 | 92 | if 'entity_group' in node_types: 93 | entity_group_nodes = [ 94 | ( 95 | "gr%d" % i, 96 | m['Type'], 97 | m['Text'] if 'entity' not in node_types else m['Type'], 98 | m['Score'], 99 | "entity_group" 100 | ) 101 | for i, e in enumerate(result['Entities']) 102 | for m in get_canonical_mention(e['Mentions']) 103 | if m['GroupScore'] > thr 104 | ] 105 | nodes.extend(entity_group_nodes) 106 | 107 | if 'entity' in node_types: 108 | entity_nodes = [ 109 | ( 110 | "gr%d-en%d" % (i, j), 111 | m['Type'], 112 | m['Text'], 113 | m['Score'], 114 | "entity" 115 | ) 116 | for i, e in enumerate(result['Entities']) 117 | for j, m in enumerate(e['Mentions']) 118 | if m['Score'] > thr 119 | ] 120 | entity_nodes = list({t[1:3]: t for t in entity_nodes}.values()) 121 | nodes.extend(entity_nodes) 122 | 123 | # Edges are (trigger_id, node_id, role, score, type) tuples. 124 | if event_nodes and entity_group_nodes: 125 | edges.extend([ 126 | ("ev%d" % i, "gr%d" % a['EntityIndex'], a['Role'], a['Score'], "argument") 127 | for i, e in enumerate(result['Events']) 128 | for j, a in enumerate(e['Arguments']) 129 | #if a['Score'] > THR 130 | ]) 131 | 132 | if entity_nodes and entity_group_nodes: 133 | entity_keys = set([n[0] for n in entity_nodes]) 134 | edges.extend([ 135 | ("gr%d" % i, "gr%d-en%d" % (i, j), "", m['GroupScore'], "coref") 136 | for i, e in enumerate(result['Entities']) 137 | for j, m in enumerate(e['Mentions']) 138 | if "gr%d-en%d" % (i, j) in entity_keys 139 | if m['GroupScore'] > thr 140 | ]) 141 | 142 | if event_nodes and trigger_nodes: 143 | trigger_keys = set([n[0] for n in trigger_nodes]) 144 | edges.extend([ 145 | ("ev%d" % i, "ev%d-tr%d" % (i, j), "", a['GroupScore'], "coref") 146 | for i, e in enumerate(result['Events']) 147 | for j, a in enumerate(e['Triggers']) 148 | if "ev%d-tr%d" % (i, j) in trigger_keys 149 | if a['GroupScore'] > thr 150 | ]) 151 | 152 | return nodes, edges 153 | 154 | 155 | def build_network_graph(nodelist, edgelist, drop_isolates=True): 156 | G = nx.Graph() 157 | # Iterate over triggers and entity mentions. 158 | for mention_id, tag, extent, score, mtype in nodelist: 159 | G.add_node( 160 | mention_id, 161 | label=extent, 162 | tag=tag, 163 | group=mtype, 164 | size=PROPERTY_MAP[mtype]['size'], 165 | color=COLOR_MAP[tag], 166 | shape=PROPERTY_MAP[mtype]['shape'] 167 | ) 168 | # Iterate over argument role assignments 169 | if edgelist: 170 | for n1_id, n2_id, role, score, etype in edgelist: 171 | label = role if etype == "argument" else "coref" 172 | G.add_edges_from( 173 | [(n1_id, n2_id)], 174 | label=role, 175 | weight=score*100, 176 | color="grey" 177 | ) 178 | # Drop mentions that don't participate in events 179 | if len(edgelist) > 0 and drop_isolates: 180 | G.remove_nodes_from(list(nx.isolates(G))) 181 | return G 182 | 183 | 184 | def plot(result, node_types, filename="nx.html", thr=0.0): 185 | nodes, edges = get_nodes_and_edges(result, node_types, thr) 186 | G = build_network_graph( 187 | nodes, edges, 188 | drop_isolates=True 189 | ) 190 | nt = Network(*IFRAME_DIMS, notebook=True, heading="") 191 | nt.from_nx(G) 192 | display(nt.show(filename)) -------------------------------------------------------------------------------- /Chapter 09/nx.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 |
6 |

7 |
8 | 9 | 11 | 12 | 29 | 30 | 31 | 32 | 33 |
34 | 35 | 36 | 107 | 108 | -------------------------------------------------------------------------------- /Chapter 09/requirements.txt: -------------------------------------------------------------------------------- 1 | ipywidgets==7.5.1 2 | networkx==2.5 3 | pandas==1.1.3 4 | pyvis==0.1.8.2 5 | spacy==2.2.4 6 | smart-open==3.0.0 -------------------------------------------------------------------------------- /Chapter 09/sample_financial_news_doc.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 09/sample_financial_news_doc.pdf -------------------------------------------------------------------------------- /Chapter 10/Reducing-localization-costs-with-machine-translation-github.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "copyrighted-memphis", 6 | "metadata": {}, 7 | "source": [ 8 | "# Reducing Localization costs and improving accuracy with Amazon Translate\n", 9 | "\n", 10 | "This is an accompanying notebook for Chapter 10 - Reducing locationlization costs and improving accuracy from the Natural Language Processing with AWS AI Services book. Please make sure to read the instructions provided in the book prior to attempting this notebook. In this chapter we will walkthrough a solution example of how to automate the translation of your web pages and save on localization costs using Amazon Translate. Organizations looking to expand internationally no longer have to implement time consuming and cost prohibitive localization projects to change their web pages, they can leverage [Amazon Translate](https://aws.amazon.com/translate/) which is a neural ML powered translation service as part of the development lifecycle to automatically convert web pages into multiple languages. We will show you how in this notebook. " 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "id": "usual-holder", 16 | "metadata": {}, 17 | "source": [ 18 | "## Input HTML Web Page\n", 19 | "\n", 20 | "For this example we will use an `About Us` HTML and Javascript page the authors created for the fictional **Family Bank**, a subsidiary of the fictional LiveRight financial organization. The page looks as shown in the cell below and is assumed to be part of an overall organizational website that has an `About Us` link leading to this page. " 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": null, 26 | "id": "scientific-yukon", 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "# display the About Us page\n", 31 | "from IPython.display import IFrame\n", 32 | "IFrame(src='./input/aboutLRH.html', width=800, height=400)" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "id": "prostate-intent", 38 | "metadata": {}, 39 | "source": [ 40 | "#### Let us now review the HTML and Javascript source code for this page\n", 41 | "As we see below, this has a small HTML div block, and a corresponding Script block to print the current date. The Style block provides some CSS styling for our page." 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": null, 47 | "id": "horizontal-buying", 48 | "metadata": {}, 49 | "outputs": [], 50 | "source": [ 51 | "!pygmentize './input/aboutLRH.html'" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "id": "special-schema", 57 | "metadata": {}, 58 | "source": [ 59 | "## Prepare for Translation" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": null, 65 | "id": "diverse-modem", 66 | "metadata": {}, 67 | "outputs": [], 68 | "source": [ 69 | "# Install the HTML parser\n", 70 | "!pip install beautifulsoup4" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": null, 76 | "id": "falling-pontiac", 77 | "metadata": {}, 78 | "outputs": [], 79 | "source": [ 80 | "html_doc = ''\n", 81 | "input_htm = './input/aboutLRH.html'\n", 82 | "with open(input_htm) as f:\n", 83 | " content = f.readlines()\n", 84 | "for i in content:\n", 85 | " html_doc += i+' '" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": null, 91 | "id": "dirty-heart", 92 | "metadata": {}, 93 | "outputs": [], 94 | "source": [ 95 | "from bs4 import BeautifulSoup\n", 96 | "soup = BeautifulSoup(html_doc, 'html.parser')" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": null, 102 | "id": "earlier-coaching", 103 | "metadata": {}, 104 | "outputs": [], 105 | "source": [ 106 | "# HTML tags containing text we are interested in translating\n", 107 | "tags = ['title','h1','h2','p']" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "id": "weighted-shooting", 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [ 117 | "# Now we will extract the text content from the HTML for each tag in our tags list and load this to a new dict\n", 118 | "x_dict = {}\n", 119 | "for tag in tags:\n", 120 | " x_dict[tag] = getattr(getattr(soup, tag),'string')\n", 121 | "x_dict" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "id": "ethical-promotion", 127 | "metadata": {}, 128 | "source": [ 129 | "## Translate to target languages\n", 130 | "We will now translate the input text from English to German, Spanish, Tamil and Hindi" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": null, 136 | "id": "ecological-satisfaction", 137 | "metadata": {}, 138 | "outputs": [], 139 | "source": [ 140 | "import boto3\n", 141 | "\n", 142 | "translate = boto3.client(service_name='translate', region_name='us-east-1', use_ssl=True)\n", 143 | "out_text = {}\n", 144 | "languages = ['de','es','ta','hi']\n", 145 | "\n", 146 | "for target_lang in languages:\n", 147 | " out_dict = {}\n", 148 | " for key in x_dict:\n", 149 | " result = translate.translate_text(Text=x_dict[key], \n", 150 | " SourceLanguageCode=\"en\", TargetLanguageCode=target_lang)\n", 151 | " out_dict[key] = result.get('TranslatedText')\n", 152 | " out_text[target_lang] = out_dict\n", 153 | "\n", 154 | "print(\"German Version of Website Text\")\n", 155 | "print(\"******************************\")\n", 156 | "print(out_text['de'])\n", 157 | "print(\"******************************\")\n", 158 | "print(\"Spanish Version of Website Text\")\n", 159 | "print(\"******************************\")\n", 160 | "print(out_text['es'])\n", 161 | "print(\"******************************\")\n", 162 | "print(\"Tamil Version of Website Text\")\n", 163 | "print(\"******************************\")\n", 164 | "print(out_text['ta'])\n", 165 | "print(\"******************************\")\n", 166 | "print(\"Hindi Version of Website Text\")\n", 167 | "print(\"******************************\")\n", 168 | "print(out_text['hi'])\n", 169 | "print(\"******************************\")\n" 170 | ] 171 | }, 172 | { 173 | "cell_type": "markdown", 174 | "id": "difficult-palace", 175 | "metadata": {}, 176 | "source": [ 177 | "## Build webpages for translated text\n", 178 | "We will now create separate HTML web pages for each of the translated languages and display them" 179 | ] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "id": "touched-champion", 184 | "metadata": {}, 185 | "source": [ 186 | "### German Webpage" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": null, 192 | "id": "actual-ballet", 193 | "metadata": {}, 194 | "outputs": [], 195 | "source": [ 196 | "web_de = soup" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": null, 202 | "id": "infrared-barbados", 203 | "metadata": {}, 204 | "outputs": [], 205 | "source": [ 206 | "web_de.title.string = out_text['de']['title']\n", 207 | "web_de.h1.string = out_text['de']['h1']\n", 208 | "web_de.h2.string = out_text['de']['h2']\n", 209 | "web_de.p.string = out_text['de']['p']" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": null, 215 | "id": "polyphonic-blast", 216 | "metadata": {}, 217 | "outputs": [], 218 | "source": [ 219 | "de_html = web_de.prettify()\n", 220 | "with open('./output/aboutLRH_DE.html','w') as de_w:\n", 221 | " de_w.write(de_html)" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": null, 227 | "id": "southeast-display", 228 | "metadata": {}, 229 | "outputs": [], 230 | "source": [ 231 | "# display the About Us page in German\n", 232 | "from IPython.display import IFrame\n", 233 | "IFrame(src='./output/aboutLRH_DE.html', width=800, height=500)" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "id": "purple-northeast", 239 | "metadata": {}, 240 | "source": [ 241 | "### Spanish Webpage" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": null, 247 | "id": "supreme-turning", 248 | "metadata": {}, 249 | "outputs": [], 250 | "source": [ 251 | "web_es = soup\n", 252 | "web_es.title.string = out_text['es']['title']\n", 253 | "web_es.h1.string = out_text['es']['h1']\n", 254 | "web_es.h2.string = out_text['es']['h2']\n", 255 | "web_es.p.string = out_text['es']['p']" 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": null, 261 | "id": "horizontal-skirt", 262 | "metadata": {}, 263 | "outputs": [], 264 | "source": [ 265 | "es_html = web_es.prettify()\n", 266 | "with open('./output/aboutLRH_ES.html','w') as es_w:\n", 267 | " es_w.write(es_html)" 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": null, 273 | "id": "signed-clinton", 274 | "metadata": {}, 275 | "outputs": [], 276 | "source": [ 277 | "# display the About Us page in German\n", 278 | "from IPython.display import IFrame\n", 279 | "IFrame(src='./output/aboutLRH_ES.html', width=800, height=500)" 280 | ] 281 | }, 282 | { 283 | "cell_type": "markdown", 284 | "id": "important-sample", 285 | "metadata": {}, 286 | "source": [ 287 | "### Hindi Webpage" 288 | ] 289 | }, 290 | { 291 | "cell_type": "code", 292 | "execution_count": null, 293 | "id": "upset-corner", 294 | "metadata": {}, 295 | "outputs": [], 296 | "source": [ 297 | "web_hi = soup\n", 298 | "web_hi.title.string = out_text['hi']['title']\n", 299 | "web_hi.h1.string = out_text['hi']['h1']\n", 300 | "web_hi.h2.string = out_text['hi']['h2']\n", 301 | "web_hi.p.string = out_text['hi']['p']" 302 | ] 303 | }, 304 | { 305 | "cell_type": "code", 306 | "execution_count": null, 307 | "id": "subtle-demand", 308 | "metadata": {}, 309 | "outputs": [], 310 | "source": [ 311 | "hi_html = web_hi.prettify()\n", 312 | "with open('./output/aboutLRH_HI.html','w') as hi_w:\n", 313 | " hi_w.write(hi_html)" 314 | ] 315 | }, 316 | { 317 | "cell_type": "code", 318 | "execution_count": null, 319 | "id": "dressed-kennedy", 320 | "metadata": {}, 321 | "outputs": [], 322 | "source": [ 323 | "# display the About Us page in German\n", 324 | "from IPython.display import IFrame\n", 325 | "IFrame(src='./output/aboutLRH_HI.html', width=800, height=500)" 326 | ] 327 | }, 328 | { 329 | "cell_type": "markdown", 330 | "id": "connected-facing", 331 | "metadata": {}, 332 | "source": [ 333 | "### Tamil Webpage" 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": null, 339 | "id": "copyrighted-nation", 340 | "metadata": {}, 341 | "outputs": [], 342 | "source": [ 343 | "web_ta = soup\n", 344 | "web_ta.title.string = out_text['ta']['title']\n", 345 | "web_ta.h1.string = out_text['ta']['h1']\n", 346 | "web_ta.h2.string = out_text['ta']['h2']\n", 347 | "web_ta.p.string = out_text['ta']['p']" 348 | ] 349 | }, 350 | { 351 | "cell_type": "code", 352 | "execution_count": null, 353 | "id": "loved-directive", 354 | "metadata": {}, 355 | "outputs": [], 356 | "source": [ 357 | "ta_html = web_ta.prettify()\n", 358 | "with open('./output/aboutLRH_TA.html','w') as ta_w:\n", 359 | " ta_w.write(ta_html)" 360 | ] 361 | }, 362 | { 363 | "cell_type": "code", 364 | "execution_count": null, 365 | "id": "outdoor-pound", 366 | "metadata": {}, 367 | "outputs": [], 368 | "source": [ 369 | "# display the About Us page in German\n", 370 | "from IPython.display import IFrame\n", 371 | "IFrame(src='./output/aboutLRH_TA.html', width=800, height=500)" 372 | ] 373 | }, 374 | { 375 | "cell_type": "markdown", 376 | "id": "municipal-disaster", 377 | "metadata": {}, 378 | "source": [ 379 | "## End of Notebook\n", 380 | "Please return back to the book to continue reading from there" 381 | ] 382 | } 383 | ], 384 | "metadata": { 385 | "kernelspec": { 386 | "display_name": "conda_python3", 387 | "language": "python", 388 | "name": "conda_python3" 389 | }, 390 | "language_info": { 391 | "codemirror_mode": { 392 | "name": "ipython", 393 | "version": 3 394 | }, 395 | "file_extension": ".py", 396 | "mimetype": "text/x-python", 397 | "name": "python", 398 | "nbconvert_exporter": "python", 399 | "pygments_lexer": "ipython3", 400 | "version": "3.6.10" 401 | } 402 | }, 403 | "nbformat": 4, 404 | "nbformat_minor": 5 405 | } 406 | -------------------------------------------------------------------------------- /Chapter 10/input/aboutLRH.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Live Well with LiveRight 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |

Family Bank Holdings

14 |

Date:

15 |
16 |
17 |

Who we are and what we do

18 |

A wholly owned subsidiary of LiveRight, we are the nation's largest bank for SMB owners and cooperative societies, with more than 4500 branches spread across the nation, servicing more than 5 million customers and continuing to grow. 19 | We offer a number of lending products to our customers including checking and savings accounts, lending, credit cards, deposits, insurance, IRA and more. Started in 1787 as a family owned business providing low interest loans for farmers struggling with poor harvests, LiveRight helped these farmers design long distance water channels from lakes in neighboring districts 20 | to their lands. The initial success helped these farmers invest their wealth in LiveRight and later led to our cooperative range of products that allowed farmers to own a part of LiveRight. 21 | In 1850 we moved our HeadQuarters to New York city to help build the economy of our nation by providing low interest lending products to small to medium business owners looking to start or expand their business. 22 | From 2 branches then to 4500 branches today, the trust of our customers helped us grow to become the nation's largest SMB bank.

23 |

24 |
25 |
26 | 35 | 36 | 92 | 93 | -------------------------------------------------------------------------------- /Chapter 10/output/aboutLRH_DE.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | Lebe gut mit LiveRight 6 | 7 | 8 | 10 | 12 | 14 | 16 | 18 | 19 | 20 |

21 | Beteiligungen für Familienbanken 22 |

23 |

24 | Date: 25 | 26 | 27 |

28 |
29 |
30 |

31 | Wer wir sind und was wir machen 32 |

33 |

34 |

35 | Als hundertprozentige Tochtergesellschaft von Liveright sind wir die größte Bank des Landes für SMB-Eigentümer und Genossenschaftsgesellschaften mit mehr als 4500 Filialen im ganzen Land, betreuen mehr als 5 Millionen Kunden und wachsen weiter. 36 | Wir bieten unseren Kunden eine Reihe von Kreditprodukten an, darunter Scheck- und Sparkonten, Kredite, Kreditkarten, Einlagen, Versicherungen, IRA und mehr. LiveRight wurde 1787 als Familienunternehmen gegründet, das Zinsdarlehen für Landwirte, die mit schlechten Ernten zu kämpfen hatten, und half diesen Landwirten, Fernwasserkanäle von Seen in benachbarten Bezirken zu entwerfen 37 | in ihr Land. Der anfängliche Erfolg half diesen Landwirten, ihr Vermögen in Liveright zu investieren, und führte später zu unserer kooperativen Produktpalette, die es den Landwirten ermöglichte, einen Teil von LiverRight zu besitzen. 38 | Im Jahr 1850 verlegten wir unseren Hauptsitz nach New York City, um zum Aufbau der Wirtschaft unseres Landes beizutragen, indem wir kleinen bis mittleren Unternehmern, die ihr Geschäft beginnen oder ausbauen möchten, Produkte mit niedrigen Zinsen zur Verfügung stellen. 39 | Von 2 Filialen bis hin zu 4500 Filialen heute hat uns das Vertrauen unserer Kunden geholfen, zur größten KMB-Bank des Landes zu werden. 40 |

41 |

42 |
43 |
44 | 53 | 54 | 109 | 110 | -------------------------------------------------------------------------------- /Chapter 11/2019-NAR-HBS.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 11/2019-NAR-HBS.pdf -------------------------------------------------------------------------------- /Chapter 11/2020-generational-trends-report-03-05-2020.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 11/2020-generational-trends-report-03-05-2020.pdf -------------------------------------------------------------------------------- /Chapter 11/Zillow-home-buyers-report.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 11/Zillow-home-buyers-report.pdf -------------------------------------------------------------------------------- /Chapter 11/faqs.csv: -------------------------------------------------------------------------------- 1 | what is important for home buyers?,"Home resale value, ability to rent out, assigned parking, smart home capabilities, good school zones, preferred neighborhoods are all important for home buyers" 2 | what do first-time home buyers want?,First-time buyers were more interested in receiving help from their agent in determining how much they could afford than repeat buyers. More buyers of new homes (10 percent) wanted help with paperwork compared to other buyer types. Married couples wanted to negotiate the terms of sale (13 percent) more than any other household composition. Single males and unmarried couples wanted help to find the right home (both 54 percent) more than other household compositions. There were many benefits for buyers using a real estate agent with the foremost reported as being the buyer(s) receiving help in understanding the buying process (61 percent). 3 | how are increased prices impacting sellers?,Increased home prices have lowered the share of home sellers who report they delayed the sale of their home because their home was worth less than their mortgage. 4 | was internet used during the home search?,Fifty-five percent of buyers who used the internet during their home search process ultimately found the home that they purchased through the internet. Forty percent of buyers who did not use the internet during their home search process found their home through a real estate agent compared to only 28 percent of buyers who did use the internet. 5 | do buyers use real estate agents?,Buyers typically interviewed only one real estate agent before working with them and the most important factor was that the agent was honest and trustworthy. Another important factor was the agent’s experience. Recent buyers were overall very satisfied with their real estate agent’s skills and qualities and definitely would use their agent again or recommend them to others. -------------------------------------------------------------------------------- /Chapter 12/ch 12 automating claims processing.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "id": "bdfb92d5", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "import boto3\n", 11 | "import json\n", 12 | "import boto3\n", 13 | "import re\n", 14 | "import csv\n", 15 | "import sagemaker\n", 16 | "from sagemaker import get_execution_role\n", 17 | "from sagemaker.s3 import S3Uploader, S3Downloader\n", 18 | "import uuid\n", 19 | "import time\n", 20 | "import io\n", 21 | "from io import BytesIO\n", 22 | "import sys\n", 23 | "from pprint import pprint\n", 24 | "\n", 25 | "from IPython.display import Image, display\n", 26 | "from PIL import Image as PImage, ImageDraw" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": null, 32 | "id": "9face02c", 33 | "metadata": {}, 34 | "outputs": [], 35 | "source": [ 36 | "!pip install amazon-textract-response-parser" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": null, 42 | "id": "d2f74c64", 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "role = get_execution_role()\n", 47 | "#print(\"RoleArn: {}\".format(role))\n", 48 | "\n", 49 | "sess = sagemaker.Session()\n", 50 | "bucket = sess.default_bucket()\n", 51 | "prefix = 'claims-process-textract'" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "id": "27ed57ed", 57 | "metadata": {}, 58 | "source": [ 59 | "# Valid Document" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": null, 65 | "id": "215958fe", 66 | "metadata": {}, 67 | "outputs": [], 68 | "source": [ 69 | "# Document\n", 70 | "documentName = \"validmedicalform.png\"\n", 71 | "\n", 72 | "display(Image(filename=documentName))" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": null, 78 | "id": "bc6c1b06", 79 | "metadata": {}, 80 | "outputs": [], 81 | "source": [ 82 | "# process using image bytes\n", 83 | "def calltextract(documentName): \n", 84 | " client = boto3.client(service_name='textract',\n", 85 | " region_name= 'us-east-1',\n", 86 | " endpoint_url='https://textract.us-east-1.amazonaws.com')\n", 87 | "\n", 88 | " with open(documentName, 'rb') as file:\n", 89 | " img_test = file.read()\n", 90 | " bytes_test = bytearray(img_test)\n", 91 | " print('Image loaded', documentName)\n", 92 | "\n", 93 | " # process using image bytes\n", 94 | " response = client.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['FORMS'])\n", 95 | "\n", 96 | " return response" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": null, 102 | "id": "c8fad04d", 103 | "metadata": {}, 104 | "outputs": [], 105 | "source": [ 106 | "response= calltextract(documentName)\n", 107 | "print(response)" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "id": "a2a17fe9", 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [ 117 | "#Extract key values\n", 118 | "# Iterate over elements in the document\n", 119 | "from trp import Document\n", 120 | "def getformkeyvalue(response):\n", 121 | " doc = Document(response)\n", 122 | " #print(doc)\n", 123 | " key_map = {}\n", 124 | " for page in doc.pages:\n", 125 | " # Print fields\n", 126 | " for field in page.form.fields:\n", 127 | " if field is None or field.key is None or field.value is None:\n", 128 | " continue\n", 129 | " #print(\"Field: Key: {}, Value: {}\".format(field.key.text, field.value.text))\n", 130 | " key_map[field.key.text] = field.value.text\n", 131 | " return key_map" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": null, 137 | "id": "dd2c5e4f", 138 | "metadata": {}, 139 | "outputs": [], 140 | "source": [ 141 | "get_form_keys = getformkeyvalue(response)\n", 142 | "print(get_form_keys)" 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "id": "15c6f940", 148 | "metadata": {}, 149 | "source": [ 150 | "# Check for validation using business rules\n", 151 | "Checking if claim Id is 12 digit and zip code is digit" 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": null, 157 | "id": "4fc3276e", 158 | "metadata": {}, 159 | "outputs": [], 160 | "source": [ 161 | "def validate(body):\n", 162 | " json_acceptable_string = body.replace(\"'\", \"\\\"\")\n", 163 | " json_data = json.loads(json_acceptable_string)\n", 164 | " print(json_data)\n", 165 | " zip = json_data['ZIP CODE']\n", 166 | " id = json_data['ID NUMBER']\n", 167 | "\n", 168 | " if(not zip.strip().isdigit()):\n", 169 | " return False, id, \"Zip code invalid\"\n", 170 | " length = len(id.strip())\n", 171 | " if(length != 12):\n", 172 | " return False, id, \"Invalid claim Id\"\n", 173 | " return True, id, \"Ok\"" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": null, 179 | "id": "3fb2f299", 180 | "metadata": {}, 181 | "outputs": [], 182 | "source": [ 183 | " # Validate \n", 184 | "textract_json= json.dumps(get_form_keys,indent=2)\n", 185 | "res, formid, result = validate(textract_json)\n", 186 | "print(result)\n", 187 | "print(formid)" 188 | ] 189 | }, 190 | { 191 | "cell_type": "markdown", 192 | "id": "2fbe1b4c", 193 | "metadata": {}, 194 | "source": [ 195 | "# Valid Medical Intake Form send to Comprehend medical to gain insights" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": null, 201 | "id": "eb11d4fa", 202 | "metadata": {}, 203 | "outputs": [], 204 | "source": [ 205 | "comprehend = boto3.client(service_name='comprehendmedical')\n" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": null, 211 | "id": "333ef20f", 212 | "metadata": {}, 213 | "outputs": [], 214 | "source": [ 215 | "# Detect medical entities\n", 216 | "cm_json_data = comprehend.detect_entities_v2(Text=textract_json)\n", 217 | "print(\"\\nMedical Entities\\n========\")\n", 218 | "for entity in cm_json_data[\"Entities\"]:\n", 219 | " print(\"- {}\".format(entity[\"Text\"]))\n", 220 | " print (\" Type: {}\".format(entity[\"Type\"]))\n", 221 | " print (\" Category: {}\".format(entity[\"Category\"]))\n", 222 | " if(entity[\"Traits\"]):\n", 223 | " print(\" Traits:\")\n", 224 | " for trait in entity[\"Traits\"]:\n", 225 | " print (\" - {}\".format(trait[\"Name\"]))\n", 226 | " print(\"\\n\")" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "id": "c38b1101", 232 | "metadata": {}, 233 | "source": [ 234 | "Writing entities to CSV File" 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": null, 240 | "id": "30f4d931", 241 | "metadata": {}, 242 | "outputs": [], 243 | "source": [ 244 | "\n", 245 | "def printtocsv(cm_json_data,formid): \n", 246 | " entities = cm_json_data['Entities']\n", 247 | " TEMP_FILE = 'cmresult.csv'\n", 248 | " with open(TEMP_FILE, 'w') as csvfile: # 'w' will truncate the file\n", 249 | " filewriter = csv.writer(csvfile, delimiter=',',\n", 250 | " quotechar='|', quoting=csv.QUOTE_MINIMAL)\n", 251 | " filewriter.writerow([ 'ID','Category', 'Type', 'Text'])\n", 252 | " for entity in entities:\n", 253 | " filewriter.writerow([formid, entity['Category'], entity['Type'], entity['Text']])\n", 254 | "\n", 255 | " filename = \"procedureresult/\" + formid + \".csv\"\n", 256 | "\n", 257 | " \n", 258 | " S3Uploader.upload(TEMP_FILE, 's3://{}/{}'.format(bucket, prefix))\n", 259 | " print(\"successfully parsed:\" + filename)" 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": null, 265 | "id": "f91792f9", 266 | "metadata": {}, 267 | "outputs": [], 268 | "source": [ 269 | "printtocsv(cm_json_data,formid)" 270 | ] 271 | }, 272 | { 273 | "cell_type": "markdown", 274 | "id": "53cae2b2", 275 | "metadata": {}, 276 | "source": [ 277 | "# Invalid Claim" 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": null, 283 | "id": "a15b8ea4", 284 | "metadata": {}, 285 | "outputs": [], 286 | "source": [ 287 | "InvalidDocument = \"invalidmedicalform.png\"\n", 288 | "\n", 289 | "display(Image(filename=InvalidDocument))" 290 | ] 291 | }, 292 | { 293 | "cell_type": "code", 294 | "execution_count": null, 295 | "id": "2b7e3801", 296 | "metadata": {}, 297 | "outputs": [], 298 | "source": [ 299 | "response = calltextract(InvalidDocument)" 300 | ] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "execution_count": null, 305 | "id": "10d16911", 306 | "metadata": {}, 307 | "outputs": [], 308 | "source": [ 309 | "get_form_keys = getformkeyvalue(response)\n", 310 | "print(get_form_keys)" 311 | ] 312 | }, 313 | { 314 | "cell_type": "code", 315 | "execution_count": null, 316 | "id": "e0b0ca01", 317 | "metadata": {}, 318 | "outputs": [], 319 | "source": [ 320 | " #In Validate \n", 321 | "textract_json= json.dumps(get_form_keys,indent=2)\n", 322 | "res, formid, result = validate(textract_json)\n", 323 | "print(result)\n", 324 | "print(formid)\n", 325 | "print(res)" 326 | ] 327 | }, 328 | { 329 | "cell_type": "markdown", 330 | "id": "8bf1dfc9", 331 | "metadata": {}, 332 | "source": [ 333 | "# Notify stakeholders that its Invalid" 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": null, 339 | "id": "18d9dbab", 340 | "metadata": {}, 341 | "outputs": [], 342 | "source": [ 343 | "sns = boto3.client('sns')" 344 | ] 345 | }, 346 | { 347 | "cell_type": "markdown", 348 | "id": "f0bbdb8d", 349 | "metadata": {}, 350 | "source": [ 351 | "# Go to https://console.aws.amazon.com/sns/v3/home?region=us-east-1#/homepage and create a topic as per book instruction" 352 | ] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "execution_count": null, 357 | "id": "ed26545b", 358 | "metadata": {}, 359 | "outputs": [], 360 | "source": [ 361 | "topicARN=\"\"" 362 | ] 363 | }, 364 | { 365 | "cell_type": "code", 366 | "execution_count": null, 367 | "id": "7227c6d0", 368 | "metadata": {}, 369 | "outputs": [], 370 | "source": [ 371 | "snsbody = \"Content:\" + str(textract_json) + \"Reason:\" + str(result)\n", 372 | "print(snsbody)" 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": null, 378 | "id": "0a1c110f", 379 | "metadata": {}, 380 | "outputs": [], 381 | "source": [ 382 | "try:\n", 383 | " response = sns.publish(\n", 384 | " TargetArn = topicARN,\n", 385 | " Message= snsbody\n", 386 | " )\n", 387 | " print(response)\n", 388 | "except Exception as e:\n", 389 | " print(\"Failed while doing validation\")\n", 390 | " print(e.message)\n" 391 | ] 392 | }, 393 | { 394 | "cell_type": "markdown", 395 | "id": "01eabab9", 396 | "metadata": {}, 397 | "source": [ 398 | "# Check your email for notification" 399 | ] 400 | }, 401 | { 402 | "cell_type": "markdown", 403 | "id": "7935053e", 404 | "metadata": {}, 405 | "source": [ 406 | "# Clean UP\n", 407 | "Delete the topic you created from Console https://console.aws.amazon.com/sns/v3/home?region=us-east-1#/topic/" 408 | ] 409 | }, 410 | { 411 | "cell_type": "markdown", 412 | "id": "70c96e49", 413 | "metadata": {}, 414 | "source": [ 415 | "Delete the Amazon s3 bucket and the files in the buckethttps://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-bucket.html" 416 | ] 417 | } 418 | ], 419 | "metadata": { 420 | "kernelspec": { 421 | "display_name": "conda_python3", 422 | "language": "python", 423 | "name": "conda_python3" 424 | }, 425 | "language_info": { 426 | "codemirror_mode": { 427 | "name": "ipython", 428 | "version": 3 429 | }, 430 | "file_extension": ".py", 431 | "mimetype": "text/x-python", 432 | "name": "python", 433 | "nbconvert_exporter": "python", 434 | "pygments_lexer": "ipython3", 435 | "version": "3.6.13" 436 | } 437 | }, 438 | "nbformat": 4, 439 | "nbformat_minor": 5 440 | } 441 | -------------------------------------------------------------------------------- /Chapter 12/invalidmedicalform.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 12/invalidmedicalform.png -------------------------------------------------------------------------------- /Chapter 12/validmedicalform.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 12/validmedicalform.png -------------------------------------------------------------------------------- /Chapter 13/chapter13 Improving accuracy of document processing .ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Textract Anazlyze API Invoke with human in the loop" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "import boto3\n", 17 | "import uuid\n", 18 | "import time\n", 19 | "import re\n", 20 | "import pprint\n", 21 | "import json\n", 22 | "pp = pprint.PrettyPrinter(indent=4)" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": null, 28 | "metadata": {}, 29 | "outputs": [], 30 | "source": [ 31 | "# Amazon Textract client\n", 32 | "textract = boto3.client('textract')\n", 33 | "\n", 34 | "# Amazon S3 client \n", 35 | "s3 = boto3.client('s3')" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": null, 41 | "metadata": {}, 42 | "outputs": [], 43 | "source": [ 44 | "humanLoopName = str(uuid.uuid4())" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "Enter the name of the S3 bucket you craeted and uplaoded the Samplecheck document" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "bucket=\"\"" 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "Enter the flow definition ARN or human review workflow arn\n", 68 | "by copying the arn from Console https://console.aws.amazon.com/a2i/home?region=us-east-1#/human-review-workflows\n", 69 | " \n" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": null, 75 | "metadata": {}, 76 | "outputs": [], 77 | "source": [ 78 | "humanLoopConfig = {\n", 79 | " 'FlowDefinitionArn':\"\",\n", 80 | " 'HumanLoopName':humanLoopName, \n", 81 | " 'DataAttributes': { 'ContentClassifiers': [ 'FreeOfPersonallyIdentifiableInformation' ]}\n", 82 | "}" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "enter the bucket name in Bucket and sample document name in Name" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": null, 95 | "metadata": {}, 96 | "outputs": [], 97 | "source": [ 98 | "response = textract.analyze_document(\n", 99 | " Document={'S3Object': {'Bucket': bucket, 'Name': \"samplecheck.PNG\"}},\n", 100 | " FeatureTypes=[\"FORMS\"], \n", 101 | " HumanLoopConfig=humanLoopConfig\n", 102 | " )" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": null, 108 | "metadata": {}, 109 | "outputs": [], 110 | "source": [ 111 | "print(response)" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": {}, 117 | "source": [ 118 | "Paste the workteam arn below which you copied from craeting private workteam or \n", 119 | "you can find it in this link https://console.aws.amazon.com/sagemaker/groundtruth?region=us-east-1#/labeling-workforces" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": null, 125 | "metadata": {}, 126 | "outputs": [], 127 | "source": [ 128 | "WORKTEAM_ARN= \"\"" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "metadata": {}, 135 | "outputs": [], 136 | "source": [ 137 | "# Amazon SageMaker client\n", 138 | "sagemaker = boto3.client('sagemaker')\n", 139 | "\n", 140 | "# Amazon Augment AI (A2I) client\n", 141 | "a2i = boto3.client('sagemaker-a2i-runtime')" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "metadata": {}, 148 | "outputs": [], 149 | "source": [ 150 | "\n", 151 | "\n", 152 | "workteamName = WORKTEAM_ARN[WORKTEAM_ARN.rfind('/') + 1:]\n", 153 | "print(\"Navigate to the private worker portal and do the tasks. Make sure you've invited yourself to your workteam!\")\n", 154 | "print('https://' + sagemaker.describe_workteam(WorkteamName=workteamName)['Workteam']['SubDomain'])\n", 155 | "\n" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": null, 161 | "metadata": {}, 162 | "outputs": [], 163 | "source": [ 164 | "completed_human_loops = []\n", 165 | "resp = a2i.describe_human_loop(HumanLoopName=humanLoopName)\n", 166 | "print(f'HumanLoop Name: {humanLoopName}')\n", 167 | "print(f'HumanLoop Status: {resp[\"HumanLoopStatus\"]}')\n", 168 | "#print(f'HumanLoop Output Destination: {resp[\"HumanLoopOutput\"]}')\n", 169 | "print('\\n')\n", 170 | " \n", 171 | "if resp[\"HumanLoopStatus\"] == \"Completed\":\n", 172 | " completed_human_loops.append(resp)" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": null, 178 | "metadata": {}, 179 | "outputs": [], 180 | "source": [ 181 | "\n", 182 | "for resp in completed_human_loops:\n", 183 | " splitted_string = re.split('s3://' + bucket + '/', resp['HumanLoopOutput']['OutputS3Uri'])\n", 184 | " output_bucket_key = splitted_string[1]\n", 185 | " print(output_bucket_key)\n", 186 | " response = s3.get_object(Bucket= bucket, Key=output_bucket_key)\n", 187 | " content = response[\"Body\"].read()\n", 188 | " json_output = json.loads(content)\n", 189 | " pp.pprint(json_output)\n", 190 | " print('\\n')\n" 191 | ] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "metadata": {}, 196 | "source": [ 197 | "# Clean Up\n", 198 | "## Delete human review worflow: https://console.aws.amazon.com/a2i/home?region=us-east-1#/human-review-workflows\n", 199 | " \n", 200 | "## Delete private workforce: https://console.aws.amazon.com/sagemaker/groundtruth?region=us-east-1#/labeling-workforces/private-details/a2i-demos\n", 201 | "\n", 202 | "## Delete Amazon S3 Bucket \n" 203 | ] 204 | } 205 | ], 206 | "metadata": { 207 | "kernelspec": { 208 | "display_name": "conda_python3", 209 | "language": "python", 210 | "name": "conda_python3" 211 | }, 212 | "language_info": { 213 | "codemirror_mode": { 214 | "name": "ipython", 215 | "version": 3 216 | }, 217 | "file_extension": ".py", 218 | "mimetype": "text/x-python", 219 | "name": "python", 220 | "nbconvert_exporter": "python", 221 | "pygments_lexer": "ipython3", 222 | "version": "3.6.13" 223 | } 224 | }, 225 | "nbformat": 4, 226 | "nbformat_minor": 4 227 | } 228 | -------------------------------------------------------------------------------- /Chapter 13/samplecheck.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 13/samplecheck.PNG -------------------------------------------------------------------------------- /Chapter 13/text.py: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Chapter 14/input/sample-loan-application.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 14/input/sample-loan-application.png -------------------------------------------------------------------------------- /Chapter 14/train/entitylist.csv: -------------------------------------------------------------------------------- 1 | Text,Type 2 | Country:UK,PERSON 3 | Country:CAN,PERSON 4 | Country:FRA,PERSON 5 | Years:10,PERSON 6 | Years:13,PERSON 7 | Years:17,PERSON 8 | Cell Phone:(281)123 4567,PERSON 9 | Cell Phone:(345)789 0123,PERSON 10 | Cell Phone:(999)999 9999,PERSON 11 | Cell Phone:(666)999 7777,PERSON 12 | Name:Kwaku Mensah,PERSON 13 | Name:Jane Doe,PERSON 14 | Name:John Smith,PERSON 15 | TOTAL $:8000.00/month,PERSON 16 | TOTAL $:9000.00/month,PERSON 17 | TOTAL $:7000.00/month,PERSON 18 | TOTAL $:6000.00/month,PERSON 19 | Social Security Number:123 - 45 - 6789,PERSON 20 | Social Security Number:111 - 11 - 1111,PERSON 21 | Social Security Number:222 - 22 - 2222,PERSON 22 | Social Security Number:234 - 56 - 7890,PERSON 23 | Date of Birth:01 / 01 / 1953,PERSON 24 | Date of Birth:01 / 01 / 1963,PERSON 25 | Date of Birth:01 / 01 / 1966,PERSON 26 | Date of Birth:02/ 01 / 1976,PERSON 27 | Country:ABC,GHOST 28 | Country:DEFG,GHOST 29 | Country:KAFP,GHOST 30 | Country:BLAH,GHOST 31 | Years:0,GHOST 32 | Years:999,GHOST 33 | Cell Phone:147,GHOST 34 | Cell Phone:1234,GHOST 35 | Cell Phone:7777,GHOST 36 | Cell Phone:000,GHOST 37 | Name:F R,GHOST 38 | Name:R A,GHOST 39 | Name:C E,GHOST 40 | Name:Z Z,GHOST 41 | TOTAL $:90.00/month,GHOST 42 | TOTAL $:800.00/month,GHOST 43 | TOTAL $:120.00/month,GHOST 44 | TOTAL $:88/m,GHOST 45 | Social Security Number:-54-,GHOST 46 | Social Security Number:777,GHOST 47 | Social Security Number:2222,GHOST 48 | Social Security Number:090,GHOST 49 | Date of Birth:1853,GHOST 50 | Date of Birth:196,GHOST 51 | Date of Birth:1780,GHOST 52 | Date of Birth:10000,GHOST -------------------------------------------------------------------------------- /Chapter 15/cha15train.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/cha15train.png -------------------------------------------------------------------------------- /Chapter 15/chapter15retrain.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/chapter15retrain.png -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/3ba082d3b307398adaf9d55301831684.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/3ba082d3b307398adaf9d55301831684.png -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/445ea8e393ca62878d5e3a68f054a8e4.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/445ea8e393ca62878d5e3a68f054a8e4.jpg -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/Howard Bank Sample Personal Bank Statement.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/Howard Bank Sample Personal Bank Statement.jpg -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 07 (1)0.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 07 (1)0.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 08 (1)0.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 08 (1)0.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 09 (1)0.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 09 (1)0.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 11 (1)0.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 11 (1)0.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 11 (1)1.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 11 (1)1.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 12 (1)0.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 12 (1)0.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 12 (1)1.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 12 (1)1.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)1.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)1.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)10.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)10.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)11.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)11.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)13.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)13.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)14.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)14.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)15.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)15.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)17.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)17.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)18.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)18.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)2.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)2.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)20.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)20.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)21.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)21.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)22.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)22.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)23.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)23.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)24.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)24.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)25.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)25.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)26.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)26.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)27.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)27.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)4.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)4.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)5.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)5.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)6.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)6.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)8.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)8.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)9.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)9.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 16 (1)0.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 16 (1)0.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 16 (1)1.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 16 (1)1.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 18 (1)0.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 18 (1)0.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 20 (1)0.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 20 (1)0.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 20 (1)1.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 20 (1)1.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Bank Statements/bank statement template 20 (1)2.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 20 (1)2.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Pay Stubs/29.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/29.PNG -------------------------------------------------------------------------------- /Chapter 15/documents/train/Pay Stubs/59f20ff3d6612.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/59f20ff3d6612.jpg -------------------------------------------------------------------------------- /Chapter 15/documents/train/Pay Stubs/59f212b5f02df.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/59f212b5f02df.jpg -------------------------------------------------------------------------------- /Chapter 15/documents/train/Pay Stubs/59f213a817dd8.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/59f213a817dd8.jpg -------------------------------------------------------------------------------- /Chapter 15/documents/train/Pay Stubs/59f214708f294.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/59f214708f294.jpg -------------------------------------------------------------------------------- /Chapter 15/documents/train/Pay Stubs/59f214eabb325.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/59f214eabb325.jpg -------------------------------------------------------------------------------- /Chapter 15/documents/train/Pay Stubs/5acf71c1d4b14.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/5acf71c1d4b14.jpg -------------------------------------------------------------------------------- /Chapter 15/documents/train/Pay Stubs/5acf72145500c.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/5acf72145500c.jpg -------------------------------------------------------------------------------- /Chapter 15/documents/train/Pay Stubs/5d945f376f89f101477294.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/5d945f376f89f101477294.jpg -------------------------------------------------------------------------------- /Chapter 15/documents/train/Pay Stubs/5d945f854fad0046913915.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/5d945f854fad0046913915.jpg -------------------------------------------------------------------------------- /Chapter 15/documents/train/Pay Stubs/adp-sample-768x946.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/adp-sample-768x946.png -------------------------------------------------------------------------------- /Chapter 15/documents/train/Pay Stubs/free-horizontal-paystub-template.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/free-horizontal-paystub-template.png -------------------------------------------------------------------------------- /Chapter 15/documents/train/Pay Stubs/free-long-creek-paystub-template.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/free-long-creek-paystub-template.png -------------------------------------------------------------------------------- /Chapter 15/documents/train/Pay Stubs/free-magenta-paystub-template.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/free-magenta-paystub-template.png -------------------------------------------------------------------------------- /Chapter 15/documents/train/Pay Stubs/free-midnight-paystub-template.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/free-midnight-paystub-template.png -------------------------------------------------------------------------------- /Chapter 15/documents/train/Pay Stubs/free-shamrock-paystub-template.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/free-shamrock-paystub-template.png -------------------------------------------------------------------------------- /Chapter 15/documents/train/Pay Stubs/free-sycamore-paystub-template.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/free-sycamore-paystub-template.png -------------------------------------------------------------------------------- /Chapter 15/documents/train/Pay Stubs/free-veritical blue-paystub-template.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/free-veritical blue-paystub-template.png -------------------------------------------------------------------------------- /Chapter 15/documents/train/Pay Stubs/free-violet-paystub-template.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/free-violet-paystub-template.png -------------------------------------------------------------------------------- /Chapter 15/documents/train/Pay Stubs/free-white-paystub-template.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/free-white-paystub-template.png -------------------------------------------------------------------------------- /Chapter 15/documents/train/Pay Stubs/pay-stub-with-logo-768x384.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/pay-stub-with-logo-768x384.png -------------------------------------------------------------------------------- /Chapter 15/documents/train/Pay Stubs/sample-pay-stub-2020-768x384.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/sample-pay-stub-2020-768x384.png -------------------------------------------------------------------------------- /Chapter 15/documents/train/Pay Stubs/sample-pay-stub.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/sample-pay-stub.png -------------------------------------------------------------------------------- /Chapter 15/documents/train/Pay Stubs/taxes-sample-pay-stub-768x614.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/taxes-sample-pay-stub-768x614.png -------------------------------------------------------------------------------- /Chapter 15/paystubsample.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/paystubsample.png -------------------------------------------------------------------------------- /Chapter 16/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 16/.DS_Store -------------------------------------------------------------------------------- /Chapter 16/form-s20-LRHL-registration.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 16/form-s20-LRHL-registration.pdf -------------------------------------------------------------------------------- /Chapter 16/form-s20-LRHL-registration_image.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 16/form-s20-LRHL-registration_image.png -------------------------------------------------------------------------------- /Chapter 16/form-s20-SUBS1-registration.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 16/form-s20-SUBS1-registration.pdf -------------------------------------------------------------------------------- /Chapter 16/form-s20-SUBS1-registration_image.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 16/form-s20-SUBS1-registration_image.png -------------------------------------------------------------------------------- /Chapter 16/form-s20-SUBS2-registration.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 16/form-s20-SUBS2-registration.pdf -------------------------------------------------------------------------------- /Chapter 16/form-s20-SUBS2-registration_image.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 16/form-s20-SUBS2-registration_image.png -------------------------------------------------------------------------------- /Chapter 16/tabular-sec.liquid.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 10 | 11 | 12 |
13 |

Instructions

14 |

Please review the SEC registration form inputs, and make corrections where appropriate.

15 |
16 |
17 |

Original Registration Form - Page 1

18 | 19 | 20 | 21 |
22 |
23 |

Please enter your modifications below

24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | {% for pair in task.input.document %} 34 | 35 | 36 | 37 | 38 | 39 | 49 | 54 | 59 | 60 | 61 | {% endfor %} 62 |
Line NrDetected TextConfidenceChange RequiredCorrected TextComments
{{ pair.linenr }} 40 |

41 | 42 | 43 |

44 |

45 | 46 | 47 |

48 |
50 |

51 | 52 |

53 |
55 |

56 | 57 |

58 |
63 |
64 |
65 |
-------------------------------------------------------------------------------- /Chapter 17/chapter17-deriving-insights-from-handwritten-content-forGitHub.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "26ba00e7", 6 | "metadata": {}, 7 | "source": [ 8 | "# Deriving insights from handwritten content using Amazon Textract and Amazon Quicksight\n", 9 | "\n", 10 | "This notebook is an accompanying utility for `Chapter 17 - Deriving insights from handwritten content` from the PACKT book **Natural Language Processing with AWS AI Services**. Please read the chapter and the instructions before trying this notebook. " 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": null, 16 | "id": "be3568f5", 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "# STEP 0 - CELL 1\n", 21 | "import boto3\n", 22 | "import json\n", 23 | "import csv\n", 24 | "import os\n", 25 | "\n", 26 | "infile = 'qsmani-raw.json'\n", 27 | "outfile = 'qsmani-formatted.json'\n", 28 | "bucket = ''\n", 29 | "prefix = 'chapter17' # change this prefix if you like" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "id": "5ae09b21", 35 | "metadata": {}, 36 | "source": [ 37 | "### Update QuickSight Manifest\n", 38 | "We will replace the S3 bucket and prefix from the raw manifest file with what you have entered in STEP 0 - CELL 1 above. We will then create a new formatted manifest file that will be used for creating a dataset with [Amazon QuickSight](https://aws.amazon.com/quicksight/) based on the content we extract from the handwritten documents." 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": null, 44 | "id": "c2b074b2", 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "# STEP 1 - CELL 1\n", 49 | "import json\n", 50 | "manifest = open(infile,'r')\n", 51 | "ln = json.load(manifest)\n", 52 | "t = json.dumps(ln['fileLocations'][0]['URIPrefixes'])\n", 53 | "t = t.replace('bucket',bucket).replace('prefix',prefix)\n", 54 | "ln['fileLocations'][0]['URIPrefixes'] = json.loads(t)\n", 55 | "with open(outfile,'w', encoding='utf-8') as out:\n", 56 | " json.dump(ln,out, ensure_ascii=False, indent=4)" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": null, 62 | "id": "3bf01e5d", 63 | "metadata": {}, 64 | "outputs": [], 65 | "source": [ 66 | "# STEP 1 - CELL 2\n", 67 | "s3 = boto3.client('s3')\n", 68 | "s3.upload_file(outfile,bucket,prefix+'/'+outfile)\n", 69 | "print(\"Manifest file uploaded to: s3://{}/{}\".format(bucket,prefix+'/'+outfile))" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "id": "251c7992", 75 | "metadata": {}, 76 | "source": [ 77 | "### Extract handwritten content using Textract\n", 78 | "In this section, we will install the [Amazon Textract Response Parser](https://github.com/aws-samples/amazon-textract-response-parser/blob/master/src-python/README.md), use the [Amazon Textract boto3 library](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html) to detect text from our handwritten images, and upload the contents into a CSV file which will be stored in your [Amazon S3 bucket](https://aws.amazon.com/s3/)" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": null, 84 | "id": "246cbfa6", 85 | "metadata": {}, 86 | "outputs": [], 87 | "source": [ 88 | "#STEP 2 - CELL 1\n", 89 | "!python -m pip install amazon-textract-response-parser" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": null, 95 | "id": "e0b2a18f", 96 | "metadata": {}, 97 | "outputs": [], 98 | "source": [ 99 | "# STEP 2 - CELL 2\n", 100 | "from trp import Document\n", 101 | "textract = boto3.client('textract')" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": null, 107 | "id": "4807cf21", 108 | "metadata": {}, 109 | "outputs": [], 110 | "source": [ 111 | "# STEP 2 - CELL 3\n", 112 | "for docs in os.listdir('.'):\n", 113 | " if docs.endswith('jpg'):\n", 114 | " with open(docs, 'rb') as img:\n", 115 | " img_test = img.read()\n", 116 | " bytes_test = bytearray(img_test)\n", 117 | " print('Extracted text from ', docs)\n", 118 | " response = textract.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES','FORMS'])\n", 119 | " text = Document(response)\n", 120 | " for page in text.pages:\n", 121 | " for table in page.tables:\n", 122 | " csvout = docs.replace('jpg','csv')\n", 123 | " with open(csvout, 'w', newline='') as csvf:\n", 124 | " tab = csv.writer(csvf, delimiter=',')\n", 125 | " for r, row in enumerate(table.rows):\n", 126 | " csvrow = []\n", 127 | " for c, cell in enumerate(row.cells):\n", 128 | " if cell.text:\n", 129 | " csvrow.append(cell.text.replace('$','').rstrip())\n", 130 | " tab.writerow(csvrow)\n", 131 | " s3.upload_file(csvout,bucket,prefix+'/dashboard/'+csvout)\n", 132 | " print(\"CSV file for document {} uploaded to: s3://{}/{}\".format(docs,bucket,prefix+'/dashboard/'+csvout))" 133 | ] 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "id": "91f606cd", 138 | "metadata": {}, 139 | "source": [ 140 | "### CONCLUSION\n", 141 | "That concludes the steps for the notebook. Please continue to follow the instructions from Chapter 17 in the book to understand how you can visualize and generate insights from your handwritten content using **[Amazon QuickSight](https://aws.amazon.com/quicksight/)**." 142 | ] 143 | } 144 | ], 145 | "metadata": { 146 | "kernelspec": { 147 | "display_name": "conda_python3", 148 | "language": "python", 149 | "name": "conda_python3" 150 | }, 151 | "language_info": { 152 | "codemirror_mode": { 153 | "name": "ipython", 154 | "version": 3 155 | }, 156 | "file_extension": ".py", 157 | "mimetype": "text/x-python", 158 | "name": "python", 159 | "nbconvert_exporter": "python", 160 | "pygments_lexer": "ipython3", 161 | "version": "3.6.13" 162 | } 163 | }, 164 | "nbformat": 4, 165 | "nbformat_minor": 5 166 | } 167 | -------------------------------------------------------------------------------- /Chapter 17/hw-receipt1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 17/hw-receipt1.jpg -------------------------------------------------------------------------------- /Chapter 17/hw-receipt2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 17/hw-receipt2.jpg -------------------------------------------------------------------------------- /Chapter 17/qsmani-raw.json: -------------------------------------------------------------------------------- 1 | { 2 | "fileLocations": [ 3 | { 4 | "URIPrefixes": [ 5 | "s3://bucket/prefix/dashboard" 6 | ] 7 | } 8 | ], 9 | "globalUploadSettings": { 10 | "format": "CSV", 11 | "delimiter": ",", 12 | "textqualifier": "'", 13 | "containsHeader": "true" 14 | } 15 | } 16 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 Packt 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | # Natural-Language-Processing-with-AWS-AI-Services 5 | 6 | Natural Language Processing with AWS AI Services 7 | 8 | This is the code repository for [Natural Language Processing with AWS AI Services](https://www.packtpub.com/product/natural-language-processing-with-aws-ai-services/9781801812535?utm_source=github&utm_medium=repository&utm_campaign=9781801812535), published by Packt. 9 | 10 | **Derive strategic insights from unstructured data with Amazon Textract and Amazon Comprehend** 11 | 12 | ## What is this book about? 13 | The book includes Python code examples for Amazon Textract, Amazon Comprehend, and other AWS AI services to build a variety of serverless NLP workflows at scale with little prior machine learning knowledge. Packed with real-life business examples, this book will help you to navigate a day in the life of an AWS AI specialist with ease. 14 | 15 | This book covers the following exciting features: 16 | * Automate various NLP workflows on AWS to accelerate business outcomes 17 | * Use Amazon Textract for text, tables, and handwriting recognition from images and PDF files 18 | * Gain insights from unstructured text in the form of sentiment analysis, topic modeling, and more using Amazon Comprehend 19 | * Set up end-to-end document processing pipelines to understand the role of humans in the loop 20 | * Develop NLP-based intelligent search solutions with just a few lines of code 21 | * Create both real-time and batch document processing pipelines using Python 22 | 23 | If you feel this book is for you, get your [copy](https://www.amazon.com/dp/1801812535) today! 24 | 25 | https://www.packtpub.com/ 27 | 28 | 29 | ## Instructions and Navigations 30 | All of the code is organized into folders. 31 | 32 | The code will look like the following: 33 | ``` 34 | # Define IAM role 35 | role = get_execution_role() 36 | print("RoleArn: {}".format(role)) 37 | sess = sagemaker.Session() 38 | s3BucketName = '' 39 | prefix = 'chapter5' 40 | ``` 41 | 42 | **Following is what you need for this book:** 43 | If you're an NLP developer or data scientist looking to get started with AWS AI services to implement various NLP scenarios quickly, this book is for you. It will show you how easy it is to integrate AI in applications with just a few lines of code. A basic understanding of machine learning (ML) concepts is necessary to understand the concepts covered. Experience with Jupyter notebooks and Python will be helpful. 44 | 45 | With the following software and hardware list you can run all code files present in the book (Chapter 1-18). 46 | 47 | ### Software and Hardware List 48 | 49 | | Software required | OS required | 50 | | --------------------------------------------| -----------------------------------| 51 | | Access and signing up to an AWS account | Windows, Mac OS X, and Linux (Any) | 52 | | Creating a SageMaker Jupyter Notebook | Windows, Mac OS X, and Linux (Any | 53 | | Creating an Amazon S3 bucket | Windows, Mac OS X, and Linux (Any | 54 | 55 | We also provide a PDF file that has color images of the screenshots/diagrams used in this book. [Click here to download it](https://static.packt-cdn.com/downloads/9781801812535_ColorImages.pdf). 56 | 57 | The Code in Action videos for this book can be viewed at https://bit.ly/3vPvDkj. 58 | 59 | 60 | ### Related products 61 | * Machine Learning with Amazon SageMaker Cookbook [[Packt]](https://www.packtpub.com/product/machine-learning-with-amazon-sagemaker-cookbook/9781800567030?utm_source=github&utm_medium=repository&utm_campaign=9781800567030) [[Amazon]](https://www.amazon.com/dp/1800567030) 62 | 63 | * Amazon SageMaker Best Practices [[Packt]](https://www.packtpub.com/product/amazon-sagemaker-best-practices/9781801070522?utm_source=github&utm_medium=repository&utm_campaign=9781801070522) [[Amazon]](https://www.amazon.com/dp/1801070520) 64 | 65 | ## Get to Know the Authors 66 | **Mona M** 67 | is a senior AI/ML specialist solutions architect at AWS. She is a highly skilled IT professional, with more than 10 years' experience in software design, development, and integration across diverse work environments. As an AWS solutions architect, her role is to ensure customer success in building applications and services on the AWS platform. She is responsible for crafting a highly scalable, flexible, and resilient cloud architecture that addresses customer business problems. She has published multiple blogs on AI and NLP on the AWS AI channel along with research papers on AI-powered search solutions. 68 | 69 | **Premkumar Rangarajan** 70 | is an enterprise solutions architect, specializing in AI/ML at Amazon Web Services. He has 25 years of experience in the IT industry in a variety of roles, including delivery lead, integration specialist, and enterprise architect. He has significant architecture and management experience in delivering large-scale programs across various industries and platforms. He is passionate about helping customers solve ML and AI problems. 71 | ### Download a free PDF 72 | 73 | If you have already purchased a print or Kindle version of this book, you can get a DRM-free PDF version at no cost.
Simply click on the link to claim your free PDF.
74 |

https://packt.link/free-ebook/9781801812535

--------------------------------------------------------------------------------