├── .DS_Store
├── Chapter 02
    ├── Amazon Textract API Sample.ipynb
    ├── emp_app_printed.png
    ├── employment_history.png
    ├── form-derogatoire.jpg
    ├── job-application-form.pdf
    ├── receipt-image.png
    ├── sample-invoice.png
    └── two-column-image.jpeg
├── Chapter 03
    └── Chapter 3 Introduction to Amazon Comprehend.ipynb
├── Chapter 04
    ├── Chapter 4 Compliance and control.ipynb
    ├── bankstatement.JPG
    └── piiredact.png
├── Chapter 05
    ├── Ch05-Kendra Search.ipynb
    ├── lambda
    │   └── index.py
    ├── resume_Sample.pdf
    ├── resume_sample.PNG
    └── template-export-textract.yml
├── Chapter 06
    ├── chapter6-nlp-in-customer-service-github.ipynb
    └── topic-modeling
    │   └── initial
    │       └── complaints_data_initial.csv
├── Chapter 07
    ├── chapter07 social media text analytics.ipynb
    ├── lib
    │   ├── test.py
    │   └── workshop.py
    └── project_path.py
├── Chapter 08
    ├── contextual-ad-marking-for-content-monetization-with-nlp-github.ipynb
    └── media-content
    │   ├── ad-index.csv
    │   ├── adserver.csv
    │   └── bank-demo-prem-ranga.mp4
├── Chapter 09
    ├── chapter 09 metadata extraction.ipynb
    ├── compact_nx.html
    ├── events_graph.py
    ├── nx.html
    ├── requirements.txt
    └── sample_financial_news_doc.pdf
├── Chapter 10
    ├── Reducing-localization-costs-with-machine-translation-github.ipynb
    ├── input
    │   └── aboutLRH.html
    └── output
    │   └── aboutLRH_DE.html
├── Chapter 11
    ├── 2019-NAR-HBS.pdf
    ├── 2020-generational-trends-report-03-05-2020.pdf
    ├── Zillow-home-buyers-report.pdf
    └── faqs.csv
├── Chapter 12
    ├── ch 12 automating claims processing.ipynb
    ├── invalidmedicalform.png
    └── validmedicalform.png
├── Chapter 13
    ├── chapter13 Improving accuracy of document processing .ipynb
    ├── samplecheck.PNG
    └── text.py
├── Chapter 14
    ├── chapter14-auditing-workflows-named-entity-detection-forGitHub.ipynb
    ├── input
    │   └── sample-loan-application.png
    └── train
    │   ├── entitylist.csv
    │   └── raw_txt.csv
├── Chapter 15
    ├── cha15train.png
    ├── chapter15 classify documents with human in the loop.ipynb
    ├── chapter15retrain.png
    ├── documents
    │   └── train
    │   │   ├── Bank Statements
    │   │       ├── 3ba082d3b307398adaf9d55301831684.png
    │   │       ├── 445ea8e393ca62878d5e3a68f054a8e4.jpg
    │   │       ├── Howard Bank Sample Personal Bank Statement.jpg
    │   │       ├── bank statement template 07 (1)0.PNG
    │   │       ├── bank statement template 08 (1)0.PNG
    │   │       ├── bank statement template 09 (1)0.PNG
    │   │       ├── bank statement template 11 (1)0.PNG
    │   │       ├── bank statement template 11 (1)1.PNG
    │   │       ├── bank statement template 12 (1)0.PNG
    │   │       ├── bank statement template 12 (1)1.PNG
    │   │       ├── bank statement template 15 (1)1.PNG
    │   │       ├── bank statement template 15 (1)10.PNG
    │   │       ├── bank statement template 15 (1)11.PNG
    │   │       ├── bank statement template 15 (1)13.PNG
    │   │       ├── bank statement template 15 (1)14.PNG
    │   │       ├── bank statement template 15 (1)15.PNG
    │   │       ├── bank statement template 15 (1)17.PNG
    │   │       ├── bank statement template 15 (1)18.PNG
    │   │       ├── bank statement template 15 (1)2.PNG
    │   │       ├── bank statement template 15 (1)20.PNG
    │   │       ├── bank statement template 15 (1)21.PNG
    │   │       ├── bank statement template 15 (1)22.PNG
    │   │       ├── bank statement template 15 (1)23.PNG
    │   │       ├── bank statement template 15 (1)24.PNG
    │   │       ├── bank statement template 15 (1)25.PNG
    │   │       ├── bank statement template 15 (1)26.PNG
    │   │       ├── bank statement template 15 (1)27.PNG
    │   │       ├── bank statement template 15 (1)4.PNG
    │   │       ├── bank statement template 15 (1)5.PNG
    │   │       ├── bank statement template 15 (1)6.PNG
    │   │       ├── bank statement template 15 (1)8.PNG
    │   │       ├── bank statement template 15 (1)9.PNG
    │   │       ├── bank statement template 16 (1)0.PNG
    │   │       ├── bank statement template 16 (1)1.PNG
    │   │       ├── bank statement template 18 (1)0.PNG
    │   │       ├── bank statement template 20 (1)0.PNG
    │   │       ├── bank statement template 20 (1)1.PNG
    │   │       └── bank statement template 20 (1)2.PNG
    │   │   └── Pay Stubs
    │   │       ├── 29.PNG
    │   │       ├── 59f20ff3d6612.jpg
    │   │       ├── 59f212b5f02df.jpg
    │   │       ├── 59f213a817dd8.jpg
    │   │       ├── 59f214708f294.jpg
    │   │       ├── 59f214eabb325.jpg
    │   │       ├── 5acf71c1d4b14.jpg
    │   │       ├── 5acf72145500c.jpg
    │   │       ├── 5d945f376f89f101477294.jpg
    │   │       ├── 5d945f854fad0046913915.jpg
    │   │       ├── adp-sample-768x946.png
    │   │       ├── free-horizontal-paystub-template.png
    │   │       ├── free-long-creek-paystub-template.png
    │   │       ├── free-magenta-paystub-template.png
    │   │       ├── free-midnight-paystub-template.png
    │   │       ├── free-shamrock-paystub-template.png
    │   │       ├── free-sycamore-paystub-template.png
    │   │       ├── free-veritical blue-paystub-template.png
    │   │       ├── free-violet-paystub-template.png
    │   │       ├── free-white-paystub-template.png
    │   │       ├── pay-stub-with-logo-768x384.png
    │   │       ├── sample-pay-stub-2020-768x384.png
    │   │       ├── sample-pay-stub.png
    │   │       └── taxes-sample-pay-stub-768x614.png
    └── paystubsample.png
├── Chapter 16
    ├── .DS_Store
    ├── Improve-accuracy-of-pdf-processing-with-Amazon-Textract-and-Amazon-A2I-forGitHub.ipynb
    ├── form-s20-LRHL-registration.pdf
    ├── form-s20-LRHL-registration_image.png
    ├── form-s20-SUBS1-registration.pdf
    ├── form-s20-SUBS1-registration_image.png
    ├── form-s20-SUBS2-registration.pdf
    ├── form-s20-SUBS2-registration_image.png
    └── tabular-sec.liquid.html
├── Chapter 17
    ├── chapter17-deriving-insights-from-handwritten-content-forGitHub.ipynb
    ├── hw-receipt1.jpg
    ├── hw-receipt2.jpg
    └── qsmani-raw.json
├── LICENSE
└── README.md


/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/.DS_Store


--------------------------------------------------------------------------------
/Chapter 02/Amazon Textract API Sample.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "metadata": {},
  7 |    "outputs": [],
  8 |    "source": [
  9 |     "!pip install amazon-textract-response-parser"
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "code",
 14 |    "execution_count": null,
 15 |    "metadata": {},
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "\n",
 19 |     "import boto3\n",
 20 |     "from IPython.display import Image, display\n",
 21 |     "from trp import Document\n",
 22 |     "from PIL import Image as PImage, ImageDraw\n",
 23 |     "import time\n",
 24 |     "from IPython.display import IFrame"
 25 |    ]
 26 |   },
 27 |   {
 28 |    "cell_type": "markdown",
 29 |    "metadata": {},
 30 |    "source": [
 31 |     "# In this section, we will deep dive into Amazon Textract APIs and its feature. \n",
 32 |     "Amazon Textract includes simple, easy-to-use APIs that can analyze image files and PDF files.\n",
 33 |     "Amazon Textract APIs can be classified into synchronous APIs for real time processing and asynchronous APIs for batch processing.\n",
 34 |     "We will deep dive into each:\n",
 35 |     "•\tSynchronous APIs(Real time processing use case)\n",
 36 |     "•\tAsynchronous APIs(Batch processing use cases)\n",
 37 |     "Synchronous APIs (Real time processing use case): There are two APIs which can help with real time analysis:\n",
 38 |     "                         Analyze Text \n",
 39 |     "                         Analyze Document API\n"
 40 |    ]
 41 |   },
 42 |   {
 43 |    "cell_type": "code",
 44 |    "execution_count": null,
 45 |    "metadata": {},
 46 |    "outputs": [],
 47 |    "source": [
 48 |     "# Curent AWS Region. Use this to choose corresponding S3 bucket with sample content\n",
 49 |     "\n",
 50 |     "mySession = boto3.session.Session()\n",
 51 |     "awsRegion = mySession.region_name"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "code",
 56 |    "execution_count": null,
 57 |    "metadata": {},
 58 |    "outputs": [],
 59 |    "source": [
 60 |     "# S3 bucket that contains sample documents. Download the sample documents and craete an Amazon s3 Bucket \n",
 61 |     "\n",
 62 |     "s3BucketName = \"<enter your S3 Bucket name>\""
 63 |    ]
 64 |   },
 65 |   {
 66 |    "cell_type": "code",
 67 |    "execution_count": null,
 68 |    "metadata": {},
 69 |    "outputs": [],
 70 |    "source": [
 71 |     "# Amazon S3 client\n",
 72 |     "s3 = boto3.client('s3')\n",
 73 |     "\n",
 74 |     "# Amazon Textract client\n",
 75 |     "textract = boto3.client('textract')"
 76 |    ]
 77 |   },
 78 |   {
 79 |    "cell_type": "code",
 80 |    "execution_count": null,
 81 |    "metadata": {},
 82 |    "outputs": [],
 83 |    "source": [
 84 |     "# 1. Detect text from image with"
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "markdown",
 89 |    "metadata": {},
 90 |    "source": [
 91 |     "https://docs.aws.amazon.com/textract/latest/dg/API_DetectDocumentText.html"
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "code",
 96 |    "execution_count": null,
 97 |    "metadata": {},
 98 |    "outputs": [],
 99 |    "source": [
100 |     "# Document\n",
101 |     "documentName = \"sample-invoice.png\""
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "code",
106 |    "execution_count": null,
107 |    "metadata": {},
108 |    "outputs": [],
109 |    "source": [
110 |     "display(Image(filename=documentName))"
111 |    ]
112 |   },
113 |   {
114 |    "cell_type": "code",
115 |    "execution_count": null,
116 |    "metadata": {},
117 |    "outputs": [],
118 |    "source": [
119 |     "# Read document content\n",
120 |     "with open(documentName, 'rb') as document:\n",
121 |     "    imageBytes = bytearray(document.read())\n",
122 |     "\n",
123 |     "# Call Amazon Textract\n",
124 |     "response = textract.detect_document_text(Document={'Bytes': imageBytes})\n"
125 |    ]
126 |   },
127 |   {
128 |    "cell_type": "code",
129 |    "execution_count": null,
130 |    "metadata": {},
131 |    "outputs": [],
132 |    "source": [
133 |     "import json\n",
134 |     "\n",
135 |     "print (json.dumps(response, indent=4, sort_keys=True))\n"
136 |    ]
137 |   },
138 |   {
139 |    "cell_type": "markdown",
140 |    "metadata": {},
141 |    "source": [
142 |     "# 2. Detect text from S3 object"
143 |    ]
144 |   },
145 |   {
146 |    "cell_type": "markdown",
147 |    "metadata": {},
148 |    "source": [
149 |     "https://docs.aws.amazon.com/textract/latest/dg/API_DetectDocumentText.html"
150 |    ]
151 |   },
152 |   {
153 |    "cell_type": "markdown",
154 |    "metadata": {},
155 |    "source": [
156 |     "## Lines and Words of Text - JSON Structure"
157 |    ]
158 |   },
159 |   {
160 |    "cell_type": "markdown",
161 |    "metadata": {},
162 |    "source": [
163 |     "https://docs.aws.amazon.com/textract/latest/dg/API_BoundingBox.html\n",
164 |     "\n",
165 |     "https://docs.aws.amazon.com/textract/latest/dg/text-location.html\n",
166 |     "\n",
167 |     "https://docs.aws.amazon.com/textract/latest/dg/how-it-works-lines-words.html"
168 |    ]
169 |   },
170 |   {
171 |    "cell_type": "code",
172 |    "execution_count": null,
173 |    "metadata": {},
174 |    "outputs": [],
175 |    "source": [
176 |     "#  Reading order"
177 |    ]
178 |   },
179 |   {
180 |    "cell_type": "code",
181 |    "execution_count": null,
182 |    "metadata": {},
183 |    "outputs": [],
184 |    "source": [
185 |     "# Document\n",
186 |     "documentName = \"two-column-image.jpeg\""
187 |    ]
188 |   },
189 |   {
190 |    "cell_type": "code",
191 |    "execution_count": null,
192 |    "metadata": {},
193 |    "outputs": [],
194 |    "source": [
195 |     "display(Image(filename=documentName))"
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "code",
200 |    "execution_count": null,
201 |    "metadata": {},
202 |    "outputs": [],
203 |    "source": [
204 |     "import boto3\n",
205 |     "\n",
206 |     "s3 = boto3.resource('s3')\n",
207 |     "s3.Bucket(s3BucketName).upload_file(documentName,documentName)"
208 |    ]
209 |   },
210 |   {
211 |    "cell_type": "code",
212 |    "execution_count": null,
213 |    "metadata": {},
214 |    "outputs": [],
215 |    "source": [
216 |     "# Call Amazon Textract\n",
217 |     "response = textract.detect_document_text(\n",
218 |     "    Document={\n",
219 |     "        'S3Object': {\n",
220 |     "            'Bucket': s3BucketName,\n",
221 |     "            'Name': documentName\n",
222 |     "        }\n",
223 |     "    })\n",
224 |     "\n",
225 |     "print(response)"
226 |    ]
227 |   },
228 |   {
229 |    "cell_type": "code",
230 |    "execution_count": null,
231 |    "metadata": {},
232 |    "outputs": [],
233 |    "source": [
234 |     "#using trp.py to parse the json into reading order\n",
235 |     "doc = Document(response)\n",
236 |     "for page in doc.pages:\n",
237 |     "    for line in page.getLinesInReadingOrder():\n",
238 |     "          print(line[1])"
239 |    ]
240 |   },
241 |   {
242 |    "cell_type": "markdown",
243 |    "metadata": {},
244 |    "source": [
245 |     "# Analyze Document API for tables and Forms: Key/Values"
246 |    ]
247 |   },
248 |   {
249 |    "cell_type": "markdown",
250 |    "metadata": {},
251 |    "source": [
252 |     "https://docs.aws.amazon.com/textract/latest/dg/API_AnalyzeDocument.html"
253 |    ]
254 |   },
255 |   {
256 |    "cell_type": "code",
257 |    "execution_count": null,
258 |    "metadata": {},
259 |    "outputs": [],
260 |    "source": [
261 |     "# Document\n",
262 |     "documentName = \"sample-invoice.png\""
263 |    ]
264 |   },
265 |   {
266 |    "cell_type": "code",
267 |    "execution_count": null,
268 |    "metadata": {},
269 |    "outputs": [],
270 |    "source": [
271 |     "display(Image(filename=documentName))"
272 |    ]
273 |   },
274 |   {
275 |    "cell_type": "code",
276 |    "execution_count": null,
277 |    "metadata": {},
278 |    "outputs": [],
279 |    "source": [
280 |     "\n",
281 |     "s3.Bucket(s3BucketName).upload_file(documentName,documentName)"
282 |    ]
283 |   },
284 |   {
285 |    "cell_type": "code",
286 |    "execution_count": null,
287 |    "metadata": {},
288 |    "outputs": [],
289 |    "source": [
290 |     "# Call Amazon Textract\n",
291 |     "response = textract.analyze_document(\n",
292 |     "    Document={\n",
293 |     "        'S3Object': {\n",
294 |     "            'Bucket': s3BucketName,\n",
295 |     "            'Name': documentName\n",
296 |     "        }\n",
297 |     "    },\n",
298 |     "    FeatureTypes=[\"FORMS\",\"TABLES\"])"
299 |    ]
300 |   },
301 |   {
302 |    "cell_type": "code",
303 |    "execution_count": null,
304 |    "metadata": {},
305 |    "outputs": [],
306 |    "source": [
307 |     "\n",
308 |     "\n",
309 |     "#print(response)\n",
310 |     "\n",
311 |     "doc = Document(response)\n",
312 |     "\n",
313 |     "for page in doc.pages:\n",
314 |     "    # Print fields\n",
315 |     "    print(\"Fields:\")\n",
316 |     "    for field in page.form.fields:\n",
317 |     "        print(\"Key: {}, Value: {}\".format(field.key, field.value))\n",
318 |     "\n",
319 |     "    # Get field by key\n",
320 |     "    print(\"\\nGet Field by Key:\")\n",
321 |     "    key = \"Phone Number:\"\n",
322 |     "    field = page.form.getFieldByKey(key)\n",
323 |     "    if(field):\n",
324 |     "        print(\"Key: {}, Value: {}\".format(field.key, field.value))\n",
325 |     "\n",
326 |     "    # Search fields by key\n",
327 |     "    print(\"\\nSearch Fields:\")\n",
328 |     "    key = \"address\"\n",
329 |     "    fields = page.form.searchFieldsByKey(key)\n",
330 |     "    for field in fields:\n",
331 |     "        print(\"Key: {}, Value: {}\".format(field.key, field.value))"
332 |    ]
333 |   },
334 |   {
335 |    "cell_type": "code",
336 |    "execution_count": null,
337 |    "metadata": {},
338 |    "outputs": [],
339 |    "source": [
340 |     "doc = Document(response)\n",
341 |     "\n",
342 |     "for page in doc.pages:\n",
343 |     "     # Print tables\n",
344 |     "    for table in page.tables:\n",
345 |     "        for r, row in enumerate(table.rows):\n",
346 |     "            for c, cell in enumerate(row.cells):\n",
347 |     "                print(\"Table[{}][{}] = {}\".format(r, c, cell.text))"
348 |    ]
349 |   },
350 |   {
351 |    "cell_type": "markdown",
352 |    "metadata": {},
353 |    "source": [
354 |     "# 12. PDF Processing"
355 |    ]
356 |   },
357 |   {
358 |    "cell_type": "markdown",
359 |    "metadata": {},
360 |    "source": [
361 |     "https://docs.aws.amazon.com/textract/latest/dg/API_StartDocumentTextDetection.html\n",
362 |     "https://docs.aws.amazon.com/textract/latest/dg/API_GetDocumentTextDetection.html\n",
363 |     "https://docs.aws.amazon.com/textract/latest/dg/API_StartDocumentAnalysis.html\n",
364 |     "https://docs.aws.amazon.com/textract/latest/dg/API_GetDocumentAnalysis.html"
365 |    ]
366 |   },
367 |   {
368 |    "cell_type": "code",
369 |    "execution_count": null,
370 |    "metadata": {},
371 |    "outputs": [],
372 |    "source": [
373 |     "def startJob(s3BucketName, objectName):\n",
374 |     "    response = None\n",
375 |     "    response = textract.start_document_text_detection(\n",
376 |     "    DocumentLocation={\n",
377 |     "        'S3Object': {\n",
378 |     "            'Bucket': s3BucketName,\n",
379 |     "            'Name': objectName\n",
380 |     "        }\n",
381 |     "    })\n",
382 |     "\n",
383 |     "    return response[\"JobId\"]\n",
384 |     "\n",
385 |     "def isJobComplete(jobId):\n",
386 |     "    response = textract.get_document_text_detection(JobId=jobId)\n",
387 |     "    status = response[\"JobStatus\"]\n",
388 |     "    print(\"Job status: {}\".format(status))\n",
389 |     "\n",
390 |     "    while(status == \"IN_PROGRESS\"):\n",
391 |     "        time.sleep(5)\n",
392 |     "        response = textract.get_document_text_detection(JobId=jobId)\n",
393 |     "        status = response[\"JobStatus\"]\n",
394 |     "        print(\"Job status: {}\".format(status))\n",
395 |     "\n",
396 |     "    return status\n",
397 |     "\n",
398 |     "def getJobResults(jobId):\n",
399 |     "\n",
400 |     "    pages = []\n",
401 |     "    response = textract.get_document_text_detection(JobId=jobId)\n",
402 |     "    \n",
403 |     "    pages.append(response)\n",
404 |     "    print(\"Resultset page recieved: {}\".format(len(pages)))\n",
405 |     "    nextToken = None\n",
406 |     "    if('NextToken' in response):\n",
407 |     "        nextToken = response['NextToken']\n",
408 |     "\n",
409 |     "    while(nextToken):\n",
410 |     "        response = textract.get_document_text_detection(JobId=jobId, NextToken=nextToken)\n",
411 |     "\n",
412 |     "        pages.append(response)\n",
413 |     "        print(\"Resultset page recieved: {}\".format(len(pages)))\n",
414 |     "        nextToken = None\n",
415 |     "        if('NextToken' in response):\n",
416 |     "            nextToken = response['NextToken']\n",
417 |     "\n",
418 |     "    return pages"
419 |    ]
420 |   },
421 |   {
422 |    "cell_type": "code",
423 |    "execution_count": null,
424 |    "metadata": {},
425 |    "outputs": [],
426 |    "source": [
427 |     "# Document\n",
428 |     "documentName = \"job-application-form.pdf\""
429 |    ]
430 |   },
431 |   {
432 |    "cell_type": "code",
433 |    "execution_count": null,
434 |    "metadata": {},
435 |    "outputs": [],
436 |    "source": [
437 |     "\n",
438 |     "s3.Bucket(s3BucketName).upload_file(documentName,documentName)"
439 |    ]
440 |   },
441 |   {
442 |    "cell_type": "code",
443 |    "execution_count": null,
444 |    "metadata": {},
445 |    "outputs": [],
446 |    "source": [
447 |     "jobId = startJob(s3BucketName, documentName)\n",
448 |     "print(\"Started job with id: {}\".format(jobId))\n",
449 |     "if(isJobComplete(jobId)):\n",
450 |     "    response = getJobResults(jobId)\n",
451 |     "\n",
452 |     "#print(response)\n",
453 |     "doc = Document(response)\n"
454 |    ]
455 |   },
456 |   {
457 |    "cell_type": "code",
458 |    "execution_count": null,
459 |    "metadata": {},
460 |    "outputs": [],
461 |    "source": [
462 |     "\n",
463 |     "#Print detected text\n",
464 |     "for page in doc.pages:\n",
465 |     "    for line in page.getLinesInReadingOrder():\n",
466 |     "          print(line[1])"
467 |    ]
468 |   },
469 |   {
470 |    "cell_type": "markdown",
471 |    "metadata": {},
472 |    "source": [
473 |     "# For Analyze expense API demo refer to Chapter 17 Visualizing Insights from handwritten content"
474 |    ]
475 |   },
476 |   {
477 |    "cell_type": "markdown",
478 |    "metadata": {},
479 |    "source": [
480 |     "# Clean UP"
481 |    ]
482 |   },
483 |   {
484 |    "cell_type": "markdown",
485 |    "metadata": {},
486 |    "source": [
487 |     "Delete the S3 bucket and sample documents from S3 https://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-objects.html"
488 |    ]
489 |   }
490 |  ],
491 |  "metadata": {
492 |   "kernelspec": {
493 |    "display_name": "conda_python3",
494 |    "language": "python",
495 |    "name": "conda_python3"
496 |   },
497 |   "language_info": {
498 |    "codemirror_mode": {
499 |     "name": "ipython",
500 |     "version": 3
501 |    },
502 |    "file_extension": ".py",
503 |    "mimetype": "text/x-python",
504 |    "name": "python",
505 |    "nbconvert_exporter": "python",
506 |    "pygments_lexer": "ipython3",
507 |    "version": "3.6.13"
508 |   }
509 |  },
510 |  "nbformat": 4,
511 |  "nbformat_minor": 2
512 | }
513 | 


--------------------------------------------------------------------------------
/Chapter 02/emp_app_printed.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 02/emp_app_printed.png


--------------------------------------------------------------------------------
/Chapter 02/employment_history.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 02/employment_history.png


--------------------------------------------------------------------------------
/Chapter 02/form-derogatoire.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 02/form-derogatoire.jpg


--------------------------------------------------------------------------------
/Chapter 02/job-application-form.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 02/job-application-form.pdf


--------------------------------------------------------------------------------
/Chapter 02/receipt-image.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 02/receipt-image.png


--------------------------------------------------------------------------------
/Chapter 02/sample-invoice.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 02/sample-invoice.png


--------------------------------------------------------------------------------
/Chapter 02/two-column-image.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 02/two-column-image.jpeg


--------------------------------------------------------------------------------
/Chapter 03/Chapter 3 Introduction to Amazon Comprehend.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "id": "331ec774",
  7 |    "metadata": {},
  8 |    "outputs": [],
  9 |    "source": [
 10 |     "\n",
 11 |     "\n",
 12 |     "import boto3\n",
 13 |     "\n",
 14 |     "\n",
 15 |     "# Amazon Comprehend client\n",
 16 |     "comprehend = boto3.client('comprehend')\n"
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "markdown",
 21 |    "id": "a06ecc2d",
 22 |    "metadata": {},
 23 |    "source": [
 24 |     "# Entity Extraction Text Analysis Real time API"
 25 |    ]
 26 |   },
 27 |   {
 28 |    "cell_type": "code",
 29 |    "execution_count": null,
 30 |    "id": "b037763b",
 31 |    "metadata": {},
 32 |    "outputs": [],
 33 |    "source": [
 34 |     "SampleText=\"Packt is a publishing company founded in 2003 headquartered in Birmingham, UK, with offices in Mumbai, India. Packt primarily publishes print and electronic books and videos relating to information technology, including programming, web design, data analysis and hardware.\""
 35 |    ]
 36 |   },
 37 |   {
 38 |    "cell_type": "code",
 39 |    "execution_count": null,
 40 |    "id": "aa0adc0d",
 41 |    "metadata": {},
 42 |    "outputs": [],
 43 |    "source": [
 44 |     "response = comprehend.detect_entities(\n",
 45 |     "    Text=SampleText,\n",
 46 |     "    LanguageCode='en')"
 47 |    ]
 48 |   },
 49 |   {
 50 |    "cell_type": "code",
 51 |    "execution_count": null,
 52 |    "id": "b3ebd476",
 53 |    "metadata": {},
 54 |    "outputs": [],
 55 |    "source": [
 56 |     "import json\n",
 57 |     "\n",
 58 |     "print (json.dumps(response, indent=4, sort_keys=True))"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "markdown",
 63 |    "id": "c33fe734",
 64 |    "metadata": {},
 65 |    "source": [
 66 |     "# Keyphrases Extraction Real Time APIs in French language"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "code",
 71 |    "execution_count": null,
 72 |    "id": "f26663f7",
 73 |    "metadata": {},
 74 |    "outputs": [],
 75 |    "source": [
 76 |     "SampleText=\"Packt est une société d'édition fondée en 2003 dont le siège est à Birmingham, au Royaume-Uni, avec des bureaux à Mumbai, en Inde. Packt publie principalement des livres et des vidéos imprimés et électroniques relatifs aux technologies de l'information, y compris la programmation, la conception Web, l'analyse de données et le matériel\""
 77 |    ]
 78 |   },
 79 |   {
 80 |    "cell_type": "code",
 81 |    "execution_count": null,
 82 |    "id": "f32cee7e",
 83 |    "metadata": {},
 84 |    "outputs": [],
 85 |    "source": [
 86 |     "response = comprehend.detect_key_phrases(\n",
 87 |     "    Text= SampleText,\n",
 88 |     "    LanguageCode='fr'\n",
 89 |     ")"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "code",
 94 |    "execution_count": null,
 95 |    "id": "922c0674",
 96 |    "metadata": {},
 97 |    "outputs": [],
 98 |    "source": [
 99 |     "import json\n",
100 |     "\n",
101 |     "print (json.dumps(response, indent=4, sort_keys=True))"
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "markdown",
106 |    "id": "6cb7dbca",
107 |    "metadata": {},
108 |    "source": [
109 |     "# Batch Real time API for Amazon Detect Sentiment Demo \n",
110 |     "Multiple Document Synchronous Processing mode where you can call Amazon Comprehend with a collection of up to 25 documents and receive a synchronous response. "
111 |    ]
112 |   },
113 |   {
114 |    "cell_type": "markdown",
115 |    "id": "75340b29",
116 |    "metadata": {},
117 |    "source": [
118 |     "\n",
119 |     "Packt Publication Book Reviews Setiment Analysis for the book \n",
120 |     "\n",
121 |     "40 Algorithms Every Programmer Should Know\n",
122 |     "https://www.packtpub.com/product/40-algorithms-every-programmer-should-know/9781789801217"
123 |    ]
124 |   },
125 |   {
126 |    "cell_type": "markdown",
127 |    "id": "881a0214",
128 |    "metadata": {},
129 |    "source": [
130 |     "We are going to analyze some of the reviews for this book using batch snetiment analyis API."
131 |    ]
132 |   },
133 |   {
134 |    "cell_type": "code",
135 |    "execution_count": null,
136 |    "id": "b1702d83",
137 |    "metadata": {},
138 |    "outputs": [],
139 |    "source": [
140 |     "response = comprehend.batch_detect_sentiment(\n",
141 |     "    TextList=[\n",
142 |     "        'Well this is an area of my interest and this book is packed with essential knowledge','kinda all in one With good examples and rather easy to follow', 'There are good examples and samples in the book.', '40 Algorithms every Programmer should know is a good start to a vast topic about algorithms'\n",
143 |     "    ],\n",
144 |     "    LanguageCode='en'\n",
145 |     ")"
146 |    ]
147 |   },
148 |   {
149 |    "cell_type": "code",
150 |    "execution_count": null,
151 |    "id": "eb435781",
152 |    "metadata": {},
153 |    "outputs": [],
154 |    "source": [
155 |     "print (json.dumps(response, indent=4, sort_keys=True))"
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "markdown",
160 |    "id": "7e611cd7",
161 |    "metadata": {},
162 |    "source": [
163 |     "# Since the book reviews were in different languages. lets Identify the various languages in this book review"
164 |    ]
165 |   },
166 |   {
167 |    "cell_type": "code",
168 |    "execution_count": null,
169 |    "id": "280e8e6f",
170 |    "metadata": {},
171 |    "outputs": [],
172 |    "source": [
173 |     "response = comprehend.batch_detect_dominant_language(\n",
174 |     "    TextList=[\n",
175 |     "        'It include recenet algorithm trend. it is very helpful.','Je ne lai pas encore lu entièrement mais le livre semble expliquer de façon suffisamment claire lensemble de ces algorithmes.'\n",
176 |     "    ]\n",
177 |     ")"
178 |    ]
179 |   },
180 |   {
181 |    "cell_type": "code",
182 |    "execution_count": null,
183 |    "id": "a4d8a274",
184 |    "metadata": {},
185 |    "outputs": [],
186 |    "source": [
187 |     "print (json.dumps(response, indent=4, sort_keys=True))"
188 |    ]
189 |   },
190 |   {
191 |    "cell_type": "code",
192 |    "execution_count": null,
193 |    "id": "9787222a",
194 |    "metadata": {},
195 |    "outputs": [],
196 |    "source": []
197 |   }
198 |  ],
199 |  "metadata": {
200 |   "kernelspec": {
201 |    "display_name": "conda_python3",
202 |    "language": "python",
203 |    "name": "conda_python3"
204 |   },
205 |   "language_info": {
206 |    "codemirror_mode": {
207 |     "name": "ipython",
208 |     "version": 3
209 |    },
210 |    "file_extension": ".py",
211 |    "mimetype": "text/x-python",
212 |    "name": "python",
213 |    "nbconvert_exporter": "python",
214 |    "pygments_lexer": "ipython3",
215 |    "version": "3.6.13"
216 |   }
217 |  },
218 |  "nbformat": 4,
219 |  "nbformat_minor": 5
220 | }
221 | 


--------------------------------------------------------------------------------
/Chapter 04/Chapter 4 Compliance and control.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "6a204c35",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# PII Detection and Redaction for setting compliance and control\n",
  9 |     "\n",
 10 |     "In this , we will be performing extracting the text from the documents using AWS Textract and then use Comprehend to perform pii detection. Then we will be using python function to redact that portion of the image. \n",
 11 |     "Here is conceptual architectural flow:\n",
 12 |     "\n",
 13 |     "![alt-text](piiredact.png)\n",
 14 |     "\n",
 15 |     "You can automate the entire end to end flow using step function and lambda for orchestration.\n",
 16 |     "\n",
 17 |     "We will walk you through following steps:\n",
 18 |     "\n",
 19 |     "## Step 1: Setup and install libraries \n",
 20 |     "## Step 2: Extract text from sample document\n",
 21 |     "## Step 3: Save the extracted text into text/csv file and uplaod to Amazon S3 bucket\n",
 22 |     "## Step 4: Check for PII using Amazon Comprehend Detect PII Sync API.\n",
 23 |     "## Step 5: Mask PII using Amazon Comprehend PII Analysis Job\n",
 24 |     "## Step 6: View the redacted/masked output in Amazon S3 Bucket\n"
 25 |    ]
 26 |   },
 27 |   {
 28 |    "cell_type": "markdown",
 29 |    "id": "bf848fe3",
 30 |    "metadata": {},
 31 |    "source": [
 32 |     "# Lets start with Step 1: Setup and install libraries"
 33 |    ]
 34 |   },
 35 |   {
 36 |    "cell_type": "markdown",
 37 |    "id": "34bcf133",
 38 |    "metadata": {},
 39 |    "source": [
 40 |     "import json\n",
 41 |     "import boto3\n",
 42 |     "import re\n",
 43 |     "import csv\n",
 44 |     "import sagemaker\n",
 45 |     "from sagemaker import get_execution_role\n",
 46 |     "from sagemaker.s3 import S3Uploader, S3Downloader\n",
 47 |     "import uuid\n",
 48 |     "import time\n",
 49 |     "import io\n",
 50 |     "from io import BytesIO\n",
 51 |     "import sys\n",
 52 |     "from pprint import pprint\n",
 53 |     "\n",
 54 |     "from IPython.display import Image, display\n",
 55 |     "from PIL import Image as PImage, ImageDraw"
 56 |    ]
 57 |   },
 58 |   {
 59 |    "cell_type": "code",
 60 |    "execution_count": null,
 61 |    "id": "2ebd3fed",
 62 |    "metadata": {},
 63 |    "outputs": [],
 64 |    "source": [
 65 |     "!pip install amazon-textract-response-parser"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "code",
 70 |    "execution_count": null,
 71 |    "id": "4a8e44b8",
 72 |    "metadata": {},
 73 |    "outputs": [],
 74 |    "source": [
 75 |     "import pandas as pd\n",
 76 |     "import webbrowser, os\n",
 77 |     "import json\n",
 78 |     "import boto3\n",
 79 |     "import re\n",
 80 |     "import sagemaker\n",
 81 |     "from sagemaker import get_execution_role\n",
 82 |     "from sagemaker.s3 import S3Uploader, S3Downloader\n",
 83 |     "import uuid\n",
 84 |     "import time\n",
 85 |     "import io\n",
 86 |     "from io import BytesIO\n",
 87 |     "import sys\n",
 88 |     "from pprint import pprint\n",
 89 |     "\n",
 90 |     "from IPython.display import Image, display\n",
 91 |     "from PIL import Image as PImage, ImageDraw"
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "code",
 96 |    "execution_count": null,
 97 |    "id": "52f62fbd",
 98 |    "metadata": {},
 99 |    "outputs": [],
100 |    "source": [
101 |     "\n",
102 |     "region = boto3.Session().region_name\n",
103 |     "\n",
104 |     "role = get_execution_role()\n",
105 |     "print(role)\n",
106 |     "\n",
107 |     "bucket = sagemaker.Session().default_bucket()\n",
108 |     "\n",
109 |     "prefix = \"pii-detection-redaction\"\n",
110 |     "bucket_path = \"https://s3-{}.amazonaws.com/{}\".format(region, bucket)\n"
111 |    ]
112 |   },
113 |   {
114 |    "cell_type": "markdown",
115 |    "id": "167dde46",
116 |    "metadata": {},
117 |    "source": [
118 |     "# Step 2: Extract text from sample document¶"
119 |    ]
120 |   },
121 |   {
122 |    "cell_type": "code",
123 |    "execution_count": null,
124 |    "id": "475ed18a",
125 |    "metadata": {},
126 |    "outputs": [],
127 |    "source": [
128 |     "# Document\n",
129 |     "documentName = \"bankstatement.JPG\"\n",
130 |     "\n",
131 |     "display(Image(filename=documentName))"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "code",
136 |    "execution_count": null,
137 |    "id": "772e4c06",
138 |    "metadata": {},
139 |    "outputs": [],
140 |    "source": [
141 |     "client = boto3.client(service_name='textract',\n",
142 |     "         region_name= 'us-east-1',\n",
143 |     "         endpoint_url='https://textract.us-east-1.amazonaws.com')\n",
144 |     "\n",
145 |     "with open(documentName, 'rb') as file:\n",
146 |     "            img_test = file.read()\n",
147 |     "            bytes_test = bytearray(img_test)\n",
148 |     "            print('Image loaded', documentName)\n",
149 |     "\n",
150 |     "    # process using image bytes\n",
151 |     "response = client.detect_document_text(Document={'Bytes': bytes_test})\n"
152 |    ]
153 |   },
154 |   {
155 |    "cell_type": "code",
156 |    "execution_count": null,
157 |    "id": "f994f746",
158 |    "metadata": {},
159 |    "outputs": [],
160 |    "source": [
161 |     "#Extract key values\n",
162 |     "# Iterate over elements in the document\n",
163 |     "from trp import Document\n",
164 |     "\n",
165 |     "\n",
166 |     "doc = Document(response)\n",
167 |     "page_string = ''\n",
168 |     "for page in doc.pages:\n",
169 |     "    # Print lines and words\n",
170 |     "       \n",
171 |     "        for line in page.lines:\n",
172 |     "            #print((line.text))\n",
173 |     "            page_string += str(line.text)\n",
174 |     "print(page_string)"
175 |    ]
176 |   },
177 |   {
178 |    "cell_type": "markdown",
179 |    "id": "e4385858",
180 |    "metadata": {},
181 |    "source": [
182 |     "# Step 3: Save the extracted text into text/csv file and uplaod to Amazon S3 bucket¶"
183 |    ]
184 |   },
185 |   {
186 |    "cell_type": "code",
187 |    "execution_count": null,
188 |    "id": "b4367f93",
189 |    "metadata": {},
190 |    "outputs": [],
191 |    "source": [
192 |     "# Lets get the  data into a text file\n",
193 |     "text_filename = 'pii_data.txt'\n",
194 |     "doc = Document(response)\n",
195 |     "with open(text_filename, 'w', encoding='utf-8') as f:\n",
196 |     "    for page in doc.pages:\n",
197 |     "    # Print lines and words\n",
198 |     "        page_string = ''\n",
199 |     "        for line in page.lines:\n",
200 |     "            #print((line.text))\n",
201 |     "            page_string += str(line.text)\n",
202 |     "        #print(page_string)\n",
203 |     "        f.writelines(page_string + \"\\n\")"
204 |    ]
205 |   },
206 |   {
207 |    "cell_type": "code",
208 |    "execution_count": null,
209 |    "id": "b62b48b3",
210 |    "metadata": {},
211 |    "outputs": [],
212 |    "source": [
213 |     "# Load the documents locally for later analysis\n",
214 |     "with open(text_filename, \"r\") as fi:\n",
215 |     "    raw_texts = [line.strip() for line in fi.readlines()]"
216 |    ]
217 |   },
218 |   {
219 |    "cell_type": "code",
220 |    "execution_count": null,
221 |    "id": "c40f2d22",
222 |    "metadata": {},
223 |    "outputs": [],
224 |    "source": [
225 |     "import boto3\n",
226 |     "\n",
227 |     "s3 = boto3.resource('s3')\n",
228 |     "s3.Bucket(bucket).upload_file(\"pii_data.txt\", \"pii-detection-redaction/pii_data.txt\")"
229 |    ]
230 |   },
231 |   {
232 |    "cell_type": "markdown",
233 |    "id": "69a5cb4c",
234 |    "metadata": {},
235 |    "source": [
236 |     "# Step 4: Check for PII using Amazon Comprehend Detect PII Sync API"
237 |    ]
238 |   },
239 |   {
240 |    "cell_type": "code",
241 |    "execution_count": null,
242 |    "id": "641fed7c",
243 |    "metadata": {},
244 |    "outputs": [],
245 |    "source": [
246 |     "comprehend = boto3.client(service_name='comprehend')"
247 |    ]
248 |   },
249 |   {
250 |    "cell_type": "code",
251 |    "execution_count": null,
252 |    "id": "12ce5e4f",
253 |    "metadata": {},
254 |    "outputs": [],
255 |    "source": [
256 |     "# Call Amazon Comprehend   and pass it the aggregated text from our   image.\n",
257 |     "\n",
258 |     "piilist=comprehend.detect_pii_entities(Text = page_string, LanguageCode='en')\n",
259 |     "redacted_box_color='red'\n",
260 |     "dpi = 72\n",
261 |     "pii_detection_threshold = 0.00\n",
262 |     "print ('Finding PII text...')\n",
263 |     "not_redacted=0\n",
264 |     "redacted=0\n",
265 |     "for pii in piilist['Entities']:\n",
266 |     "    print(pii['Type'])\n",
267 |     "    if pii['Score'] > pii_detection_threshold:\n",
268 |     "                    print (\"detected as type '\"+pii['Type']+\"' and will be redacted.\")\n",
269 |     "                    redacted+=1\n",
270 |     "                \n",
271 |     "    else:\n",
272 |     "        print (\" was detected as type '\"+pii['Type']+\"', but did not meet the confidence score threshold and will not be redacted.\")\n",
273 |     "        not_redacted+=1\n",
274 |     "\n",
275 |     "\n",
276 |     "print (\"Found\", redacted, \"text boxes to redact.\")\n",
277 |     "print (not_redacted, \"additional text boxes were detected, but did not meet the confidence score threshold.\")"
278 |    ]
279 |   },
280 |   {
281 |    "cell_type": "markdown",
282 |    "id": "69c52a92",
283 |    "metadata": {},
284 |    "source": [
285 |     "# Step 5: Mask PII using Amazon Comprehend PII Analysis Job"
286 |    ]
287 |   },
288 |   {
289 |    "cell_type": "markdown",
290 |    "id": "275acf23",
291 |    "metadata": {},
292 |    "source": [
293 |     "We will use StartPiiEntitiesDetectionJob API\n",
294 |     "\n",
295 |     "StartPiiEntitiesDetectionJob API starts an asynchronous PII entity detection job for a collection of documents.\n",
296 |     "\n",
297 |     "We would be using this API to perform pii detection and redaction for pii_data.txt which we had inspected above.\n"
298 |    ]
299 |   },
300 |   {
301 |    "cell_type": "code",
302 |    "execution_count": null,
303 |    "id": "ee42299c",
304 |    "metadata": {},
305 |    "outputs": [],
306 |    "source": [
307 |     "import uuid\n",
308 |     "InputS3URI= \"s3://\"+bucket+ \"/pii-detection-redaction/pii_data.txt\"\n",
309 |     "print(InputS3URI)\n",
310 |     "OutputS3URI=\"s3://\"+bucket+\"/pii-detection-redaction\"\n",
311 |     "print(OutputS3URI)\n",
312 |     "job_uuid = uuid.uuid1()\n",
313 |     "job_name = f\"pii-job-{job_uuid}\""
314 |    ]
315 |   },
316 |   {
317 |    "cell_type": "markdown",
318 |    "id": "56212a64",
319 |    "metadata": {},
320 |    "source": [
321 |     "# Adding Amazon Comprehend as an additional trusted entity to this role\n",
322 |     "\n",
323 |     "This step is needed if you want to pass the execution role of this Notebook while calling Comprehend APIs as well without creating an additional Role. \n",
324 |     "\n",
325 |     "\n",
326 |     "\n",
327 |     "On the IAM dashboard, please click on Roles on the left sidenav and search for this Role. Once the Role appears, click on the Role to go to its Summary page. Click on the Trust relationships tab on the Summary page to add Amazon Comprehend as an additional trusted entity.\n",
328 |     "\n",
329 |     "Click on **Edit trust relationship** and replace the JSON with this JSON.\n",
330 |     "```\n",
331 |     "{\n",
332 |     "  \"Version\": \"2012-10-17\",\n",
333 |     "  \"Statement\": [\n",
334 |     "    {\n",
335 |     "      \"Effect\": \"Allow\",\n",
336 |     "      \"Principal\": {\n",
337 |     "        \"Service\": \"comprehend.amazonaws.com\"\n",
338 |     "      },\n",
339 |     "      \"Action\": \"sts:AssumeRole\"\n",
340 |     "    }\n",
341 |     "  ]\n",
342 |     "}\n",
343 |     "```\n",
344 |     "\n",
345 |     "Once this is complete, click on Update Trust Policy and you are done."
346 |    ]
347 |   },
348 |   {
349 |    "cell_type": "code",
350 |    "execution_count": null,
351 |    "id": "1ac2943e",
352 |    "metadata": {},
353 |    "outputs": [],
354 |    "source": [
355 |     "role_name = role[role.rfind('/') + 1:]\n",
356 |     "print(\"https://console.aws.amazon.com/iam/home?region={0}#/roles/{1}\".format(region, role_name))\n"
357 |    ]
358 |   },
359 |   {
360 |    "cell_type": "code",
361 |    "execution_count": null,
362 |    "id": "dc816112",
363 |    "metadata": {},
364 |    "outputs": [],
365 |    "source": [
366 |     "\n",
367 |     "response = comprehend.start_pii_entities_detection_job(\n",
368 |     "    InputDataConfig={\n",
369 |     "        'S3Uri': InputS3URI,\n",
370 |     "        'InputFormat': 'ONE_DOC_PER_FILE'\n",
371 |     "    },\n",
372 |     "    OutputDataConfig={\n",
373 |     "        'S3Uri': OutputS3URI\n",
374 |     "       \n",
375 |     "    },\n",
376 |     "    Mode='ONLY_REDACTION',\n",
377 |     "    RedactionConfig={\n",
378 |     "        'PiiEntityTypes': [\n",
379 |     "           'ALL',\n",
380 |     "        ],\n",
381 |     "        'MaskMode': 'MASK',\n",
382 |     "        'MaskCharacter': '*'\n",
383 |     "    },\n",
384 |     "    DataAccessRoleArn = role,\n",
385 |     "    JobName=job_name,\n",
386 |     "    LanguageCode='en',\n",
387 |     "    \n",
388 |     ")"
389 |    ]
390 |   },
391 |   {
392 |    "cell_type": "code",
393 |    "execution_count": null,
394 |    "id": "038aca17",
395 |    "metadata": {},
396 |    "outputs": [],
397 |    "source": [
398 |     "# Get the job ID\n",
399 |     "events_job_id = response['JobId']\n",
400 |     "job = comprehend.describe_pii_entities_detection_job(JobId=events_job_id)\n",
401 |     "print(job)"
402 |    ]
403 |   },
404 |   {
405 |    "cell_type": "markdown",
406 |    "id": "d5c84f74",
407 |    "metadata": {},
408 |    "source": [
409 |     "\n",
410 |     "The job will take roughly 6-7 minutes. \n",
411 |     "The below code is to check the status of the job. \n",
412 |     "The cell execution would be completed after the job is completed.\n",
413 |     "In case the job fails you can check the logs and status in AWS Console https://console.aws.amazon.com/comprehend/v2/home?region=us-east-1#analysis\n",
414 |     "and try re running the job if you get this failure reason:\n",
415 |     "    NO_WRITE_ACCESS_TO_OUTPUT: The provided data access role does not have write access to the output S3 URI."
416 |    ]
417 |   },
418 |   {
419 |    "cell_type": "code",
420 |    "execution_count": null,
421 |    "id": "aeb94ab2",
422 |    "metadata": {},
423 |    "outputs": [],
424 |    "source": [
425 |     "from time import sleep\n",
426 |     "# Get current job status\n",
427 |     "job = comprehend.describe_pii_entities_detection_job(JobId=events_job_id)\n",
428 |     "print(job)\n",
429 |     "# Loop until job is completed\n",
430 |     "waited = 0\n",
431 |     "timeout_minutes = 10\n",
432 |     "while job['PiiEntitiesDetectionJobProperties']['JobStatus'] != 'COMPLETED':\n",
433 |     "    sleep(60)\n",
434 |     "    waited += 60\n",
435 |     "    assert waited//60 < timeout_minutes, \"Job timed out after %d seconds.\" % waited\n",
436 |     "    job = comprehend.describe_pii_entities_detection_job(JobId=events_job_id)"
437 |    ]
438 |   },
439 |   {
440 |    "cell_type": "code",
441 |    "execution_count": null,
442 |    "id": "6fe60743",
443 |    "metadata": {},
444 |    "outputs": [],
445 |    "source": [
446 |     "print(response)"
447 |    ]
448 |   },
449 |   {
450 |    "cell_type": "markdown",
451 |    "id": "8a44d579",
452 |    "metadata": {},
453 |    "source": [
454 |     "# Step 6: View the redacted/masked output in Amazon S3 Bucket¶"
455 |    ]
456 |   },
457 |   {
458 |    "cell_type": "code",
459 |    "execution_count": null,
460 |    "id": "4bb82ee2",
461 |    "metadata": {},
462 |    "outputs": [],
463 |    "source": [
464 |     "filename=\"pii_data.txt\"\n",
465 |     "output_data_s3_file = job['PiiEntitiesDetectionJobProperties']['OutputDataConfig']['S3Uri'] + filename + '.out'\n",
466 |     "print(output_data_s3_file)"
467 |    ]
468 |   },
469 |   {
470 |    "cell_type": "code",
471 |    "execution_count": null,
472 |    "id": "21be6b75",
473 |    "metadata": {},
474 |    "outputs": [],
475 |    "source": [
476 |     "\n",
477 |     "# The output filename is the input filename + \".out\"\n",
478 |     "s3_client = boto3.client(service_name='s3')\n",
479 |     "filename=\"pii_data.txt\"\n",
480 |     "output_data_s3_file = job['PiiEntitiesDetectionJobProperties']['OutputDataConfig']['S3Uri'] + filename + '.out'\n",
481 |     "print(output_data_s3_file)\n",
482 |     "output_data_s3_filepath=output_data_s3_file.split(\"//\")[1].split(\"/\")[1]+\"/\"+output_data_s3_file.split(\"//\")[1].split(\"/\")[2]+\"/\"+output_data_s3_file.split(\"//\")[1].split(\"/\")[3]+\"/\"+output_data_s3_file.split(\"//\")[1].split(\"/\")[4]\n",
483 |     "print(output_data_s3_filepath)\n",
484 |     "\n",
485 |     "f = BytesIO()\n",
486 |     "s3_client.download_fileobj(bucket, output_data_s3_filepath, f)\n",
487 |     "f.seek(0)\n",
488 |     "print(f.getvalue())"
489 |    ]
490 |   },
491 |   {
492 |    "cell_type": "markdown",
493 |    "id": "686fc93b",
494 |    "metadata": {},
495 |    "source": [
496 |     "Clean Up!\n",
497 |     "\n",
498 |     "Delete Amazon S3 Bucket https://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-bucket.html"
499 |    ]
500 |   }
501 |  ],
502 |  "metadata": {
503 |   "kernelspec": {
504 |    "display_name": "conda_python3",
505 |    "language": "python",
506 |    "name": "conda_python3"
507 |   },
508 |   "language_info": {
509 |    "codemirror_mode": {
510 |     "name": "ipython",
511 |     "version": 3
512 |    },
513 |    "file_extension": ".py",
514 |    "mimetype": "text/x-python",
515 |    "name": "python",
516 |    "nbconvert_exporter": "python",
517 |    "pygments_lexer": "ipython3",
518 |    "version": "3.6.13"
519 |   }
520 |  },
521 |  "nbformat": 4,
522 |  "nbformat_minor": 5
523 | }
524 | 


--------------------------------------------------------------------------------
/Chapter 04/bankstatement.JPG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 04/bankstatement.JPG


--------------------------------------------------------------------------------
/Chapter 04/piiredact.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 04/piiredact.png


--------------------------------------------------------------------------------
/Chapter 05/Ch05-Kendra Search.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "id": "4f8f6eb9",
  7 |    "metadata": {},
  8 |    "outputs": [],
  9 |    "source": [
 10 |     "\n",
 11 |     "import pandas as pd\n",
 12 |     "import webbrowser, os\n",
 13 |     "import json\n",
 14 |     "import boto3\n",
 15 |     "import re\n",
 16 |     "import sagemaker\n",
 17 |     "from sagemaker import get_execution_role\n",
 18 |     "from sagemaker.s3 import S3Uploader, S3Downloader\n",
 19 |     "import uuid\n",
 20 |     "import time\n",
 21 |     "import io\n",
 22 |     "from io import BytesIO\n",
 23 |     "import sys\n",
 24 |     "import csv\n",
 25 |     "from pprint import pprint\n",
 26 |     "from IPython.display import Image, display\n",
 27 |     "from PIL import Image as PImage, ImageDraw\n",
 28 |     "\n",
 29 |     "# Define IAM role\n",
 30 |     "role = get_execution_role()\n",
 31 |     "print(\"RoleArn: {}\".format(role))\n",
 32 |     "sess = sagemaker.Session()\n",
 33 |     "s3BucketName =  \"<enter your bucket name>\"\n",
 34 |     "prefix = 'chapter5'\n",
 35 |     "\n",
 36 |     "s3 = boto3.client('s3')"
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "code",
 41 |    "execution_count": null,
 42 |    "id": "dd320b14",
 43 |    "metadata": {},
 44 |    "outputs": [],
 45 |    "source": [
 46 |     "\n",
 47 |     "# initialize the boto3 handle for comprehend\n",
 48 |     "comprehend = boto3.client('comprehend')\n",
 49 |     "textract= boto3.client('textract')\n",
 50 |     "kendra= boto3.client('kendra')"
 51 |    ]
 52 |   },
 53 |   {
 54 |    "cell_type": "code",
 55 |    "execution_count": null,
 56 |    "id": "49ded142",
 57 |    "metadata": {},
 58 |    "outputs": [],
 59 |    "source": [
 60 |     "# Document\n",
 61 |     "documentName = \"resume_Sample.pdf\""
 62 |    ]
 63 |   },
 64 |   {
 65 |    "cell_type": "code",
 66 |    "execution_count": null,
 67 |    "id": "aa7e908a",
 68 |    "metadata": {},
 69 |    "outputs": [],
 70 |    "source": [
 71 |     "\n",
 72 |     "s3 = boto3.resource('s3')\n",
 73 |     "s3.Bucket(s3BucketName).upload_file(documentName,documentName)"
 74 |    ]
 75 |   },
 76 |   {
 77 |    "cell_type": "code",
 78 |    "execution_count": null,
 79 |    "id": "d6097210",
 80 |    "metadata": {},
 81 |    "outputs": [],
 82 |    "source": [
 83 |     "def startJob(s3BucketName, objectName):\n",
 84 |     "    response = None\n",
 85 |     "    response = textract.start_document_text_detection(\n",
 86 |     "    DocumentLocation={\n",
 87 |     "        'S3Object': {\n",
 88 |     "            'Bucket': s3BucketName,\n",
 89 |     "            'Name': objectName\n",
 90 |     "        }\n",
 91 |     "    })\n",
 92 |     "\n",
 93 |     "    return response[\"JobId\"]\n",
 94 |     "\n",
 95 |     "def isJobComplete(jobId):\n",
 96 |     "    response = textract.get_document_text_detection(JobId=jobId)\n",
 97 |     "    status = response[\"JobStatus\"]\n",
 98 |     "    print(\"Job status: {}\".format(status))\n",
 99 |     "\n",
100 |     "    while(status == \"IN_PROGRESS\"):\n",
101 |     "        time.sleep(5)\n",
102 |     "        response = textract.get_document_text_detection(JobId=jobId)\n",
103 |     "        status = response[\"JobStatus\"]\n",
104 |     "        print(\"Job status: {}\".format(status))\n",
105 |     "\n",
106 |     "    return status\n",
107 |     "\n",
108 |     "def getJobResults(jobId):\n",
109 |     "\n",
110 |     "    pages = []\n",
111 |     "    response = textract.get_document_text_detection(JobId=jobId)\n",
112 |     "    \n",
113 |     "    pages.append(response)\n",
114 |     "    print(\"Resultset page recieved: {}\".format(len(pages)))\n",
115 |     "    nextToken = None\n",
116 |     "    if('NextToken' in response):\n",
117 |     "        nextToken = response['NextToken']\n",
118 |     "\n",
119 |     "    while(nextToken):\n",
120 |     "        response = textract.get_document_text_detection(JobId=jobId, NextToken=nextToken)\n",
121 |     "\n",
122 |     "        pages.append(response)\n",
123 |     "        print(\"Resultset page recieved: {}\".format(len(pages)))\n",
124 |     "        nextToken = None\n",
125 |     "        if('NextToken' in response):\n",
126 |     "            nextToken = response['NextToken']\n",
127 |     "\n",
128 |     "    return pages"
129 |    ]
130 |   },
131 |   {
132 |    "cell_type": "code",
133 |    "execution_count": null,
134 |    "id": "e4988775",
135 |    "metadata": {},
136 |    "outputs": [],
137 |    "source": [
138 |     "\n",
139 |     "\n",
140 |     "jobId = startJob(s3BucketName, documentName)\n",
141 |     "print(\"Started job with id: {}\".format(jobId))\n",
142 |     "if(isJobComplete(jobId)):\n",
143 |     "    response = getJobResults(jobId)\n",
144 |     "\n",
145 |     "#print(response)\n",
146 |     "\n",
147 |     "\n"
148 |    ]
149 |   },
150 |   {
151 |    "cell_type": "code",
152 |    "execution_count": null,
153 |    "id": "ed4a963e",
154 |    "metadata": {},
155 |    "outputs": [],
156 |    "source": [
157 |     "# Print detected text\n",
158 |     "text=\"\"\n",
159 |     "for resultPage in response:\n",
160 |     "    for item in resultPage[\"Blocks\"]:\n",
161 |     "        if item[\"BlockType\"] == \"LINE\":\n",
162 |     "            #print ('\\033[94m' +  item[\"Text\"] + '\\033[0m')\n",
163 |     "            text += item['Text']+\"\\n\"\n",
164 |     "print(text)"
165 |    ]
166 |   },
167 |   {
168 |    "cell_type": "markdown",
169 |    "id": "58dba8da",
170 |    "metadata": {},
171 |    "source": [
172 |     "# Call Amazon Comprehend"
173 |    ]
174 |   },
175 |   {
176 |    "cell_type": "code",
177 |    "execution_count": null,
178 |    "id": "c65788e6",
179 |    "metadata": {},
180 |    "outputs": [],
181 |    "source": [
182 |     "entities= comprehend.detect_entities(Text=text, LanguageCode='en')\n"
183 |    ]
184 |   },
185 |   {
186 |    "cell_type": "code",
187 |    "execution_count": null,
188 |    "id": "e096d750",
189 |    "metadata": {},
190 |    "outputs": [],
191 |    "source": [
192 |     "print(json.dumps(entities, sort_keys=True, indent=4))"
193 |    ]
194 |   },
195 |   {
196 |    "cell_type": "markdown",
197 |    "id": "f7ffc4c2",
198 |    "metadata": {},
199 |    "source": [
200 |     "# Create Kendra Index \n",
201 |     "go to Kendra console https://console.aws.amazon.com/kendra/home?region=us-east-1#indexes/create\n",
202 |     "to create an index by following book instructions and skip creating using API.\n",
203 |     " \n",
204 |     "Alternatively, Please craete an IAM role and provide in Role ARN, \n",
205 |     "\n",
206 |     "https://docs.aws.amazon.com/kendra/latest/dg/deploying.html"
207 |    ]
208 |   },
209 |   {
210 |    "cell_type": "code",
211 |    "execution_count": null,
212 |    "id": "3b2abed6",
213 |    "metadata": {},
214 |    "outputs": [],
215 |    "source": [
216 |     "# run this code only once as it will craete multiple indexes\n",
217 |     "#response = kendra.create_index(\n",
218 |     "#    Name='Search',\n",
219 |     "#    Edition='DEVELOPER_EDITION',\n",
220 |     "#    RoleArn='<enter your role arn>')\n",
221 |     "#print(response)\n"
222 |    ]
223 |   },
224 |   {
225 |    "cell_type": "markdown",
226 |    "id": "e5b4f468",
227 |    "metadata": {},
228 |    "source": [
229 |     "Get IndexId from Console and paste it in ID or run above code to create Index which will give 36 digit Index ID."
230 |    ]
231 |   },
232 |   {
233 |    "cell_type": "code",
234 |    "execution_count": null,
235 |    "id": "c525a4d3",
236 |    "metadata": {},
237 |    "outputs": [],
238 |    "source": [
239 |     "response = kendra.update_index(\n",
240 |     "    Id=\"<enter kendra index id>\",\n",
241 |     "    DocumentMetadataConfigurationUpdates=[\n",
242 |     "        {\n",
243 |     "            'Name':'ORGANIZATION',\n",
244 |     "            'Type':'STRING_LIST_VALUE',\n",
245 |     "            'Search': {\n",
246 |     "                'Facetable': True,\n",
247 |     "                'Searchable': True,\n",
248 |     "                'Displayable': True\n",
249 |     "            }\n",
250 |     "        },\n",
251 |     "        {\n",
252 |     "            'Name':'PERSON',\n",
253 |     "            'Type':'STRING_LIST_VALUE',\n",
254 |     "            'Search': {\n",
255 |     "                'Facetable': False,\n",
256 |     "                'Searchable': True,\n",
257 |     "                'Displayable': True\n",
258 |     "            }\n",
259 |     "        },\n",
260 |     "        {\n",
261 |     "            'Name':'DATE',\n",
262 |     "            'Type':'STRING_LIST_VALUE',\n",
263 |     "            'Search': {\n",
264 |     "                'Facetable': False,\n",
265 |     "                'Searchable': True,\n",
266 |     "                'Displayable': True\n",
267 |     "            }\n",
268 |     "        },\n",
269 |     "        {\n",
270 |     "            'Name':'COMMERCIAL_ITEM',\n",
271 |     "            'Type':'STRING_LIST_VALUE',\n",
272 |     "            'Search': {\n",
273 |     "                'Facetable': True,\n",
274 |     "                'Searchable': False,\n",
275 |     "                'Displayable': True\n",
276 |     "            }\n",
277 |     "        },\n",
278 |     "        {\n",
279 |     "            'Name':'OTHER',\n",
280 |     "            'Type':'STRING_LIST_VALUE',\n",
281 |     "            'Search': {\n",
282 |     "                'Facetable': True,\n",
283 |     "                'Searchable': True,\n",
284 |     "                'Displayable': True\n",
285 |     "            }\n",
286 |     "        }\n",
287 |     "        ,\n",
288 |     "        {\n",
289 |     "            'Name':'QUANTITY',\n",
290 |     "            'Type':'STRING_LIST_VALUE',\n",
291 |     "            'Search': {\n",
292 |     "                'Facetable': True,\n",
293 |     "                'Searchable': True,\n",
294 |     "                'Displayable': True\n",
295 |     "            }\n",
296 |     "        }\n",
297 |     "        ,\n",
298 |     "        {\n",
299 |     "            'Name':'TITLE',\n",
300 |     "            'Type':'STRING_LIST_VALUE',\n",
301 |     "            'Search': {\n",
302 |     "                'Facetable': False,\n",
303 |     "                'Searchable': True,\n",
304 |     "                'Displayable': True\n",
305 |     "            }\n",
306 |     "        }\n",
307 |     "    ])\n",
308 |     "    \n",
309 |     "print(response)"
310 |    ]
311 |   },
312 |   {
313 |    "cell_type": "code",
314 |    "execution_count": null,
315 |    "id": "1a32c481",
316 |    "metadata": {},
317 |    "outputs": [],
318 |    "source": [
319 |     "#List of categories recognized by Comprehend \n",
320 |     "categories = [\"ORGANIZATION\", \"PERSON\", \"DATE\", \"COMMERCIAL_ITEM\", \"OTHER\", \"TITLE\", \"QUANTITY\"]"
321 |    ]
322 |   },
323 |   {
324 |    "cell_type": "code",
325 |    "execution_count": null,
326 |    "id": "f56eccf1",
327 |    "metadata": {},
328 |    "outputs": [],
329 |    "source": [
330 |     "#List of JSON objects to store entities\n",
331 |     "entity_data = dict()\n",
332 |     "#List of observed text strings recognized as categories\n",
333 |     "category_text = dict()\n",
334 |     "#Frequency of each text string\n",
335 |     "text_frequency = dict()\n",
336 |     "#The Kendra attributes JSON object with metadata list to be populated\n",
337 |     "attributes = dict()\n",
338 |     "metadata = dict()"
339 |    ]
340 |   },
341 |   {
342 |    "cell_type": "code",
343 |    "execution_count": null,
344 |    "id": "80a6b07e",
345 |    "metadata": {},
346 |    "outputs": [],
347 |    "source": [
348 |     "for et in categories:\n",
349 |     "        entity_data[et] = set()\n",
350 |     "        #print(entity_data[et])\n",
351 |     "        category_text[et] = []\n",
352 |     "        text_frequency[et] = dict()"
353 |    ]
354 |   },
355 |   {
356 |    "cell_type": "code",
357 |    "execution_count": null,
358 |    "id": "5fffe276",
359 |    "metadata": {},
360 |    "outputs": [],
361 |    "source": [
362 |     "for e in entities[\"Entities\"]:\n",
363 |     "    if (e[\"Text\"].isprintable()) and (not \"\\\"\" in e[\"Text\"]) and (not e[\"Text\"].upper() in category_text[e[\"Type\"]]):\n",
364 |     "                #Append the text to entity data to be used for a Kendra custom attribute\n",
365 |     "                entity_data[e[\"Type\"]].add(e[\"Text\"])\n",
366 |     "                #Keep track of text in upper case so that we don't treat the same text written in different cases differently\n",
367 |     "                category_text[e[\"Type\"]].append(e[\"Text\"].upper())\n",
368 |     "                #Keep track of the frequency of the text so that we can take the text with highest frequency of occurrance\n",
369 |     "                text_frequency[e[\"Type\"]][e[\"Text\"].upper()] = 1\n",
370 |     "    elif (e[\"Text\"].upper() in category_text[e[\"Type\"]]):\n",
371 |     "                #Keep track of the frequency of the text so that we can take the text with highest frequency of occurrance\n",
372 |     "                text_frequency[e[\"Type\"]][e[\"Text\"].upper()] += 1\n",
373 |     "\n",
374 |     "print(entity_data)"
375 |    ]
376 |   },
377 |   {
378 |    "cell_type": "code",
379 |    "execution_count": null,
380 |    "id": "6fa3c949",
381 |    "metadata": {},
382 |    "outputs": [],
383 |    "source": [
384 |     "#Populate the metadata list\n",
385 |     "elimit = 10\n",
386 |     "for et in categories:\n",
387 |     "        #Take at most elimit number of recognized text strings having the highest frequency of occurrance\n",
388 |     "    el = [pair[0] for pair in sorted(text_frequency[et].items(), key=lambda item: item[1], reverse=True)][0:elimit]\n",
389 |     "    metadata[et] = [d for d in entity_data[et] if d.upper() in el]\n",
390 |     "metadata[\"_source_uri\"] = documentName\n",
391 |     "attributes[\"Attributes\"] = metadata\n",
392 |     "print(json.dumps(attributes, sort_keys=True, indent=4))"
393 |    ]
394 |   },
395 |   {
396 |    "cell_type": "code",
397 |    "execution_count": null,
398 |    "id": "7d3fe867",
399 |    "metadata": {},
400 |    "outputs": [],
401 |    "source": [
402 |     "with open(\"metadata.json\", \"w\") as f:\n",
403 |     "     json.dump(attributes, f)"
404 |    ]
405 |   },
406 |   {
407 |    "cell_type": "code",
408 |    "execution_count": null,
409 |    "id": "d7abba45",
410 |    "metadata": {},
411 |    "outputs": [],
412 |    "source": [
413 |     "s3 = boto3.client('s3')\n",
414 |     "prefix= 'meta/'\n",
415 |     "with open(\"metadata.json\", \"rb\") as f:\n",
416 |     "    #s3.upload_fileobj(f,s3BucketName, prefix+\"resume_Sample.pdf.metadata.json\")\n",
417 |     "    s3.upload_file( \"metadata.json\", s3BucketName,'%s/%s' % (\"meta\",\"resume_Sample.pdf.metadata.json\"))\n",
418 |     "print(\"Uploaded to Amazon S3 meta folder\")"
419 |    ]
420 |   },
421 |   {
422 |    "cell_type": "markdown",
423 |    "id": "36d2cec1",
424 |    "metadata": {},
425 |    "source": [
426 |     "# Run Kendra Sync in AWS Console"
427 |    ]
428 |   },
429 |   {
430 |    "cell_type": "markdown",
431 |    "id": "b64a0f5c",
432 |    "metadata": {},
433 |    "source": [
434 |     "# Clean UP"
435 |    ]
436 |   },
437 |   {
438 |    "cell_type": "markdown",
439 |    "id": "161959b7",
440 |    "metadata": {},
441 |    "source": [
442 |     "# Delete the Amazon S3 Data source and the Kendra Index \n",
443 |     "https://docs.aws.amazon.com/kendra/latest/dg/delete-data-source.html"
444 |    ]
445 |   }
446 |  ],
447 |  "metadata": {
448 |   "kernelspec": {
449 |    "display_name": "conda_python3",
450 |    "language": "python",
451 |    "name": "conda_python3"
452 |   },
453 |   "language_info": {
454 |    "codemirror_mode": {
455 |     "name": "ipython",
456 |     "version": 3
457 |    },
458 |    "file_extension": ".py",
459 |    "mimetype": "text/x-python",
460 |    "name": "python",
461 |    "nbconvert_exporter": "python",
462 |    "pygments_lexer": "ipython3",
463 |    "version": "3.6.13"
464 |   }
465 |  },
466 |  "nbformat": 4,
467 |  "nbformat_minor": 5
468 | }
469 | 


--------------------------------------------------------------------------------
/Chapter 05/lambda/index.py:
--------------------------------------------------------------------------------
  1 | from elasticsearch import Elasticsearch, RequestsHttpConnection
  2 | import requests
  3 | from aws_requests_auth.aws_auth import AWSRequestsAuth
  4 | from requests_aws4auth import AWS4Auth
  5 | import base64
  6 | from s3transfer.manager import TransferManager
  7 | import os
  8 | import os.path
  9 | import sys
 10 | import boto3
 11 | import json
 12 | import io
 13 | from io import BytesIO
 14 | import sys
 15 | from trp import Document
 16 | 
 17 | try:
 18 |     from urllib.parse import unquote_plus
 19 | except ImportError:
 20 |      from urllib import unquote_plus
 21 | 
 22 | 
 23 | print('setting up boto3')
 24 | 
 25 | root = os.environ["LAMBDA_TASK_ROOT"]
 26 | sys.path.insert(0, root)
 27 | print(boto3.__version__)
 28 | print('core path setup')
 29 | s3 = boto3.resource('s3')
 30 | s3client = boto3.client('s3')
 31 | 
 32 | host= os.environ['esDomain']
 33 | print("ES DOMAIN IS..........")
 34 | region=os.environ['AWS_REGION']
 35 | 
 36 | service = 'es'
 37 | credentials = boto3.Session().get_credentials()
 38 | 
 39 | def connectES():
 40 |  print ('Connecting to the ES Endpoint {0}')
 41 |  awsauth = AWS4Auth(credentials.access_key, 
 42 |  credentials.secret_key, 
 43 |  region, service,
 44 |  session_token=credentials.token)
 45 |  try:
 46 |   es = Elasticsearch(
 47 |    hosts=[{'host': host, 'port': 443}],
 48 |    http_auth = awsauth,
 49 |    use_ssl=True,
 50 |    verify_certs=True,
 51 |    connection_class=RequestsHttpConnection)
 52 |   return es
 53 |  except Exception as E:
 54 |   print("Unable to connect to {0}")
 55 |   print(E)
 56 |   exit(3)
 57 | print("sucess seting up es")
 58 | 
 59 | print("setting up Textract")
 60 | # get the results
 61 | textract = boto3.client(
 62 |          service_name='textract',
 63 |          region_name=region)
 64 | 
 65 | print('initializing comprehend')
 66 | comprehend = boto3.client(service_name='comprehend', region_name=region)
 67 | print('done')
 68 | 
 69 | def outputForm(page):
 70 |         csvData = []
 71 |         for field in page.form.fields:
 72 |             csvItem  = []
 73 |             if(field.key):
 74 |                 csvItem.append(field.key.text)
 75 |             else:
 76 |                 csvItem.append("")
 77 |             if(field.value):
 78 |                 csvItem.append(field.value.text)
 79 |             else:
 80 |                 csvItem.append("")
 81 |             csvData.append(csvItem)
 82 |         return csvData
 83 | 
 84 | def outputTable(page):
 85 |     csvData = []
 86 |     print("//////////////////")
 87 |     #print(page)
 88 |     for table in page.tables:
 89 |             csvRow = []
 90 |             csvRow.append("Table")
 91 |             csvData.append(csvRow)
 92 |             for row in table.rows:
 93 |                 csvRow  = []
 94 |                 for cell in row.cells:
 95 |                     csvRow.append(cell.text)
 96 |                 csvData.append(csvRow)
 97 |             csvData.append([])
 98 |             csvData.append([])
 99 |     return csvData
100 | # --------------- Main Lambda Handler ------------------
101 | 
102 | 
103 | def handler(event, context):
104 |     print("Received event: " + json.dumps(event, indent=2))
105 |     
106 |     # Get the object from the event and show its content type
107 |     bucket = event['Records'][0]['s3']['bucket']['name']
108 |     key = unquote_plus(event['Records'][0]['s3']['object']['key'])
109 |     print("key is"+key)
110 |     print("bucket is"+bucket)
111 |     text=""
112 |     textvalues=[]
113 |     textvalues_entity={}
114 |     try:
115 |         s3.Bucket(bucket).download_file(Key=key,Filename='/tmp/{}')
116 |         # Read document content
117 |         with open('/tmp/{}', 'rb') as document:
118 |             imageBytes = bytearray(document.read())
119 |         print("Object downloaded")
120 |         response = textract.analyze_document(Document={'Bytes': imageBytes},FeatureTypes=["TABLES", "FORMS"])
121 |         document = Document(response)
122 |         table=[]
123 |         forms=[]
124 |         #print(document)
125 |         for page in document.pages:
126 |                 table = outputTable(page)
127 |                 forms = outputForm(page)
128 |         print(table)
129 |         blocks=response['Blocks']
130 |         for block in blocks:
131 |             if block['BlockType'] == 'LINE':
132 |                 text += block['Text']+"\n"
133 |         print(text)
134 |         # Extracting Key Phrases
135 |         keyphrase_response = comprehend.detect_key_phrases(Text=text, LanguageCode='en')
136 |         KeyPhraseList=keyphrase_response.get("KeyPhrases")
137 |         for s in KeyPhraseList:
138 |               textvalues.append(s.get("Text"))
139 |                     
140 |         detect_entity= comprehend.detect_entities(Text=text, LanguageCode='en')
141 |         EntityList=detect_entity.get("Entities")
142 |         for s in EntityList:
143 |                 textvalues_entity.update([(s.get("Type").strip('\t\n\r'),s.get("Text").strip('\t\n\r'))])
144 | 
145 |         s3url= 'https://s3.console.aws.amazon.com/s3/object/'+bucket+'/'+key+'?region='+region
146 |         
147 |         searchdata={'s3link':s3url,'KeyPhrases':textvalues,'Entity':textvalues_entity,'text':text, 'table':table, 'forms':forms}
148 |         print(searchdata)
149 |         print("connecting to ES")
150 |         es=connectES()
151 |         #es.index(index="resume-search", doc_type="_doc", body=searchdata)
152 |         es.index(index="document", doc_type="_doc", body=searchdata)
153 |         print("data uploaded to Elasticsearch")
154 |         return 'keyphrases Successfully Uploaded'
155 |     except Exception as e:
156 |         print(e)
157 |         print('Error: ')
158 |         raise e
159 | 


--------------------------------------------------------------------------------
/Chapter 05/resume_Sample.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 05/resume_Sample.pdf


--------------------------------------------------------------------------------
/Chapter 05/resume_sample.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 05/resume_sample.PNG


--------------------------------------------------------------------------------
/Chapter 05/template-export-textract.yml:
--------------------------------------------------------------------------------
  1 | AWSTemplateFormatVersion: 2010-09-09
  2 | Transform:
  3 | - AWS::Serverless-2016-10-31
  4 | Parameters:
  5 |   DOMAINNAME:
  6 |     Description: Name for the Amazon ES domain that this template will create. Domain
  7 |       names must start with a lowercase letter and must be between 3 and 28 characters.
  8 |       Valid characters are a-z (lowercase only), 0-9.
  9 |     Type: String
 10 |     Default: documentsearchapp
 11 |   CognitoAdminEmail:
 12 |     Type: String
 13 |     Default: abc@amazon.com
 14 |     AllowedPattern: ^\w+([\.-]?\w+)*@\w+([\.-]?\w+)*(\.\w{2,3})+$
 15 |     Description: E-mail address of the Cognito admin name
 16 | Mappings:
 17 |   SourceCode:
 18 |     General:
 19 |       S3Bucket: solutions
 20 |       KeyPrefix: centralized-logging/v2.2.0
 21 | Resources:
 22 |   ComprehendKeyPhraseAnalysis:
 23 |     Properties:
 24 |       Description: Triggered by S3 review upload to the repo bucket and start the
 25 |         key phrase analysis via Amazon Comprehend
 26 |       Handler: comprehend.handler
 27 |       MemorySize: 128
 28 |       Policies:
 29 |         Statement:
 30 |         - Sid: comprehend
 31 |           Effect: Allow
 32 |           Action:
 33 |           - comprehend:*
 34 |           Resource: '*'
 35 |         - Sid: textract
 36 |           Effect: Allow
 37 |           Action:
 38 |           - textract:*
 39 |           Resource: '*'
 40 |         - Sid: s3
 41 |           Effect: Allow
 42 |           Action:
 43 |           - s3:*Object
 44 |           Resource:
 45 |             Fn::Sub: arn:aws:s3:::${S3}/*
 46 |         - Sid: es
 47 |           Effect: Allow
 48 |           Action:
 49 |           - es:*
 50 |           Resource: '*'
 51 |       Environment:
 52 |         Variables:
 53 |           bucket:
 54 |             Ref: S3
 55 |           esDomain:
 56 |             Fn::GetAtt:
 57 |             - ElasticsearchDomain
 58 |             - DomainEndpoint
 59 |       Runtime: python3.6
 60 |       Timeout: 300
 61 |       CodeUri: s3://forindexing/3c0a3b1c981cda97ffabeb704fd0abd2
 62 |     Type: AWS::Serverless::Function
 63 |   S3:
 64 |     Type: AWS::S3::Bucket
 65 |   TestS3BucketEventPermission:
 66 |     Type: AWS::Lambda::Permission
 67 |     Properties:
 68 |       Action: lambda:invokeFunction
 69 |       SourceAccount:
 70 |         Ref: AWS::AccountId
 71 |       FunctionName:
 72 |         Ref: ComprehendKeyPhraseAnalysis
 73 |       SourceArn:
 74 |         Fn::GetAtt:
 75 |         - S3
 76 |         - Arn
 77 |       Principal: s3.amazonaws.com
 78 |   ApplyNotificationFunctionRole:
 79 |     Type: AWS::IAM::Role
 80 |     Properties:
 81 |       AssumeRolePolicyDocument:
 82 |         Version: '2012-10-17'
 83 |         Statement:
 84 |         - Effect: Allow
 85 |           Principal:
 86 |             Service: lambda.amazonaws.com
 87 |           Action: sts:AssumeRole
 88 |       ManagedPolicyArns:
 89 |       - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
 90 |       Path: /
 91 |       Policies:
 92 |       - PolicyName: S3BucketNotificationPolicy
 93 |         PolicyDocument:
 94 |           Version: '2012-10-17'
 95 |           Statement:
 96 |           - Sid: AllowBucketNotification
 97 |             Effect: Allow
 98 |             Action: s3:PutBucketNotification
 99 |             Resource:
100 |             - Fn::Sub: arn:aws:s3:::${S3}
101 |             - Fn::Sub: arn:aws:s3:::${S3}/*
102 |   ApplyBucketNotificationFunction:
103 |     Type: AWS::Lambda::Function
104 |     Properties:
105 |       Description: Dummy function, just logs the received event
106 |       Handler: index.handler
107 |       Runtime: python3.9
108 |       Role:
109 |         Fn::GetAtt:
110 |         - ApplyNotificationFunctionRole
111 |         - Arn
112 |       Timeout: 240
113 |       Code:
114 |         ZipFile: "import boto3\nimport logging\nimport json\nimport cfnresponse\n\n\
115 |           s3Client = boto3.client('s3')\nlogger = logging.getLogger()\nlogger.setLevel(logging.DEBUG)\n\
116 |           \ndef addBucketNotification(bucketName, notificationId, functionArn):\n\
117 |           \  notificationResponse = s3Client.put_bucket_notification_configuration(\n\
118 |           \    Bucket=bucketName,\n    NotificationConfiguration={\n      'LambdaFunctionConfigurations':\
119 |           \ [\n        {\n          'Id': notificationId,\n          'LambdaFunctionArn':\
120 |           \ functionArn,\n          'Events': [\n            's3:ObjectCreated:*'\n\
121 |           \          ]\n        },\n      ]\n    }\n  )\n  return notificationResponse\n\
122 |           \ndef create(properties, physical_id):\n  bucketName = properties['S3Bucket']\n\
123 |           \  notificationId = properties['NotificationId']\n  functionArn = properties['FunctionARN']\n\
124 |           \  response = addBucketNotification(bucketName, notificationId, functionArn)\n\
125 |           \  logger.info('AddBucketNotification response: %s' % json.dumps(response))\n\
126 |           \  return cfnresponse.SUCCESS, physical_id\n\ndef update(properties, physical_id):\n\
127 |           \  return cfnresponse.SUCCESS, None\n\ndef delete(properties, physical_id):\n\
128 |           \  return cfnresponse.SUCCESS, None\n\ndef handler(event, context):\n  logger.info('Received\
129 |           \ event: %s' % json.dumps(event))\n\n  status = cfnresponse.FAILED\n  new_physical_id\
130 |           \ = None\n\n  try:\n    properties = event.get('ResourceProperties')\n \
131 |           \   physical_id = event.get('PhysicalResourceId')\n\n    status, new_physical_id\
132 |           \ = {\n      'Create': create,\n      'Update': update,\n      'Delete':\
133 |           \ delete\n    }.get(event['RequestType'], lambda x, y: (cfnresponse.FAILED,\
134 |           \ None))(properties, physical_id)\n  except Exception as e:\n    logger.error('Exception:\
135 |           \ %s' % e)\n    status = cfnresponse.FAILED\n  finally:\n    cfnresponse.send(event,\
136 |           \ context, status, {}, new_physical_id)\n"
137 |   UserPool:
138 |     Type: AWS::Cognito::UserPool
139 |     Properties:
140 |       UserPoolName:
141 |         Fn::Sub: ${DOMAINNAME}_kibana_access
142 |       AutoVerifiedAttributes:
143 |       - email
144 |       MfaConfiguration: 'OFF'
145 |       EmailVerificationSubject:
146 |         Ref: AWS::StackName
147 |       Schema:
148 |       - Name: name
149 |         AttributeDataType: String
150 |         Mutable: true
151 |         Required: true
152 |       - Name: email
153 |         AttributeDataType: String
154 |         Mutable: false
155 |         Required: true
156 |   UserPoolGroup:
157 |     Type: AWS::Cognito::UserPoolGroup
158 |     Properties:
159 |       Description: User pool group for Kibana access
160 |       GroupName:
161 |         Fn::Sub: ${DOMAINNAME}_kibana_access_group
162 |       Precedence: 0
163 |       UserPoolId:
164 |         Ref: UserPool
165 |   UserPoolClient:
166 |     Type: AWS::Cognito::UserPoolClient
167 |     Properties:
168 |       ClientName:
169 |         Fn::Sub: ${DOMAINNAME}-client
170 |       GenerateSecret: false
171 |       UserPoolId:
172 |         Ref: UserPool
173 |   IdentityPool:
174 |     Type: AWS::Cognito::IdentityPool
175 |     Properties:
176 |       IdentityPoolName:
177 |         Fn::Sub: ${DOMAINNAME}Identity
178 |       AllowUnauthenticatedIdentities: true
179 |       CognitoIdentityProviders:
180 |       - ClientId:
181 |           Ref: UserPoolClient
182 |         ProviderName:
183 |           Fn::GetAtt:
184 |           - UserPool
185 |           - ProviderName
186 |   CognitoUnAuthorizedRole:
187 |     Type: AWS::IAM::Role
188 |     Properties:
189 |       AssumeRolePolicyDocument:
190 |         Version: '2012-10-17'
191 |         Statement:
192 |         - Effect: Allow
193 |           Principal:
194 |             Federated: cognito-identity.amazonaws.com
195 |           Action:
196 |           - sts:AssumeRoleWithWebIdentity
197 |           Condition:
198 |             StringEquals:
199 |               cognito-identity.amazonaws.com:aud:
200 |                 Ref: IdentityPool
201 |             ForAnyValue:StringLike:
202 |               cognito-identity.amazonaws.com:amr: unauthenticated
203 |       Policies:
204 |       - PolicyName: CognitoUnauthorizedPolicy
205 |         PolicyDocument:
206 |           Version: '2012-10-17'
207 |           Statement:
208 |           - Effect: Allow
209 |             Action:
210 |             - mobileanalytics:PutEvents
211 |             - cognito-sync:BulkPublish
212 |             - cognito-sync:DescribeIdentityPoolUsage
213 |             - cognito-sync:GetBulkPublishDetails
214 |             - cognito-sync:GetCognitoEvents
215 |             - cognito-sync:GetIdentityPoolConfiguration
216 |             - cognito-sync:ListIdentityPoolUsage
217 |             - cognito-sync:SetCognitoEvents
218 |             - congito-sync:SetIdentityPoolConfiguration
219 |             Resource:
220 |               Fn::Sub: arn:aws:cognito-identity:${AWS::Region}:${AWS::AccountId}:identitypool/${IdentityPool}
221 |   CognitoAuthorizedRole:
222 |     Type: AWS::IAM::Role
223 |     Properties:
224 |       AssumeRolePolicyDocument:
225 |         Version: '2012-10-17'
226 |         Statement:
227 |         - Effect: Allow
228 |           Principal:
229 |             Federated: cognito-identity.amazonaws.com
230 |           Action:
231 |           - sts:AssumeRoleWithWebIdentity
232 |           Condition:
233 |             StringEquals:
234 |               cognito-identity.amazonaws.com:aud:
235 |                 Ref: IdentityPool
236 |             ForAnyValue:StringLike:
237 |               cognito-identity.amazonaws.com:amr: authenticated
238 |       Policies:
239 |       - PolicyName: CognitoAuthorizedPolicy
240 |         PolicyDocument:
241 |           Version: '2012-10-17'
242 |           Statement:
243 |           - Effect: Allow
244 |             Action:
245 |             - mobileanalytics:PutEvents
246 |             - cognito-sync:BulkPublish
247 |             - cognito-sync:DescribeIdentityPoolUsage
248 |             - cognito-sync:GetBulkPublishDetails
249 |             - cognito-sync:GetCognitoEvents
250 |             - cognito-sync:GetIdentityPoolConfiguration
251 |             - cognito-sync:ListIdentityPoolUsage
252 |             - cognito-sync:SetCognitoEvents
253 |             - congito-sync:SetIdentityPoolConfiguration
254 |             - cognito-identity:DeleteIdentityPool
255 |             - cognito-identity:DescribeIdentityPool
256 |             - cognito-identity:GetIdentityPoolRoles
257 |             - cognito-identity:GetOpenIdTokenForDeveloperIdentity
258 |             - cognito-identity:ListIdentities
259 |             - cognito-identity:LookupDeveloperIdentity
260 |             - cognito-identity:MergeDeveloperIdentities
261 |             - cognito-identity:UnlikeDeveloperIdentity
262 |             - cognito-identity:UpdateIdentityPool
263 |             Resource:
264 |               Fn::Sub: arn:aws:cognito-identity:${AWS::Region}:${AWS::AccountId}:identitypool/${IdentityPool}
265 |   CognitoESAccessRole:
266 |     Type: AWS::IAM::Role
267 |     Properties:
268 |       ManagedPolicyArns:
269 |       - arn:aws:iam::aws:policy/AmazonESCognitoAccess
270 |       AssumeRolePolicyDocument:
271 |         Version: '2012-10-17'
272 |         Statement:
273 |         - Effect: Allow
274 |           Principal:
275 |             Service: es.amazonaws.com
276 |           Action:
277 |           - sts:AssumeRole
278 |   IdentityPoolRoleMapping:
279 |     Type: AWS::Cognito::IdentityPoolRoleAttachment
280 |     Properties:
281 |       IdentityPoolId:
282 |         Ref: IdentityPool
283 |       Roles:
284 |         authenticated:
285 |           Fn::GetAtt:
286 |           - CognitoAuthorizedRole
287 |           - Arn
288 |         unauthenticated:
289 |           Fn::GetAtt:
290 |           - CognitoUnAuthorizedRole
291 |           - Arn
292 |   AdminUser:
293 |     Type: AWS::Cognito::UserPoolUser
294 |     Properties:
295 |       DesiredDeliveryMediums:
296 |       - EMAIL
297 |       UserAttributes:
298 |       - Name: email
299 |         Value:
300 |           Ref: CognitoAdminEmail
301 |       Username:
302 |         Ref: CognitoAdminEmail
303 |       UserPoolId:
304 |         Ref: UserPool
305 |   SetupESCognito:
306 |     Type: Custom::SetupESCognito
307 |     Version: 1.0
308 |     Properties:
309 |       ServiceToken:
310 |         Fn::GetAtt:
311 |         - LambdaESCognito
312 |         - Arn
313 |       Domain:
314 |         Ref: DOMAINNAME
315 |       CognitoDomain:
316 |         Fn::Sub: ${DOMAINNAME}-${AWS::AccountId}
317 |       UserPoolId:
318 |         Ref: UserPool
319 |       IdentityPoolId:
320 |         Ref: IdentityPool
321 |       RoleArn:
322 |         Fn::GetAtt:
323 |         - CognitoESAccessRole
324 |         - Arn
325 |   LambdaESCognito:
326 |     Type: AWS::Lambda::Function
327 |     Properties:
328 |       Description: Centralized Logging - Lambda function to enable cognito authentication
329 |         for kibana
330 |       Environment:
331 |         Variables:
332 |           LOG_LEVEL: INFO
333 |       Handler: index.handler
334 |       Runtime: nodejs12.x
335 |       Timeout: 600
336 |       Role:
337 |         Fn::GetAtt:
338 |         - LambdaESCognitoRole
339 |         - Arn
340 |       Code:
341 |         S3Bucket:
342 |           Fn::Join:
343 |           - '-'
344 |           - - Fn::FindInMap:
345 |               - SourceCode
346 |               - General
347 |               - S3Bucket
348 |             - Ref: AWS::Region
349 |         S3Key:
350 |           Fn::Join:
351 |           - /
352 |           - - Fn::FindInMap:
353 |               - SourceCode
354 |               - General
355 |               - KeyPrefix
356 |             - clog-auth.zip
357 |   LambdaESCognitoRole:
358 |     Type: AWS::IAM::Role
359 |     DependsOn: ElasticsearchDomain
360 |     Properties:
361 |       AssumeRolePolicyDocument:
362 |         Version: '2012-10-17'
363 |         Statement:
364 |         - Effect: Allow
365 |           Principal:
366 |             Service:
367 |             - lambda.amazonaws.com
368 |           Action:
369 |           - sts:AssumeRole
370 |       Path: /
371 |       Policies:
372 |       - PolicyName: root
373 |         PolicyDocument:
374 |           Version: '2012-10-17'
375 |           Statement:
376 |           - Effect: Allow
377 |             Action:
378 |             - logs:CreateLogGroup
379 |             - logs:CreateLogStream
380 |             - logs:PutLogEvents
381 |             Resource: arn:aws:logs:*:*:*
382 |           - Effect: Allow
383 |             Action:
384 |             - es:UpdateElasticsearchDomainConfig
385 |             Resource:
386 |               Fn::Sub: arn:aws:es:${AWS::Region}:${AWS::AccountId}:domain/${DOMAINNAME}
387 |           - Effect: Allow
388 |             Action:
389 |             - cognito-idp:CreateUserPoolDomain
390 |             - cognito-idp:DeleteUserPoolDomain
391 |             Resource:
392 |               Fn::GetAtt:
393 |               - UserPool
394 |               - Arn
395 |           - Effect: Allow
396 |             Action:
397 |             - iam:PassRole
398 |             Resource:
399 |               Fn::GetAtt:
400 |               - CognitoESAccessRole
401 |               - Arn
402 |   ElasticsearchDomain:
403 |     Type: AWS::Elasticsearch::Domain
404 |     Properties:
405 |       DomainName:
406 |         Ref: DOMAINNAME
407 |       ElasticsearchVersion: '6.3'
408 |       ElasticsearchClusterConfig:
409 |         InstanceCount: '1'
410 |         InstanceType: t2.small.elasticsearch
411 |       EBSOptions:
412 |         EBSEnabled: true
413 |         Iops: 0
414 |         VolumeSize: 10
415 |         VolumeType: gp2
416 |       SnapshotOptions:
417 |         AutomatedSnapshotStartHour: '0'
418 |       AccessPolicies:
419 |         Version: '2012-10-17'
420 |         Statement:
421 |         - Action: es:*
422 |           Principal:
423 |             AWS:
424 |               Fn::Sub:
425 |               - arn:aws:sts::${AWS::AccountId}:assumed-role/${AuthRole}/CognitoIdentityCredentials
426 |               - AuthRole:
427 |                   Ref: CognitoAuthorizedRole
428 |           Effect: Allow
429 |           Resource:
430 |             Fn::Sub: arn:aws:es:${AWS::Region}:${AWS::AccountId}:domain/${DOMAINNAME}/*
431 |   ApplyNotification:
432 |     Type: Custom::ApplyNotification
433 |     Properties:
434 |       ServiceToken:
435 |         Fn::GetAtt:
436 |         - ApplyBucketNotificationFunction
437 |         - Arn
438 |       S3Bucket:
439 |         Ref: S3
440 |       FunctionARN:
441 |         Fn::GetAtt:
442 |         - ComprehendKeyPhraseAnalysis
443 |         - Arn
444 |       NotificationId: S3ObjectCreatedEvent
445 | Outputs:
446 |   S3KeyPhraseBucket:
447 |     Value:
448 |       Fn::Sub: https://console.aws.amazon.com/s3/buckets/${S3}/?region=us-east-1
449 |   KibanaLoginURL:
450 |     Description: Kibana login URL
451 |     Value:
452 |       Fn::Sub: https://${ElasticsearchDomain.DomainEndpoint}/_plugin/kibana/
453 | 


--------------------------------------------------------------------------------
/Chapter 07/lib/test.py:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/Chapter 07/project_path.py:
--------------------------------------------------------------------------------
1 | import sys
2 | import os
3 | 
4 | module_path = os.path.abspath(os.path.join(os.pardir))
5 | if module_path not in sys.path:
6 |     sys.path.append(module_path)


--------------------------------------------------------------------------------
/Chapter 08/contextual-ad-marking-for-content-monetization-with-nlp-github.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "f1da84ae",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# Using NLP for content monetization\n",
  9 |     "This is an accompanying notebook to Chapter 8 of the book - Natural Language Processing with AWS AI Services. Please do not use this notebook directly as there are prerequisites and dependent steps required to be performed as documented in the book. Briefly in this chapter, we look at a use case of how to use AWS services specifically NLP to enable monetization of your video content. The following high level steps (along with where the instructions are) walk through the solution:\n",
 10 |     "1. Upload a video file to an Amazon S3 bucket - Refer to the book\n",
 11 |     "2. Use AWS Elemental MediaConvert to create brodcast streams - Refer to the book\n",
 12 |     "3. Run a transcription of the video file using Amazon Transcribe - Refer to this notebook\n",
 13 |     "4. Run an Amazon Comprehend Topic Modeling job to extract topics - Refer to this notebook\n",
 14 |     "5. Select the ad markers based on topics extracted - Refer to this notebook\n",
 15 |     "6. Stitch into an Ad decision server URL - Refer to this notebook\n",
 16 |     "7. Create an AWS Elemental MediaTailor configuration - Refer to the book\n",
 17 |     "8. Play the ad embedded video to test - Refer to the book"
 18 |    ]
 19 |   },
 20 |   {
 21 |    "cell_type": "markdown",
 22 |    "id": "66010cce",
 23 |    "metadata": {},
 24 |    "source": [
 25 |     "## Transcribe section"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": null,
 31 |    "id": "6ebea48c",
 32 |    "metadata": {},
 33 |    "outputs": [],
 34 |    "source": [
 35 |     "import pandas as pd\n",
 36 |     "import json\n",
 37 |     "import boto3\n",
 38 |     "import re\n",
 39 |     "import uuid\n",
 40 |     "import time\n",
 41 |     "import io\n",
 42 |     "import os\n",
 43 |     "from io import BytesIO\n",
 44 |     "import sys\n",
 45 |     "import csv\n",
 46 |     "from IPython.display import Image, display\n",
 47 |     "from PIL import Image as PImage, ImageDraw"
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "code",
 52 |    "execution_count": null,
 53 |    "id": "db26300e",
 54 |    "metadata": {},
 55 |    "outputs": [],
 56 |    "source": [
 57 |     "# create topic-modeling/raw folder we need down the line\n",
 58 |     "directory = \"topic-modeling\"\n",
 59 |     "parent_dir = os.getcwd()\n",
 60 |     " \n",
 61 |     "# Path\n",
 62 |     "path = os.path.join(parent_dir, directory)\n",
 63 |     "os.makedirs(path, exist_ok = True)\n",
 64 |     "print(\"Directory '%s' created successfully\" %directory)\n",
 65 |     "\n",
 66 |     "directory = \"raw\"\n",
 67 |     "parent_dir = os.getcwd()+'/topic-modeling'\n",
 68 |     " \n",
 69 |     "# Path\n",
 70 |     "path = os.path.join(parent_dir, directory)\n",
 71 |     "os.makedirs(path, exist_ok = True)\n",
 72 |     "print(\"Directory '%s' created successfully\" %directory)"
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "code",
 77 |    "execution_count": null,
 78 |    "id": "21c72aea",
 79 |    "metadata": {},
 80 |    "outputs": [],
 81 |    "source": [
 82 |     "bucket='<your-s3-bucket>'\n",
 83 |     "prefix='chapter8'\n",
 84 |     "s3=boto3.client('s3')"
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "code",
 89 |    "execution_count": null,
 90 |    "id": "6e304bdf",
 91 |    "metadata": {},
 92 |    "outputs": [],
 93 |    "source": [
 94 |     "import time\n",
 95 |     "import boto3\n",
 96 |     "\n",
 97 |     "def transcribe_file(job_name, file_uri, transcribe_client):\n",
 98 |     "    transcribe_client.start_transcription_job(\n",
 99 |     "        TranscriptionJobName=job_name,\n",
100 |     "        Media={'MediaFileUri': file_uri},\n",
101 |     "        MediaFormat='mp4',\n",
102 |     "        LanguageCode='en-US'\n",
103 |     "    )"
104 |    ]
105 |   },
106 |   {
107 |    "cell_type": "code",
108 |    "execution_count": null,
109 |    "id": "3a28030a",
110 |    "metadata": {},
111 |    "outputs": [],
112 |    "source": [
113 |     "job_name = 'media-monetization-transcribe-3'"
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "code",
118 |    "execution_count": null,
119 |    "id": "fbf4f49f",
120 |    "metadata": {},
121 |    "outputs": [],
122 |    "source": [
123 |     "transcribe_client = boto3.client('transcribe')\n",
124 |     "file_uri = 's3://'+bucket+'/'+prefix+'/'+'rawvideo/bank-demo-prem-ranga.mp4'\n",
125 |     "transcribe_file(job_name, file_uri, transcribe_client)"
126 |    ]
127 |   },
128 |   {
129 |    "cell_type": "code",
130 |    "execution_count": null,
131 |    "id": "c6d2d81b",
132 |    "metadata": {},
133 |    "outputs": [],
134 |    "source": [
135 |     "job = transcribe_client.get_transcription_job(TranscriptionJobName=job_name)\n",
136 |     "job_status = job['TranscriptionJob']['TranscriptionJobStatus']\n",
137 |     "if job_status in ['COMPLETED', 'FAILED']:\n",
138 |     "    print(f\"Job {job_name} is {job_status}.\")\n",
139 |     "    if job_status == 'COMPLETED':\n",
140 |     "        print(f\"Download the transcript from\\n\"\n",
141 |     "              f\"\\t{job['TranscriptionJob']['Transcript']['TranscriptFileUri']}\")"
142 |    ]
143 |   },
144 |   {
145 |    "cell_type": "markdown",
146 |    "id": "19c1d2c6",
147 |    "metadata": {},
148 |    "source": [
149 |     "## Comprehend Topic Modeling Section"
150 |    ]
151 |   },
152 |   {
153 |    "cell_type": "markdown",
154 |    "id": "0f76a306",
155 |    "metadata": {},
156 |    "source": [
157 |     "### First get the transcript"
158 |    ]
159 |   },
160 |   {
161 |    "cell_type": "code",
162 |    "execution_count": null,
163 |    "id": "35d14975",
164 |    "metadata": {},
165 |    "outputs": [],
166 |    "source": [
167 |     "# Load the csv file into a Pandas DataFrame for easy manipulation\n",
168 |     "raw_df = pd.read_json(job['TranscriptionJob']['Transcript']['TranscriptFileUri'])\n",
169 |     "raw_df.shape"
170 |    ]
171 |   },
172 |   {
173 |    "cell_type": "code",
174 |    "execution_count": null,
175 |    "id": "ac7e18e1",
176 |    "metadata": {},
177 |    "outputs": [],
178 |    "source": [
179 |     "raw_df.head()"
180 |    ]
181 |   },
182 |   {
183 |    "cell_type": "code",
184 |    "execution_count": null,
185 |    "id": "2fe83f93",
186 |    "metadata": {},
187 |    "outputs": [],
188 |    "source": [
189 |     "# Let's drop the rest of the columns, we only need the transcript for our solution\n",
190 |     "raw_df = pd.DataFrame(raw_df.at['transcripts','results'].copy())"
191 |    ]
192 |   },
193 |   {
194 |    "cell_type": "code",
195 |    "execution_count": null,
196 |    "id": "454f1765",
197 |    "metadata": {},
198 |    "outputs": [],
199 |    "source": [
200 |     "#Convert this back to the CSV file\n",
201 |     "raw_df.to_csv('topic-modeling/raw/transcript.csv', header=False, index=False)"
202 |    ]
203 |   },
204 |   {
205 |    "cell_type": "code",
206 |    "execution_count": null,
207 |    "id": "b30f7c6a",
208 |    "metadata": {},
209 |    "outputs": [],
210 |    "source": [
211 |     "directory = \"job-input\"\n",
212 |     "parent_dir = os.getcwd()+'/topic-modeling'\n",
213 |     "# Path\n",
214 |     "path = os.path.join(parent_dir, directory)\n",
215 |     "os.makedirs(path, exist_ok = True)\n",
216 |     "print(\"Directory '%s' created successfully\" %directory)"
217 |    ]
218 |   },
219 |   {
220 |    "cell_type": "code",
221 |    "execution_count": null,
222 |    "id": "f7249ec9",
223 |    "metadata": {},
224 |    "outputs": [],
225 |    "source": [
226 |     "import csv\n",
227 |     "# Run Regex expression to create a list of sentences\n",
228 |     "folderpath = r\"topic-modeling/raw\" # make sure to put the 'r' in front and provide the folder where your files are\n",
229 |     "filepaths  = [os.path.join(folderpath, name) for name in os.listdir(folderpath) if not name.startswith('.')] # do not select hidden directories\n",
230 |     "fnfull = \"topic-modeling/job-input/transcript_formatted.csv\"\n",
231 |     "for path in filepaths:\n",
232 |     "    print(path)\n",
233 |     "    with open(path, 'r') as f:\n",
234 |     "        content = f.read() # Read the whole file\n",
235 |     "        lines = content.split('.') # a list of all sentences\n",
236 |     "        with open(fnfull, \"w\", encoding='utf-8') as ff:\n",
237 |     "            csv_writer = csv.writer(ff, delimiter=',', quotechar = '\"')\n",
238 |     "            for num,line in enumerate(lines): # for each sentence\n",
239 |     "                csv_writer.writerow([line])\n",
240 |     "f.close()"
241 |    ]
242 |   },
243 |   {
244 |    "cell_type": "code",
245 |    "execution_count": null,
246 |    "id": "0cc5f379",
247 |    "metadata": {},
248 |    "outputs": [],
249 |    "source": [
250 |     "# Upload the CSV file to the input prefix in S3 to be used in the topic modeling job\n",
251 |     "s3.upload_file('topic-modeling/job-input/transcript_formatted.csv', bucket, prefix+'/topic-modeling/job-input/tm-input.csv')\n",
252 |     "print('s3://'+bucket+'/'+prefix+'/topic-modeling/job-input/tm-input.csv')"
253 |    ]
254 |   },
255 |   {
256 |    "cell_type": "markdown",
257 |    "id": "42f1014b",
258 |    "metadata": {},
259 |    "source": [
260 |     "### Now follow the instructions in the book to run the topic modeling job from the Amazon Comprehend console"
261 |    ]
262 |   },
263 |   {
264 |    "cell_type": "markdown",
265 |    "id": "1b4862fa",
266 |    "metadata": {},
267 |    "source": [
268 |     "### Process Topic Modeling Results"
269 |    ]
270 |   },
271 |   {
272 |    "cell_type": "code",
273 |    "execution_count": null,
274 |    "id": "47baa716",
275 |    "metadata": {},
276 |    "outputs": [],
277 |    "source": [
278 |     "# Let's first download the results of the topic modeling job. \n",
279 |     "# Please copy the output data location from your topic modeling job for this step and use it below\n",
280 |     "\n",
281 |     "# create topic-modeling/results folder\n",
282 |     "directory = \"results\"\n",
283 |     "parent_dir = os.getcwd()+'/topic-modeling'\n",
284 |     " \n",
285 |     "# Path\n",
286 |     "path = os.path.join(parent_dir, directory)\n",
287 |     "os.makedirs(path, exist_ok = True)\n",
288 |     "print(\"Directory '%s' created successfully\" %directory)\n",
289 |     "\n",
290 |     "#tpprefix = prefix+'/'+'<path-to-job-output-tar>'\n",
291 |     "tpprefix = prefix+'/'+'topic-modeling/<s3-topic-model-prefix>/output/output.tar.gz'\n",
292 |     "s3.download_file(bucket, tpprefix, 'topic-modeling/results/output.tar.gz')\n",
293 |     "!tar -xzvf topic-modeling/results/output.tar.gz"
294 |    ]
295 |   },
296 |   {
297 |    "cell_type": "code",
298 |    "execution_count": null,
299 |    "id": "f29067b8",
300 |    "metadata": {},
301 |    "outputs": [],
302 |    "source": [
303 |     "# Now load each of the resulting CSV files to their own DataFrames\n",
304 |     "tt_df = pd.read_csv('topic-terms.csv')\n",
305 |     "dt_df = pd.read_csv('doc-topics.csv')"
306 |    ]
307 |   },
308 |   {
309 |    "cell_type": "code",
310 |    "execution_count": null,
311 |    "id": "915342d0",
312 |    "metadata": {},
313 |    "outputs": [],
314 |    "source": [
315 |     "# the topic terms DataFrame contains the topic number, what term corresponds to the topic, and \n",
316 |     "# the weightage of this term contributing to the topic\n",
317 |     "for i,x in tt_df.iterrows():\n",
318 |     "    print(str(x['topic'])+\":\"+x['term']+\":\"+str(x['weight']))"
319 |    ]
320 |   },
321 |   {
322 |    "cell_type": "code",
323 |    "execution_count": null,
324 |    "id": "525911ff",
325 |    "metadata": {},
326 |    "outputs": [],
327 |    "source": [
328 |     "# We may have multiple topics in the same line, but for this example we are not interested in these duplicates, so we will drop it\n",
329 |     "dt_df = dt_df.drop_duplicates(subset=['docname'])"
330 |    ]
331 |   },
332 |   {
333 |    "cell_type": "code",
334 |    "execution_count": null,
335 |    "id": "fa01262f",
336 |    "metadata": {},
337 |    "outputs": [],
338 |    "source": [
339 |     "# Filter the rows in the mean range of weightage for a topic\n",
340 |     "ttdf_max = tt_df.groupby(['topic'], sort=False)['weight'].max()"
341 |    ]
342 |   },
343 |   {
344 |    "cell_type": "code",
345 |    "execution_count": null,
346 |    "id": "01f1fc1c",
347 |    "metadata": {},
348 |    "outputs": [],
349 |    "source": [
350 |     "ttdf_max.head()"
351 |    ]
352 |   },
353 |   {
354 |    "cell_type": "code",
355 |    "execution_count": null,
356 |    "id": "b8dc6243",
357 |    "metadata": {},
358 |    "outputs": [],
359 |    "source": [
360 |     "# Load these into its own DataFrame and remove terms that are masked\n",
361 |     "newtt_df = pd.DataFrame()\n",
362 |     "for x in ttdf_max:\n",
363 |     "    newtt_df = newtt_df.append(tt_df.query('weight == @x'))\n",
364 |     "newtt_df = newtt_df.reset_index(drop=True)    \n",
365 |     "adtopic = newtt_df.at[0,'term']"
366 |    ]
367 |   },
368 |   {
369 |    "cell_type": "markdown",
370 |    "id": "2e83f342",
371 |    "metadata": {},
372 |    "source": [
373 |     "## Ad marking for Media Tailor\n",
374 |     "I have provided a sample csv containing content metadata for looking up ads. For this example, we'll use the topics we discovered from our topic modeling job as the key to fetch the cmsid & vid. We will then substitute these in the VAST ad marker URL before creating the AWS Elemental Media Tailor configuration."
375 |    ]
376 |   },
377 |   {
378 |    "cell_type": "code",
379 |    "execution_count": null,
380 |    "id": "939a893b",
381 |    "metadata": {},
382 |    "outputs": [],
383 |    "source": [
384 |     "#Get the ad content for marking our input video\n",
385 |     "adindex_df = pd.read_csv('media-content/ad-index.csv', header=None, index_col=0)\n",
386 |     "adindex_df"
387 |    ]
388 |   },
389 |   {
390 |    "cell_type": "markdown",
391 |    "id": "43ac583c",
392 |    "metadata": {},
393 |    "source": [
394 |     "#### Lookup a topic from our ad index file"
395 |    ]
396 |   },
397 |   {
398 |    "cell_type": "code",
399 |    "execution_count": null,
400 |    "id": "097ed2a7",
401 |    "metadata": {},
402 |    "outputs": [],
403 |    "source": [
404 |     "#Lookup the cmsid and vid for content as the topic\n",
405 |     "advalue = adindex_df.loc[adtopic]\n",
406 |     "print(advalue[1] + \" and \" + advalue[2])"
407 |    ]
408 |   },
409 |   {
410 |    "cell_type": "code",
411 |    "execution_count": null,
412 |    "id": "71095949",
413 |    "metadata": {},
414 |    "outputs": [],
415 |    "source": [
416 |     "#Now we will create the AdMarker URL to use with AWS Elemental MediaTailor. \n",
417 |     "#Lets first copy the placeholder URL available in our github repo which has a pre-roll, mid-roll and post-roll segments filled in\n",
418 |     "ad_rawurl = pd.read_csv('media-content/adserver.csv', header=None).at[0,0].split('&')\n",
419 |     "ad_rawurl"
420 |    ]
421 |   },
422 |   {
423 |    "cell_type": "code",
424 |    "execution_count": null,
425 |    "id": "82870766",
426 |    "metadata": {},
427 |    "outputs": [],
428 |    "source": [
429 |     "ad_formattedurl = ''\n",
430 |     "for x in ad_rawurl:\n",
431 |     "    if 'cmsid' in x:\n",
432 |     "        x = advalue[1]\n",
433 |     "    if 'vid' in x:\n",
434 |     "        x = advalue[2]\n",
435 |     "    ad_formattedurl += x + '&'\n",
436 |     "    \n",
437 |     "ad_formattedurl = ad_formattedurl.rstrip('&')\n",
438 |     "ad_formattedurl"
439 |    ]
440 |   },
441 |   {
442 |    "cell_type": "markdown",
443 |    "id": "84197e06",
444 |    "metadata": {},
445 |    "source": [
446 |     "## Resume from Creating AWS Elemental MediaTailor Configuration section in Chapter 8 of the book"
447 |    ]
448 |   }
449 |  ],
450 |  "metadata": {
451 |   "kernelspec": {
452 |    "display_name": "conda_python3",
453 |    "language": "python",
454 |    "name": "conda_python3"
455 |   },
456 |   "language_info": {
457 |    "codemirror_mode": {
458 |     "name": "ipython",
459 |     "version": 3
460 |    },
461 |    "file_extension": ".py",
462 |    "mimetype": "text/x-python",
463 |    "name": "python",
464 |    "nbconvert_exporter": "python",
465 |    "pygments_lexer": "ipython3",
466 |    "version": "3.6.13"
467 |   }
468 |  },
469 |  "nbformat": 4,
470 |  "nbformat_minor": 5
471 | }
472 | 


--------------------------------------------------------------------------------
/Chapter 08/media-content/ad-index.csv:
--------------------------------------------------------------------------------
1 | insight,cmsid=496,vid=short_onecue
2 | automate,cmsid=176,vid=short_tencue
3 | text,cmsid=496,vid=short_onecue
4 | content,cmsid=176,vid=short_tencue
5 | result,cmsid=496,vid=short_onecue
6 | infrastructure,cmsid=176,vid=short_tencue
7 | compute,cmsid=496,vid=short_onecue
8 | document,cmsid=176,vid=short_tencue


--------------------------------------------------------------------------------
/Chapter 08/media-content/adserver.csv:
--------------------------------------------------------------------------------
1 | https://pubads.g.doubleclick.net/gampad/ads?sz=640x480&iu=/124319096/external/ad_rule_samples&ciu_szs=300x250&ad_rule=1&impl=s&gdfp_req=1&env=vp&output=vmap&unviewed_position_start=1&cust_params=deployment%3Ddevsite%26sample_ar%3Dpremidpost&cmsid=&vid=&correlator=[avail.random]


--------------------------------------------------------------------------------
/Chapter 08/media-content/bank-demo-prem-ranga.mp4:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 08/media-content/bank-demo-prem-ranga.mp4


--------------------------------------------------------------------------------
/Chapter 09/compact_nx.html:
--------------------------------------------------------------------------------
  1 | <html>
  2 | <head>
  3 | <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/vis/4.16.1/vis.css" type="text/css" />
  4 | <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/vis/4.16.1/vis-network.min.js"> </script>
  5 | <center>
  6 | <h1></h1>
  7 | </center>
  8 | 
  9 | <!-- <link rel="stylesheet" href="../node_modules/vis/dist/vis.min.css" type="text/css" />
 10 | <script type="text/javascript" src="../node_modules/vis/dist/vis.js"> </script>-->
 11 | 
 12 | <style type="text/css">
 13 | 
 14 |         #mynetwork {
 15 |             width: 800px;
 16 |             height: 600px;
 17 |             background-color: #ffffff;
 18 |             border: 1px solid lightgray;
 19 |             position: relative;
 20 |             float: left;
 21 |         }
 22 | 
 23 |         
 24 | 
 25 |         
 26 | 
 27 |         
 28 | </style>
 29 | 
 30 | </head>
 31 | 
 32 | <body>
 33 | <div id = "mynetwork"></div>
 34 | 
 35 | 
 36 | <script type="text/javascript">
 37 | 
 38 |     // initialize global variables.
 39 |     var edges;
 40 |     var nodes;
 41 |     var network; 
 42 |     var container;
 43 |     var options, data;
 44 | 
 45 |     
 46 |     // This method is responsible for drawing the graph, returns the drawn network
 47 |     function drawGraph() {
 48 |         var container = document.getElementById('mynetwork');
 49 |         
 50 |         
 51 | 
 52 |         // parsing and collecting nodes and edges from the python
 53 |         nodes = new vis.DataSet([{"color": "#5956a5", "group": "trigger", "id": "tr0", "label": "CORPORATE_MERGER", "shape": "dot", "size": 9, "tag": "CORPORATE_MERGER"}, {"color": "#4fa9af", "group": "entity", "id": "en4", "label": "today", "shape": "dot", "size": 9, "tag": "DATE"}, {"color": "#fafdb8", "group": "entity", "id": "en1", "label": "NASDAQ:AMZN", "shape": "dot", "size": 9, "tag": "STOCK_CODE"}, {"color": "#fafdb8", "group": "entity", "id": "en3", "label": "NASDAQ:WFM", "shape": "dot", "size": 9, "tag": "STOCK_CODE"}, {"color": "#5e4fa2", "group": "entity", "id": "en2", "label": "Whole Foods Market, Inc.", "shape": "dot", "size": 9, "tag": "ORGANIZATION"}, {"color": "#56b0ad", "group": "trigger", "id": "tr1", "label": "CORPORATE_ACQUISITION", "shape": "dot", "size": 9, "tag": "CORPORATE_ACQUISITION"}, {"color": "#e3534a", "group": "entity", "id": "en5", "label": "$13.7 billion", "shape": "dot", "size": 9, "tag": "MONETARY_VALUE"}, {"id": "en0", "label": "en0", "shape": "dot", "size": 10}, {"color": "#4fa9af", "group": "entity", "id": "en8", "label": "during the second half of 2017.Whole", "shape": "dot", "size": 9, "tag": "DATE"}, {"color": "#82cda5", "group": "trigger", "id": "tr2", "label": "EMPLOYMENT", "shape": "dot", "size": 9, "tag": "EMPLOYMENT"}, {"color": "#8bd0a4", "group": "entity", "id": "en6", "label": "John Mackey", "shape": "dot", "size": 9, "tag": "PERSON"}, {"color": "#7acaa5", "group": "entity", "id": "en7", "label": "co-founder", "shape": "dot", "size": 9, "tag": "PERSON_TITLE"}, {"color": "#5956a5", "group": "trigger", "id": "tr3", "label": "CORPORATE_MERGER", "shape": "dot", "size": 9, "tag": "CORPORATE_MERGER"}, {"color": "#82cda5", "group": "trigger", "id": "tr4", "label": "EMPLOYMENT", "shape": "dot", "size": 9, "tag": "EMPLOYMENT"}, {"color": "#8bd0a4", "group": "entity", "id": "en9", "label": "John Mackey", "shape": "dot", "size": 9, "tag": "PERSON"}, {"color": "#7acaa5", "group": "entity", "id": "en10", "label": "CEOof", "shape": "dot", "size": 9, "tag": "PERSON_TITLE"}, {"color": "#56b0ad", "group": "trigger", "id": "tr5", "label": "CORPORATE_ACQUISITION", "shape": "dot", "size": 6, "tag": "CORPORATE_ACQUISITION"}, {"color": "#4fa9af", "group": "entity", "id": "en11", "label": "during the secondhalf of 2017", "shape": "dot", "size": 9, "tag": "DATE"}]);
 54 |         edges = new vis.DataSet([{"color": "grey", "from": "tr0", "label": "DATE", "to": "en4", "weight": 99.4679}, {"color": "grey", "from": "tr0", "label": "PARTICIPANT", "to": "en1", "weight": 98.89829999999999}, {"color": "grey", "from": "tr0", "label": "PARTICIPANT", "to": "en3", "weight": 97.5579}, {"color": "grey", "from": "tr0", "label": "PARTICIPANT", "to": "en2", "weight": 99.9671}, {"color": "grey", "from": "tr1", "label": "AMOUNT", "to": "en5", "weight": 99.8241}, {"color": "grey", "from": "tr1", "label": "DATE", "to": "en4", "weight": 99.4679}, {"color": "grey", "from": "tr1", "label": "INVESTEE", "to": "en2", "weight": 99.9588}, {"color": "grey", "from": "tr1", "label": "INVESTOR", "to": "en0", "weight": 99.9565}, {"color": "grey", "from": "tr1", "label": "INVESTEE", "to": "en3", "weight": 97.5579}, {"color": "grey", "from": "tr1", "label": "DATE", "to": "en8", "weight": 99.42649999999999}, {"color": "grey", "from": "tr2", "label": "EMPLOYEE", "to": "en6", "weight": 99.97789999999999}, {"color": "grey", "from": "tr2", "label": "EMPLOYEE_TITLE", "to": "en7", "weight": 99.839}, {"color": "grey", "from": "tr2", "label": "EMPLOYER", "to": "en2", "weight": 99.9786}, {"color": "grey", "from": "tr3", "label": "PARTICIPANT", "to": "en2", "weight": 99.9633}, {"color": "grey", "from": "tr4", "label": "EMPLOYEE", "to": "en9", "weight": 99.9757}, {"color": "grey", "from": "tr4", "label": "EMPLOYEE_TITLE", "to": "en10", "weight": 99.8287}, {"color": "grey", "from": "tr4", "label": "EMPLOYER", "to": "en2", "weight": 99.9783}, {"color": "grey", "from": "tr5", "label": "INVESTEE", "to": "en2", "weight": 99.96770000000001}, {"color": "grey", "from": "tr5", "label": "DATE", "to": "en11", "weight": 97.7672}]);
 55 | 
 56 |         // adding nodes and edges to the graph
 57 |         data = {nodes: nodes, edges: edges};
 58 | 
 59 |         var options = {
 60 |     "configure": {
 61 |         "enabled": false
 62 |     },
 63 |     "edges": {
 64 |         "color": {
 65 |             "inherit": true
 66 |         },
 67 |         "smooth": {
 68 |             "enabled": false,
 69 |             "type": "continuous"
 70 |         }
 71 |     },
 72 |     "interaction": {
 73 |         "dragNodes": true,
 74 |         "hideEdgesOnDrag": false,
 75 |         "hideNodesOnDrag": false
 76 |     },
 77 |     "physics": {
 78 |         "enabled": true,
 79 |         "stabilization": {
 80 |             "enabled": true,
 81 |             "fit": true,
 82 |             "iterations": 1000,
 83 |             "onlyDynamicEdges": false,
 84 |             "updateInterval": 50
 85 |         }
 86 |     }
 87 | };
 88 |         
 89 |         
 90 | 
 91 |         
 92 | 
 93 |         network = new vis.Network(container, data, options);
 94 | 
 95 |         
 96 | 
 97 | 
 98 |         
 99 | 
100 |         return network;
101 | 
102 |     }
103 | 
104 |     drawGraph();
105 | 
106 | </script>
107 | </body>
108 | </html>


--------------------------------------------------------------------------------
/Chapter 09/events_graph.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Helper functions and constants for Comprehend Events semantic network graphing.
  3 | """
  4 | 
  5 | from collections import Counter
  6 | from matplotlib import cm, colors
  7 | import networkx as nx
  8 | from pyvis.network import Network
  9 | 
 10 | 
 11 | ENTITY_TYPES = ['DATE', 'FACILITY', 'LOCATION', 'MONETARY_VALUE', 'ORGANIZATION',
 12 |                 'PERSON', 'PERSON_TITLE', 'QUANTITY', 'STOCK_CODE']
 13 | 
 14 | TRIGGER_TYPES = ['BANKRUPTCY', 'EMPLOYMENT', 'CORPORATE_ACQUISITION', 
 15 |                  'INVESTMENT_GENERAL', 'CORPORATE_MERGER', 'IPO', 'RIGHTS_ISSUE', 
 16 |                  'SECONDARY_OFFERING', 'SHELF_OFFERING', 'TENDER_OFFERING', 'STOCK_SPLIT']
 17 | 
 18 | PROPERTY_MAP = {
 19 |     "event": {"size": 10, "shape": "box", "color": "#dbe3e5"},
 20 |     "entity_group": {"size": 6, "shape": "dot", "color": "#776d8a"},
 21 |     "entity": {"size": 4, "shape": "square", "color": "#f3e6e3"},
 22 |     "trigger": {"size": 4, "shape": "diamond", "color": "#f3e6e3"}
 23 | }
 24 | 
 25 | def get_color_map(tags):
 26 |     spectral = cm.get_cmap("Spectral", len(tags))
 27 |     tag_colors = [colors.rgb2hex(spectral(i)) for i in range(len(tags))]
 28 |     color_map = dict(zip(*(tags, tag_colors)))
 29 |     color_map.update({'ROLE': 'grey'})
 30 |     return color_map
 31 | 
 32 | COLOR_MAP = get_color_map(ENTITY_TYPES + TRIGGER_TYPES)
 33 | COLOR_MAP['ROLE'] = "grey"
 34 | 
 35 | IFRAME_DIMS = ("600", "800")
 36 | 
 37 | 
 38 | def get_canonical_mention(mentions, method="longest"):
 39 |     extents = enumerate([m['Text'] for m in mentions])
 40 |     if method == "longest":
 41 |         name = sorted(extents, key=lambda x: len(x[1]))
 42 |     elif method == "most_common": 
 43 |         name = [Counter(extents).most_common()[0][0]]
 44 |     else:
 45 |         name = [list(extents)[0]]
 46 |     return [mentions[name[-1][0]]]
 47 | 
 48 | 
 49 | def get_nodes_and_edges(
 50 |     result, node_types=['event', 'trigger', 'entity_group', 'entity'], thr=0.0
 51 |     ):
 52 |     """Convert results to (nodelist, edgelist) depending on specified entity types."""
 53 |     nodes = []
 54 |     edges = []
 55 |     event_nodes = []
 56 |     entity_nodes = []  
 57 |     entity_group_nodes = [] 
 58 |     trigger_nodes = []
 59 |     
 60 |     # Nodes are (id, type, tag, score, mention_type) tuples.
 61 |     if 'event' in node_types:
 62 |         event_nodes = [
 63 |             (
 64 |                 "ev%d" % i,
 65 |                  t['Type'],
 66 |                  t['Type'],
 67 |                  t['Score'],
 68 |                  "event"
 69 |             )
 70 |             for i, e in enumerate(result['Events'])
 71 |             for t in e['Triggers'][:1]
 72 |             if t['GroupScore'] > thr
 73 |         ]
 74 |         nodes.extend(event_nodes)
 75 |     
 76 |     if 'trigger' in node_types:
 77 |         trigger_nodes = [
 78 |             (
 79 |                 "ev%d-tr%d" % (i, j),
 80 |                 t['Type'],
 81 |                 t['Text'],
 82 |                 t['Score'],
 83 |                 "trigger"
 84 |             )
 85 |             for i, e in enumerate(result['Events'])
 86 |             for j, t in enumerate(e['Triggers'])
 87 |             if t['Score'] > thr
 88 |         ]
 89 |         trigger_nodes = list({t[1:3]: t for t in trigger_nodes}.values())
 90 |         nodes.extend(trigger_nodes)
 91 |         
 92 |     if 'entity_group' in node_types:
 93 |         entity_group_nodes = [
 94 |             (
 95 |                 "gr%d" % i,
 96 |                 m['Type'],
 97 |                 m['Text'] if 'entity' not in node_types else m['Type'],
 98 |                 m['Score'],
 99 |                 "entity_group"
100 |             )
101 |             for i, e in enumerate(result['Entities'])
102 |             for m in get_canonical_mention(e['Mentions'])
103 |             if m['GroupScore'] > thr
104 |         ]
105 |         nodes.extend(entity_group_nodes)
106 |         
107 |     if 'entity' in node_types:
108 |         entity_nodes = [
109 |             (
110 |                 "gr%d-en%d" % (i, j),
111 |                 m['Type'],
112 |                 m['Text'],
113 |                 m['Score'],
114 |                 "entity"
115 |             )
116 |             for i, e in enumerate(result['Entities'])
117 |             for j, m in enumerate(e['Mentions'])
118 |             if m['Score'] > thr
119 |         ]
120 |         entity_nodes = list({t[1:3]: t for t in entity_nodes}.values())
121 |         nodes.extend(entity_nodes)
122 | 
123 |     # Edges are (trigger_id, node_id, role, score, type) tuples.
124 |     if event_nodes and entity_group_nodes:
125 |         edges.extend([
126 |             ("ev%d" % i, "gr%d" % a['EntityIndex'], a['Role'], a['Score'], "argument")
127 |             for i, e in enumerate(result['Events'])
128 |             for j, a in enumerate(e['Arguments'])
129 |             #if a['Score'] > THR
130 |         ])
131 |     
132 |     if entity_nodes and entity_group_nodes:
133 |         entity_keys = set([n[0] for n in entity_nodes])
134 |         edges.extend([
135 |             ("gr%d" % i, "gr%d-en%d" % (i, j), "", m['GroupScore'], "coref")
136 |             for i, e in enumerate(result['Entities'])
137 |             for j, m in enumerate(e['Mentions'])
138 |             if "gr%d-en%d" % (i, j) in entity_keys
139 |             if m['GroupScore'] > thr
140 |         ])
141 | 
142 |     if event_nodes and trigger_nodes:
143 |         trigger_keys = set([n[0] for n in trigger_nodes])
144 |         edges.extend([
145 |             ("ev%d" % i, "ev%d-tr%d" % (i, j), "", a['GroupScore'], "coref")
146 |             for i, e in enumerate(result['Events'])
147 |             for j, a in enumerate(e['Triggers'])
148 |             if "ev%d-tr%d" % (i, j) in trigger_keys
149 |             if a['GroupScore'] > thr
150 |         ])
151 |         
152 |     return nodes, edges
153 | 
154 | 
155 | def build_network_graph(nodelist, edgelist, drop_isolates=True):
156 |     G = nx.Graph()
157 |     # Iterate over triggers and entity mentions.
158 |     for mention_id, tag, extent, score, mtype in nodelist:
159 |         G.add_node(
160 |             mention_id,
161 |             label=extent,
162 |             tag=tag,
163 |             group=mtype,
164 |             size=PROPERTY_MAP[mtype]['size'],
165 |             color=COLOR_MAP[tag],
166 |             shape=PROPERTY_MAP[mtype]['shape']
167 |             )
168 |     # Iterate over argument role assignments
169 |     if edgelist:
170 |         for n1_id, n2_id, role, score, etype in edgelist:
171 |             label = role if etype == "argument" else "coref"
172 |             G.add_edges_from(
173 |                 [(n1_id, n2_id)],
174 |                 label=role,
175 |                 weight=score*100,
176 |                 color="grey"
177 |             )
178 |     # Drop mentions that don't participate in events
179 |     if len(edgelist) > 0 and drop_isolates:
180 |         G.remove_nodes_from(list(nx.isolates(G)))
181 |     return G
182 | 
183 | 
184 | def plot(result, node_types, filename="nx.html", thr=0.0):
185 |     nodes, edges = get_nodes_and_edges(result, node_types, thr)
186 |     G = build_network_graph(
187 |         nodes, edges,
188 |         drop_isolates=True
189 |     )
190 |     nt = Network(*IFRAME_DIMS, notebook=True, heading="")
191 |     nt.from_nx(G)
192 |     display(nt.show(filename))


--------------------------------------------------------------------------------
/Chapter 09/nx.html:
--------------------------------------------------------------------------------
  1 | <html>
  2 | <head>
  3 | <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/vis/4.16.1/vis.css" type="text/css" />
  4 | <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/vis/4.16.1/vis-network.min.js"> </script>
  5 | <center>
  6 | <h1></h1>
  7 | </center>
  8 | 
  9 | <!-- <link rel="stylesheet" href="../node_modules/vis/dist/vis.min.css" type="text/css" />
 10 | <script type="text/javascript" src="../node_modules/vis/dist/vis.js"> </script>-->
 11 | 
 12 | <style type="text/css">
 13 | 
 14 |         #mynetwork {
 15 |             width: 800;
 16 |             height: 600;
 17 |             background-color: #ffffff;
 18 |             border: 1px solid lightgray;
 19 |             position: relative;
 20 |             float: left;
 21 |         }
 22 | 
 23 |         
 24 | 
 25 |         
 26 | 
 27 |         
 28 | </style>
 29 | 
 30 | </head>
 31 | 
 32 | <body>
 33 | <div id = "mynetwork"></div>
 34 | 
 35 | 
 36 | <script type="text/javascript">
 37 | 
 38 |     // initialize global variables.
 39 |     var edges;
 40 |     var nodes;
 41 |     var network; 
 42 |     var container;
 43 |     var options, data;
 44 | 
 45 |     
 46 |     // This method is responsible for drawing the graph, returns the drawn network
 47 |     function drawGraph() {
 48 |         var container = document.getElementById('mynetwork');
 49 |         
 50 |         
 51 | 
 52 |         // parsing and collecting nodes and edges from the python
 53 |         nodes = new vis.DataSet([{"color": "#b4e1a2", "group": "event", "id": "ev0", "label": "CORPORATE_MERGER", "shape": "box", "size": 10, "tag": "CORPORATE_MERGER"}, {"color": "#9e0142", "group": "entity_group", "id": "gr4", "label": "DATE", "shape": "dot", "size": 6, "tag": "DATE"}, {"color": "#fee796", "group": "entity_group", "id": "gr1", "label": "STOCK_CODE", "shape": "dot", "size": 6, "tag": "STOCK_CODE"}, {"color": "#fee796", "group": "entity_group", "id": "gr3", "label": "STOCK_CODE", "shape": "dot", "size": 6, "tag": "STOCK_CODE"}, {"color": "#f57446", "group": "entity_group", "id": "gr2", "label": "ORGANIZATION", "shape": "dot", "size": 6, "tag": "ORGANIZATION"}, {"color": "#ebf7a0", "group": "event", "id": "ev1", "label": "CORPORATE_ACQUISITION", "shape": "box", "size": 10, "tag": "CORPORATE_ACQUISITION"}, {"color": "#e75948", "group": "entity_group", "id": "gr5", "label": "MONETARY_VALUE", "shape": "dot", "size": 6, "tag": "MONETARY_VALUE"}, {"id": "gr0", "label": "gr0", "shape": "dot", "size": 10}, {"color": "#9e0142", "group": "entity_group", "id": "gr8", "label": "DATE", "shape": "dot", "size": 6, "tag": "DATE"}, {"color": "#ebf7a0", "group": "trigger", "id": "ev1-tr1", "label": "all", "shape": "diamond", "size": 4, "tag": "CORPORATE_ACQUISITION"}, {"color": "#f8fcb5", "group": "event", "id": "ev2", "label": "EMPLOYMENT", "shape": "box", "size": 10, "tag": "EMPLOYMENT"}, {"color": "#fa9656", "group": "entity_group", "id": "gr6", "label": "PERSON", "shape": "dot", "size": 6, "tag": "PERSON"}, {"color": "#fdb668", "group": "entity_group", "id": "gr7", "label": "PERSON_TITLE", "shape": "dot", "size": 6, "tag": "PERSON_TITLE"}, {"color": "#b4e1a2", "group": "event", "id": "ev3", "label": "CORPORATE_MERGER", "shape": "box", "size": 10, "tag": "CORPORATE_MERGER"}, {"color": "#b4e1a2", "group": "trigger", "id": "ev3-tr0", "label": "partnership", "shape": "diamond", "size": 4, "tag": "CORPORATE_MERGER"}, {"color": "#b4e1a2", "group": "trigger", "id": "ev3-tr1", "label": "merger", "shape": "diamond", "size": 4, "tag": "CORPORATE_MERGER"}, {"color": "#f8fcb5", "group": "event", "id": "ev4", "label": "EMPLOYMENT", "shape": "box", "size": 10, "tag": "EMPLOYMENT"}, {"color": "#fa9656", "group": "entity_group", "id": "gr9", "label": "PERSON", "shape": "dot", "size": 6, "tag": "PERSON"}, {"color": "#fdb668", "group": "entity_group", "id": "gr10", "label": "PERSON_TITLE", "shape": "dot", "size": 6, "tag": "PERSON_TITLE"}, {"color": "#f8fcb5", "group": "trigger", "id": "ev4-tr0", "label": "remain", "shape": "diamond", "size": 4, "tag": "EMPLOYMENT"}, {"color": "#ebf7a0", "group": "event", "id": "ev5", "label": "CORPORATE_ACQUISITION", "shape": "box", "size": 10, "tag": "CORPORATE_ACQUISITION"}, {"color": "#9e0142", "group": "entity_group", "id": "gr11", "label": "DATE", "shape": "dot", "size": 6, "tag": "DATE"}, {"color": "#ebf7a0", "group": "trigger", "id": "ev5-tr1", "label": "transaction", "shape": "diamond", "size": 4, "tag": "CORPORATE_ACQUISITION"}, {"color": "#fee796", "group": "entity", "id": "gr1-en1", "label": "NASDAQ:AMZN", "shape": "square", "size": 4, "tag": "STOCK_CODE"}, {"color": "#f57446", "group": "entity", "id": "gr2-en0", "label": "Whole Foods Market, Inc.", "shape": "square", "size": 4, "tag": "ORGANIZATION"}, {"color": "#f57446", "group": "entity", "id": "gr2-en1", "label": "Whole Foods Marketfor", "shape": "square", "size": 4, "tag": "ORGANIZATION"}, {"color": "#f57446", "group": "entity", "id": "gr2-en25", "label": "Whole FoodsMarket", "shape": "square", "size": 4, "tag": "ORGANIZATION"}, {"color": "#fa9656", "group": "entity", "id": "gr2-en26", "label": "customers", "shape": "square", "size": 4, "tag": "PERSON"}, {"color": "#fee796", "group": "entity", "id": "gr3-en1", "label": "NASDAQ:WFM", "shape": "square", "size": 4, "tag": "STOCK_CODE"}, {"color": "#9e0142", "group": "entity", "id": "gr4-en1", "label": "today", "shape": "square", "size": 4, "tag": "DATE"}, {"color": "#e75948", "group": "entity", "id": "gr5-en0", "label": "$42", "shape": "square", "size": 4, "tag": "MONETARY_VALUE"}, {"color": "#e75948", "group": "entity", "id": "gr5-en1", "label": "$13.7 billion", "shape": "square", "size": 4, "tag": "MONETARY_VALUE"}, {"color": "#fdb668", "group": "entity", "id": "gr7-en0", "label": "co-founder", "shape": "square", "size": 4, "tag": "PERSON_TITLE"}, {"color": "#fdb668", "group": "entity", "id": "gr7-en2", "label": "CEO", "shape": "square", "size": 4, "tag": "PERSON_TITLE"}, {"color": "#9e0142", "group": "entity", "id": "gr8-en0", "label": "during the second half of 2017.Whole", "shape": "square", "size": 4, "tag": "DATE"}, {"color": "#fa9656", "group": "entity", "id": "gr9-en1", "label": "John Mackey", "shape": "square", "size": 4, "tag": "PERSON"}, {"color": "#fdb668", "group": "entity", "id": "gr10-en0", "label": "CEOof", "shape": "square", "size": 4, "tag": "PERSON_TITLE"}, {"color": "#9e0142", "group": "entity", "id": "gr11-en0", "label": "during the secondhalf of 2017", "shape": "square", "size": 4, "tag": "DATE"}]);
 54 |         edges = new vis.DataSet([{"color": "grey", "from": "ev0", "label": "DATE", "to": "gr4", "weight": 99.4679}, {"color": "grey", "from": "ev0", "label": "PARTICIPANT", "to": "gr1", "weight": 98.89829999999999}, {"color": "grey", "from": "ev0", "label": "PARTICIPANT", "to": "gr3", "weight": 97.5579}, {"color": "grey", "from": "ev0", "label": "PARTICIPANT", "to": "gr2", "weight": 99.9671}, {"color": "grey", "from": "ev1", "label": "AMOUNT", "to": "gr5", "weight": 99.8241}, {"color": "grey", "from": "ev1", "label": "DATE", "to": "gr4", "weight": 99.4679}, {"color": "grey", "from": "ev1", "label": "INVESTEE", "to": "gr2", "weight": 99.9588}, {"color": "grey", "from": "ev1", "label": "INVESTOR", "to": "gr0", "weight": 99.9565}, {"color": "grey", "from": "ev1", "label": "INVESTEE", "to": "gr3", "weight": 97.5579}, {"color": "grey", "from": "ev1", "label": "DATE", "to": "gr8", "weight": 99.42649999999999}, {"color": "grey", "from": "ev1", "label": "", "to": "ev1-tr1", "weight": 99.9974}, {"color": "grey", "from": "ev2", "label": "EMPLOYEE", "to": "gr6", "weight": 99.97789999999999}, {"color": "grey", "from": "ev2", "label": "EMPLOYEE_TITLE", "to": "gr7", "weight": 99.839}, {"color": "grey", "from": "ev2", "label": "EMPLOYER", "to": "gr2", "weight": 99.9786}, {"color": "grey", "from": "ev3", "label": "PARTICIPANT", "to": "gr2", "weight": 99.9633}, {"color": "grey", "from": "ev3", "label": "", "to": "ev3-tr0", "weight": 100.0}, {"color": "grey", "from": "ev3", "label": "", "to": "ev3-tr1", "weight": 99.9974}, {"color": "grey", "from": "ev4", "label": "EMPLOYEE", "to": "gr9", "weight": 99.9757}, {"color": "grey", "from": "ev4", "label": "EMPLOYEE_TITLE", "to": "gr10", "weight": 99.8287}, {"color": "grey", "from": "ev4", "label": "EMPLOYER", "to": "gr2", "weight": 99.9783}, {"color": "grey", "from": "ev4", "label": "", "to": "ev4-tr0", "weight": 100.0}, {"color": "grey", "from": "ev5", "label": "INVESTEE", "to": "gr2", "weight": 99.96770000000001}, {"color": "grey", "from": "ev5", "label": "DATE", "to": "gr11", "weight": 97.7672}, {"color": "grey", "from": "ev5", "label": "", "to": "ev5-tr1", "weight": 99.9965}, {"color": "grey", "from": "gr1", "label": "", "to": "gr1-en1", "weight": 97.0482}, {"color": "grey", "from": "gr2", "label": "", "to": "gr2-en0", "weight": 100.0}, {"color": "grey", "from": "gr2", "label": "", "to": "gr2-en1", "weight": 94.0198}, {"color": "grey", "from": "gr2", "label": "", "to": "gr2-en25", "weight": 95.2064}, {"color": "grey", "from": "gr2", "label": "", "to": "gr2-en26", "weight": 59.162400000000005}, {"color": "grey", "from": "gr3", "label": "", "to": "gr3-en1", "weight": 96.7623}, {"color": "grey", "from": "gr4", "label": "", "to": "gr4-en1", "weight": 86.7346}, {"color": "grey", "from": "gr5", "label": "", "to": "gr5-en0", "weight": 100.0}, {"color": "grey", "from": "gr5", "label": "", "to": "gr5-en1", "weight": 87.1696}, {"color": "grey", "from": "gr7", "label": "", "to": "gr7-en0", "weight": 100.0}, {"color": "grey", "from": "gr7", "label": "", "to": "gr7-en2", "weight": 74.8572}, {"color": "grey", "from": "gr8", "label": "", "to": "gr8-en0", "weight": 100.0}, {"color": "grey", "from": "gr9", "label": "", "to": "gr9-en1", "weight": 98.3652}, {"color": "grey", "from": "gr10", "label": "", "to": "gr10-en0", "weight": 100.0}, {"color": "grey", "from": "gr11", "label": "", "to": "gr11-en0", "weight": 100.0}]);
 55 | 
 56 |         // adding nodes and edges to the graph
 57 |         data = {nodes: nodes, edges: edges};
 58 | 
 59 |         var options = {
 60 |     "configure": {
 61 |         "enabled": false
 62 |     },
 63 |     "edges": {
 64 |         "color": {
 65 |             "inherit": true
 66 |         },
 67 |         "smooth": {
 68 |             "enabled": false,
 69 |             "type": "continuous"
 70 |         }
 71 |     },
 72 |     "interaction": {
 73 |         "dragNodes": true,
 74 |         "hideEdgesOnDrag": false,
 75 |         "hideNodesOnDrag": false
 76 |     },
 77 |     "physics": {
 78 |         "enabled": true,
 79 |         "stabilization": {
 80 |             "enabled": true,
 81 |             "fit": true,
 82 |             "iterations": 1000,
 83 |             "onlyDynamicEdges": false,
 84 |             "updateInterval": 50
 85 |         }
 86 |     }
 87 | };
 88 |         
 89 |         
 90 | 
 91 |         
 92 | 
 93 |         network = new vis.Network(container, data, options);
 94 | 
 95 |         
 96 | 
 97 | 
 98 |         
 99 | 
100 |         return network;
101 | 
102 |     }
103 | 
104 |     drawGraph();
105 | 
106 | </script>
107 | </body>
108 | </html>


--------------------------------------------------------------------------------
/Chapter 09/requirements.txt:
--------------------------------------------------------------------------------
1 | ipywidgets==7.5.1
2 | networkx==2.5
3 | pandas==1.1.3
4 | pyvis==0.1.8.2
5 | spacy==2.2.4
6 | smart-open==3.0.0


--------------------------------------------------------------------------------
/Chapter 09/sample_financial_news_doc.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 09/sample_financial_news_doc.pdf


--------------------------------------------------------------------------------
/Chapter 10/Reducing-localization-costs-with-machine-translation-github.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "copyrighted-memphis",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# Reducing Localization costs and improving accuracy with Amazon Translate\n",
  9 |     "\n",
 10 |     "This is an accompanying notebook for Chapter 10 - Reducing locationlization costs and improving accuracy from the Natural Language Processing with AWS AI Services book. Please make sure to read the instructions provided in the book prior to attempting this notebook. In this chapter we will walkthrough a solution example of how to automate the translation of your web pages and save on localization costs using Amazon Translate. Organizations looking to expand internationally no longer have to implement time consuming and cost prohibitive localization projects to change their web pages, they can leverage [Amazon Translate](https://aws.amazon.com/translate/) which is a neural ML powered translation service as part of the development lifecycle to automatically convert web pages into multiple languages. We will show you how in this notebook. "
 11 |    ]
 12 |   },
 13 |   {
 14 |    "cell_type": "markdown",
 15 |    "id": "usual-holder",
 16 |    "metadata": {},
 17 |    "source": [
 18 |     "## Input HTML Web Page\n",
 19 |     "\n",
 20 |     "For this example we will use an `About Us` HTML and Javascript page the authors created for the fictional **Family Bank**, a subsidiary of the fictional LiveRight financial organization. The page looks as shown in the cell below and is assumed to be part of an overall organizational website that has an `About Us` link leading to this page. "
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "code",
 25 |    "execution_count": null,
 26 |    "id": "scientific-yukon",
 27 |    "metadata": {},
 28 |    "outputs": [],
 29 |    "source": [
 30 |     "# display the About Us page\n",
 31 |     "from IPython.display import IFrame\n",
 32 |     "IFrame(src='./input/aboutLRH.html', width=800, height=400)"
 33 |    ]
 34 |   },
 35 |   {
 36 |    "cell_type": "markdown",
 37 |    "id": "prostate-intent",
 38 |    "metadata": {},
 39 |    "source": [
 40 |     "#### Let us now review the HTML and Javascript source code for this page\n",
 41 |     "As we see below, this has a small HTML div block, and a corresponding Script block to print the current date. The Style block provides some CSS styling for our page."
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "code",
 46 |    "execution_count": null,
 47 |    "id": "horizontal-buying",
 48 |    "metadata": {},
 49 |    "outputs": [],
 50 |    "source": [
 51 |     "!pygmentize './input/aboutLRH.html'"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "markdown",
 56 |    "id": "special-schema",
 57 |    "metadata": {},
 58 |    "source": [
 59 |     "## Prepare for Translation"
 60 |    ]
 61 |   },
 62 |   {
 63 |    "cell_type": "code",
 64 |    "execution_count": null,
 65 |    "id": "diverse-modem",
 66 |    "metadata": {},
 67 |    "outputs": [],
 68 |    "source": [
 69 |     "# Install the HTML parser\n",
 70 |     "!pip install beautifulsoup4"
 71 |    ]
 72 |   },
 73 |   {
 74 |    "cell_type": "code",
 75 |    "execution_count": null,
 76 |    "id": "falling-pontiac",
 77 |    "metadata": {},
 78 |    "outputs": [],
 79 |    "source": [
 80 |     "html_doc = ''\n",
 81 |     "input_htm = './input/aboutLRH.html'\n",
 82 |     "with open(input_htm) as f:\n",
 83 |     "    content = f.readlines()\n",
 84 |     "for i in content:\n",
 85 |     "    html_doc += i+' '"
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "code",
 90 |    "execution_count": null,
 91 |    "id": "dirty-heart",
 92 |    "metadata": {},
 93 |    "outputs": [],
 94 |    "source": [
 95 |     "from bs4 import BeautifulSoup\n",
 96 |     "soup = BeautifulSoup(html_doc, 'html.parser')"
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "code",
101 |    "execution_count": null,
102 |    "id": "earlier-coaching",
103 |    "metadata": {},
104 |    "outputs": [],
105 |    "source": [
106 |     "# HTML tags containing text we are interested in translating\n",
107 |     "tags = ['title','h1','h2','p']"
108 |    ]
109 |   },
110 |   {
111 |    "cell_type": "code",
112 |    "execution_count": null,
113 |    "id": "weighted-shooting",
114 |    "metadata": {},
115 |    "outputs": [],
116 |    "source": [
117 |     "# Now we will extract the text content from the HTML for each tag in our tags list and load this to a new dict\n",
118 |     "x_dict = {}\n",
119 |     "for tag in tags:\n",
120 |     "    x_dict[tag] = getattr(getattr(soup, tag),'string')\n",
121 |     "x_dict"
122 |    ]
123 |   },
124 |   {
125 |    "cell_type": "markdown",
126 |    "id": "ethical-promotion",
127 |    "metadata": {},
128 |    "source": [
129 |     "## Translate to target languages\n",
130 |     "We will now translate the input text from English to German, Spanish, Tamil and Hindi"
131 |    ]
132 |   },
133 |   {
134 |    "cell_type": "code",
135 |    "execution_count": null,
136 |    "id": "ecological-satisfaction",
137 |    "metadata": {},
138 |    "outputs": [],
139 |    "source": [
140 |     "import boto3\n",
141 |     "\n",
142 |     "translate = boto3.client(service_name='translate', region_name='us-east-1', use_ssl=True)\n",
143 |     "out_text = {}\n",
144 |     "languages = ['de','es','ta','hi']\n",
145 |     "\n",
146 |     "for target_lang in languages:\n",
147 |     "    out_dict = {}\n",
148 |     "    for key in x_dict:\n",
149 |     "        result = translate.translate_text(Text=x_dict[key], \n",
150 |     "            SourceLanguageCode=\"en\", TargetLanguageCode=target_lang)\n",
151 |     "        out_dict[key] = result.get('TranslatedText')\n",
152 |     "    out_text[target_lang] = out_dict\n",
153 |     "\n",
154 |     "print(\"German Version of Website Text\")\n",
155 |     "print(\"******************************\")\n",
156 |     "print(out_text['de'])\n",
157 |     "print(\"******************************\")\n",
158 |     "print(\"Spanish Version of Website Text\")\n",
159 |     "print(\"******************************\")\n",
160 |     "print(out_text['es'])\n",
161 |     "print(\"******************************\")\n",
162 |     "print(\"Tamil Version of Website Text\")\n",
163 |     "print(\"******************************\")\n",
164 |     "print(out_text['ta'])\n",
165 |     "print(\"******************************\")\n",
166 |     "print(\"Hindi Version of Website Text\")\n",
167 |     "print(\"******************************\")\n",
168 |     "print(out_text['hi'])\n",
169 |     "print(\"******************************\")\n"
170 |    ]
171 |   },
172 |   {
173 |    "cell_type": "markdown",
174 |    "id": "difficult-palace",
175 |    "metadata": {},
176 |    "source": [
177 |     "## Build webpages for translated text\n",
178 |     "We will now create separate HTML web pages for each of the translated languages and display them"
179 |    ]
180 |   },
181 |   {
182 |    "cell_type": "markdown",
183 |    "id": "touched-champion",
184 |    "metadata": {},
185 |    "source": [
186 |     "### German Webpage"
187 |    ]
188 |   },
189 |   {
190 |    "cell_type": "code",
191 |    "execution_count": null,
192 |    "id": "actual-ballet",
193 |    "metadata": {},
194 |    "outputs": [],
195 |    "source": [
196 |     "web_de = soup"
197 |    ]
198 |   },
199 |   {
200 |    "cell_type": "code",
201 |    "execution_count": null,
202 |    "id": "infrared-barbados",
203 |    "metadata": {},
204 |    "outputs": [],
205 |    "source": [
206 |     "web_de.title.string = out_text['de']['title']\n",
207 |     "web_de.h1.string = out_text['de']['h1']\n",
208 |     "web_de.h2.string = out_text['de']['h2']\n",
209 |     "web_de.p.string = out_text['de']['p']"
210 |    ]
211 |   },
212 |   {
213 |    "cell_type": "code",
214 |    "execution_count": null,
215 |    "id": "polyphonic-blast",
216 |    "metadata": {},
217 |    "outputs": [],
218 |    "source": [
219 |     "de_html = web_de.prettify()\n",
220 |     "with open('./output/aboutLRH_DE.html','w') as de_w:\n",
221 |     "    de_w.write(de_html)"
222 |    ]
223 |   },
224 |   {
225 |    "cell_type": "code",
226 |    "execution_count": null,
227 |    "id": "southeast-display",
228 |    "metadata": {},
229 |    "outputs": [],
230 |    "source": [
231 |     "# display the About Us page in German\n",
232 |     "from IPython.display import IFrame\n",
233 |     "IFrame(src='./output/aboutLRH_DE.html', width=800, height=500)"
234 |    ]
235 |   },
236 |   {
237 |    "cell_type": "markdown",
238 |    "id": "purple-northeast",
239 |    "metadata": {},
240 |    "source": [
241 |     "### Spanish Webpage"
242 |    ]
243 |   },
244 |   {
245 |    "cell_type": "code",
246 |    "execution_count": null,
247 |    "id": "supreme-turning",
248 |    "metadata": {},
249 |    "outputs": [],
250 |    "source": [
251 |     "web_es = soup\n",
252 |     "web_es.title.string = out_text['es']['title']\n",
253 |     "web_es.h1.string = out_text['es']['h1']\n",
254 |     "web_es.h2.string = out_text['es']['h2']\n",
255 |     "web_es.p.string = out_text['es']['p']"
256 |    ]
257 |   },
258 |   {
259 |    "cell_type": "code",
260 |    "execution_count": null,
261 |    "id": "horizontal-skirt",
262 |    "metadata": {},
263 |    "outputs": [],
264 |    "source": [
265 |     "es_html = web_es.prettify()\n",
266 |     "with open('./output/aboutLRH_ES.html','w') as es_w:\n",
267 |     "    es_w.write(es_html)"
268 |    ]
269 |   },
270 |   {
271 |    "cell_type": "code",
272 |    "execution_count": null,
273 |    "id": "signed-clinton",
274 |    "metadata": {},
275 |    "outputs": [],
276 |    "source": [
277 |     "# display the About Us page in German\n",
278 |     "from IPython.display import IFrame\n",
279 |     "IFrame(src='./output/aboutLRH_ES.html', width=800, height=500)"
280 |    ]
281 |   },
282 |   {
283 |    "cell_type": "markdown",
284 |    "id": "important-sample",
285 |    "metadata": {},
286 |    "source": [
287 |     "### Hindi Webpage"
288 |    ]
289 |   },
290 |   {
291 |    "cell_type": "code",
292 |    "execution_count": null,
293 |    "id": "upset-corner",
294 |    "metadata": {},
295 |    "outputs": [],
296 |    "source": [
297 |     "web_hi = soup\n",
298 |     "web_hi.title.string = out_text['hi']['title']\n",
299 |     "web_hi.h1.string = out_text['hi']['h1']\n",
300 |     "web_hi.h2.string = out_text['hi']['h2']\n",
301 |     "web_hi.p.string = out_text['hi']['p']"
302 |    ]
303 |   },
304 |   {
305 |    "cell_type": "code",
306 |    "execution_count": null,
307 |    "id": "subtle-demand",
308 |    "metadata": {},
309 |    "outputs": [],
310 |    "source": [
311 |     "hi_html = web_hi.prettify()\n",
312 |     "with open('./output/aboutLRH_HI.html','w') as hi_w:\n",
313 |     "    hi_w.write(hi_html)"
314 |    ]
315 |   },
316 |   {
317 |    "cell_type": "code",
318 |    "execution_count": null,
319 |    "id": "dressed-kennedy",
320 |    "metadata": {},
321 |    "outputs": [],
322 |    "source": [
323 |     "# display the About Us page in German\n",
324 |     "from IPython.display import IFrame\n",
325 |     "IFrame(src='./output/aboutLRH_HI.html', width=800, height=500)"
326 |    ]
327 |   },
328 |   {
329 |    "cell_type": "markdown",
330 |    "id": "connected-facing",
331 |    "metadata": {},
332 |    "source": [
333 |     "### Tamil Webpage"
334 |    ]
335 |   },
336 |   {
337 |    "cell_type": "code",
338 |    "execution_count": null,
339 |    "id": "copyrighted-nation",
340 |    "metadata": {},
341 |    "outputs": [],
342 |    "source": [
343 |     "web_ta = soup\n",
344 |     "web_ta.title.string = out_text['ta']['title']\n",
345 |     "web_ta.h1.string = out_text['ta']['h1']\n",
346 |     "web_ta.h2.string = out_text['ta']['h2']\n",
347 |     "web_ta.p.string = out_text['ta']['p']"
348 |    ]
349 |   },
350 |   {
351 |    "cell_type": "code",
352 |    "execution_count": null,
353 |    "id": "loved-directive",
354 |    "metadata": {},
355 |    "outputs": [],
356 |    "source": [
357 |     "ta_html = web_ta.prettify()\n",
358 |     "with open('./output/aboutLRH_TA.html','w') as ta_w:\n",
359 |     "    ta_w.write(ta_html)"
360 |    ]
361 |   },
362 |   {
363 |    "cell_type": "code",
364 |    "execution_count": null,
365 |    "id": "outdoor-pound",
366 |    "metadata": {},
367 |    "outputs": [],
368 |    "source": [
369 |     "# display the About Us page in German\n",
370 |     "from IPython.display import IFrame\n",
371 |     "IFrame(src='./output/aboutLRH_TA.html', width=800, height=500)"
372 |    ]
373 |   },
374 |   {
375 |    "cell_type": "markdown",
376 |    "id": "municipal-disaster",
377 |    "metadata": {},
378 |    "source": [
379 |     "## End of Notebook\n",
380 |     "Please return back to the book to continue reading from there"
381 |    ]
382 |   }
383 |  ],
384 |  "metadata": {
385 |   "kernelspec": {
386 |    "display_name": "conda_python3",
387 |    "language": "python",
388 |    "name": "conda_python3"
389 |   },
390 |   "language_info": {
391 |    "codemirror_mode": {
392 |     "name": "ipython",
393 |     "version": 3
394 |    },
395 |    "file_extension": ".py",
396 |    "mimetype": "text/x-python",
397 |    "name": "python",
398 |    "nbconvert_exporter": "python",
399 |    "pygments_lexer": "ipython3",
400 |    "version": "3.6.10"
401 |   }
402 |  },
403 |  "nbformat": 4,
404 |  "nbformat_minor": 5
405 | }
406 | 


--------------------------------------------------------------------------------
/Chapter 10/input/aboutLRH.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 |     <head>
 4 |         <title>Live Well with LiveRight</title>
 5 |         <meta name="viewport" charset="UTF-8" content="width=device-width, initial-scale=1.0">
 6 |         <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.4.0/jquery.min.js"></script>
 7 |         <script src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.12.9/umd/popper.min.js"></script>
 8 |         <script src="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/js/bootstrap.min.js"></script>
 9 |         <script src="https://sdk.amazonaws.com/js/aws-sdk-2.408.0.min.js"></script>
10 |         <script src="https://cdnjs.cloudflare.com/ajax/libs/Chart.js/2.4.0/Chart.min.js"></script>
11 |     </head>
12 |     <body>
13 |         <h1>Family Bank Holdings</h1>
14 |         <h3>Date: <span id="date"></span></h3>
15 |         <div id="home">
16 |           <div id="hometext">
17 |         <h2>Who we are and what we do</h2>
18 |         <h4><p>A wholly owned subsidiary of LiveRight, we are the nation's largest bank for SMB owners and cooperative societies, with more than 4500 branches spread across the nation, servicing more than 5 million customers and continuing to grow.
19 |           We offer a number of lending products to our customers including checking and savings accounts, lending, credit cards, deposits, insurance, IRA and more. Started in 1787 as a family owned business providing low interest loans for farmers struggling with poor harvests, LiveRight helped these farmers design long distance water channels from lakes in neighboring districts
20 |              to their lands. The initial success helped these farmers invest their wealth in LiveRight and later led to our cooperative range of products that allowed farmers to own a part of LiveRight.
21 |              In 1850 we moved our HeadQuarters to New York city to help build the economy of our nation by providing low interest lending products to small to medium business owners looking to start or expand their business.
22 |               From 2 branches then to 4500 branches today, the trust of our customers helped us grow to become the nation's largest SMB bank. </p>
23 |         </h4>
24 |         </div>
25 |         </div>
26 |         <script>
27 |             // get date
28 |             var today = new Date();
29 |             var dd = String(today.getDate()).padStart(2, '0');
30 |             var mm = String(today.getMonth() + 1).padStart(2, '0'); //January is 0!
31 |             var yyyy = today.getFullYear();
32 |             today = mm + '/' + dd + '/' + yyyy;
33 |             document.getElementById('date').innerHTML = today; //update the date
34 |         </script>
35 |     </body>
36 |     <style>
37 |     body {
38 |           overflow: hidden;
39 |           position: absolute;
40 |           width: 100%;
41 |           height: 100%;
42 |           background: #404040;
43 |           top: 0;
44 |           margin: 0;
45 |           padding: 0;
46 |           -webkit-font-smoothing: antialiased;
47 | }
48 |         #home {
49 |           width: 100%;
50 |           height: 80%;
51 |           bottom: 0;
52 |           background-color: #ff8c00;
53 |           color: #fff;
54 |           margin: 0px;
55 |           padding: 0;
56 |         }
57 |         #hometext {
58 |           top: 20%;
59 |           margin: 10px;
60 |           padding: 0;
61 |         }
62 |         h1 {
63 |             text-align: center;
64 |             color: #fff;
65 |             font-family: 'Lato', sans-serif;
66 |         }
67 |         h2 {
68 |             text-align: center;
69 |             color: #fff;
70 |             font-family: 'Lato', sans-serif;
71 |         }
72 |         h3 {
73 |             text-align: center;
74 |             color: #fff;
75 |             font-family: 'Lato', sans-serif;
76 |         }
77 |         h4 {
78 |             font-family: 'Lato', sans-serif;
79 |         }
80 |         p {
81 |             font-family: 'Lato', sans-serif;
82 |         }
83 |         .row {
84 |           width: 200px;
85 |           height: 200px;
86 |           position: absolute;
87 |           left:0;
88 |           top:0;
89 |         }
90 | 
91 |     </style>
92 | </html>
93 | 


--------------------------------------------------------------------------------
/Chapter 10/output/aboutLRH_DE.html:
--------------------------------------------------------------------------------
  1 | <!DOCTYPE html>
  2 | <html>
  3 |  <head>
  4 |   <title>
  5 |    Lebe gut mit LiveRight
  6 |   </title>
  7 |   <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  8 |   <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.4.0/jquery.min.js">
  9 |   </script>
 10 |   <script src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.12.9/umd/popper.min.js">
 11 |   </script>
 12 |   <script src="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/js/bootstrap.min.js">
 13 |   </script>
 14 |   <script src="https://sdk.amazonaws.com/js/aws-sdk-2.408.0.min.js">
 15 |   </script>
 16 |   <script src="https://cdnjs.cloudflare.com/ajax/libs/Chart.js/2.4.0/Chart.min.js">
 17 |   </script>
 18 |  </head>
 19 |  <body>
 20 |   <h1>
 21 |    Beteiligungen für Familienbanken
 22 |   </h1>
 23 |   <h3>
 24 |    Date:
 25 |    <span id="date">
 26 |    </span>
 27 |   </h3>
 28 |   <div id="home">
 29 |    <div id="hometext">
 30 |     <h2>
 31 |      Wer wir sind und was wir machen
 32 |     </h2>
 33 |     <h4>
 34 |      <p>
 35 |       Als hundertprozentige Tochtergesellschaft von Liveright sind wir die größte Bank des Landes für SMB-Eigentümer und Genossenschaftsgesellschaften mit mehr als 4500 Filialen im ganzen Land, betreuen mehr als 5 Millionen Kunden und wachsen weiter.
 36 |  Wir bieten unseren Kunden eine Reihe von Kreditprodukten an, darunter Scheck- und Sparkonten, Kredite, Kreditkarten, Einlagen, Versicherungen, IRA und mehr. LiveRight wurde 1787 als Familienunternehmen gegründet, das Zinsdarlehen für Landwirte, die mit schlechten Ernten zu kämpfen hatten, und half diesen Landwirten, Fernwasserkanäle von Seen in benachbarten Bezirken zu entwerfen
 37 |  in ihr Land. Der anfängliche Erfolg half diesen Landwirten, ihr Vermögen in Liveright zu investieren, und führte später zu unserer kooperativen Produktpalette, die es den Landwirten ermöglichte, einen Teil von LiverRight zu besitzen.
 38 |  Im Jahr 1850 verlegten wir unseren Hauptsitz nach New York City, um zum Aufbau der Wirtschaft unseres Landes beizutragen, indem wir kleinen bis mittleren Unternehmern, die ihr Geschäft beginnen oder ausbauen möchten, Produkte mit niedrigen Zinsen zur Verfügung stellen.
 39 |  Von 2 Filialen bis hin zu 4500 Filialen heute hat uns das Vertrauen unserer Kunden geholfen, zur größten KMB-Bank des Landes zu werden.
 40 |      </p>
 41 |     </h4>
 42 |    </div>
 43 |   </div>
 44 |   <script>
 45 |    // get date
 46 |              var today = new Date();
 47 |              var dd = String(today.getDate()).padStart(2, '0');
 48 |              var mm = String(today.getMonth() + 1).padStart(2, '0'); //January is 0!
 49 |              var yyyy = today.getFullYear();
 50 |              today = mm + '/' + dd + '/' + yyyy;
 51 |              document.getElementById('date').innerHTML = today; //update the date
 52 |   </script>
 53 |  </body>
 54 |  <style>
 55 |   body {
 56 |            overflow: hidden;
 57 |            position: absolute;
 58 |            width: 100%;
 59 |            height: 100%;
 60 |            background: #404040;
 61 |            top: 0;
 62 |            margin: 0;
 63 |            padding: 0;
 64 |            -webkit-font-smoothing: antialiased;
 65 |  }
 66 |          #home {
 67 |            width: 100%;
 68 |            height: 80%;
 69 |            bottom: 0;
 70 |            background-color: #ff8c00;
 71 |            color: #fff;
 72 |            margin: 0px;
 73 |            padding: 0;
 74 |          }
 75 |          #hometext {
 76 |            top: 20%;
 77 |            margin: 10px;
 78 |            padding: 0;
 79 |          }
 80 |          h1 {
 81 |              text-align: center;
 82 |              color: #fff;
 83 |              font-family: 'Lato', sans-serif;
 84 |          }
 85 |          h2 {
 86 |              text-align: center;
 87 |              color: #fff;
 88 |              font-family: 'Lato', sans-serif;
 89 |          }
 90 |          h3 {
 91 |              text-align: center;
 92 |              color: #fff;
 93 |              font-family: 'Lato', sans-serif;
 94 |          }
 95 |          h4 {
 96 |              font-family: 'Lato', sans-serif;
 97 |          }
 98 |          p {
 99 |              font-family: 'Lato', sans-serif;
100 |          }
101 |          .row {
102 |            width: 200px;
103 |            height: 200px;
104 |            position: absolute;
105 |            left:0;
106 |            top:0;
107 |          }
108 |  </style>
109 | </html>
110 | 


--------------------------------------------------------------------------------
/Chapter 11/2019-NAR-HBS.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 11/2019-NAR-HBS.pdf


--------------------------------------------------------------------------------
/Chapter 11/2020-generational-trends-report-03-05-2020.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 11/2020-generational-trends-report-03-05-2020.pdf


--------------------------------------------------------------------------------
/Chapter 11/Zillow-home-buyers-report.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 11/Zillow-home-buyers-report.pdf


--------------------------------------------------------------------------------
/Chapter 11/faqs.csv:
--------------------------------------------------------------------------------
1 | what is important for home buyers?,"Home resale value, ability to rent out, assigned parking, smart home capabilities, good school zones, preferred neighborhoods are all important for home buyers"
2 | what do first-time home buyers want?,First-time buyers were more interested in receiving help from their agent in determining how much they could afford than repeat buyers. More buyers of new homes (10 percent) wanted help with paperwork compared to other buyer types. Married couples wanted to negotiate the terms of sale (13 percent) more than any other household composition. Single males and unmarried couples wanted help to find the right home (both 54 percent) more than other household compositions. There were many benefits for buyers using a real estate agent with the foremost reported as being the buyer(s) receiving help in understanding the buying process (61 percent).
3 | how are increased prices impacting sellers?,Increased home prices have lowered the share of home sellers who report they delayed the sale of their home because their home was worth less than their mortgage.
4 | was internet used during the home search?,Fifty-five percent of buyers who used the internet during their home search process ultimately found the home that they purchased through the internet. Forty percent of buyers who did not use the internet during their home search process found their home through a real estate agent compared to only 28 percent of buyers who did use the internet.
5 | do buyers use real estate agents?,Buyers typically interviewed only one real estate agent before working with them and the most important factor was that the agent was honest and trustworthy. Another important factor was the agent’s experience. Recent buyers were overall very satisfied with their real estate agent’s skills and qualities and definitely would use their agent again or recommend them to others.


--------------------------------------------------------------------------------
/Chapter 12/ch 12 automating claims processing.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "id": "bdfb92d5",
  7 |    "metadata": {},
  8 |    "outputs": [],
  9 |    "source": [
 10 |     "import boto3\n",
 11 |     "import json\n",
 12 |     "import boto3\n",
 13 |     "import re\n",
 14 |     "import csv\n",
 15 |     "import sagemaker\n",
 16 |     "from sagemaker import get_execution_role\n",
 17 |     "from sagemaker.s3 import S3Uploader, S3Downloader\n",
 18 |     "import uuid\n",
 19 |     "import time\n",
 20 |     "import io\n",
 21 |     "from io import BytesIO\n",
 22 |     "import sys\n",
 23 |     "from pprint import pprint\n",
 24 |     "\n",
 25 |     "from IPython.display import Image, display\n",
 26 |     "from PIL import Image as PImage, ImageDraw"
 27 |    ]
 28 |   },
 29 |   {
 30 |    "cell_type": "code",
 31 |    "execution_count": null,
 32 |    "id": "9face02c",
 33 |    "metadata": {},
 34 |    "outputs": [],
 35 |    "source": [
 36 |     "!pip install amazon-textract-response-parser"
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "code",
 41 |    "execution_count": null,
 42 |    "id": "d2f74c64",
 43 |    "metadata": {},
 44 |    "outputs": [],
 45 |    "source": [
 46 |     "role = get_execution_role()\n",
 47 |     "#print(\"RoleArn: {}\".format(role))\n",
 48 |     "\n",
 49 |     "sess = sagemaker.Session()\n",
 50 |     "bucket = sess.default_bucket()\n",
 51 |     "prefix = 'claims-process-textract'"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "markdown",
 56 |    "id": "27ed57ed",
 57 |    "metadata": {},
 58 |    "source": [
 59 |     "# Valid Document"
 60 |    ]
 61 |   },
 62 |   {
 63 |    "cell_type": "code",
 64 |    "execution_count": null,
 65 |    "id": "215958fe",
 66 |    "metadata": {},
 67 |    "outputs": [],
 68 |    "source": [
 69 |     "# Document\n",
 70 |     "documentName = \"validmedicalform.png\"\n",
 71 |     "\n",
 72 |     "display(Image(filename=documentName))"
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "code",
 77 |    "execution_count": null,
 78 |    "id": "bc6c1b06",
 79 |    "metadata": {},
 80 |    "outputs": [],
 81 |    "source": [
 82 |     "# process using image bytes\n",
 83 |     "def calltextract(documentName): \n",
 84 |     "    client = boto3.client(service_name='textract',\n",
 85 |     "         region_name= 'us-east-1',\n",
 86 |     "         endpoint_url='https://textract.us-east-1.amazonaws.com')\n",
 87 |     "\n",
 88 |     "    with open(documentName, 'rb') as file:\n",
 89 |     "            img_test = file.read()\n",
 90 |     "            bytes_test = bytearray(img_test)\n",
 91 |     "            print('Image loaded', documentName)\n",
 92 |     "\n",
 93 |     "    # process using image bytes\n",
 94 |     "    response = client.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['FORMS'])\n",
 95 |     "\n",
 96 |     "    return response"
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "code",
101 |    "execution_count": null,
102 |    "id": "c8fad04d",
103 |    "metadata": {},
104 |    "outputs": [],
105 |    "source": [
106 |     "response= calltextract(documentName)\n",
107 |     "print(response)"
108 |    ]
109 |   },
110 |   {
111 |    "cell_type": "code",
112 |    "execution_count": null,
113 |    "id": "a2a17fe9",
114 |    "metadata": {},
115 |    "outputs": [],
116 |    "source": [
117 |     "#Extract key values\n",
118 |     "# Iterate over elements in the document\n",
119 |     "from trp import Document\n",
120 |     "def getformkeyvalue(response):\n",
121 |     "    doc = Document(response)\n",
122 |     "    #print(doc)\n",
123 |     "    key_map = {}\n",
124 |     "    for page in doc.pages:\n",
125 |     "        # Print fields\n",
126 |     "        for field in page.form.fields:\n",
127 |     "            if field is None or field.key is None or field.value is None:\n",
128 |     "                continue\n",
129 |     "        #print(\"Field: Key: {}, Value: {}\".format(field.key.text, field.value.text))\n",
130 |     "            key_map[field.key.text] = field.value.text\n",
131 |     "    return key_map"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "code",
136 |    "execution_count": null,
137 |    "id": "dd2c5e4f",
138 |    "metadata": {},
139 |    "outputs": [],
140 |    "source": [
141 |     "get_form_keys = getformkeyvalue(response)\n",
142 |     "print(get_form_keys)"
143 |    ]
144 |   },
145 |   {
146 |    "cell_type": "markdown",
147 |    "id": "15c6f940",
148 |    "metadata": {},
149 |    "source": [
150 |     "# Check for validation using business rules\n",
151 |     "Checking if claim Id is 12 digit and zip code is digit"
152 |    ]
153 |   },
154 |   {
155 |    "cell_type": "code",
156 |    "execution_count": null,
157 |    "id": "4fc3276e",
158 |    "metadata": {},
159 |    "outputs": [],
160 |    "source": [
161 |     "def validate(body):\n",
162 |     "    json_acceptable_string = body.replace(\"'\", \"\\\"\")\n",
163 |     "    json_data = json.loads(json_acceptable_string)\n",
164 |     "    print(json_data)\n",
165 |     "    zip = json_data['ZIP CODE']\n",
166 |     "    id = json_data['ID NUMBER']\n",
167 |     "\n",
168 |     "    if(not zip.strip().isdigit()):\n",
169 |     "        return False, id, \"Zip code invalid\"\n",
170 |     "    length = len(id.strip())\n",
171 |     "    if(length != 12):\n",
172 |     "        return False, id, \"Invalid claim Id\"\n",
173 |     "    return True, id, \"Ok\""
174 |    ]
175 |   },
176 |   {
177 |    "cell_type": "code",
178 |    "execution_count": null,
179 |    "id": "3fb2f299",
180 |    "metadata": {},
181 |    "outputs": [],
182 |    "source": [
183 |     " # Validate \n",
184 |     "textract_json= json.dumps(get_form_keys,indent=2)\n",
185 |     "res, formid, result = validate(textract_json)\n",
186 |     "print(result)\n",
187 |     "print(formid)"
188 |    ]
189 |   },
190 |   {
191 |    "cell_type": "markdown",
192 |    "id": "2fbe1b4c",
193 |    "metadata": {},
194 |    "source": [
195 |     "# Valid Medical Intake Form send to Comprehend medical to gain insights"
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "code",
200 |    "execution_count": null,
201 |    "id": "eb11d4fa",
202 |    "metadata": {},
203 |    "outputs": [],
204 |    "source": [
205 |     "comprehend = boto3.client(service_name='comprehendmedical')\n"
206 |    ]
207 |   },
208 |   {
209 |    "cell_type": "code",
210 |    "execution_count": null,
211 |    "id": "333ef20f",
212 |    "metadata": {},
213 |    "outputs": [],
214 |    "source": [
215 |     "# Detect medical entities\n",
216 |     "cm_json_data =  comprehend.detect_entities_v2(Text=textract_json)\n",
217 |     "print(\"\\nMedical Entities\\n========\")\n",
218 |     "for entity in cm_json_data[\"Entities\"]:\n",
219 |     "    print(\"- {}\".format(entity[\"Text\"]))\n",
220 |     "    print (\"   Type: {}\".format(entity[\"Type\"]))\n",
221 |     "    print (\"   Category: {}\".format(entity[\"Category\"]))\n",
222 |     "    if(entity[\"Traits\"]):\n",
223 |     "        print(\"   Traits:\")\n",
224 |     "        for trait in entity[\"Traits\"]:\n",
225 |     "            print (\"    - {}\".format(trait[\"Name\"]))\n",
226 |     "    print(\"\\n\")"
227 |    ]
228 |   },
229 |   {
230 |    "cell_type": "markdown",
231 |    "id": "c38b1101",
232 |    "metadata": {},
233 |    "source": [
234 |     "Writing entities to CSV File"
235 |    ]
236 |   },
237 |   {
238 |    "cell_type": "code",
239 |    "execution_count": null,
240 |    "id": "30f4d931",
241 |    "metadata": {},
242 |    "outputs": [],
243 |    "source": [
244 |     "\n",
245 |     "def printtocsv(cm_json_data,formid):       \n",
246 |     "        entities = cm_json_data['Entities']\n",
247 |     "        TEMP_FILE = 'cmresult.csv'\n",
248 |     "        with open(TEMP_FILE, 'w') as csvfile: # 'w' will truncate the file\n",
249 |     "            filewriter = csv.writer(csvfile, delimiter=',',\n",
250 |     "                            quotechar='|', quoting=csv.QUOTE_MINIMAL)\n",
251 |     "            filewriter.writerow([ 'ID','Category', 'Type', 'Text'])\n",
252 |     "            for entity in entities:\n",
253 |     "                filewriter.writerow([formid, entity['Category'], entity['Type'], entity['Text']])\n",
254 |     "\n",
255 |     "        filename = \"procedureresult/\" + formid + \".csv\"\n",
256 |     "\n",
257 |     "      \n",
258 |     "        S3Uploader.upload(TEMP_FILE, 's3://{}/{}'.format(bucket, prefix))\n",
259 |     "        print(\"successfully parsed:\" + filename)"
260 |    ]
261 |   },
262 |   {
263 |    "cell_type": "code",
264 |    "execution_count": null,
265 |    "id": "f91792f9",
266 |    "metadata": {},
267 |    "outputs": [],
268 |    "source": [
269 |     "printtocsv(cm_json_data,formid)"
270 |    ]
271 |   },
272 |   {
273 |    "cell_type": "markdown",
274 |    "id": "53cae2b2",
275 |    "metadata": {},
276 |    "source": [
277 |     "# Invalid Claim"
278 |    ]
279 |   },
280 |   {
281 |    "cell_type": "code",
282 |    "execution_count": null,
283 |    "id": "a15b8ea4",
284 |    "metadata": {},
285 |    "outputs": [],
286 |    "source": [
287 |     "InvalidDocument = \"invalidmedicalform.png\"\n",
288 |     "\n",
289 |     "display(Image(filename=InvalidDocument))"
290 |    ]
291 |   },
292 |   {
293 |    "cell_type": "code",
294 |    "execution_count": null,
295 |    "id": "2b7e3801",
296 |    "metadata": {},
297 |    "outputs": [],
298 |    "source": [
299 |     "response = calltextract(InvalidDocument)"
300 |    ]
301 |   },
302 |   {
303 |    "cell_type": "code",
304 |    "execution_count": null,
305 |    "id": "10d16911",
306 |    "metadata": {},
307 |    "outputs": [],
308 |    "source": [
309 |     "get_form_keys = getformkeyvalue(response)\n",
310 |     "print(get_form_keys)"
311 |    ]
312 |   },
313 |   {
314 |    "cell_type": "code",
315 |    "execution_count": null,
316 |    "id": "e0b0ca01",
317 |    "metadata": {},
318 |    "outputs": [],
319 |    "source": [
320 |     " #In Validate \n",
321 |     "textract_json= json.dumps(get_form_keys,indent=2)\n",
322 |     "res, formid, result = validate(textract_json)\n",
323 |     "print(result)\n",
324 |     "print(formid)\n",
325 |     "print(res)"
326 |    ]
327 |   },
328 |   {
329 |    "cell_type": "markdown",
330 |    "id": "8bf1dfc9",
331 |    "metadata": {},
332 |    "source": [
333 |     "# Notify stakeholders that its Invalid"
334 |    ]
335 |   },
336 |   {
337 |    "cell_type": "code",
338 |    "execution_count": null,
339 |    "id": "18d9dbab",
340 |    "metadata": {},
341 |    "outputs": [],
342 |    "source": [
343 |     "sns = boto3.client('sns')"
344 |    ]
345 |   },
346 |   {
347 |    "cell_type": "markdown",
348 |    "id": "f0bbdb8d",
349 |    "metadata": {},
350 |    "source": [
351 |     "# Go to https://console.aws.amazon.com/sns/v3/home?region=us-east-1#/homepage and create a topic as per book instruction"
352 |    ]
353 |   },
354 |   {
355 |    "cell_type": "code",
356 |    "execution_count": null,
357 |    "id": "ed26545b",
358 |    "metadata": {},
359 |    "outputs": [],
360 |    "source": [
361 |     "topicARN=\"<enter topic arn>\""
362 |    ]
363 |   },
364 |   {
365 |    "cell_type": "code",
366 |    "execution_count": null,
367 |    "id": "7227c6d0",
368 |    "metadata": {},
369 |    "outputs": [],
370 |    "source": [
371 |     "snsbody = \"Content:\" + str(textract_json) + \"Reason:\" + str(result)\n",
372 |     "print(snsbody)"
373 |    ]
374 |   },
375 |   {
376 |    "cell_type": "code",
377 |    "execution_count": null,
378 |    "id": "0a1c110f",
379 |    "metadata": {},
380 |    "outputs": [],
381 |    "source": [
382 |     "try:\n",
383 |     "    response = sns.publish(\n",
384 |     "                    TargetArn = topicARN,\n",
385 |     "                    Message= snsbody\n",
386 |     "    )\n",
387 |     "    print(response)\n",
388 |     "except Exception as e:\n",
389 |     "        print(\"Failed while doing validation\")\n",
390 |     "        print(e.message)\n"
391 |    ]
392 |   },
393 |   {
394 |    "cell_type": "markdown",
395 |    "id": "01eabab9",
396 |    "metadata": {},
397 |    "source": [
398 |     "# Check your email for notification"
399 |    ]
400 |   },
401 |   {
402 |    "cell_type": "markdown",
403 |    "id": "7935053e",
404 |    "metadata": {},
405 |    "source": [
406 |     "# Clean UP\n",
407 |     "Delete the topic you created from Console https://console.aws.amazon.com/sns/v3/home?region=us-east-1#/topic/"
408 |    ]
409 |   },
410 |   {
411 |    "cell_type": "markdown",
412 |    "id": "70c96e49",
413 |    "metadata": {},
414 |    "source": [
415 |     "Delete the Amazon s3 bucket and the files in the buckethttps://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-bucket.html"
416 |    ]
417 |   }
418 |  ],
419 |  "metadata": {
420 |   "kernelspec": {
421 |    "display_name": "conda_python3",
422 |    "language": "python",
423 |    "name": "conda_python3"
424 |   },
425 |   "language_info": {
426 |    "codemirror_mode": {
427 |     "name": "ipython",
428 |     "version": 3
429 |    },
430 |    "file_extension": ".py",
431 |    "mimetype": "text/x-python",
432 |    "name": "python",
433 |    "nbconvert_exporter": "python",
434 |    "pygments_lexer": "ipython3",
435 |    "version": "3.6.13"
436 |   }
437 |  },
438 |  "nbformat": 4,
439 |  "nbformat_minor": 5
440 | }
441 | 


--------------------------------------------------------------------------------
/Chapter 12/invalidmedicalform.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 12/invalidmedicalform.png


--------------------------------------------------------------------------------
/Chapter 12/validmedicalform.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 12/validmedicalform.png


--------------------------------------------------------------------------------
/Chapter 13/chapter13 Improving accuracy of document processing .ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Textract Anazlyze API Invoke with human in the loop"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": null,
 13 |    "metadata": {},
 14 |    "outputs": [],
 15 |    "source": [
 16 |     "import boto3\n",
 17 |     "import uuid\n",
 18 |     "import time\n",
 19 |     "import re\n",
 20 |     "import pprint\n",
 21 |     "import json\n",
 22 |     "pp = pprint.PrettyPrinter(indent=4)"
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "code",
 27 |    "execution_count": null,
 28 |    "metadata": {},
 29 |    "outputs": [],
 30 |    "source": [
 31 |     "# Amazon Textract client\n",
 32 |     "textract = boto3.client('textract')\n",
 33 |     "\n",
 34 |     "# Amazon S3 client \n",
 35 |     "s3 = boto3.client('s3')"
 36 |    ]
 37 |   },
 38 |   {
 39 |    "cell_type": "code",
 40 |    "execution_count": null,
 41 |    "metadata": {},
 42 |    "outputs": [],
 43 |    "source": [
 44 |     "humanLoopName = str(uuid.uuid4())"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "markdown",
 49 |    "metadata": {},
 50 |    "source": [
 51 |     "Enter the name of the S3 bucket you craeted and uplaoded the Samplecheck document"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "code",
 56 |    "execution_count": null,
 57 |    "metadata": {},
 58 |    "outputs": [],
 59 |    "source": [
 60 |     "bucket=\"<enter s3 bucket name>\""
 61 |    ]
 62 |   },
 63 |   {
 64 |    "cell_type": "markdown",
 65 |    "metadata": {},
 66 |    "source": [
 67 |     "Enter the flow definition ARN or human review workflow arn\n",
 68 |     "by copying the arn from Console https://console.aws.amazon.com/a2i/home?region=us-east-1#/human-review-workflows\n",
 69 |     "    \n"
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "code",
 74 |    "execution_count": null,
 75 |    "metadata": {},
 76 |    "outputs": [],
 77 |    "source": [
 78 |     "humanLoopConfig = {\n",
 79 |     "    'FlowDefinitionArn':\"<enter flow definition or human review workflow arn>\",\n",
 80 |     "    'HumanLoopName':humanLoopName, \n",
 81 |     "    'DataAttributes': { 'ContentClassifiers': [ 'FreeOfPersonallyIdentifiableInformation' ]}\n",
 82 |     "}"
 83 |    ]
 84 |   },
 85 |   {
 86 |    "cell_type": "markdown",
 87 |    "metadata": {},
 88 |    "source": [
 89 |     "enter the bucket name in Bucket and sample document name in Name"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "code",
 94 |    "execution_count": null,
 95 |    "metadata": {},
 96 |    "outputs": [],
 97 |    "source": [
 98 |     "response = textract.analyze_document(\n",
 99 |     "        Document={'S3Object': {'Bucket': bucket, 'Name':  \"samplecheck.PNG\"}},\n",
100 |     "        FeatureTypes=[\"FORMS\"], \n",
101 |     "        HumanLoopConfig=humanLoopConfig\n",
102 |     "    )"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "code",
107 |    "execution_count": null,
108 |    "metadata": {},
109 |    "outputs": [],
110 |    "source": [
111 |     "print(response)"
112 |    ]
113 |   },
114 |   {
115 |    "cell_type": "markdown",
116 |    "metadata": {},
117 |    "source": [
118 |     "Paste the workteam arn below which you copied from craeting private workteam or \n",
119 |     "you can find it in this link https://console.aws.amazon.com/sagemaker/groundtruth?region=us-east-1#/labeling-workforces"
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "code",
124 |    "execution_count": null,
125 |    "metadata": {},
126 |    "outputs": [],
127 |    "source": [
128 |     "WORKTEAM_ARN= \"<enter workteam arn>\""
129 |    ]
130 |   },
131 |   {
132 |    "cell_type": "code",
133 |    "execution_count": null,
134 |    "metadata": {},
135 |    "outputs": [],
136 |    "source": [
137 |     "# Amazon SageMaker client\n",
138 |     "sagemaker = boto3.client('sagemaker')\n",
139 |     "\n",
140 |     "# Amazon Augment AI (A2I) client\n",
141 |     "a2i = boto3.client('sagemaker-a2i-runtime')"
142 |    ]
143 |   },
144 |   {
145 |    "cell_type": "code",
146 |    "execution_count": null,
147 |    "metadata": {},
148 |    "outputs": [],
149 |    "source": [
150 |     "\n",
151 |     "\n",
152 |     "workteamName = WORKTEAM_ARN[WORKTEAM_ARN.rfind('/') + 1:]\n",
153 |     "print(\"Navigate to the private worker portal and do the tasks. Make sure you've invited yourself to your workteam!\")\n",
154 |     "print('https://' + sagemaker.describe_workteam(WorkteamName=workteamName)['Workteam']['SubDomain'])\n",
155 |     "\n"
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "code",
160 |    "execution_count": null,
161 |    "metadata": {},
162 |    "outputs": [],
163 |    "source": [
164 |     "completed_human_loops = []\n",
165 |     "resp = a2i.describe_human_loop(HumanLoopName=humanLoopName)\n",
166 |     "print(f'HumanLoop Name: {humanLoopName}')\n",
167 |     "print(f'HumanLoop Status: {resp[\"HumanLoopStatus\"]}')\n",
168 |     "#print(f'HumanLoop Output Destination: {resp[\"HumanLoopOutput\"]}')\n",
169 |     "print('\\n')\n",
170 |     "    \n",
171 |     "if resp[\"HumanLoopStatus\"] == \"Completed\":\n",
172 |     "    completed_human_loops.append(resp)"
173 |    ]
174 |   },
175 |   {
176 |    "cell_type": "code",
177 |    "execution_count": null,
178 |    "metadata": {},
179 |    "outputs": [],
180 |    "source": [
181 |     "\n",
182 |     "for resp in completed_human_loops:\n",
183 |     "    splitted_string = re.split('s3://' + bucket  + '/', resp['HumanLoopOutput']['OutputS3Uri'])\n",
184 |     "    output_bucket_key = splitted_string[1]\n",
185 |     "    print(output_bucket_key)\n",
186 |     "    response = s3.get_object(Bucket= bucket, Key=output_bucket_key)\n",
187 |     "    content = response[\"Body\"].read()\n",
188 |     "    json_output = json.loads(content)\n",
189 |     "    pp.pprint(json_output)\n",
190 |     "    print('\\n')\n"
191 |    ]
192 |   },
193 |   {
194 |    "cell_type": "markdown",
195 |    "metadata": {},
196 |    "source": [
197 |     "# Clean Up\n",
198 |     "## Delete human review worflow: https://console.aws.amazon.com/a2i/home?region=us-east-1#/human-review-workflows\n",
199 |     "        \n",
200 |     "## Delete private workforce: https://console.aws.amazon.com/sagemaker/groundtruth?region=us-east-1#/labeling-workforces/private-details/a2i-demos\n",
201 |     "\n",
202 |     "## Delete Amazon S3 Bucket      \n"
203 |    ]
204 |   }
205 |  ],
206 |  "metadata": {
207 |   "kernelspec": {
208 |    "display_name": "conda_python3",
209 |    "language": "python",
210 |    "name": "conda_python3"
211 |   },
212 |   "language_info": {
213 |    "codemirror_mode": {
214 |     "name": "ipython",
215 |     "version": 3
216 |    },
217 |    "file_extension": ".py",
218 |    "mimetype": "text/x-python",
219 |    "name": "python",
220 |    "nbconvert_exporter": "python",
221 |    "pygments_lexer": "ipython3",
222 |    "version": "3.6.13"
223 |   }
224 |  },
225 |  "nbformat": 4,
226 |  "nbformat_minor": 4
227 | }
228 | 


--------------------------------------------------------------------------------
/Chapter 13/samplecheck.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 13/samplecheck.PNG


--------------------------------------------------------------------------------
/Chapter 13/text.py:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/Chapter 14/input/sample-loan-application.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 14/input/sample-loan-application.png


--------------------------------------------------------------------------------
/Chapter 14/train/entitylist.csv:
--------------------------------------------------------------------------------
 1 | Text,Type
 2 | Country:UK,PERSON
 3 | Country:CAN,PERSON
 4 | Country:FRA,PERSON
 5 | Years:10,PERSON
 6 | Years:13,PERSON
 7 | Years:17,PERSON
 8 | Cell Phone:(281)123 4567,PERSON
 9 | Cell Phone:(345)789 0123,PERSON
10 | Cell Phone:(999)999 9999,PERSON
11 | Cell Phone:(666)999 7777,PERSON
12 | Name:Kwaku Mensah,PERSON
13 | Name:Jane Doe,PERSON
14 | Name:John Smith,PERSON
15 | TOTAL $:8000.00/month,PERSON
16 | TOTAL $:9000.00/month,PERSON
17 | TOTAL $:7000.00/month,PERSON
18 | TOTAL $:6000.00/month,PERSON
19 | Social Security Number:123 - 45 - 6789,PERSON
20 | Social Security Number:111 - 11 - 1111,PERSON
21 | Social Security Number:222 - 22 - 2222,PERSON
22 | Social Security Number:234 - 56 - 7890,PERSON
23 | Date of Birth:01 / 01 / 1953,PERSON
24 | Date of Birth:01 / 01 / 1963,PERSON
25 | Date of Birth:01 / 01 / 1966,PERSON
26 | Date of Birth:02/ 01 / 1976,PERSON
27 | Country:ABC,GHOST
28 | Country:DEFG,GHOST
29 | Country:KAFP,GHOST
30 | Country:BLAH,GHOST
31 | Years:0,GHOST
32 | Years:999,GHOST
33 | Cell Phone:147,GHOST
34 | Cell Phone:1234,GHOST
35 | Cell Phone:7777,GHOST
36 | Cell Phone:000,GHOST
37 | Name:F R,GHOST
38 | Name:R A,GHOST
39 | Name:C E,GHOST
40 | Name:Z Z,GHOST
41 | TOTAL $:90.00/month,GHOST
42 | TOTAL $:800.00/month,GHOST
43 | TOTAL $:120.00/month,GHOST
44 | TOTAL $:88/m,GHOST
45 | Social Security Number:-54-,GHOST
46 | Social Security Number:777,GHOST
47 | Social Security Number:2222,GHOST
48 | Social Security Number:090,GHOST
49 | Date of Birth:1853,GHOST
50 | Date of Birth:196,GHOST
51 | Date of Birth:1780,GHOST
52 | Date of Birth:10000,GHOST


--------------------------------------------------------------------------------
/Chapter 15/cha15train.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/cha15train.png


--------------------------------------------------------------------------------
/Chapter 15/chapter15retrain.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/chapter15retrain.png


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/3ba082d3b307398adaf9d55301831684.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/3ba082d3b307398adaf9d55301831684.png


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/445ea8e393ca62878d5e3a68f054a8e4.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/445ea8e393ca62878d5e3a68f054a8e4.jpg


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/Howard Bank Sample Personal Bank Statement.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/Howard Bank Sample Personal Bank Statement.jpg


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 07 (1)0.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 07 (1)0.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 08 (1)0.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 08 (1)0.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 09 (1)0.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 09 (1)0.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 11 (1)0.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 11 (1)0.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 11 (1)1.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 11 (1)1.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 12 (1)0.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 12 (1)0.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 12 (1)1.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 12 (1)1.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)1.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)1.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)10.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)10.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)11.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)11.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)13.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)13.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)14.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)14.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)15.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)15.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)17.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)17.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)18.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)18.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)2.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)2.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)20.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)20.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)21.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)21.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)22.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)22.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)23.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)23.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)24.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)24.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)25.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)25.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)26.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)26.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)27.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)27.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)4.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)4.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)5.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)5.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)6.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)6.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)8.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)8.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)9.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 15 (1)9.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 16 (1)0.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 16 (1)0.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 16 (1)1.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 16 (1)1.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 18 (1)0.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 18 (1)0.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 20 (1)0.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 20 (1)0.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 20 (1)1.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 20 (1)1.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Bank Statements/bank statement template 20 (1)2.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Bank Statements/bank statement template 20 (1)2.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Pay Stubs/29.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/29.PNG


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Pay Stubs/59f20ff3d6612.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/59f20ff3d6612.jpg


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Pay Stubs/59f212b5f02df.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/59f212b5f02df.jpg


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Pay Stubs/59f213a817dd8.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/59f213a817dd8.jpg


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Pay Stubs/59f214708f294.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/59f214708f294.jpg


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Pay Stubs/59f214eabb325.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/59f214eabb325.jpg


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Pay Stubs/5acf71c1d4b14.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/5acf71c1d4b14.jpg


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Pay Stubs/5acf72145500c.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/5acf72145500c.jpg


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Pay Stubs/5d945f376f89f101477294.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/5d945f376f89f101477294.jpg


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Pay Stubs/5d945f854fad0046913915.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/5d945f854fad0046913915.jpg


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Pay Stubs/adp-sample-768x946.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/adp-sample-768x946.png


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Pay Stubs/free-horizontal-paystub-template.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/free-horizontal-paystub-template.png


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Pay Stubs/free-long-creek-paystub-template.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/free-long-creek-paystub-template.png


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Pay Stubs/free-magenta-paystub-template.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/free-magenta-paystub-template.png


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Pay Stubs/free-midnight-paystub-template.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/free-midnight-paystub-template.png


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Pay Stubs/free-shamrock-paystub-template.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/free-shamrock-paystub-template.png


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Pay Stubs/free-sycamore-paystub-template.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/free-sycamore-paystub-template.png


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Pay Stubs/free-veritical blue-paystub-template.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/free-veritical blue-paystub-template.png


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Pay Stubs/free-violet-paystub-template.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/free-violet-paystub-template.png


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Pay Stubs/free-white-paystub-template.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/free-white-paystub-template.png


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Pay Stubs/pay-stub-with-logo-768x384.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/pay-stub-with-logo-768x384.png


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Pay Stubs/sample-pay-stub-2020-768x384.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/sample-pay-stub-2020-768x384.png


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Pay Stubs/sample-pay-stub.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/sample-pay-stub.png


--------------------------------------------------------------------------------
/Chapter 15/documents/train/Pay Stubs/taxes-sample-pay-stub-768x614.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/documents/train/Pay Stubs/taxes-sample-pay-stub-768x614.png


--------------------------------------------------------------------------------
/Chapter 15/paystubsample.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 15/paystubsample.png


--------------------------------------------------------------------------------
/Chapter 16/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 16/.DS_Store


--------------------------------------------------------------------------------
/Chapter 16/form-s20-LRHL-registration.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 16/form-s20-LRHL-registration.pdf


--------------------------------------------------------------------------------
/Chapter 16/form-s20-LRHL-registration_image.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 16/form-s20-LRHL-registration_image.png


--------------------------------------------------------------------------------
/Chapter 16/form-s20-SUBS1-registration.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 16/form-s20-SUBS1-registration.pdf


--------------------------------------------------------------------------------
/Chapter 16/form-s20-SUBS1-registration_image.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 16/form-s20-SUBS1-registration_image.png


--------------------------------------------------------------------------------
/Chapter 16/form-s20-SUBS2-registration.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 16/form-s20-SUBS2-registration.pdf


--------------------------------------------------------------------------------
/Chapter 16/form-s20-SUBS2-registration_image.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 16/form-s20-SUBS2-registration_image.png


--------------------------------------------------------------------------------
/Chapter 16/tabular-sec.liquid.html:
--------------------------------------------------------------------------------
 1 | <script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
 2 | 
 3 | <style>
 4 |   table, tr, th, td {
 5 |     border: 1px solid black;
 6 |     border-collapse: collapse;
 7 |     padding: 5px;
 8 |   }
 9 | </style>
10 | 
11 | <crowd-form>
12 |     <div>
13 |         <h1>Instructions</h1>
14 |         <p>Please review the SEC registration form inputs, and make corrections where appropriate. </p>
15 |     </div>
16 |    <div>
17 |       <h3>Original Registration Form - Page 1</h3>
18 |       <classification-target>
19 |         <img style="width: 70%; max-height: 40%; margin-bottom: 10px" src="{{ task.input.image | grant_read_access }}"/>        
20 |       </classification-target>     
21 |    </div>
22 |     <br>
23 |     <h1> Please enter your modifications below </h1>
24 |     <table>
25 |     <tr>
26 |         <th>Line Nr</th>
27 |         <th style="width:500px">Detected Text</th>
28 |         <th style="width:500px">Confidence</th>
29 |         <th>Change Required</th>
30 |         <th style="width:500px">Corrected Text</th>
31 |         <th>Comments</th>
32 |     </tr>
33 |     {% for pair in task.input.document %}
34 | 
35 |         <tr>
36 |           <td>{{ pair.linenr }}</td>
37 |           <td><crowd-text-area name="predicteddoc{{ pair.linenr }}" value="{{ pair.detectedtext }}"></crowd-text-area></td>
38 |           <td><crowd-text-area name="confidence{{ pair.linenr }}" value="{{ pair.confidence }}"></crowd-text-area></td>
39 |           <td>
40 |             <p>
41 |               <input type="radio" id="agree{{ pair.linenr }}" name="rating{{ pair.linenr }}" value="agree" required>
42 |               <label for="agree{{ pair.linenr }}">Correct</label>
43 |             </p>
44 |             <p>
45 |               <input type="radio" id="disagree{{ pair.linenr }}" name="rating{{ pair.linenr }}" value="disagree" required>
46 |               <label for="disagree{{ pair.linenr }}">Incorrect</label>
47 |             </p>
48 |           </td>
49 |           <td>
50 |             <p>
51 |             <input style="width:500px" rows="3" type="text" name="correcteddoc{{ pair.linenr }}" value="{{pair.detectedtext}}" required/>
52 |             </p>
53 |            </td>
54 |            <td>
55 |             <p>
56 |             <input style="width:500px" rows="3" type="text" name="comments{{ pair.linenr }}" placeholder="Explain why you changed the value"/>
57 |             </p>
58 |            </td>
59 |         </tr>
60 | 
61 |       {% endfor %}
62 |     </table>
63 |     <br>
64 |     <br>
65 | </crowd-form>


--------------------------------------------------------------------------------
/Chapter 17/chapter17-deriving-insights-from-handwritten-content-forGitHub.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "26ba00e7",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# Deriving insights from handwritten content using Amazon Textract and Amazon Quicksight\n",
  9 |     "\n",
 10 |     "This notebook is an accompanying utility for `Chapter 17 - Deriving insights from handwritten content` from the PACKT book **Natural Language Processing with AWS AI Services**. Please read the chapter and the instructions before trying this notebook. "
 11 |    ]
 12 |   },
 13 |   {
 14 |    "cell_type": "code",
 15 |    "execution_count": null,
 16 |    "id": "be3568f5",
 17 |    "metadata": {},
 18 |    "outputs": [],
 19 |    "source": [
 20 |     "# STEP 0 - CELL 1\n",
 21 |     "import boto3\n",
 22 |     "import json\n",
 23 |     "import csv\n",
 24 |     "import os\n",
 25 |     "\n",
 26 |     "infile = 'qsmani-raw.json'\n",
 27 |     "outfile = 'qsmani-formatted.json'\n",
 28 |     "bucket = '<enter-S3-bucket-name>'\n",
 29 |     "prefix = 'chapter17' # change this prefix if you like"
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "markdown",
 34 |    "id": "5ae09b21",
 35 |    "metadata": {},
 36 |    "source": [
 37 |     "### Update QuickSight Manifest\n",
 38 |     "We will replace the S3 bucket and prefix from the raw manifest file with what you have entered in STEP 0 - CELL 1 above. We will then create a new formatted manifest file that will be used for creating a dataset with [Amazon QuickSight](https://aws.amazon.com/quicksight/) based on the content we extract from the handwritten documents."
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "code",
 43 |    "execution_count": null,
 44 |    "id": "c2b074b2",
 45 |    "metadata": {},
 46 |    "outputs": [],
 47 |    "source": [
 48 |     "# STEP 1 - CELL 1\n",
 49 |     "import json\n",
 50 |     "manifest = open(infile,'r')\n",
 51 |     "ln = json.load(manifest)\n",
 52 |     "t = json.dumps(ln['fileLocations'][0]['URIPrefixes'])\n",
 53 |     "t = t.replace('bucket',bucket).replace('prefix',prefix)\n",
 54 |     "ln['fileLocations'][0]['URIPrefixes'] = json.loads(t)\n",
 55 |     "with open(outfile,'w', encoding='utf-8') as out:\n",
 56 |     "    json.dump(ln,out, ensure_ascii=False, indent=4)"
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "code",
 61 |    "execution_count": null,
 62 |    "id": "3bf01e5d",
 63 |    "metadata": {},
 64 |    "outputs": [],
 65 |    "source": [
 66 |     "# STEP 1 - CELL 2\n",
 67 |     "s3 = boto3.client('s3')\n",
 68 |     "s3.upload_file(outfile,bucket,prefix+'/'+outfile)\n",
 69 |     "print(\"Manifest file uploaded to: s3://{}/{}\".format(bucket,prefix+'/'+outfile))"
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "markdown",
 74 |    "id": "251c7992",
 75 |    "metadata": {},
 76 |    "source": [
 77 |     "### Extract handwritten content using Textract\n",
 78 |     "In this section, we will install the [Amazon Textract Response Parser](https://github.com/aws-samples/amazon-textract-response-parser/blob/master/src-python/README.md), use the [Amazon Textract boto3 library](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html) to detect text from our handwritten images, and upload the contents into a CSV file which will be stored in your [Amazon S3 bucket](https://aws.amazon.com/s3/)"
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "code",
 83 |    "execution_count": null,
 84 |    "id": "246cbfa6",
 85 |    "metadata": {},
 86 |    "outputs": [],
 87 |    "source": [
 88 |     "#STEP 2 - CELL 1\n",
 89 |     "!python -m pip install amazon-textract-response-parser"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "code",
 94 |    "execution_count": null,
 95 |    "id": "e0b2a18f",
 96 |    "metadata": {},
 97 |    "outputs": [],
 98 |    "source": [
 99 |     "# STEP 2 - CELL 2\n",
100 |     "from trp import Document\n",
101 |     "textract = boto3.client('textract')"
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "code",
106 |    "execution_count": null,
107 |    "id": "4807cf21",
108 |    "metadata": {},
109 |    "outputs": [],
110 |    "source": [
111 |     "# STEP 2 - CELL 3\n",
112 |     "for docs in os.listdir('.'):\n",
113 |     "    if docs.endswith('jpg'):\n",
114 |     "        with open(docs, 'rb') as img:\n",
115 |     "            img_test = img.read()\n",
116 |     "            bytes_test = bytearray(img_test)\n",
117 |     "            print('Extracted text from ', docs)\n",
118 |     "        response = textract.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES','FORMS'])\n",
119 |     "        text = Document(response)\n",
120 |     "        for page in text.pages:\n",
121 |     "            for table in page.tables:\n",
122 |     "                csvout = docs.replace('jpg','csv')\n",
123 |     "                with open(csvout, 'w', newline='') as csvf:\n",
124 |     "                    tab = csv.writer(csvf, delimiter=',')\n",
125 |     "                    for r, row in enumerate(table.rows):\n",
126 |     "                        csvrow = []\n",
127 |     "                        for c, cell in enumerate(row.cells):\n",
128 |     "                            if cell.text:\n",
129 |     "                                csvrow.append(cell.text.replace('$','').rstrip())\n",
130 |     "                        tab.writerow(csvrow)\n",
131 |     "        s3.upload_file(csvout,bucket,prefix+'/dashboard/'+csvout)\n",
132 |     "        print(\"CSV file for document {} uploaded to: s3://{}/{}\".format(docs,bucket,prefix+'/dashboard/'+csvout))"
133 |    ]
134 |   },
135 |   {
136 |    "cell_type": "markdown",
137 |    "id": "91f606cd",
138 |    "metadata": {},
139 |    "source": [
140 |     "### CONCLUSION\n",
141 |     "That concludes the steps for the notebook. Please continue to follow the instructions from Chapter 17 in the book to understand how you can visualize and generate insights from your handwritten content using **[Amazon QuickSight](https://aws.amazon.com/quicksight/)**."
142 |    ]
143 |   }
144 |  ],
145 |  "metadata": {
146 |   "kernelspec": {
147 |    "display_name": "conda_python3",
148 |    "language": "python",
149 |    "name": "conda_python3"
150 |   },
151 |   "language_info": {
152 |    "codemirror_mode": {
153 |     "name": "ipython",
154 |     "version": 3
155 |    },
156 |    "file_extension": ".py",
157 |    "mimetype": "text/x-python",
158 |    "name": "python",
159 |    "nbconvert_exporter": "python",
160 |    "pygments_lexer": "ipython3",
161 |    "version": "3.6.13"
162 |   }
163 |  },
164 |  "nbformat": 4,
165 |  "nbformat_minor": 5
166 | }
167 | 


--------------------------------------------------------------------------------
/Chapter 17/hw-receipt1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 17/hw-receipt1.jpg


--------------------------------------------------------------------------------
/Chapter 17/hw-receipt2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/31dcd3bcf33043ffb280725e61d3444399a08568/Chapter 17/hw-receipt2.jpg


--------------------------------------------------------------------------------
/Chapter 17/qsmani-raw.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "fileLocations": [
 3 |         {
 4 |             "URIPrefixes": [
 5 |                 "s3://bucket/prefix/dashboard"
 6 |             ]
 7 |         }
 8 |     ],
 9 |     "globalUploadSettings": {
10 |         "format": "CSV",
11 |         "delimiter": ",",
12 |         "textqualifier": "'",
13 |         "containsHeader": "true"
14 |     }
15 | }
16 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2021 Packt
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | 
 3 | 
 4 | # Natural-Language-Processing-with-AWS-AI-Services
 5 | 
 6 | <a href="https://www.packtpub.com/product/natural-language-processing-with-aws-ai-services/9781801812535?utm_source=github&utm_medium=repository&utm_campaign=9781801812535"><img src="https://static.packt-cdn.com/products/9781801812535/cover/smaller" alt="Natural Language Processing with AWS AI Services" height="256px" align="right"></a>
 7 | 
 8 | This is the code repository for [Natural Language Processing with AWS AI Services](https://www.packtpub.com/product/natural-language-processing-with-aws-ai-services/9781801812535?utm_source=github&utm_medium=repository&utm_campaign=9781801812535), published by Packt.
 9 | 
10 | **Derive strategic insights from unstructured data with Amazon Textract and Amazon Comprehend**
11 | 
12 | ## What is this book about?
13 | The book includes Python code examples for Amazon Textract, Amazon Comprehend, and other AWS AI services to build a variety of serverless NLP workflows at scale with little prior machine learning knowledge. Packed with real-life business examples, this book will help you to navigate a day in the life of an AWS AI specialist with ease.	
14 | 
15 | This book covers the following exciting features: 
16 | * Automate various NLP workflows on AWS to accelerate business outcomes
17 | * Use Amazon Textract for text, tables, and handwriting recognition from images and PDF files
18 | * Gain insights from unstructured text in the form of sentiment analysis, topic modeling, and more using Amazon Comprehend
19 | * Set up end-to-end document processing pipelines to understand the role of humans in the loop
20 | * Develop NLP-based intelligent search solutions with just a few lines of code
21 | * Create both real-time and batch document processing pipelines using Python
22 | 
23 | If you feel this book is for you, get your [copy](https://www.amazon.com/dp/1801812535) today!
24 | 
25 | <a href="https://www.packtpub.com/?utm_source=github&utm_medium=banner&utm_campaign=GitHubBanner"><img src="https://raw.githubusercontent.com/PacktPublishing/GitHub/master/GitHub.png" 
26 | alt="https://www.packtpub.com/" border="5" /></a>
27 | 
28 | 
29 | ## Instructions and Navigations
30 | All of the code is organized into folders.
31 | 
32 | The code will look like the following:
33 | ```
34 | # Define IAM role
35 | role = get_execution_role()
36 | print("RoleArn: {}".format(role))
37 | sess = sagemaker.Session()
38 | s3BucketName = '<your s3 bucket name>'
39 | prefix = 'chapter5'
40 | ```
41 | 
42 | **Following is what you need for this book:**
43 | If you're an NLP developer or data scientist looking to get started with AWS AI services to implement various NLP scenarios quickly, this book is for you. It will show you how easy it is to integrate AI in applications with just a few lines of code. A basic understanding of machine learning (ML) concepts is necessary to understand the concepts covered. Experience with Jupyter notebooks and Python will be helpful.	
44 | 
45 | With the following software and hardware list you can run all code files present in the book (Chapter 1-18).
46 | 
47 | ### Software and Hardware List
48 | 
49 | | Software required                           | OS required                        |
50 | | --------------------------------------------| -----------------------------------|
51 | | Access and signing up to an AWS account     | Windows, Mac OS X, and Linux (Any) |
52 | | Creating a SageMaker Jupyter Notebook       | Windows, Mac OS X, and Linux (Any  |
53 | | Creating an Amazon S3 bucket                | Windows, Mac OS X, and Linux (Any  |
54 | 
55 | We also provide a PDF file that has color images of the screenshots/diagrams used in this book. [Click here to download it](https://static.packt-cdn.com/downloads/9781801812535_ColorImages.pdf).
56 | 
57 | The Code in Action videos for this book can be viewed at https://bit.ly/3vPvDkj.
58 | 
59 | 
60 | ### Related products <Other books you may enjoy>
61 | * Machine Learning with Amazon SageMaker Cookbook [[Packt]](https://www.packtpub.com/product/machine-learning-with-amazon-sagemaker-cookbook/9781800567030?utm_source=github&utm_medium=repository&utm_campaign=9781800567030) [[Amazon]](https://www.amazon.com/dp/1800567030)
62 | 
63 | * Amazon SageMaker Best Practices [[Packt]](https://www.packtpub.com/product/amazon-sagemaker-best-practices/9781801070522?utm_source=github&utm_medium=repository&utm_campaign=9781801070522) [[Amazon]](https://www.amazon.com/dp/1801070520)
64 | 
65 | ## Get to Know the Authors
66 | **Mona M**
67 | is a senior AI/ML specialist solutions architect at AWS. She is a highly skilled IT professional, with more than 10 years' experience in software design, development, and integration across diverse work environments. As an AWS solutions architect, her role is to ensure customer success in building applications and services on the AWS platform. She is responsible for crafting a highly scalable, flexible, and resilient cloud architecture that addresses customer business problems. She has published multiple blogs on AI and NLP on the AWS AI channel along with research papers on AI-powered search solutions.
68 | 
69 | **Premkumar Rangarajan**
70 | is an enterprise solutions architect, specializing in AI/ML at Amazon Web Services. He has 25 years of experience in the IT industry in a variety of roles, including delivery lead, integration specialist, and enterprise architect. He has significant architecture and management experience in delivering large-scale programs across various industries and platforms. He is passionate about helping customers solve ML and AI problems.
71 | ### Download a free PDF
72 | 
73 |  <i>If you have already purchased a print or Kindle version of this book, you can get a DRM-free PDF version at no cost.<br>Simply click on the link to claim your free PDF.</i>
74 | <p align="center"> <a href="https://packt.link/free-ebook/9781801812535">https://packt.link/free-ebook/9781801812535 </a> </p>


--------------------------------------------------------------------------------