├── .gitignore
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── Creating_visualizations_with_QuickSight.pdf
├── LICENSE
├── README.md
├── architecture.pdf
├── architecture.png
├── architecture.pptx
├── backend
    ├── lambdas
    │   ├── custom_resources
    │   │   └── lookout4metrics
    │   │   │   ├── lambda.py
    │   │   │   └── requirements.txt
    │   ├── locate_post
    │   │   └── lambda.py
    │   ├── process_post
    │   │   ├── examples
    │   │   │   ├── examples_eng.py
    │   │   │   └── examples_es.py
    │   │   ├── lambda.py
    │   │   ├── output_models
    │   │   │   └── models.py
    │   │   ├── prompt_selector.py
    │   │   ├── requirements.txt
    │   │   └── text_preprocessing.py
    │   ├── save_post
    │   │   ├── lambda.py
    │   │   └── requirements.txt
    │   └── workflow_from_sqs
    │   │   ├── lambda.py
    │   │   └── requirements.txt
    ├── state_machine
    │   └── process_post.asl.json
    └── template.yaml
├── data-streammer
    ├── mexico_big_cities.csv
    ├── new_years_resolutions_tweets.csv
    └── stream_posts.py
├── sample-files
    ├── phrases
    │   ├── 2022-09-12
    │   │   ├── cb1bc907-32b8-11ed-a39b-010c818c1d51.json
    │   │   └── f4491b0d-32dd-11ed-96e7-1dd85b3a04d7.json
    │   └── 2022-09-13
    │   │   ├── 0e79b260-33a6-11ed-a6f9-adab8308bd5b.json
    │   │   ├── 14a2cce4-33b3-11ed-9dd4-d752f0bb0867.json
    │   │   └── 88fd7722-33af-11ed-a356-b52a4d4d8f68.json
    └── tweets
    │   ├── 2022-09-12
    │       ├── cb1bc907-32b8-11ed-a39b-010c818c1d51.json
    │       └── f4491b0d-32dd-11ed-96e7-1dd85b3a04d7.json
    │   └── 2022-09-13
    │       ├── 0e79b260-33a6-11ed-a6f9-adab8308bd5b.json
    │       ├── 14a2cce4-33b3-11ed-9dd4-d752f0bb0867.json
    │       └── 88fd7722-33af-11ed-a356-b52a4d4d8f68.json
└── stream-getter
    ├── .gitignore
    ├── Dockerfile
    ├── backoff.py
    ├── main.py
    ├── requirements.txt
    ├── sqs_helper.py
    └── stream_match.py


/.gitignore:
--------------------------------------------------------------------------------
1 | # OS X extensions
2 | *.DS_Store
3 | 
4 | # PyCharm
5 | #  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
6 | #  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
7 | #  and can be added to the global gitignore or merged into this file.  For a more nuclear
8 | #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
9 | .idea/


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
 4 | documentation, we greatly value feedback and contributions from our community.
 5 | 
 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
 7 | information to effectively respond to your bug report or contribution.
 8 | 
 9 | 
10 | ## Reporting Bugs/Feature Requests
11 | 
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 | 
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 | 
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 | 
22 | 
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 | 
26 | 1. You are working against the latest source on the *main* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 | 
30 | To send us a pull request, please:
31 | 
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 | 
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 | 
42 | 
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 | 
46 | 
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 | 
52 | 
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 | 
56 | 
57 | ## Licensing
58 | 
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 | 


--------------------------------------------------------------------------------
/Creating_visualizations_with_QuickSight.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/ai-powered-text-insights/ace836af93c6e0c16de8eab606bf0b83d7868156/Creating_visualizations_with_QuickSight.pdf


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | 
 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 4 | this software and associated documentation files (the "Software"), to deal in
 5 | the Software without restriction, including without limitation the rights to
 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 7 | the Software, and to permit persons to whom the Software is furnished to do so.
 8 | 
 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 | 
16 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # AI Powered Text Insights
  2 | 
  3 | # Table of Content
  4 | 
  5 | 1. [Overview](#overview)
  6 | 2. [Architecture](#architecture)
  7 | 3. [Deployment](#deployment)
  8 |     - [Prerequisites](#prerequisites)
  9 |     - [Backend resources](#backend-resources)
 10 |     - [Test the application by streaming sample X.com posts](#optional-test-the-application-by-streaming-sample-xcom-posts)
 11 |     - [Stream real X.com posts to the application](#optional-test-the-application-by-streaming-sample-xcom-posts)
 12 | 4. [Cost](#cost)
 13 |     - [Optional Costs](#optional-costs)
 14 | 5. [Visualize your insights with Amazon QuickSight](#visualize-your-insights-with-amazon-quicksight)
 15 | 6. [Cleanup](#clean-up)
 16 | 7. [CDK Notes](#cdk-notes)
 17 | 8. [Useful commands](#useful-commands)
 18 | 9. [License](#license)
 19 | 
 20 | ## Overview
 21 | 
 22 | This package includes the code of a prototype to help you gain insights on how your customers interact with you or your brand on social media. By leveraging Large Language Models (LLMs) on Amazon Bedrock we are able to extract real time insights (like topic, entities, sentiment, and more) from short text of any kind of short text (including posts on social media). We then use these insights to create rich visualizations on Amazon Quicksight and configure automated alerts using Amazon Lookout for Metrics. The solution consists of a text processing pipeline (using AWS Lambda) that extracts the following insights from posts on social media:
 23 | 
 24 | - Topic of the post
 25 | - Sentiment of the post
 26 | - Entities involved in the post
 27 | - Location of the post (if present)
 28 | - Links in the post (if present)
 29 | - Keyphrases in the post
 30 | 
 31 | This extraction is made by leveraging an LLM (Claude 3 Haiku) to extract such information and store it in a JSON object as described bellow 
 32 | 
 33 | ```json
 34 | {
 35 |   "type": "object",
 36 |   "properties": {
 37 |     "topic": {
 38 |     "description": "the main topic of the post",
 39 |     "type": "string",
 40 |     "default": ""
 41 |     },
 42 |     "location": {
 43 |     "description": "the location, if exists, where the events occur",
 44 |     "type": "string",
 45 |     "default": ""
 46 |     },
 47 |     "entities": {
 48 |     "description": "the entities involved in the post",
 49 |     "type": "list",
 50 |     "default": []
 51 |     },
 52 |     "keyphrases": {
 53 |     "description": "the keyphrases in the post",
 54 |     "type": "list",
 55 |     "default": []
 56 |     },
 57 |     "sentiment": {
 58 |     "description": "the sentiment of the post",
 59 |     "type": "string",
 60 |     "default": ""
 61 |     },
 62 |     "links": {
 63 |     "description": "any links found withing the post",
 64 |     "type": "list",
 65 |     "default": []
 66 |     }
 67 |   }
 68 | }
 69 | 
 70 | ```
 71 | 
 72 | If a processed text contains a location the coordinates of such location are obtained using Amazon Location Services. Processed posts are stored in an Amazon S3 bucket. Data stored in S3 is queried using Amazon Athena. Anomaly detection is performed on the volume of posts per category per period of time using Amazon Lookout for Metrics and notifications are sent when anomalies are detected. All insights can be presented in a QuickSight dashboard.
 73 | 
 74 | The application includes the following resources (directories): 
 75 | 
 76 | - `backend`: Contains all the code for the application and the deployment of the resources described in the Architecture diagram
 77 | - `data-streamer`: A sample application that streams sample X.com posts about New Year's resolutions to the AI Powered Text insights application
 78 | - `stream-getter`: A sample application to receive posts from a real X.com account to the AI Powered Text insights application
 79 | - `sample-files`: Example JSON objects of processed posts.
 80 | 
 81 | 
 82 | ## Architecture
 83 | 
 84 | Deploying the sample application builds the following environment in the AWS Cloud:
 85 | 
 86 | ![architecture](architecture.png)
 87 | 
 88 | 1. An [Amazon Elastic Container Service](https://aws.amazon.com/ecs/) (Amazon ECS) task runs on serverless infrastructure managed by [AWS Fargate](https://aws.amazon.com/fargate/) and maintains an open connection to the social media.
 89 | 2. The social media access tokens are securely stored in [AWS Systems Manager Parameter Store](https://aws.amazon.com/systems-manager/), and the container image is hosted on [Amazon Elastic Container Registry](https://aws.amazon.com/ecr/) (Amazon ECR).
 90 | 3. When a new post arrives, it’s placed into an [Amazon Simple Queue Service](https://aws.amazon.com/sqs/) (SQS) queue.
 91 | 4. The logic of the solution resides in [AWS Lambda](https://aws.amazon.com/lambda/) function microservices, coordinated by [AWS Step Functions](https://aws.amazon.com/step-functions/).
 92 | 5. The post is processed in real time by one of the Large Language Models (LLM) supported by [Amazon Bedrock](https://aws.amazon.com/bedrock).
 93 | 6. [Amazon Location Service](https://aws.amazon.com/location/) transforms a location name into coordinates. 
 94 | 7. The post and metadata (insights) are sent to [Amazon Simple Storage Service](https://aws.amazon.com/s3/) (Amazon S3), and [Amazon Athena](https://aws.amazon.com/athena/) queries the processed tweets with standard SQL.
 95 | 8. [Amazon Lookout for Metrics](https://aws.amazon.com/lookout-for-metrics/) looks for anomalies in the volume of mentions per category. [Amazon Simple Notification Service](https://aws.amazon.com/sns/) (Amazon SNS) sends an alert to users when an anomaly is detected.
 96 | 9. We recommend setting up a [Amazon QuickSight](https://aws.amazon.com/quicksight/) Dashboard so that business users can easily visualize insights.
 97 | 
 98 | ## Deployment
 99 | 
100 | ### Prerequisites
101 | 
102 | * AWS CLI. Refer to [Installing the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html)
103 | * AWS Credentials configured in your environment. Refer to
104 |   [Configuration and credential file settings](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html)
105 | * AWS SAM CLI. Refer
106 |   to [Installing the AWS SAM CLI](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-sam-cli-install.html)
107 | * AWS Copilot CLI. Refer to
108 |   [Install Copilot](https://aws.github.io/copilot-cli/docs/getting-started/install/)
109 | * Docker. Refer to [Docker](https://www.docker.com/products/docker-desktop)
110 | * Get access to Claude 3 Haiku model on Amazon Bedrock. Follow the instructions in the [model access guide](https://docs.aws.amazon.com/bedrock/latest/userguide/model-access.html).
111 | * Authenticate to Amazon ECR public registry. Follow this [guide to authenticate](https://docs.aws.amazon.com/AmazonECR/latest/public/public-registries.html)
112 | * [Optional] Twitter application Bearer token. Refer
113 |   to [OAuth 2.0 Bearer Token - Prerequisites](https://developer.twitter.com/en/docs/authentication/oauth-2-0/bearer-tokens)
114 | * [Optional] Twitter Filtered stream rules configured. Refer to the examples in the end of this document and to
115 |   [Building rules for filtered stream](https://developer.twitter.com/en/docs/twitter-api/tweets/filtered-stream/integrate/build-a-rule)
116 | 
117 | ### Backend resources
118 | 
119 | Run the command below, from within the `backend/` directory, to deploy the backend:
120 | 
121 | ```
122 | sam build --use-container && sam deploy --guided
123 | ```
124 | 
125 | Follow the prompts. NOTE: Due to a constraint in Lookout for Metrics naming of databases please name you stack using the following regular expression pattern: [a-zA-Z0-9_]+
126 | 
127 | The command above deploys an AWS CloudFormation stack in your AWS account. You will need the stack's output values to deploy
128 | the Twitter stream getter container.
129 | 
130 | #### 1. Data format
131 | 
132 | This solution generates the posts insights, stored as JSON files, into four S3 locations, **/posts**, **/links**, **/topics**, and **/phrases**, on the results bucket whose name is specified by the CloudFormation stack's outputs under "PostsBucketName". Under each subfolder data is organized by day following the **YYYY-MM-dd 00:00:00** datetime format.
133 | 
134 | Sample output files can be found in this repository under the **/sample_files** folder. 
135 | 
136 | #### 2. Activate the Lookout for Metrics detector
137 | 
138 | To allow for you to provide historical data to the anomaly detector to [reduce the detector’s learning time](https://docs.aws.amazon.com/lookoutmetrics/latest/dev/services-athena.html) the prototype is deployed with the anomaly detector disabled.
139 | 
140 | If you have historical data with the same format as the data generated by this solution you may move it to the data S3 bucket generated by deploying the backend (PostsBucketName). Make sure to follow the format of the files in the **/sample_files** folder.
141 | 
142 | [Follow the instructions](https://docs.aws.amazon.com/lookoutmetrics/latest/dev/gettingstarted-detector.html) to activate your detector, the detector’s name can be found as part of the CloudFormation stack’s outputs.
143 | 
144 | Optionally you can configure alerts for your anomaly detector. [Follow the instructions](https://docs.aws.amazon.com/lookoutmetrics/latest/dev/gettingstarted-detector.html) to create an alert that sends a notification to SNS, the SNS topic name are part of the CloudFormation stack’s outputs.
145 | 
146 | ### (Optional) Test the application by streaming sample X.com posts
147 | 
148 | This section is entirely optional. It will show you how to stream sample X.com posts to the Amazon SQS queue (2 in the architecture diagram) locally from your computer.
149 | 
150 | Navigate to ``data-streamer`` folder and run: 
151 | 
152 | ```
153 | python stream_posts.py \
154 | --queue_url <SQS_QUEUE_URL> \
155 | --region <DEPLOYMENT_REGION>
156 | ```
157 | 
158 | ### (Optional) Stream real X.com posts to the application
159 | 
160 | This section is entirely optional. It will show you how to deploy the assets under **stream-getter** folder which creates an application (1 in the architecture diagram) to get X.com posts using the [streaming API](https://developer.twitter.com/en/docs/tutorials/stream-tweets-in-real-time).
161 | 
162 | Run the command below, from within the `stream-getter/` directory, to deploy the container application:
163 | 
164 | ##### 1. Create application
165 | 
166 | ```
167 | copilot app init twitter-app
168 | ```
169 | 
170 | #### 2. Create environment
171 | 
172 | ```
173 | copilot env init --name test --region <BACKEND_STACK_REGION>
174 | ```
175 | 
176 | Replace `<BACKEND_STACK_REGION>` with the same region to which you deployed the backend resources previously.
177 | 
178 | Follow the prompts accepting the default values.
179 | 
180 | The above command provisions the required network infrastructure (VPC, subnets, security groups, and more). In its
181 | default configuration, Copilot
182 | follows [AWS best practices](https://aws.amazon.com/blogs/containers/amazon-ecs-availability-best-practices/) and
183 | creates a VPC with two public and two private subnets in different Availability Zones (AZs). For security reasons, we'll
184 | soon configure the placement of the service as _private_. Because of that, the service will run on the private subnets
185 | and Copilot will automatically add NAT Gateways, but NAT Gateways increase the overall cost. In case you decide to run
186 | the application in a single AZ to have only one NAT Gateway **(not recommended)**, you can run the following command
187 | instead:
188 | 
189 | ```
190 | copilot env init --name test --region <BACKEND_STACK_REGION> \
191 |     --override-vpc-cidr 10.0.0.0/16 --override-public- cidrs 10.0.0.0/24 --override-private-cidrs 10.0.1.0/24
192 | ```
193 | 
194 | **Note:** The current implementation is prepared to run one container at a time solely. Not only your Twitter account
195 | should allow you to have more than one Twitter's stream connection at a time, but the application also must be modified
196 | to handle other complexities such as duplicates (learn more
197 | in [Recovery and redundancy features](https://developer.twitter.com/en/docs/twitter-api/tweets/filtered-stream/integrate/recovery-and-redundancy-features)).
198 | Even though there will be only one container running at a time, having two AZs is still recommended, because in case
199 | one AZ is down, ECS can run the application in the other AZ.
200 | 
201 | #### 3. Deploy the environment
202 | 
203 | ```
204 | copilot env deploy --name test
205 | ```
206 | 
207 | #### 4. Create service
208 | 
209 | ```
210 | copilot svc init --name stream-getter --svc-type "Backend Service" --dockerfile ./Dockerfile
211 | ```
212 | 
213 | #### 5. Create secret to store the Twitter Bearer token
214 | 
215 | ```
216 | copilot secret init --name TwitterBearerToken
217 | ```
218 | 
219 | When prompted to provide the secret, paste the Twitter Bearer token.
220 | 
221 | #### 6. Edit service manifest
222 | 
223 | Open the file `copilot/stream-getter/manifest.yml` and change its content to the following:
224 | 
225 | ```
226 | name: stream-getter
227 | type: Backend Service
228 | 
229 | image:
230 |   build: Dockerfile
231 | 
232 | cpu: 256
233 | memory: 512
234 | count: 1
235 | exec: true
236 | 
237 | network:
238 |   vpc:
239 |     placement: private
240 | 
241 | variables:
242 |   SQS_QUEUE_URL: <SQS_QUEUE_URL>
243 |   LOG_LEVEL: info
244 | 
245 | secrets:
246 |   BEARER_TOKEN: /copilot/${COPILOT_APPLICATION_NAME}/${COPILOT_ENVIRONMENT_NAME}/secrets/TwitterBearerToken
247 | ```
248 | 
249 | Replace `<SQS_QUEUE_URL>` with the URL of the SQS queue deployed in your AWS account.
250 | 
251 | You can use the following command to get the value from the backend AWS CloudFormation stack outputs
252 | (replace `<BACKEND_STACK_NAME>` with the name of your backend stack):
253 | 
254 | ```
255 | aws cloudformation describe-stacks --stack-name <BACKEND_STACK_NAME> \
256 |     --query "Stacks[].Outputs[?OutputKey=='TweetsQueueUrl'][] | [0].OutputValue"
257 | ```
258 | 
259 | #### 7. Add permission to write to the queue
260 | 
261 | Create a new file in `copilot/stream-getter/addons/` called `sqs-policy.yaml` with the following content:
262 | 
263 | ```
264 | Parameters:
265 |   App:
266 |     Type: String
267 |     Description: Your application's name.
268 |   Env:
269 |     Type: String
270 |     Description: The environment name your service, job, or workflow is being deployed to.
271 |   Name:
272 |     Type: String
273 |     Description: The name of the service, job, or workflow being deployed.
274 | 
275 | Resources:
276 |   QueuePolicy:
277 |     Type: AWS::IAM::ManagedPolicy
278 |     Properties:
279 |       PolicyDocument:
280 |         Version: 2012-10-17
281 |         Statement:
282 |           - Sid: SqsActions
283 |             Effect: Allow
284 |             Action:
285 |               - sqs:SendMessage
286 |             Resource: <SQS_QUEUE_ARN>
287 | 
288 | Outputs:
289 |   QueuePolicyArn:
290 |     Description: The ARN of the ManagedPolicy to attach to the task role.
291 |     Value: !Ref QueuePolicy
292 | 
293 | ```
294 | 
295 | Replace `<SQS_QUEUE_ARN>` with the ARN of the SQS queue deployed in your AWS account.
296 | 
297 | You can use the following command to get the value from the backend AWS CloudFormation stack outputs
298 | (replace `<BACKEND_STACK_NAME>` with the name of your backend stack):
299 | 
300 | ```
301 | aws cloudformation describe-stacks --stack-name <BACKEND_STACK_NAME> \
302 |     --query "Stacks[].Outputs[?OutputKey=='TweetsQueueArn'][] | [0].OutputValue"
303 | ```
304 | 
305 | After that, your directory should look like the following:
306 | 
307 | ```
308 | .
309 | ├── Dockerfile
310 | ├── backoff.py
311 | ├── copilot
312 | │     ├── stream-getter
313 | │     │    ├── addons
314 | │     │    │     └── sqs-policy.yaml
315 | │     │    └── manifest.yml
316 | │     └── environments
317 | │          └── test
318 | │               └── manifest.yml
319 | ├── main.py
320 | ├── requirements.txt
321 | ├── sqs_helper.py
322 | └── stream_match.py
323 | ```
324 | 
325 | #### 8. Deploy service
326 | 
327 | > **IMPORTANT:** The container will connect to the Twitter stream as soon as it starts, after deploying the service. You need your Twitter stream rules configured before connecting to the stream. Therefore, if you haven't configured the rules yet, configure them before proceeding.
328 | 
329 | ```
330 | copilot svc deploy --name stream-getter --env test
331 | ```
332 | 
333 | When the deployment finishes, you should have the container running inside ECS. To check the logs, run the following:
334 | 
335 | ```
336 | copilot svc logs --follow
337 | ```
338 | 
339 | #### 9. Rules examples for filtered stream
340 | 
341 | Twitter provides endpoints that enable you to create and manage rules, and apply those rules to filter a stream of
342 | real-time tweets that will return matching public tweets.
343 | 
344 | For instance, following is a rule that returns tweets from the accounts `@awscloud`, `@AWSSecurityInfo`, and `@AmazonScience`:
345 | 
346 | ```
347 | from:awscloud OR from:AWSSecurityInfo OR from:AmazonScience
348 | ```
349 | 
350 | To add that rule, issue a request like the following, replacing `<BEARER_TOKEN>` with the Twitter Bearer token:
351 | 
352 | ```
353 | curl -X POST 'https://api.twitter.com/2/tweets/search/stream/rules' \
354 | -H "Content-type: application/json" \
355 | -H "Authorization: Bearer <BEARER_TOKEN>" -d \
356 | '{
357 |   "add": [
358 |     {
359 |       "value": "from:awscloud OR from:AWSSecurityInfo OR from:AmazonScience",
360 |       "tag": "news"
361 |     }
362 |   ]
363 | }'
364 | ```
365 | 
366 | ## Cost
367 | 
368 | You are responsible for the cost of the AWS services used while running this Guidance. 
369 | 
370 | As of May 2024, the cost for running this Guidance continuously for one month, with the default settings in the US East (N.Virginia) Region, and processing 1000 posts a day is approximately $150 per month.
371 | 
372 | After the stack is destroyed, you will stop incurring in costs.
373 | 
374 | The table below shows the resources provisioned by this CDK stack, and their respective cost. The table below does not consider free tier.
375 | 
376 | |Resource|Description|Approximate Cost|
377 | |--------|-----------|----------|
378 | | AWS Lambda | Functions for processing the text | 2 USD |
379 | | Amazon Simple Queue Service | Queue to get text to process | 1 USD |
380 | | Amazon Location Service	 | Pinpoint the locations mentioned in the text | 15 USD |
381 | | Amazon Simple Storage Service | Store the processed short text | 1 USD |
382 | | Amazon Athena | Query processed text and its metadata | 49 USD |
383 | | Amazon QuickSight | Visualize the processed text and metadata | 23 USD |
384 | | Amazon Lookout for Metrics | Detect anomalies in the trends mentiond in the text | 7 USD |
385 | | AWS Step Functions | Manage the text processing pipeline | 4 USD |
386 | | Amazon Bedrock (Claude 3 Haiku) | Extract information from the text | 45 USD |
387 | | Total |  | 147 USD |
388 | 
389 | ### Optional costs
390 | 
391 | The following costs will only be incurred if you deploy the resources in the Optional section [Stream real X.com posts to the application](#optional-stream-real-xcom-posts-to-the-application)
392 | 
393 | |Resource|Description|Approximate Cost|
394 | |--------|-----------|----------|
395 | | AWS Fargate | Run the ECR container serverless | 37 USD |
396 | | Total |  | 37 USD |
397 | 
398 | ## Visualize your insights with Amazon QuickSight
399 | 
400 | To create some example visualizations from the processed text data follow the instructions on the [Creating visualizations with QuickSight.pdf](Creating_visualizations_with_QuickSight.pdf) file.
401 | 
402 | ## Clean up
403 | 
404 | If you don't want to continue using the sample, clean up its resources to avoid further charges.
405 | 
406 | Start by deleting the backend AWS CloudFormation stack which, in turn, will remove the underlying resources created then delete all the resources AWS Copilot set up for the container application, run the following commands:
407 | 
408 | ```
409 | sam delete --stack-name <sam stack name>
410 | ```
411 | 
412 | Additionally, if you deployed the X.com streaming application delete the deployed resources with:
413 | 
414 | ```
415 | copilot svc delete --name stream-getter
416 | copilot env delete --name test
417 | copilot app delete
418 | ```
419 | 
420 | ## License
421 | 
422 | This project is licensed under the MIT-0 License. See the [LICENSE](LICENSE) file.


--------------------------------------------------------------------------------
/architecture.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/ai-powered-text-insights/ace836af93c6e0c16de8eab606bf0b83d7868156/architecture.pdf


--------------------------------------------------------------------------------
/architecture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/ai-powered-text-insights/ace836af93c6e0c16de8eab606bf0b83d7868156/architecture.png


--------------------------------------------------------------------------------
/architecture.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/ai-powered-text-insights/ace836af93c6e0c16de8eab606bf0b83d7868156/architecture.pptx


--------------------------------------------------------------------------------
/backend/lambdas/custom_resources/lookout4metrics/lambda.py:
--------------------------------------------------------------------------------
  1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | # SPDX-License-Identifier: MIT-0
  3 | 
  4 | import boto3
  5 | import logging
  6 | import uuid
  7 | import cfnresponse
  8 | from crhelper import CfnResource
  9 | 
 10 | logger = logging.getLogger(__name__)
 11 | helper = CfnResource(log_level="INFO")
 12 | L4M = boto3.client("lookoutmetrics")
 13 | 
 14 | 
 15 | def create_detector(project_name, frequency):
 16 |     response_create = ''
 17 | 
 18 |     response_create = L4M.create_anomaly_detector(
 19 |         AnomalyDetectorName=project_name + "-detector-" + str(uuid.uuid1())[:6],
 20 |         AnomalyDetectorDescription="Text insights anomaly detector",
 21 |         AnomalyDetectorConfig={
 22 |             "AnomalyDetectorFrequency": frequency,
 23 |         },
 24 |     )
 25 | 
 26 |     logger.info(response_create)
 27 | 
 28 |     return response_create
 29 | 
 30 | 
 31 | def define_dataset(detector_arn, project_name, frequency, athena_role_arn, athena_config):
 32 |     params = {
 33 |         "AnomalyDetectorArn": detector_arn,
 34 |         "MetricSetName": project_name + '-metric-set-1',
 35 |         "MetricList": [
 36 |             {
 37 |                 "MetricName": "count",
 38 |                 "AggregationFunction": "AVG",
 39 |             }
 40 |         ],
 41 | 
 42 |         "DimensionList": ["platform", "topic", "sentiment"],
 43 |         "Offset": 60,
 44 | 
 45 |         "TimestampColumn": {
 46 |             "ColumnName": "partition_timestamp",
 47 |             "ColumnFormat": "yyyy-MM-dd HH:mm:ss",
 48 |         },
 49 | 
 50 |         # "Delay" : 120, # seconds the detector will wait before attempting to read latest data per current time and detection frequency below
 51 |         "MetricSetFrequency": frequency,
 52 | 
 53 |         "MetricSource": {
 54 |             "AthenaSourceConfig": {
 55 |                 "RoleArn": athena_role_arn,
 56 |                 "DatabaseName": athena_config['db_name'],
 57 |                 "DataCatalog": athena_config['data_catalog'],
 58 |                 "TableName": athena_config['table_name'],
 59 |                 "WorkGroupName": athena_config['workgroup_name'],
 60 |             }
 61 |         },
 62 |     }
 63 | 
 64 |     return params
 65 | 
 66 | 
 67 | @helper.create
 68 | @helper.update
 69 | def create(event, context):
 70 |     logger.info(event)
 71 | 
 72 |     cfn_input = event["ResourceProperties"]
 73 |     target = cfn_input["Target"]
 74 | 
 75 |     try:
 76 | 
 77 |         athena_role_arn = target['AthenaRoleArn']
 78 |         athena_config = {
 79 |             'db_name': target['GlueDbName'],
 80 |             'data_catalog': target['AwsDataCatalog'],
 81 |             'table_name': target['GlueTableName'],
 82 |             'workgroup_name': target['AthenaWorkgroupName']
 83 |         }
 84 |         sns_role_arn = target['SnsRoleArn']
 85 |         topic_arn = target['SnsTopicArn']
 86 |         alert_threshold = target['AlertThreshold']
 87 | 
 88 |         project = 'ai-text-insights'
 89 |         frequency = target['DetectorFrequency']
 90 | 
 91 |         l4m_detector = create_detector(project, frequency)
 92 | 
 93 |         anomaly_detector_arn = l4m_detector["AnomalyDetectorArn"]
 94 |         dataset = define_dataset(anomaly_detector_arn, project, frequency, athena_role_arn, athena_config)
 95 | 
 96 |         L4M.create_metric_set(**dataset)
 97 | 
 98 |         L4M.activate_anomaly_detector(AnomalyDetectorArn=anomaly_detector_arn)
 99 | 
100 |         L4M.create_alert(
101 |             Action={
102 |                 "SNSConfiguration": {
103 |                     "RoleArn": sns_role_arn,
104 |                     "SnsTopicArn": topic_arn
105 |                 }
106 |             },
107 |             AlertDescription="Text insights alert",
108 |             AlertName=project + "-alert-all",
109 |             AnomalyDetectorArn=anomaly_detector_arn,
110 |             AlertSensitivityThreshold=int(alert_threshold)
111 |         )
112 | 
113 |         cfnresponse.send(event, context, cfnresponse.SUCCESS, {'Data': 'Created L4M resource'}, anomaly_detector_arn)
114 | 
115 |     except Exception as e:
116 | 
117 |         logger.error(e)
118 |         cfnresponse.send(event, context, cfnresponse.FAILED, {'Data': 'Failed to create L4M resource'}, '')
119 | 
120 | 
121 | @helper.delete
122 | def delete(event, _):
123 |     logger.info(event)
124 | 
125 | 
126 | def handler(_event, _context):
127 |     helper(_event, _context)
128 | 


--------------------------------------------------------------------------------
/backend/lambdas/custom_resources/lookout4metrics/requirements.txt:
--------------------------------------------------------------------------------
1 | boto3==1.24.20
2 | botocore==1.27.20
3 | crhelper==2.0.11
4 | cfnresponse==1.1.2


--------------------------------------------------------------------------------
/backend/lambdas/locate_post/lambda.py:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | 
 4 | import boto3
 5 | import os
 6 | import logging
 7 | 
 8 | logging.getLogger().setLevel(os.environ.get('LOG_LEVEL', 'WARNING').upper())
 9 | location_service = boto3.client('location')
10 | 
11 | def handler(event, context):
12 | 
13 |     item = event
14 | 
15 |     logging.info('Item is located in ' + item['location'])
16 |     res_locations = location_service.search_place_index_for_text(FilterCountries=[os.environ['GEO_REGION']],
17 |                                                                  IndexName=os.environ['PLACE_INDEX_NAME'],
18 |                                                                  Language=os.environ['LANGUAGE'],
19 |                                                                  MaxResults=1,
20 |                                                                  Text=item['location']
21 |                                                                  )
22 |     logging.info(res_locations)
23 | 
24 |     if len(res_locations['Results']) >= 1:
25 |         print(res_locations['Results'][0])
26 |         item['longitude'] = res_locations['Results'][0]['Place']['Geometry']['Point'][0]
27 |         item['latitude'] = res_locations['Results'][0]['Place']['Geometry']['Point'][1]
28 | 
29 |     return item
30 | 
31 | 


--------------------------------------------------------------------------------
/backend/lambdas/process_post/examples/examples_eng.py:
--------------------------------------------------------------------------------
 1 | examples_eng = [
 2 | 
 3 |     {
 4 |         "text":
 5 |         """
 6 |         Six months ago, Wall Street Journal reporter Evan Gershkovich was detained in Russia during a reporting trip. 
 7 |         He remains in a Moscow prison.
 8 | 
 9 |         We’re offering resources for those who want to show their support for him. #IStandWithEvan https://wsj.com/Evan 
10 |         """,
11 |         "extraction":
12 |         """
13 |         {
14 |             "topic": "detention of a reporter",
15 |             "location": "Moscow",
16 |             "entities": ["Evan Gershkovich", "Wall Street Journal"],
17 |             "keyphrases": ["reporter", "detained", "prison"],
18 |             "sentiment": "negative",
19 |             "links": ["https://wsj.com/Evan"],
20 |         }
21 |         """
22 |     },
23 |     {
24 |         "text":
25 |         """
26 |         We’re living an internal war": Once-peaceful Ecuador has become engulfed in the cocaine trade, and the bodies are piling up.
27 |         """,
28 |         "extraction":
29 |         """
30 |         {
31 |             "topic": "drug war",
32 |             "location": "Ecuador",
33 |             "entities": ["Ecuador"],
34 |             "keyphrases": ["drug war", "cocaine trade"],
35 |             "sentiment": "negative",
36 |             "links": [],
37 |         }
38 |         """
39 |     },
40 |     {
41 |         "text":
42 |         """
43 |         House Democrats will soon face a difficult decision: Are they better off keeping Kevin McCarthy \
44 |         as House speaker, or taking chances with someone else?
45 |         """,
46 |         "extraction":
47 |         """
48 |         {
49 |             "topic": "house speaker choice",
50 |             "location": "",
51 |             "entities": ["Kevin McCarthy", "House Democrats"],
52 |             "keyphrases": ["house speaker", "house democrats"],
53 |             "sentiment": "neutral",
54 |             "links": [],
55 |         }
56 |         """
57 |     },
58 |     {
59 |         "text":
60 |         """
61 |         A postpandemic hiring spree has left airports vulnerable to security gaps as new staff gain access to secure areas, \
62 |         creating an opening for criminal groups.
63 |         """,
64 |         "extraction":
65 |         """
66 |         {
67 |             "topic": "airport security vulnerabilities",
68 |             "location": "",
69 |             "entities": [],
70 |             "keyphrases": ["security gaps", "secure areas", "criminal groups"],
71 |             "sentiment": "negative",
72 |             "links": [],
73 |         }
74 |         """
75 |     }
76 | ]


--------------------------------------------------------------------------------
/backend/lambdas/process_post/examples/examples_es.py:
--------------------------------------------------------------------------------
 1 | examples_es = [
 2 |     {
 3 |         "text":
 4 |         """
 5 |         Inspírese para innovar en #AWSreInvent
 6 |         
 7 |         Asista a la ponencia principal y escuche a los principales líderes de #AWS revelar los últimos lanzamientos de productos y compartir sus 
 8 |         opiniones sobre las últimas tendencias en #CloudComputing
 9 | 
10 |         Regístrese: https://go.aws/3RBhhRq
11 |         """,
12 |         "extraction":
13 |         """
14 |         {
15 |             "topic": "ponencia principal de reinvent", 
16 |             "location": "", 
17 |             "entities": ["AWS", "AWSreInvent"], 
18 |             "keyphrases": ["ponencia principal", "lanzamientos de productos", "#CloudComputing", "#AWSreInvent"],
19 |             "sentiment": "neutral"
20 |             "links: ["https://go.aws/3RBhhRq"]
21 |         }
22 |         """
23 |     },
24 |     {
25 |         "text":
26 |         """
27 |         Esta guerra fue forzada sobre nosotros por un enemigo horrendo", dice Netanyahu
28 |                         
29 |         Más información: https://cnn.it/3RKT5MK
30 |         """,
31 |         "extraction":
32 |         """
33 |         {
34 |             "topic": "Guerra", 
35 |             "location": "", 
36 |             "entities": ["Netanyahu", "enemigo horrendo"], 
37 |             "keyphrases": ["enemigo horrendo", "guerra"],
38 |             "sentiment": "negativo"
39 |             "links": ["https://cnn.it/3RKT5MK"]
40 |         }
41 |         """
42 |     },
43 |     {
44 |         "text":
45 |         """
46 |         En medio de la cobertura del conflicto entre Israel y Gaza, el equipo de CNN, que se encuentra en Ashdod, Israel, 
47 |         tuvo que resguardarse de un "bombardeo masivo de cohetes".
48 |         """,
49 |         "extraction":
50 |         """
51 |         {
52 |             "topic": "equipo de CNN se resguardo de un bombardeo", 
53 |             "location": "Ashdod, Israel", 
54 |             "entities": ["Israel", "Gaza", "CNN"], 
55 |             "keyphrases": ["bombardeo masivo de cohetes", "conflicto entre Israel y Gaza"],
56 |             "sentiment": "negativo",
57 |             "links":[]
58 |         }
59 |         """
60 |     },
61 |     {
62 |         "text":
63 |         """
64 |         El Consejo de la FIFA acordó por unanimidad celebrar el centenario de la Copa Mundial de la FIFA de la manera más apropiada. \
65 |         Tres países sudamericanos (Uruguay, Argentina y Paraguay) organizarán un partido cada uno de la Copa Mundial de la FIFA 2030.
66 |         """,
67 |         "extraction":
68 |         """
69 |         {
70 |             "topic": "Anfitriones copa del mundo 2030", 
71 |             "location": "", 
72 |             "entities": ["Uruguay", "Argentina", "Paraguay", "FIFA", "Copa Mundial de la FIFA"], 
73 |             "keyphrases": ["centenario de la Copa Mundial de la FIFA"],
74 |             "sentiment": "positivo",
75 |             "links":[]
76 |         }
77 |         """
78 |     }
79 | ]


--------------------------------------------------------------------------------
/backend/lambdas/process_post/lambda.py:
--------------------------------------------------------------------------------
  1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | # SPDX-License-Identifier: MIT-0
  3 | import copy
  4 | import os
  5 | import traceback
  6 | 
  7 | import demoji
  8 | 
  9 | import boto3
 10 | import langchain_core
 11 | from langchain_aws import ChatBedrock
 12 | 
 13 | from output_models.models import ExtractedInformation, TopicMatch, TextWithInsights
 14 | from prompt_selector import get_information_extraction_prompt_selector, get_topic_match_prompt_selector
 15 | 
 16 | from aws_lambda_powertools.utilities.typing import LambdaContext
 17 | from aws_lambda_powertools import Logger
 18 | 
 19 | MODEL_ID = os.environ['MODEL_ID']
 20 | AWS_REGION = os.environ['AWS_REGION']
 21 | LANGUAGE_CODE = os.environ['LANGUAGE_CODE']
 22 | META_TOPICS_STR = os.environ['LABELS']
 23 | META_SENTIMENTS_STR = os.environ['SENTIMENT_LABELS']
 24 | 
 25 | INFORMATION_EXTRACTION_PROMPT_SELECTOR = get_information_extraction_prompt_selector(LANGUAGE_CODE)
 26 | TOPIC_MATCH_PROMPT_SELECTOR = get_topic_match_prompt_selector(LANGUAGE_CODE)
 27 | 
 28 | TOPIC_MATCH_MODEL_PARAMETERS = {
 29 |     "max_tokens": 500,
 30 |     "temperature": 0.1,
 31 |     "top_k": 20,
 32 | }
 33 | 
 34 | INFORMATION_EXTRACTION_MODEL_PARAMETERS = {
 35 |     "max_tokens": 1500,
 36 |     "temperature": 0.1,
 37 |     "top_k": 20,
 38 | }
 39 | 
 40 | bedrock_runtime = boto3.client(
 41 |     service_name="bedrock-runtime",
 42 |     region_name=AWS_REGION
 43 | )
 44 | 
 45 | logger = Logger()
 46 | 
 47 | langchain_core.globals.set_debug(True)
 48 | 
 49 | def remove_unknown_values(extracted_info: ExtractedInformation):
 50 | 
 51 |     text_insights = copy.deepcopy(extracted_info)
 52 | 
 53 |     topic = extracted_info.topic.strip()
 54 |     location = extracted_info.location.strip()
 55 |     sentiment = extracted_info.sentiment.strip()
 56 | 
 57 |     text_insights.topic = topic if topic != "<UNKNOWN>" else ""
 58 |     text_insights.location = location if location != "<UNKNOWN>" else ""
 59 |     text_insights.sentiment = sentiment if sentiment != "<UNKNOWN>" else ""
 60 | 
 61 |     return text_insights
 62 | 
 63 | def text_topic_match(
 64 |         meta_topics: str,
 65 |         text: str,
 66 | ) -> TopicMatch:
 67 | 
 68 |     bedrock_llm = ChatBedrock(
 69 |         model_id=MODEL_ID,
 70 |         model_kwargs=TOPIC_MATCH_MODEL_PARAMETERS,
 71 |         client=bedrock_runtime,
 72 |     )
 73 | 
 74 |     claude_topic_match_prompt_template = TOPIC_MATCH_PROMPT_SELECTOR.get_prompt(MODEL_ID)
 75 | 
 76 |     structured_llm = bedrock_llm.with_structured_output(TopicMatch)
 77 | 
 78 |     structured_topic_match_chain = claude_topic_match_prompt_template | structured_llm
 79 | 
 80 |     topic_match_obj = structured_topic_match_chain.invoke({"meta_topics": meta_topics, "text": text})
 81 | 
 82 |     return topic_match_obj
 83 | 
 84 | def text_information_extraction(
 85 |         sentiments: str,
 86 |         text: str
 87 | ) -> ExtractedInformation:
 88 | 
 89 |     bedrock_llm = ChatBedrock(
 90 |         model_id=MODEL_ID,
 91 |         model_kwargs=INFORMATION_EXTRACTION_MODEL_PARAMETERS,
 92 |         client=bedrock_runtime,
 93 |     )
 94 | 
 95 |     claude_information_extraction_prompt_template = INFORMATION_EXTRACTION_PROMPT_SELECTOR.get_prompt(MODEL_ID)
 96 | 
 97 |     structured_llm = bedrock_llm.with_structured_output(ExtractedInformation)
 98 | 
 99 |     structured_chain = claude_information_extraction_prompt_template | structured_llm
100 | 
101 |     information_extraction_obj = structured_chain.invoke({
102 |         "text": text,
103 |         "sentiments": sentiments
104 |     })
105 | 
106 |     return information_extraction_obj
107 | 
108 | @logger.inject_lambda_context(log_event=True)
109 | def handler(event, _context: LambdaContext):
110 | 
111 |         item = event
112 |         text = item['text']
113 | 
114 |         clean_text = demoji.replace(text, "")
115 | 
116 |         # Attemp to extract information from text
117 |         try:
118 | 
119 |             topic_match = text_topic_match(META_TOPICS_STR, clean_text)
120 | 
121 |             if topic_match.is_match and len(topic_match.related_topics) > 0:
122 | 
123 |                 try:
124 | 
125 |                     insights = text_information_extraction(META_SENTIMENTS_STR, clean_text)
126 |                     logger.info(f'Text insights:')
127 |                     logger.info(insights)
128 | 
129 |                     if insights is not None:
130 | 
131 |                         logger.info("Removing <UNKNOWN> values")
132 |                         logger.debug(insights)
133 | 
134 |                         insights = remove_unknown_values(insights)
135 | 
136 |                         logger.info("<UNKNOWN> removed")
137 |                         logger.debug(insights)
138 | 
139 |                         # Create output object
140 |                         text_insights = TextWithInsights(
141 |                             text=item["text"],
142 |                             user=item["user"],
143 |                             created_at=item["created_at"],
144 |                             source=item["source"],
145 |                             platform=item["platform"],
146 |                             text_clean=clean_text,
147 |                             meta_topics=topic_match.related_topics,
148 |                             topic=insights.topic,
149 |                             location=insights.location,
150 |                             entities=insights.entities,
151 |                             keyphrases=insights.keyphrases,
152 |                             sentiment=insights.sentiment,
153 |                             links=insights.links,
154 |                             model_id=MODEL_ID,
155 |                             process_post=True,
156 |                             process_location=True if len(insights.location) > 0 else False # Process location only if there is one
157 |                         )
158 | 
159 |                         logger.info(f'Invoking next function')
160 |                         logger.debug(item)
161 |                     else:
162 | 
163 |                         # Create output object
164 |                         text_insights = TextWithInsights(
165 |                             text=item["text"],
166 |                             user=item["user"],
167 |                             created_at=item["created_at"],
168 |                             source=item["source"],
169 |                             platform=item["platform"],
170 |                             text_clean=clean_text,
171 |                             model_id=MODEL_ID,
172 |                             process_post=False
173 |                         )
174 | 
175 |                         logger.info(f'Could not extract information from this text')
176 |                         logger.debug(item)
177 | 
178 |                     return text_insights.dict()
179 | 
180 |                 except Exception as e:
181 | 
182 |                     logger.error("Unable to extract data from text")
183 |                     logger.error(traceback.format_exc())
184 | 
185 |                     raise Exception("Unable to extract data from text")
186 | 
187 |             else:
188 |                 logger.info(f'Topic not matched')
189 |                 return {'process_post': False}
190 | 
191 |         except Exception as e:
192 | 
193 |             logger.error(traceback.format_exc())
194 | 
195 |             raise Exception("Unable to match topics on text")


--------------------------------------------------------------------------------
/backend/lambdas/process_post/output_models/models.py:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | 
 4 | from langchain_core.pydantic_v1 import BaseModel, Field
 5 | from typing import List, Optional, Literal
 6 | 
 7 | class ExtractedInformation(BaseModel):
 8 |     """Contains information extracted from the text"""
 9 |     topic: str = Field([], description="The main topic of the text")
10 |     location: str = Field("", description="The location where the events occur, empty if no location can be inferred")
11 |     entities: List[str] = Field([], description="The entities involved in the text")
12 |     keyphrases: List[str] = Field([], description="The keyphrases in the short text")
13 |     sentiment: str = Field("neutral", description="The overall sentiment of the text")
14 |     links: List[str] = Field([], description="Any links found withing the text")
15 | 
16 | class TopicMatch(BaseModel):
17 |     """Information regarding the match of a topic to a set of predefined topics"""
18 |     understanding: str = Field(description="In your own words the topics to which this text is related")
19 |     related_topics: List[str] = Field(description="The list of topics into which the input text can be classified")
20 |     is_match: bool = Field(description="true if the text matches one of your topics of interest, false otherwise")
21 | 
22 | 
23 | class TextWithInsights(ExtractedInformation):
24 |     """A text with its extracted insights"""
25 |     text: str = Field(description="The original text as written by the user")
26 |     user: str = Field(description="The user that wrote the text")
27 |     created_at: str = Field(description="The datetime when te text was created")
28 |     source: str = Field(description="The source platform from where this text was generated")
29 |     platform: str = Field(description="The platform from where this text was generated")
30 |     text_clean: str = Field(description="The original text after being pre-processed")
31 |     process_post: bool = Field(description="Whether the text should be further processed or not")
32 |     meta_topics: List[str] = Field([], description="A list of topics in which the text can be classified"),
33 |     process_location: bool = Field(False, description="Whether the text has a location that should be further processed"),
34 |     model_id: str = Field(description="The model id that was used to extract these insights")
35 | 


--------------------------------------------------------------------------------
/backend/lambdas/process_post/prompt_selector.py:
--------------------------------------------------------------------------------
  1 | from langchain.chains.prompt_selector import ConditionalPromptSelector
  2 | 
  3 | from langchain_core.prompts.chat import ChatPromptTemplate, HumanMessagePromptTemplate, SystemMessagePromptTemplate, \
  4 |     AIMessagePromptTemplate
  5 | from langchain_core.prompts.few_shot import FewShotChatMessagePromptTemplate
  6 | 
  7 | from examples.examples_eng import examples_eng
  8 | from examples.examples_es import examples_es
  9 | 
 10 | from typing import Callable
 11 | 
 12 | # ANTHROPIC CLAUDE 3 PROMPT TEMPLATES
 13 | 
 14 | # English Prompts
 15 | 
 16 | claude_information_extraction_system_prompt_en = """You are an information extraction system. Your task is to extract key information from the text that will be presented to you.
 17 | 
 18 | Here are some basic rules for the task:
 19 | 
 20 | - You can reason about the task and the information, take your time to think
 21 | - You must classify the sentiment of the text in one of the following: {sentiments}
 22 | - NEVER fill a value of your own for entries where you are unsure of and rather use default values
 23 | 
 24 | Here are some examples of how to extract information from text:
 25 | """
 26 | 
 27 | claude_information_extraction_user_prompt_en = """
 28 | Extract the information from the following text:
 29 | 
 30 | {text}
 31 | """
 32 | 
 33 | claude_topic_match_system_prompt_en = """
 34 | You are a highly accurate text classification system. Your task is to classify text into one or more categories. Even though you can classify text in any category \
 35 | for this task you are only interested in classifying the text in some of the topics listed below:
 36 | 
 37 | <meta_topics>
 38 | {meta_topics}
 39 | </meta_topics>
 40 | 
 41 | Here are some rules for the text classification you will perform:
 42 | 
 43 | - You should not be forced to classify the text in any of the existing categories, its ok if you cant fit a text into the topics of interest
 44 | - Only classify a text into a category if you are really sure it belongs to it
 45 | - Use only the categories in the <meta_topics> list for your classification
 46 | - You may classify a text in any number of topics (including zero) but your result must always be a list (even if its empty)
 47 | - You can classify text in multiple categories at once
 48 | - You can reason about your task, take your time to think
 49 | """
 50 | 
 51 | claude_topic_match_user_prompt_en = user_prompt = """
 52 | Classify the following text into one or more categories of your interest:
 53 | 
 54 | {text}
 55 | """
 56 | 
 57 | examples_prompt_template_eng = ChatPromptTemplate.from_messages(
 58 |     [
 59 |         HumanMessagePromptTemplate.from_template("{text}", input_variables=["text"], validate_template=True),
 60 |         AIMessagePromptTemplate.from_template("{extraction}", input_variables=["extraction"], validate_template=True)
 61 |     ]
 62 | )
 63 | 
 64 | few_shot_chat_prompt_eng = FewShotChatMessagePromptTemplate(
 65 |     example_prompt=examples_prompt_template_eng,
 66 |     examples=examples_eng,
 67 | )
 68 | 
 69 | CLAUDE_INFORMATION_EXTRACTION_PROMPT_TEMPLATE_EN = ChatPromptTemplate.from_messages([
 70 |     SystemMessagePromptTemplate.from_template(claude_information_extraction_system_prompt_en, input_variables=["sentiments"], validate_template=True),
 71 |     few_shot_chat_prompt_eng,
 72 |     HumanMessagePromptTemplate.from_template(claude_information_extraction_user_prompt_en, input_variables=["text"], validate_template=True),
 73 | ])
 74 | 
 75 | CLAUDE_TOPIC_MATCH_PROMPT_TEMPLATE_EN = ChatPromptTemplate.from_messages([
 76 |     SystemMessagePromptTemplate.from_template(claude_topic_match_system_prompt_en, input_variables=["meta_topics"], validate_template=True),
 77 |     HumanMessagePromptTemplate.from_template(claude_topic_match_user_prompt_en, input_variables=["text"], validate_template=True),
 78 | ])
 79 | 
 80 | # Spanish Prompts
 81 | 
 82 | claude_information_extraction_system_prompt_es = """
 83 | Eres un sistema de extraccion de informacion. Tu tarea consiste en extraer informacion clave del texto que te sera presentado.
 84 | 
 85 | Estas son algunas reglas basicas que debes seguir:
 86 | 
 87 | - Puedes razonar sobre la tarea y la informacion, tomate un tiempo para pensar
 88 | - Debes clasificar el sentimiento en uno de los siguientes: {sentiments}
 89 | - NUNCA acompletes un valor del cual no estas seguro con valores tuyos, en su lugar emplea los valores por defecto para cada campo
 90 | 
 91 | Aqui hay algunos ejemplos de como extraer informacion de texto:
 92 | """
 93 | 
 94 | claude_information_extraction_user_prompt_es = """
 95 | Extrae la informacion de siguiente texto:
 96 | 
 97 | {text}
 98 | """
 99 | 
100 | claude_topic_match_system_prompt_es = """
101 | Eres un sistema de clasificacion de texto altamente preciso. Tu tarea es clasificar texto en una o mas categorias. Aunque eres capaz de clasificar texto en cualquier \
102 | categoria para esta tarea solo estas interesado en clasificar el texto en algunas de las categorias listadas aqui:
103 | 
104 | <meta_topics>
105 | {meta_topics}
106 | </meta_topics>
107 | 
108 | Aqui hay algunas reglas para la clasificacion de texto que vas a realizar:
109 | 
110 | - No te debes ver forzado a clasificar el texto en ninguna de las categorias existentes, esta bien si no puedes clasificar el texto en ninguno de los temas de interes
111 | - Solo clasifica el texto en una categoria si estas completamente seguro que pertenece a esa categoria
112 | - Usa solo las categorias en la lista <meta_topics> para tu clasificacion
113 | - Puedes clasificar el texto en cualquier numero de categorias (incluso cero) pero tu respuesta siempre debe ser una lista (aunque sea una lista vacia)
114 | - Puedes clasificar el texto en multiples categorias a la vez
115 | - Puedes razonar sobre tu tarea, tomate tu tiempo para pensar
116 | """
117 | 
118 | claude_topic_match_user_prompt_es = """
119 | Clasifica el siguiente texto en una o mas categorias de tu interes:
120 | 
121 | {text}
122 | """
123 | 
124 | examples_prompt_template_es = ChatPromptTemplate.from_messages(
125 |     [
126 |         HumanMessagePromptTemplate.from_template("{text}", input_variables=["text"], validate_template=True),
127 |         AIMessagePromptTemplate.from_template("{extraction}", input_variables=["extraction"], validate_template=True)
128 |     ]
129 | )
130 | 
131 | few_shot_chat_prompt_es = FewShotChatMessagePromptTemplate(
132 |     example_prompt=examples_prompt_template_es,
133 |     examples=examples_es,
134 | )
135 | 
136 | CLAUDE_INFORMATION_EXTRACTION_PROMPT_TEMPLATE_ES = ChatPromptTemplate.from_messages([
137 |     SystemMessagePromptTemplate.from_template(claude_information_extraction_system_prompt_es, input_variables=["sentiments"], validate_template=True),
138 |     few_shot_chat_prompt_eng,
139 |     HumanMessagePromptTemplate.from_template(claude_information_extraction_user_prompt_es,
140 |                                              input_variables=["text"], validate_template=True),
141 | ])
142 | 
143 | CLAUDE_TOPIC_MATCH_PROMPT_TEMPLATE_ES = ChatPromptTemplate.from_messages([
144 |     SystemMessagePromptTemplate.from_template(claude_topic_match_system_prompt_es, input_variables=["meta_topics"],
145 |                                               validate_template=True),
146 |     HumanMessagePromptTemplate.from_template(claude_topic_match_user_prompt_es, input_variables=["text"],
147 |                                              validate_template=True),
148 | ])
149 | 
150 | 
151 | def is_es(language: str) -> bool:
152 |     return "es" == language
153 | 
154 | 
155 | def is_en(language: str) -> bool:
156 |     return "en" == language
157 | 
158 | 
159 | def is_claude(model_id: str) -> bool:
160 |     return "claude" in model_id
161 | 
162 | 
163 | def is_es_claude(language: str) -> Callable[[str], bool]:
164 |     return lambda model_id: is_es(language) and is_claude(model_id)
165 | 
166 | 
167 | def is_en_claude(language: str) -> Callable[[str], bool]:
168 |     return lambda model_id: is_en(language) and is_claude(model_id)
169 | 
170 | 
171 | def get_information_extraction_prompt_selector(lang: str) -> ConditionalPromptSelector:
172 |     return ConditionalPromptSelector(
173 |         default_prompt=CLAUDE_INFORMATION_EXTRACTION_PROMPT_TEMPLATE_EN,
174 |         conditionals=[
175 |             (is_es_claude(lang), CLAUDE_INFORMATION_EXTRACTION_PROMPT_TEMPLATE_ES),
176 |         ]
177 |     )
178 | 
179 | 
180 | def get_topic_match_prompt_selector(lang: str) -> ConditionalPromptSelector:
181 |     return ConditionalPromptSelector(
182 |         default_prompt=CLAUDE_TOPIC_MATCH_PROMPT_TEMPLATE_EN,
183 |         conditionals=[
184 |             (is_es_claude(lang), CLAUDE_TOPIC_MATCH_PROMPT_TEMPLATE_ES)
185 |         ]
186 |     )


--------------------------------------------------------------------------------
/backend/lambdas/process_post/requirements.txt:
--------------------------------------------------------------------------------
1 | demoji==1.1.0
2 | boto3
3 | langchain==0.2.6
4 | langchain_aws==0.1.9
5 | langchain-core==0.2.10
6 | aws-lambda-powertools


--------------------------------------------------------------------------------
/backend/lambdas/process_post/text_preprocessing.py:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | 
 4 | import re
 5 | 
 6 | class TweetPreprocessor:
 7 | 
 8 |     @staticmethod
 9 |     def remove_links(tweet):
10 |         """Takes a string and removes web links from it"""
11 |         tweet = re.sub(r'http\S+', '', tweet)  # remove http links
12 |         tweet = re.sub(r'bit.ly/\S+', '', tweet)  # remove bitly links
13 |         tweet = re.sub(r'pic.twitter\S+', '', tweet)
14 |         return tweet
15 | 
16 |     @staticmethod
17 |     def remove_users(tweet):
18 |         """Takes a string and removes retweet and @user information"""
19 |         tweet = re.sub('(RT\s@[A-Za-z]+[A-Za-z0-9-_]+):*', '', tweet)  # remove re-tweet
20 |         tweet = re.sub('(@[A-Za-z]+[A-Za-z0-9-_]+):*', '', tweet)  # remove tweeted at
21 |         return tweet
22 | 
23 |     @staticmethod
24 |     def remove_hashtags(tweet):
25 |         """Takes a string and removes any hash tags"""
26 |         tweet = re.sub('(#[A-Za-z]+[A-Za-z0-9-_]+)', '', tweet)  # remove hash tags
27 |         return tweet
28 | 
29 |     @staticmethod
30 |     def remove_av(tweet):
31 |         """Takes a string and removes AUDIO/VIDEO tags or labels"""
32 |         tweet = re.sub('VIDEO:', '', tweet)  # remove 'VIDEO:' from start of tweet
33 |         tweet = re.sub('AUDIO:', '', tweet)  # remove 'AUDIO:' from start of tweet
34 |         return tweet
35 | 
36 |     @staticmethod
37 |     def preprocess(tweet):
38 |         #tweet = tweet.encode('latin1', 'ignore').decode('latin1')
39 |         tweet = tweet.lower()
40 |         #tweet = TweetPreprocessor.remove_users(tweet)
41 |         tweet = TweetPreprocessor.remove_links(tweet)
42 |         #tweet = TweetPreprocessor.remove_hashtags(tweet)
43 |         tweet = TweetPreprocessor.remove_av(tweet)
44 |         tweet = ' '.join(tweet.split())  # Remove extra spaces
45 |         return tweet.strip()
46 | 
47 |     @staticmethod
48 |     def get_hash_tags(tweet):
49 |         return re.findall(r"#(\w+)", tweet)


--------------------------------------------------------------------------------
/backend/lambdas/save_post/lambda.py:
--------------------------------------------------------------------------------
  1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | # SPDX-License-Identifier: MIT-0
  3 | 
  4 | import boto3
  5 | import os
  6 | import logging
  7 | import datetime
  8 | import uuid
  9 | import io
 10 | import json
 11 | import jsonlines
 12 | 
 13 | logging.getLogger().setLevel(os.environ.get('LOG_LEVEL', 'WARNING').upper())
 14 | s3 = boto3.client('s3')
 15 | 
 16 | 
 17 | def create_multi_records(insights, key, dest_key):
 18 |     items = io.StringIO()
 19 |     item = {}
 20 | 
 21 |     item['created_at'] = insights['created_at']
 22 |     item['timestamp'] = insights['timestamp']
 23 |     item['user'] = insights['user']
 24 |     item['platform'] = insights['platform']
 25 |     item['text_clean'] = insights['text_clean']
 26 |     item['count'] = 1
 27 | 
 28 |     if key == 'meta_topics':
 29 |         item['sentiment'] = insights['sentiment'].lower().strip()
 30 | 
 31 |         if insights['process_location']:
 32 |             item['location'] = insights['location']
 33 |             item['longitude'] = insights['longitude']
 34 |             item['latitude'] = insights['latitude']
 35 | 
 36 |     # Write as JSON lines so that Athena can read them
 37 |     with jsonlines.Writer(items) as writer:
 38 |         for insights_key in insights[key]:
 39 |             if key == 'meta_topics':
 40 |                 item[dest_key] = insights_key.lower().strip()
 41 |             else:
 42 |                 item[dest_key] = insights_key
 43 |             writer.write(item)
 44 | 
 45 |     return items.getvalue()
 46 | 
 47 | 
 48 | def handler(event, context):
 49 | 
 50 |     item = event
 51 | 
 52 |     utc_now = datetime.datetime.now(datetime.timezone.utc)
 53 |     item['timestamp'] = utc_now.strftime('%Y-%m-%d %H:%M:%S.%f')
 54 | 
 55 |     # Add count for anomaly detection
 56 |     item['count'] = 1
 57 | 
 58 |     created_at = datetime.datetime.strptime(item['created_at'], '%Y-%m-%dT%H:%M:%S.%fz')
 59 |     item['created_at'] = created_at.strftime('%Y-%m-%d %H:%M:%S.%f')
 60 | 
 61 |     logging.info("Item to be saved")
 62 |     logging.info(item)
 63 | 
 64 |     #Retrive objects 1:N
 65 |     phrases_json_lines = create_multi_records(item, key='keyphrases', dest_key='phrase')
 66 |     meta_topics_json_lines = create_multi_records(item, key='meta_topics', dest_key='topic')
 67 |     links_json_lines = create_multi_records(item, key='links', dest_key='link')
 68 | 
 69 |     # Create partition key for S3
 70 |     partition_timestamp = datetime.datetime(created_at.year, created_at.month, created_at.day, 0, 0, 0)
 71 |     key = f"{partition_timestamp.strftime('%Y-%m-%d %H:%M:%S')}/{str(uuid.uuid1())}.json"
 72 | 
 73 |     logging.info('Key phrases')
 74 |     logging.info(phrases_json_lines)
 75 | 
 76 |     logging.info('Meta topics')
 77 |     logging.info(meta_topics_json_lines)
 78 | 
 79 |     logging.info('Links')
 80 |     logging.info(links_json_lines)
 81 | 
 82 |     # Save item
 83 |     if item['process_location']:
 84 |         post = {'text': item['text'], 'user': item['user'], 'created_at': item['created_at'], 'source': item['source'],
 85 |                 'platform': item['platform'], 'text_clean': item['text_clean'], 'topic': item['topic'].lower().strip(),
 86 |                 'model': item['model_id'], 'sentiment': item['sentiment'].lower().strip(), 'location': item['location'],
 87 |                 'longitude': item['longitude'], 'latitude': item['latitude'], 'timestamp': item['timestamp'], 'count': item['count']}
 88 |     else:
 89 |         post = {'text': item['text'], 'user': item['user'], 'created_at': item['created_at'], 'source': item['source'],
 90 |                 'platform': item['platform'], 'text_clean': item['text_clean'], 'topic': item['topic'].lower().strip(),
 91 |                 'model': item['model_id'], 'sentiment': item['sentiment'].lower().strip(), 'timestamp': item['timestamp'],
 92 |                 'count': item['count']}
 93 | 
 94 |     s3.put_object(
 95 |         Body=json.dumps(post, ensure_ascii=False),
 96 |         Bucket=os.environ['POSTS_BUCKET'],
 97 |         Key='posts/' + key
 98 |     )
 99 | 
100 |     # Save 1:N objectsº
101 | 
102 |     if phrases_json_lines:
103 |         # Save key phrases
104 |         s3.put_object(
105 |             Body=phrases_json_lines,
106 |             Bucket=os.environ['POSTS_BUCKET'],
107 |             Key='phrases/' + key
108 |         )
109 | 
110 |     if meta_topics_json_lines:
111 |         # Save meta topics
112 |         s3.put_object(
113 |             Body=meta_topics_json_lines,
114 |             Bucket=os.environ['POSTS_BUCKET'],
115 |             Key='topics/' + key
116 |         )
117 | 
118 |     if links_json_lines:
119 |         # Save links
120 |         s3.put_object(
121 |             Body=links_json_lines,
122 |             Bucket=os.environ['POSTS_BUCKET'],
123 |             Key='links/' + key
124 |         )
125 | 
126 |     return {'success': True}
127 | 


--------------------------------------------------------------------------------
/backend/lambdas/save_post/requirements.txt:
--------------------------------------------------------------------------------
1 | jsonlines==3.1.0


--------------------------------------------------------------------------------
/backend/lambdas/workflow_from_sqs/lambda.py:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | 
 4 | import boto3
 5 | import logging
 6 | import os
 7 | 
 8 | from aws_lambda_powertools.utilities.batch import BatchProcessor, EventType, batch_processor
 9 | from aws_lambda_powertools.utilities.data_classes.sqs_event import SQSRecord
10 | 
11 | processor = BatchProcessor(event_type=EventType.SQS)
12 | 
13 | logging.getLogger().setLevel(os.environ.get('LOG_LEVEL', 'WARNING').upper())
14 | 
15 | def record_handler(record: SQSRecord):
16 | 
17 |     message_body = record.body
18 | 
19 |     if message_body:
20 |         logging.info('SQS message body:')
21 |         logging.info(message_body)
22 |         #item = json.loads(message_body, strict=False)
23 | 
24 |         logging.info('Starting step function processing:')
25 | 
26 |         client = boto3.client('stepfunctions')
27 |         response = client.start_execution(
28 |             stateMachineArn=os.environ['PROCESS_POST_STATE_MACHINE'],
29 |             input=message_body
30 |         )
31 | 
32 |         logging.info('Step function response:')
33 |         logging.info(response)
34 | 
35 |         return response
36 | 
37 | 
38 | @batch_processor(record_handler=record_handler, processor=processor)
39 | def handler(event, context):
40 |     return processor.response()
41 | 
42 | 
43 | 
44 | 


--------------------------------------------------------------------------------
/backend/lambdas/workflow_from_sqs/requirements.txt:
--------------------------------------------------------------------------------
1 | aws-lambda-powertools==1.24.1


--------------------------------------------------------------------------------
/backend/state_machine/process_post.asl.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "Comment": "A description of my state machine",
 3 |     "StartAt": "Extract Insights",
 4 |     "States": {
 5 |         "Extract Insights": {
 6 |             "Type": "Task",
 7 |             "Resource": "arn:aws:states:::lambda:invoke",
 8 |             "OutputPath": "$.Payload",
 9 |             "Parameters": {
10 |                 "Payload.$": "$",
11 |                 "FunctionName": "${ExtractInisghtsFunctionArn}"
12 |             },
13 |             "Retry": [
14 |                 {
15 |                     "ErrorEquals": [
16 |                         "Lambda.ServiceException",
17 |                         "Lambda.AWSLambdaException",
18 |                         "Lambda.SdkClientException"
19 |                     ],
20 |                     "IntervalSeconds": 2,
21 |                     "MaxAttempts": 6,
22 |                     "BackoffRate": 2
23 |                 }
24 |             ],
25 |             "Next": "ProcessPost?"
26 |         },
27 |         "ProcessPost?": {
28 |             "Type": "Choice",
29 |             "Choices": [
30 |                 {
31 |                     "Variable": "$.process_post",
32 |                     "BooleanEquals": true,
33 |                     "Next": "LocatePost?"
34 |                 }
35 |             ],
36 |             "Default": "Success"
37 |         },
38 |         "Success": {
39 |             "Type": "Succeed"
40 |         },
41 |         "LocatePost?": {
42 |             "Type": "Choice",
43 |             "Choices": [
44 |                 {
45 |                     "Variable": "$.process_location",
46 |                     "BooleanEquals": true,
47 |                     "Next": "Locate Post"
48 |                 }
49 |             ],
50 |             "Default": "Save Post"
51 |         },
52 |         "Locate Post": {
53 |             "Type": "Task",
54 |             "Resource": "arn:aws:states:::lambda:invoke",
55 |             "OutputPath": "$.Payload",
56 |             "Parameters": {
57 |                 "Payload.$": "$",
58 |                 "FunctionName": "${AddLocationFunctionArn}"
59 |             },
60 |             "Retry": [
61 |                 {
62 |                     "ErrorEquals": [
63 |                         "Lambda.ServiceException",
64 |                         "Lambda.AWSLambdaException",
65 |                         "Lambda.SdkClientException"
66 |                     ],
67 |                     "IntervalSeconds": 2,
68 |                     "MaxAttempts": 6,
69 |                     "BackoffRate": 2
70 |                 }
71 |             ],
72 |             "Next": "Save Post"
73 |         },
74 |         "Save Post": {
75 |             "Type": "Task",
76 |             "Resource": "arn:aws:states:::lambda:invoke",
77 |             "OutputPath": "$.Payload",
78 |             "Parameters": {
79 |                 "Payload.$": "$",
80 |                 "FunctionName": "${SavePostFunctionArn}"
81 |             },
82 |             "Retry": [
83 |                 {
84 |                     "ErrorEquals": [
85 |                         "Lambda.ServiceException",
86 |                         "Lambda.AWSLambdaException",
87 |                         "Lambda.SdkClientException"
88 |                     ],
89 |                     "IntervalSeconds": 2,
90 |                     "MaxAttempts": 6,
91 |                     "BackoffRate": 2
92 |                 }
93 |             ],
94 |             "End": true
95 |         }
96 |     }
97 | }


--------------------------------------------------------------------------------
/backend/template.yaml:
--------------------------------------------------------------------------------
  1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | # SPDX-License-Identifier: MIT-0
  3 | 
  4 | AWSTemplateFormatVersion: 2010-09-09
  5 | Transform: AWS::Serverless-2016-10-31
  6 | Description: "(SO9130) AI Powered Text Insights"
  7 | 
  8 | Parameters:
  9 |   AthenaProjectionRangeStart:
 10 |     Type: String
 11 |     Description: Start date of Athena Partition Projection. Only Posts after this date appear in Athena history table.
 12 |     Default: 2022-01-01 00:00:00
 13 |     AllowedPattern: '\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}'
 14 |     ConstraintDescription: Must be a valid date in the yyyy-MM-dd HH:mm:SS format
 15 |   Labels:
 16 |     Type: String
 17 |     Default: 'Health and fitness, Humor, Personal growth, Philanthropy, Recreation and leisure, Family, friends, and, relationships, Career, Finance, Education and training, Time management'
 18 |   SentimentCategories:
 19 |     Type: String
 20 |     Default: 'Positive, Negative, Neutral'
 21 |   ModelId:
 22 |     Type: String
 23 |     Default: 'anthropic.claude-3-sonnet-20240229-v1:0'
 24 |     AllowedValues:
 25 |       - 'anthropic.claude-3-haiku-20240307-v1:0'
 26 |       - 'anthropic.claude-3-sonnet-20240229-v1:0'
 27 |   Language:
 28 |     Type: String
 29 |     Description: The language that the text to be processed is in
 30 |     Default: "en"
 31 |   Region:
 32 |     Type: String
 33 |     Description: The region for the localization services
 34 |     Default: "USA"
 35 |   AnomalyAlertThreshold:
 36 |     Type: Number
 37 |     Description: An anomaly must be beyond this threshold to trigger an alert
 38 |     Default: 50
 39 |   AnomalyDetectionFrequency:
 40 |     Type: String
 41 |     Description: The frequency at which the detector will look for anomalies
 42 |     Default: PT1H
 43 | 
 44 | Resources:
 45 | 
 46 |   ##########################
 47 |   # Logs for state machine #
 48 |   ##########################
 49 |   ExpressLogGroup:
 50 |     Type: AWS::Logs::LogGroup
 51 |     Properties:
 52 |       RetentionInDays: 7
 53 | 
 54 |   ##################
 55 |   # KMS Athena Key #
 56 |   ##################
 57 |   AthenaSourceKmsKey:
 58 |     Type: 'AWS::KMS::Key'
 59 |     Properties:
 60 |       Description: Used to encrypt athena data and ADs
 61 |       EnableKeyRotation: true
 62 |       KeyPolicy:
 63 |         Version: '2012-10-17'
 64 |         Statement:
 65 |           - Sid: Enable IAM User Permissions
 66 |             Effect: Allow
 67 |             Principal:
 68 |               AWS: !Sub arn:${AWS::Partition}:iam::${AWS::AccountId}:root
 69 |             Action: "kms:*"
 70 |             Resource: '*'
 71 | 
 72 |   ##############
 73 |   # Amazon SNS #
 74 |   ##############
 75 |   AlertTopic:
 76 |     Type: AWS::SNS::Topic
 77 |     Properties:
 78 |       KmsMasterKeyId: alias/aws/sns
 79 |   AlertTopicPolicy:
 80 |     Type: AWS::SNS::TopicPolicy
 81 |     Properties:
 82 |       PolicyDocument:
 83 |         Statement:
 84 |           - Sid: AllowPublishThroughSSLOnly
 85 |             Action: sns:Publish
 86 |             Effect: Deny
 87 |             Principal: '*'
 88 |             Resource:
 89 |               - !Ref AlertTopic
 90 |             Condition:
 91 |               Bool:
 92 |                 aws:SecureTransport: false
 93 |       Topics:
 94 |         - !Ref AlertTopic
 95 | 
 96 |   ##############
 97 |   # SQS queues #
 98 |   ##############
 99 |   PostsQueue:
100 |     Type: AWS::SQS::Queue
101 |     Properties:
102 |       KmsMasterKeyId: alias/aws/sqs
103 |       VisibilityTimeout: 200
104 |       RedrivePolicy:
105 |         deadLetterTargetArn: !GetAtt PostsDeadLetterQueue.Arn
106 |         maxReceiveCount: 5
107 |   PostsDeadLetterQueue:
108 |     Type: AWS::SQS::Queue
109 |     Properties:
110 |       KmsMasterKeyId: alias/aws/sqs
111 |   PostsQueuePolicy:
112 |     Type: AWS::SQS::QueuePolicy
113 |     Properties:
114 |       Queues:
115 |         - !Ref PostsQueue
116 |       PolicyDocument:
117 |         Statement:
118 |           - Effect: Deny
119 |             Principal: '*'
120 |             Action: sqs:SendMessage
121 |             Resource: !GetAtt PostsQueue.Arn
122 |             Condition:
123 |               Bool:
124 |                 aws:SecureTransport: false
125 |   PostsDeadLetterQueuePolicy:
126 |     Type: AWS::SQS::QueuePolicy
127 |     Properties:
128 |       Queues:
129 |         - !Ref PostsDeadLetterQueue
130 |       PolicyDocument:
131 |         Statement:
132 |           - Effect: Deny
133 |             Principal: '*'
134 |             Action: sqs:SendMessage
135 |             Resource: !GetAtt PostsDeadLetterQueue.Arn
136 |             Condition:
137 |               Bool:
138 |                 aws:SecureTransport: false
139 | 
140 |   ##############
141 |   # S3 buckets #
142 |   ##############
143 |   PostsBucket:
144 |     Type: AWS::S3::Bucket
145 |     DeletionPolicy: Retain
146 |     Properties:
147 |       VersioningConfiguration:
148 |         Status: Enabled
149 |       BucketEncryption:
150 |         ServerSideEncryptionConfiguration:
151 |           - ServerSideEncryptionByDefault:
152 |               SSEAlgorithm: AES256
153 |       PublicAccessBlockConfiguration:
154 |         BlockPublicAcls: true
155 |         BlockPublicPolicy: true
156 |         IgnorePublicAcls: true
157 |         RestrictPublicBuckets: true
158 |       OwnershipControls:
159 |         Rules:
160 |           - ObjectOwnership: ObjectWriter
161 |       LoggingConfiguration:
162 |         DestinationBucketName: !Ref LoggingBucket
163 |         LogFilePrefix: "posts-bucket/"
164 |   PostsBucketPolicy:
165 |     Type: AWS::S3::BucketPolicy
166 |     Properties:
167 |       Bucket: !Ref PostsBucket
168 |       PolicyDocument:
169 |         Statement:
170 |           - Effect: Deny
171 |             Principal: "*"
172 |             Action: "*"
173 |             Resource:
174 |               - !Sub "arn:aws:s3:::${PostsBucket}/*"
175 |               - !Sub "arn:aws:s3:::${PostsBucket}"
176 |             Condition:
177 |               Bool:
178 |                 aws:SecureTransport: false
179 |   AthenaResultsBucket:
180 |     Type: AWS::S3::Bucket
181 |     DeletionPolicy: Retain
182 |     Properties:
183 |       VersioningConfiguration:
184 |         Status: Enabled
185 |       BucketEncryption:
186 |         ServerSideEncryptionConfiguration:
187 |           - ServerSideEncryptionByDefault:
188 |               SSEAlgorithm: AES256
189 |       PublicAccessBlockConfiguration:
190 |         BlockPublicAcls: true
191 |         BlockPublicPolicy: true
192 |         IgnorePublicAcls: true
193 |         RestrictPublicBuckets: true
194 |       OwnershipControls:
195 |         Rules:
196 |           - ObjectOwnership: ObjectWriter
197 |       LoggingConfiguration:
198 |         DestinationBucketName: !Ref LoggingBucket
199 |         LogFilePrefix: "athena-results-bucket/"
200 |   AthenaResultsBucketPolicy:
201 |     Type: AWS::S3::BucketPolicy
202 |     Properties:
203 |       Bucket: !Ref AthenaResultsBucket
204 |       PolicyDocument:
205 |         Statement:
206 |           - Effect: Deny
207 |             Principal: "*"
208 |             Action: "*"
209 |             Resource:
210 |               - !Sub "arn:aws:s3:::${AthenaResultsBucket}/*"
211 |               - !Sub "arn:aws:s3:::${AthenaResultsBucket}"
212 |             Condition:
213 |               Bool:
214 |                 aws:SecureTransport: false
215 |   LoggingBucket:
216 |     Type: AWS::S3::Bucket
217 |     DeletionPolicy: Retain
218 |     Properties:
219 |       VersioningConfiguration:
220 |         Status: Enabled
221 |       PublicAccessBlockConfiguration:
222 |         BlockPublicAcls: true
223 |         BlockPublicPolicy: true
224 |         IgnorePublicAcls: true
225 |         RestrictPublicBuckets: true
226 |       OwnershipControls:
227 |         Rules:
228 |           - ObjectOwnership: ObjectWriter
229 |       AccessControl: LogDeliveryWrite
230 |       BucketEncryption:
231 |         ServerSideEncryptionConfiguration:
232 |           - ServerSideEncryptionByDefault:
233 |               SSEAlgorithm: AES256
234 |     Metadata:
235 |       cfn_nag:
236 |         rules_to_suppress:
237 |           - id: W35
238 |             reason: S3 Bucket access logging not needed here
239 |   LoggingBucketPolicy:
240 |     Type: AWS::S3::BucketPolicy
241 |     Properties:
242 |       Bucket: !Ref LoggingBucket
243 |       PolicyDocument:
244 |         Statement:
245 |           - Effect: Deny
246 |             Principal: "*"
247 |             Action: "*"
248 |             Resource:
249 |               - !Sub "arn:aws:s3:::${LoggingBucket}/*"
250 |               - !Sub "arn:aws:s3:::${LoggingBucket}"
251 |             Condition:
252 |               Bool:
253 |                 aws:SecureTransport: false
254 | 
255 |   ###################################
256 |   # State Machine to classify posts #
257 |   ###################################
258 |   ProcessPostStateMachine:
259 |     Type: AWS::Serverless::StateMachine
260 |     Properties:
261 |       Type: EXPRESS
262 |       Logging:
263 |         Level: ALL
264 |         IncludeExecutionData: True
265 |         Destinations:
266 |           - CloudWatchLogsLogGroup:
267 |               LogGroupArn: !GetAtt ExpressLogGroup.Arn
268 |       DefinitionUri: state_machine/process_post.asl.json
269 |       DefinitionSubstitutions:
270 |         ExtractInisghtsFunctionArn: !GetAtt ExtractInsightsFunction.Arn
271 |         AddLocationFunctionArn: !GetAtt AddLocationFunction.Arn
272 |         SavePostFunctionArn: !GetAtt SavePostFunction.Arn
273 |       Policies:
274 |         - LambdaInvokePolicy:
275 |             FunctionName: !Ref ExtractInsightsFunction
276 |         - LambdaInvokePolicy:
277 |             FunctionName: !Ref AddLocationFunction
278 |         - LambdaInvokePolicy:
279 |             FunctionName: !Ref SavePostFunction
280 |         - Version: 2012-10-17
281 |           Statement:
282 |             - Effect: Allow
283 |               Action: "logs:*"
284 |               Resource: "*"
285 | 
286 |   ##################################################################
287 |   # SQS queue Lambda function consumer (invokes the state machine) #
288 |   ##################################################################
289 |   TriggerOnSQSFunction:
290 |     Type: AWS::Serverless::Function
291 |     Properties:
292 |       Runtime: python3.12
293 |       Timeout: 180
294 |       Handler: lambda.handler
295 |       CodeUri: lambdas/workflow_from_sqs/
296 |       Policies:
297 |         - SQSPollerPolicy:
298 |             QueueName: !GetAtt PostsQueue.QueueName
299 |         - StepFunctionsExecutionPolicy:
300 |             StateMachineName: !GetAtt ProcessPostStateMachine.Name
301 |       Environment:
302 |         Variables:
303 |           LOG_LEVEL: info
304 |           PROCESS_POST_STATE_MACHINE: !Ref ProcessPostStateMachine
305 |       Events:
306 |         SQSEvent:
307 |           Type: SQS
308 |           Properties:
309 |             Queue: !GetAtt PostsQueue.Arn
310 |             BatchSize: 3
311 |             FunctionResponseTypes:
312 |               - ReportBatchItemFailures
313 | 
314 |   ########################################################
315 |   # Function to process the post                         #
316 |   ########################################################
317 |   ExtractInsightsFunction:
318 |     Type: AWS::Serverless::Function
319 |     Properties:
320 |       Runtime: python3.12
321 |       Timeout: 180
322 |       Handler: lambda.handler
323 |       CodeUri: lambdas/process_post/
324 |       Policies:
325 |         - Version: 2012-10-17
326 |           Statement:
327 |             - Effect: Allow
328 |               Action: "bedrock:InvokeModel"
329 |               Resource: "*"
330 |       Environment:
331 |         Variables:
332 |           MODEL_ID: !Ref ModelId
333 |           LABELS: !Ref Labels
334 |           LOG_LEVEL: info
335 |           LANGUAGE_CODE: !Ref Language
336 |           SENTIMENT_LABELS: !Ref SentimentCategories
337 | 
338 |   ################################################
339 |   # Function to obtain coordinates from the post #
340 |   ################################################
341 |   AddLocationFunction:
342 |     Type: AWS::Serverless::Function
343 |     Properties:
344 |       Runtime: python3.12
345 |       Timeout: 180
346 |       Handler: lambda.handler
347 |       CodeUri: lambdas/locate_post/
348 |       Policies:
349 |         - Version: 2012-10-17
350 |           Statement:
351 |             - Effect: Allow
352 |               Action: "geo:SearchPlaceIndexForText"
353 |               Resource:
354 |                 - !GetAtt PlaceIndex.Arn
355 |       Environment:
356 |         Variables:
357 |           LANGUAGE: !Ref Language
358 |           GEO_REGION: !Ref Region
359 |           PLACE_INDEX_NAME: !Ref PlaceIndex
360 |           LOG_LEVEL: info
361 | 
362 |   ##############################################
363 |   # Function to save the post and its metadata #
364 |   ##############################################
365 |   SavePostFunction:
366 |     Type: AWS::Serverless::Function
367 |     Properties:
368 |       Runtime: python3.12
369 |       Timeout: 180
370 |       Handler: lambda.handler
371 |       CodeUri: lambdas/save_post/
372 |       Policies:
373 |       - S3WritePolicy:
374 |           BucketName: !Ref PostsBucket
375 |       Environment:
376 |         Variables:
377 |           POSTS_BUCKET: !Ref PostsBucket
378 |           LOG_LEVEL: info
379 | 
380 |   ########################################################################
381 |   # Lambda to execute custom resource to create the L4M anomaly detector #
382 |   ########################################################################
383 |   CreateL4MDetectorFunction:
384 |     Type: AWS::Serverless::Function
385 |     Properties:
386 |       Runtime: python3.12
387 |       Timeout: 180
388 |       Handler: lambda.handler
389 |       CodeUri: lambdas/custom_resources/lookout4metrics/
390 |       Policies:
391 |         # Lookout for metrics policy (required for creating the detector)
392 |         - Version: 2012-10-17
393 |           Statement:
394 |             - Effect: Allow
395 |               Action:
396 |                 - lookoutmetrics:CreateAnomalyDetector
397 |                 - lookoutmetrics:CreateAlert
398 |                 - lookoutmetrics:ActivateAnomalyDetector
399 |                 - lookoutmetrics:CreateMetricSet
400 |               Resource:
401 |                 - !Sub "arn:aws:lookoutmetrics:*:${AWS::AccountId}:MetricSet/*/*"
402 |                 - !Sub "arn:aws:lookoutmetrics:*:${AWS::AccountId}:Alert:*"
403 |                 - !Sub "arn:aws:lookoutmetrics:*:${AWS::AccountId}:AnomalyDetector:*"
404 |                 - !Sub "arn:aws:lookoutmetrics:*:${AWS::AccountId}:*"
405 |             - Effect: Allow
406 |               Action: "iam:PassRole"
407 |               Resource:
408 |                 - !GetAtt AthenaSourceAccessRole.Arn
409 |                 - !GetAtt SnsPublishRole.Arn
410 | 
411 |   #######################################################################
412 |   # Custom resource to create anomaly detector with lookout for metrics #
413 |   #######################################################################
414 |   CreateLookoutMetricsResource:
415 |     Type: Custom::CreateLookoutMetrics
416 |     DependsOn:
417 |       - AthenaSourceAccessRole
418 |       - SnsPublishRole
419 |       - AlertTopic
420 |       - AthenaWorkGroup
421 |       - GlueDatabase
422 |       - GlueTopicsTable
423 |     Properties:
424 |       ServiceToken: !GetAtt CreateL4MDetectorFunction.Arn
425 |       Target:
426 |         AthenaRoleArn: !GetAtt AthenaSourceAccessRole.Arn
427 |         SnsRoleArn: !GetAtt SnsPublishRole.Arn
428 |         SnsTopicArn: !Ref AlertTopic
429 |         AlertThreshold: !Ref AnomalyAlertThreshold
430 |         AthenaWorkgroupName: !Ref AthenaWorkGroup
431 |         GlueDbName: !Ref GlueDatabase
432 |         GlueTableName: !Ref GlueTopicsTable
433 |         AwsDataCatalog: AwsDataCatalog
434 |         DetectorFrequency: !Ref AnomalyDetectionFrequency
435 |       Body: |
436 | 
437 | 
438 |   ################
439 |   # Athena setup #
440 |   ################
441 |   AthenaWorkGroup:
442 |     Type: AWS::Athena::WorkGroup
443 |     Properties:
444 |       Name: !Sub "${AWS::StackName}-PostsWorkGroup"
445 |       WorkGroupConfiguration:
446 |         EnforceWorkGroupConfiguration: false
447 |         ResultConfiguration:
448 |           EncryptionConfiguration:
449 |             EncryptionOption: SSE_KMS
450 |             KmsKey: !Ref AthenaSourceKmsKey
451 |           OutputLocation: !Sub "s3://${AthenaResultsBucket}/"
452 |   GlueDatabase:
453 |     Type: AWS::Glue::Database
454 |     Properties:
455 |       CatalogId: !Ref "AWS::AccountId"
456 |       DatabaseInput:
457 |         Name: !Sub "${AWS::StackName}_db"
458 |   GluePostsTable:
459 |     Type: AWS::Glue::Table
460 |     Properties:
461 |       CatalogId: !Ref "AWS::AccountId"
462 |       DatabaseName: !Ref GlueDatabase
463 |       TableInput:
464 |         Name: posts
465 |         PartitionKeys:
466 |           - Name: partition_timestamp
467 |             Type: timestamp
468 |         StorageDescriptor:
469 |           Columns:
470 |             - Name: longitude
471 |               Type: float
472 |             - Name: latitude
473 |               Type: float
474 |             - Name: location
475 |               Type: string
476 |             - Name: topic
477 |               Type: string
478 |             - Name: sentiment
479 |               Type: string
480 |             - Name: created_at
481 |               Type: timestamp
482 |             - Name: model
483 |               Type: string
484 |             - Name: notification
485 |               Type: boolean
486 |             - Name: timestamp
487 |               Type: timestamp
488 |             - Name: text
489 |               Type: string
490 |             - Name: text_clean
491 |               Type: string
492 |             - Name: user
493 |               Type: string
494 |             - Name: source
495 |               Type: string
496 |             - Name: count
497 |               Type: tinyint
498 |             - Name: platform
499 |               Type: string
500 |           Compressed: False
501 |           Location: !Sub "s3://${PostsBucket}/posts"
502 |           InputFormat: org.apache.hadoop.mapred.TextInputFormat
503 |           OutputFormat: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
504 |           SerdeInfo:
505 |             SerializationLibrary: org.apache.hive.hcatalog.data.JsonSerDe
506 |         Parameters:
507 |           "projection.enabled": "true"
508 |           "projection.partition_timestamp.type": "date"
509 |           "projection.partition_timestamp.format": "yyyy-MM-dd HH:mm:SS"
510 |           "projection.partition_timestamp.range": !Sub "${AthenaProjectionRangeStart},NOW+1DAY"
511 |           "projection.partition_timestamp.interval": "15"
512 |           "projection.partition_timestamp.interval.unit": "MINUTES"
513 |           "storage.location.template": !Sub "s3://${PostsBucket}/posts/${!partition_timestamp}/"
514 |         TableType: EXTERNAL_TABLE
515 |   GluePhrasesTable:
516 |     Type: AWS::Glue::Table
517 |     Properties:
518 |       CatalogId: !Ref "AWS::AccountId"
519 |       DatabaseName: !Ref GlueDatabase
520 |       TableInput:
521 |         Name: phrases
522 |         PartitionKeys:
523 |           - Name: partition_timestamp
524 |             Type: timestamp
525 |         StorageDescriptor:
526 |           Columns:
527 |             - Name: created_at
528 |               Type: timestamp
529 |             - Name: timestamp
530 |               Type: timestamp
531 |             - Name: text_clean
532 |               Type: string
533 |             - Name: user
534 |               Type: string
535 |             - Name: platform
536 |               Type: string
537 |             - Name: phrase
538 |               Type: string
539 |             - Name: count
540 |               Type: tinyint
541 |           Compressed: False
542 |           Location: !Sub "s3://${PostsBucket}/phrases"
543 |           InputFormat: org.apache.hadoop.mapred.TextInputFormat
544 |           OutputFormat: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
545 |           SerdeInfo:
546 |             SerializationLibrary: org.apache.hive.hcatalog.data.JsonSerDe
547 |         Parameters:
548 |           "projection.enabled": "true"
549 |           "projection.partition_timestamp.type": "date"
550 |           "projection.partition_timestamp.format": "yyyy-MM-dd HH:mm:SS"
551 |           "projection.partition_timestamp.range": !Sub "${AthenaProjectionRangeStart},NOW+1DAY"
552 |           "projection.partition_timestamp.interval": "1"
553 |           "projection.partition_timestamp.interval.unit": "DAYS"
554 |           "storage.location.template": !Sub "s3://${PostsBucket}/phrases/${!partition_timestamp}/"
555 |         TableType: EXTERNAL_TABLE
556 |   GlueTopicsTable:
557 |     Type: AWS::Glue::Table
558 |     Properties:
559 |       CatalogId: !Ref "AWS::AccountId"
560 |       DatabaseName: !Ref GlueDatabase
561 |       TableInput:
562 |         Name: topics
563 |         PartitionKeys:
564 |           - Name: partition_timestamp
565 |             Type: timestamp
566 |         StorageDescriptor:
567 |           Columns:
568 |             - Name: created_at
569 |               Type: timestamp
570 |             - Name: timestamp
571 |               Type: timestamp
572 |             - Name: text_clean
573 |               Type: string
574 |             - Name: user
575 |               Type: string
576 |             - Name: platform
577 |               Type: string
578 |             - Name: topic
579 |               Type: string
580 |             - Name: sentiment
581 |               Type: string
582 |             - Name: longitude
583 |               Type: float
584 |             - Name: latitude
585 |               Type: float
586 |             - Name: location
587 |               Type: string
588 |             - Name: count
589 |               Type: tinyint
590 |           Compressed: False
591 |           Location: !Sub "s3://${PostsBucket}/topics"
592 |           InputFormat: org.apache.hadoop.mapred.TextInputFormat
593 |           OutputFormat: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
594 |           SerdeInfo:
595 |             SerializationLibrary: org.apache.hive.hcatalog.data.JsonSerDe
596 |         Parameters:
597 |           "projection.enabled": "true"
598 |           "projection.partition_timestamp.type": "date"
599 |           "projection.partition_timestamp.format": "yyyy-MM-dd HH:mm:SS"
600 |           "projection.partition_timestamp.range": !Sub "${AthenaProjectionRangeStart},NOW+1DAY"
601 |           "projection.partition_timestamp.interval": "1"
602 |           "projection.partition_timestamp.interval.unit": "DAYS"
603 |           "storage.location.template": !Sub "s3://${PostsBucket}/topics/${!partition_timestamp}/"
604 |         TableType: EXTERNAL_TABLE
605 |   GlueLinksTable:
606 |     Type: AWS::Glue::Table
607 |     Properties:
608 |       CatalogId: !Ref "AWS::AccountId"
609 |       DatabaseName: !Ref GlueDatabase
610 |       TableInput:
611 |         Name: links
612 |         PartitionKeys:
613 |           - Name: partition_timestamp
614 |             Type: timestamp
615 |         StorageDescriptor:
616 |           Columns:
617 |             - Name: created_at
618 |               Type: timestamp
619 |             - Name: timestamp
620 |               Type: timestamp
621 |             - Name: text_clean
622 |               Type: string
623 |             - Name: user
624 |               Type: string
625 |             - Name: platform
626 |               Type: string
627 |             - Name: link
628 |               Type: string
629 |             - Name: count
630 |               Type: tinyint
631 |           Compressed: False
632 |           Location: !Sub "s3://${PostsBucket}/links"
633 |           InputFormat: org.apache.hadoop.mapred.TextInputFormat
634 |           OutputFormat: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
635 |           SerdeInfo:
636 |             SerializationLibrary: org.apache.hive.hcatalog.data.JsonSerDe
637 |         Parameters:
638 |           "projection.enabled": "true"
639 |           "projection.partition_timestamp.type": "date"
640 |           "projection.partition_timestamp.format": "yyyy-MM-dd HH:mm:SS"
641 |           "projection.partition_timestamp.range": !Sub "${AthenaProjectionRangeStart},NOW+1DAY"
642 |           "projection.partition_timestamp.interval": "1"
643 |           "projection.partition_timestamp.interval.unit": "DAYS"
644 |           "storage.location.template": !Sub "s3://${PostsBucket}/links/${!partition_timestamp}/"
645 |         TableType: EXTERNAL_TABLE
646 | 
647 |   #########################################
648 |   # Athena policy for Lookout for Metrics #
649 |   #########################################
650 |   AthenaSourceAccessRole:
651 |     Type: AWS::IAM::Role
652 |     Properties:
653 |       AssumeRolePolicyDocument:
654 |         Statement:
655 |           - Action: ['sts:AssumeRole']
656 |             Effect: Allow
657 |             Principal:
658 |               Service: ['lookoutmetrics.amazonaws.com']
659 |               AWS: !Sub "${AWS::AccountId}"
660 |         Version: '2012-10-17'
661 |       Policies:
662 |         - PolicyDocument:
663 |             Statement:
664 |               - Action:
665 |                   - 'glue:GetCatalogImportStatus'
666 |                   - 'glue:GetDatabase'
667 |                   - 'glue:GetTable'
668 |                 Effect: Allow
669 |                 Resource:
670 |                   - !Sub "arn:${AWS::Partition}:glue:${AWS::Region}:${AWS::AccountId}:catalog"
671 |                   - !Sub "arn:${AWS::Partition}:glue:${AWS::Region}:${AWS::AccountId}:database/${GlueDatabase}"
672 |                   - !Sub "arn:${AWS::Partition}:glue:${AWS::Region}:${AWS::AccountId}:table/${GlueDatabase}/*"
673 |               - Action:
674 |                   - 's3:GetObject'
675 |                   - 's3:ListBucket'
676 |                   - 's3:PutObject'
677 |                   - 's3:GetBucketLocation'
678 |                   - 's3:ListBucketMultipartUploads'
679 |                   - 's3:ListMultipartUploadParts'
680 |                   - 's3:AbortMultipartUpload'
681 |                 Effect: Allow
682 |                 Resource:
683 |                   - !Sub "arn:${AWS::Partition}:s3:::${AthenaResultsBucket}"
684 |                   - !Sub "arn:${AWS::Partition}:s3:::${AthenaResultsBucket}/*"
685 |                   - !Sub "arn:${AWS::Partition}:s3:::${PostsBucket}"
686 |                   - !Sub "arn:${AWS::Partition}:s3:::${PostsBucket}/*"
687 |               - Action:
688 |                   - 'athena:CreatePreparedStatement' # On WG
689 |                   - 'athena:DeletePreparedStatement' # On WG
690 |                   - 'athena:GetDatabase'
691 |                   - 'athena:GetPreparedStatement' # On WG
692 |                   - 'athena:GetQueryExecution'
693 |                   - 'athena:GetQueryResults' # On WG
694 |                   - 'athena:GetQueryResultsStream'
695 |                   - 'athena:GetTableMetadata'
696 |                   - 'athena:GetWorkGroup'
697 |                   - 'athena:StartQueryExecution'
698 |                 Effect: Allow
699 |                 Resource:
700 |                   - !Sub "arn:${AWS::Partition}:athena:${AWS::Region}:${AWS::AccountId}:datacatalog/*"
701 |                   - !Sub "arn:${AWS::Partition}:athena:${AWS::Region}:${AWS::AccountId}:workgroup/primary"
702 |                   - !Sub "arn:${AWS::Partition}:athena:${AWS::Region}:${AWS::AccountId}:workgroup/${AthenaWorkGroup}"
703 |               - Action: [ 'kms:GenerateDataKey', 'kms:Decrypt' ]
704 |                 Effect: Allow
705 |                 Resource:
706 |                   - !GetAtt AthenaSourceKmsKey.Arn
707 |             Version: '2012-10-17'
708 |           PolicyName: 'AthenaAccessPolicy'
709 | 
710 |   ######################################
711 |   # SNS policy for Lookout for Metrics #
712 |   ######################################
713 |   SnsPublishRole:
714 |     Type: AWS::IAM::Role
715 |     Properties:
716 |       AssumeRolePolicyDocument:
717 |         Statement:
718 |           - Action: ['sts:AssumeRole']
719 |             Effect: Allow
720 |             Principal:
721 |               Service: ['lookoutmetrics.amazonaws.com']
722 |               AWS: !Sub "${AWS::AccountId}"
723 |         Version: '2012-10-17'
724 |       Policies:
725 |         - PolicyDocument:
726 |             Statement:
727 |               - Action:
728 |                   - 'sns:Publish'
729 |                 Effect: Allow
730 |                 Resource:
731 |                   - !Ref AlertTopic
732 |             Version: '2012-10-17'
733 |           PolicyName: 'AthenaAccessPolicy'
734 | 
735 |   ############################################
736 |   # Place index creation (Location services) #
737 |   ############################################
738 |   PlaceIndex:
739 |     Type: AWS::Location::PlaceIndex
740 |     Properties:
741 |       DataSource: Esri
742 |       Description: Place Index
743 |       IndexName: !Sub 'PlaceIndex-${AWS::StackName}'
744 | 
745 | 
746 | Outputs:
747 |   AwsRegion:
748 |     Value: !Ref 'AWS::Region'
749 |   PostsQueueUrl:
750 |     Value: !Ref PostsQueue
751 |   PostsQueueArn:
752 |     Value: !GetAtt PostsQueue.Arn
753 |   AlertTopicArn:
754 |     Value: !Ref AlertTopic
755 |   PostsBucketName:
756 |     Value: !Ref PostsBucket
757 |   PlaceIndexName:
758 |     Value: !Ref PlaceIndex
759 |   AthenaAccessRoleARN:
760 |     Value: !GetAtt AthenaSourceAccessRole.Arn
761 |   SnsPublishRoleARN:
762 |     Value: !GetAtt SnsPublishRole.Arn
763 |   KMSAthenaKey:
764 |     Value: !Ref AthenaSourceKmsKey
765 | 


--------------------------------------------------------------------------------
/data-streammer/mexico_big_cities.csv:
--------------------------------------------------------------------------------
  1 | Ciudad,Entidad Federativa
  2 | ,
  3 | Ciudad de México, Distrito Federal
  4 | Ecatepec,México
  5 | Guadalajara,Jalisco
  6 | Puebla,Puebla
  7 | Juárez,Chihuahua
  8 | Tijuana,Baja California
  9 | León,Guanajuato
 10 | Zapopan,Jalisco
 11 | Monterrey,Nuevo León
 12 | Nezahualcóyotl,México
 13 | Chihuahua,Chihuahua
 14 | Naucalpan,México
 15 | Mérida,Yucatán
 16 | San Luis Potosí,San Luis Potosí
 17 | Aguascalientes,Aguascalientes
 18 | Hermosillo,Sonora
 19 | Saltillo,Coahuila
 20 | Mexicali,Baja California
 21 | Culiacán,Sinaloa
 22 | Guadalupe,Nuevo León
 23 | Acapulco,Guerrero
 24 | Tlalnepantla,México
 25 | Cancún,Quintana Roo
 26 | Querétaro,Querétaro
 27 | Chimalhuacán,México
 28 | Torreón,Coahuila
 29 | Morelia,Michoacán
 30 | Reynosa,Tamaulipas
 31 | Tlaquepaque,Jalisco
 32 | Tuxtla,Chiapas
 33 | Durango,Durango
 34 | Toluca,México
 35 | Ciudad López Mateos,México
 36 | Cuautitlán Izcalli,México
 37 | Apodaca,Nuevo León
 38 | Matamoros,Tamaulipas
 39 | San Nicolás de los Garza,Nuevo León
 40 | Veracruz,Veracruz
 41 | Xalapa,Veracruz
 42 | Tonalá,Jalisco
 43 | Mazatlán,Sinaloa
 44 | Irapuato,Guanajuato
 45 | Nuevo Laredo,Tamaulipas
 46 | Xico,México
 47 | Villahermosa,Tabasco
 48 | General Escobedo,Nuevo León
 49 | Celaya,Guanajuato
 50 | Cuernavaca,Morelos
 51 | Tepic,Nayarit
 52 | Ixtapaluca,México
 53 | Ciudad Victoria,Tamaulipas
 54 | Ciudad Obregón,Sonora
 55 | Tampico,Tamaulipas
 56 | Villa Nicolás Romero,México
 57 | Ensenada,Baja California
 58 | San Francisco Coacalco,México
 59 | Santa Catarina,Nuevo León
 60 | Uruapan,Michoacán
 61 | Gómez Palacio,Durango
 62 | Los Mochis,Sinaloa
 63 | Pachuca,Hidalgo
 64 | Oaxaca,Oaxaca
 65 | Soledad de Graciano Sánchez,San Luis Potosí
 66 | Tehuacán,Puebla
 67 | Ojo de Agua,México
 68 | Coatzacoalcos,Veracruz
 69 | Campeche,Campeche
 70 | Monclova,Coahuila
 71 | La Paz,Baja California Sur
 72 | Nogales,Sonora
 73 | Buenavista,México
 74 | Puerto Vallarta,Jalisco
 75 | Tapachula,Chiapas
 76 | Ciudad Madero,Tamaulipas
 77 | San Pablo de las Salinas,México
 78 | Chilpancingo,Guerrero
 79 | Poza Rica de Hidalgo,Veracruz
 80 | Chicoloapan,México
 81 | Ciudad del Carmen,Campeche
 82 | Chalco,México
 83 | Jiutepec,Morelos
 84 | Salamanca,Guanajuato
 85 | San Luis Río Colorado,Sonora
 86 | San Cristóbal de las Casas,Chiapas
 87 | Minatitlán,Veracruz
 88 | Cuautla,Morelos
 89 | Juárez,Nuevo León
 90 | Chetumal,Quintana Roo
 91 | Piedras Negras,Coahuila
 92 | Playa del Carmen,Quintana Roo
 93 | Zamora,Michoacán
 94 | Córdoba,Veracruz
 95 | San Juan del Río,Querétaro
 96 | Colima,Colima
 97 | Ciudad Acuña,Coahuila
 98 | Manzanillo,Colima
 99 | Zacatecas,Zacatecas
100 | Veracruz,Veracruz
101 | Ciudad Valles,San Luis Potosí
102 | Guadalupe,Zacatecas
103 | San Pedro Garza García,Nuevo León
104 | Naucalpan de Juárez,México
105 | Fresnillo,Zacatecas
106 | Orizaba,Veracruz
107 | Miramar,Tamaulipas
108 | Iguala,Guerrero
109 | Delicias,Chihuahua
110 | Villa de Álvarez,Colima
111 | Cuauhtémoc,Chihuahua
112 | Navojoa,Sonora
113 | Guaymas,Sonora
114 | Cuautitlán,México
115 | Texcoco de Mora,México
116 | Hidalgo del Parral,Chihuahua
117 | Tepexpan,México
118 | Tulancingo de Bravo,Hidalgo
119 | San Juan Bautista Tuxtepec,Oaxaca
120 | ,
121 | ,
122 | ,
123 | ,
124 | 


--------------------------------------------------------------------------------
/data-streammer/stream_posts.py:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | 
 4 | import json
 5 | import time
 6 | import boto3
 7 | import argparse
 8 | import pandas as pd
 9 | import random
10 | from datetime import datetime
11 | 
12 | parser = argparse.ArgumentParser(description='Stream sample posts to an Amazon SQS Queue for processing')
13 | parser.add_argument('--queue_url', type=str, help='Amazon SQS Queue URL')
14 | parser.add_argument('--region', type=str, help='The region where the AI Powered Text Insights is deployed')
15 | 
16 | sample_cities = ['New York', 'Chicago', 'Kentucky', 'Arkansas', 'Miami', 'Los Angeles',
17 |                  'San Francisco', 'New Jersey', 'San Diego', 'Orlando', 'Arlington', 'Washington DC',
18 |                  'Las Vegas', 'Portland', 'Seattle', 'Austin', 'Phoenix']
19 | 
20 | # Create a function to send a JSON object to an SQS queue
21 | def send_sqs_message(queue_url, message):
22 |     response = sqs.send_message(
23 |         QueueUrl=queue_url,
24 |         MessageBody=json.dumps(message)
25 |     )
26 |     return response
27 | 
28 | 
29 | def send_batch(posts_batch, queue_url):
30 | 
31 |     for text in posts_batch:
32 | 
33 |         utc_now_str = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%S.%fz")
34 | 
35 |         # Add a location randomly.
36 |         if random.uniform(0, 1) < 0.1:
37 |             text = text + '. From ' + random.choice(sample_cities)
38 | 
39 |         message = {
40 |             "text": text,
41 |             "user": "test_user",
42 |             "created_at": utc_now_str,
43 |             "source": "test",
44 |             "platform": "test client"
45 |         }
46 | 
47 |         print(message)
48 |         response = send_sqs_message(queue_url, message)
49 |         print(response)
50 | 
51 | 
52 | if __name__ == '__main__':
53 | 
54 |     args = parser.parse_args()
55 | 
56 |     queue_url = args.queue_url
57 |     region = args.region
58 | 
59 |     sqs = boto3.client('sqs')
60 |     translate = boto3.client(service_name='translate', region_name=region, use_ssl=True)
61 | 
62 |     new_years_resolutions_df = pd.read_csv('new_years_resolutions_tweets.csv')
63 |     new_years_resolutions_df['text'] = new_years_resolutions_df['text'].map(lambda x: x.encode(encoding='UTF-8', errors='replace').decode())
64 | 
65 |     resolutions_text = new_years_resolutions_df['text'].values.tolist()
66 | 
67 |     #Use Amazon translate to translate the text into Spanish
68 |     delta_i = 20
69 |     for i in range(0, len(resolutions_text), delta_i):
70 | 
71 |         end_i = i + delta_i if i + delta_i < len(resolutions_text) else len(resolutions_text)
72 |         text_batch = resolutions_text[i:end_i]
73 | 
74 |         print(f'Streaming from {i} to {end_i}')
75 | 
76 |         send_batch(text_batch, queue_url)
77 | 
78 |         time.sleep(60)


--------------------------------------------------------------------------------
/sample-files/phrases/2022-09-12/cb1bc907-32b8-11ed-a39b-010c818c1d51.json:
--------------------------------------------------------------------------------
1 | {"created_at": "2022-09-12 16:33:45.000000", "timestamp": "2022-09-12 16:34:38.130057", "user": "AmazonScience", "platform": "Twitter", "text_clean": "rt @parmidabeigi: amazon's 20b language model has fewer than 1/8 the number of parameters of gpt-3, yet outperforms it in several nlp tasks…", "count": 1, "phrase": "the number"}
2 | {"created_at": "2022-09-12 16:33:45.000000", "timestamp": "2022-09-12 16:34:38.130057", "user": "AmazonScience", "platform": "Twitter", "text_clean": "rt @parmidabeigi: amazon's 20b language model has fewer than 1/8 the number of parameters of gpt-3, yet outperforms it in several nlp tasks…", "count": 1, "phrase": "amazon"}
3 | {"created_at": "2022-09-12 16:33:45.000000", "timestamp": "2022-09-12 16:34:38.130057", "user": "AmazonScience", "platform": "Twitter", "text_clean": "rt @parmidabeigi: amazon's 20b language model has fewer than 1/8 the number of parameters of gpt-3, yet outperforms it in several nlp tasks…", "count": 1, "phrase": "20b language model"}
4 | {"created_at": "2022-09-12 16:33:45.000000", "timestamp": "2022-09-12 16:34:38.130057", "user": "AmazonScience", "platform": "Twitter", "text_clean": "rt @parmidabeigi: amazon's 20b language model has fewer than 1/8 the number of parameters of gpt-3, yet outperforms it in several nlp tasks…", "count": 1, "phrase": "gpt-3"}
5 | {"created_at": "2022-09-12 16:33:45.000000", "timestamp": "2022-09-12 16:34:38.130057", "user": "AmazonScience", "platform": "Twitter", "text_clean": "rt @parmidabeigi: amazon's 20b language model has fewer than 1/8 the number of parameters of gpt-3, yet outperforms it in several nlp tasks…", "count": 1, "phrase": "parameters"}
6 | {"created_at": "2022-09-12 16:33:45.000000", "timestamp": "2022-09-12 16:34:38.130057", "user": "AmazonScience", "platform": "Twitter", "text_clean": "rt @parmidabeigi: amazon's 20b language model has fewer than 1/8 the number of parameters of gpt-3, yet outperforms it in several nlp tasks…", "count": 1, "phrase": "tasks…"}
7 | {"created_at": "2022-09-12 16:33:45.000000", "timestamp": "2022-09-12 16:34:38.130057", "user": "AmazonScience", "platform": "Twitter", "text_clean": "rt @parmidabeigi: amazon's 20b language model has fewer than 1/8 the number of parameters of gpt-3, yet outperforms it in several nlp tasks…", "count": 1, "phrase": "fewer than 1/8"}
8 | {"created_at": "2022-09-12 16:33:45.000000", "timestamp": "2022-09-12 16:34:38.130057", "user": "AmazonScience", "platform": "Twitter", "text_clean": "rt @parmidabeigi: amazon's 20b language model has fewer than 1/8 the number of parameters of gpt-3, yet outperforms it in several nlp tasks…", "count": 1, "phrase": "rt @parmidabeigi"}
9 | 


--------------------------------------------------------------------------------
/sample-files/phrases/2022-09-12/f4491b0d-32dd-11ed-96e7-1dd85b3a04d7.json:
--------------------------------------------------------------------------------
 1 | {"created_at": "2022-09-12 20:59:49.000000", "timestamp": "2022-09-12 21:00:38.580243", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "what's with aws security services? #awsfirewallmanager adds support for aws waf custom requests and responses #awssecurityhub launches new security best practice control details on these &amp; more security feature updates:", "count": 1, "phrase": "support"}
 2 | {"created_at": "2022-09-12 20:59:49.000000", "timestamp": "2022-09-12 21:00:38.580243", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "what's with aws security services? #awsfirewallmanager adds support for aws waf custom requests and responses #awssecurityhub launches new security best practice control details on these &amp; more security feature updates:", "count": 1, "phrase": "awssecurityhub"}
 3 | {"created_at": "2022-09-12 20:59:49.000000", "timestamp": "2022-09-12 21:00:38.580243", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "what's with aws security services? #awsfirewallmanager adds support for aws waf custom requests and responses #awssecurityhub launches new security best practice control details on these &amp; more security feature updates:", "count": 1, "phrase": "aws security services"}
 4 | {"created_at": "2022-09-12 20:59:49.000000", "timestamp": "2022-09-12 21:00:38.580243", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "what's with aws security services? #awsfirewallmanager adds support for aws waf custom requests and responses #awssecurityhub launches new security best practice control details on these &amp; more security feature updates:", "count": 1, "phrase": "new security"}
 5 | {"created_at": "2022-09-12 20:59:49.000000", "timestamp": "2022-09-12 21:00:38.580243", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "what's with aws security services? #awsfirewallmanager adds support for aws waf custom requests and responses #awssecurityhub launches new security best practice control details on these &amp; more security feature updates:", "count": 1, "phrase": "these &amp; more security feature updates"}
 6 | {"created_at": "2022-09-12 20:59:49.000000", "timestamp": "2022-09-12 21:00:38.580243", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "what's with aws security services? #awsfirewallmanager adds support for aws waf custom requests and responses #awssecurityhub launches new security best practice control details on these &amp; more security feature updates:", "count": 1, "phrase": "best practice control details"}
 7 | {"created_at": "2022-09-12 20:59:49.000000", "timestamp": "2022-09-12 21:00:38.580243", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "what's with aws security services? #awsfirewallmanager adds support for aws waf custom requests and responses #awssecurityhub launches new security best practice control details on these &amp; more security feature updates:", "count": 1, "phrase": "awsfirewallmanager"}
 8 | {"created_at": "2022-09-12 20:59:49.000000", "timestamp": "2022-09-12 21:00:38.580243", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "what's with aws security services? #awsfirewallmanager adds support for aws waf custom requests and responses #awssecurityhub launches new security best practice control details on these &amp; more security feature updates:", "count": 1, "phrase": "aws waf custom requests and responses"}
 9 | {"created_at": "2022-09-12 20:59:49.000000", "timestamp": "2022-09-12 21:00:38.580243", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "what's with aws security services? #awsfirewallmanager adds support for aws waf custom requests and responses #awssecurityhub launches new security best practice control details on these &amp; more security feature updates:", "count": 1, "phrase": "#awsfirewallmanager"}
10 | 


--------------------------------------------------------------------------------
/sample-files/phrases/2022-09-13/0e79b260-33a6-11ed-a6f9-adab8308bd5b.json:
--------------------------------------------------------------------------------
 1 | {"created_at": "2022-09-13 20:52:06.000000", "timestamp": "2022-09-13 20:53:01.877685", "user": "davidlaredo1", "platform": "Twitter", "text_clean": "accelerate innovation without sacrificing security. learn how migrating to the cloud can help decrease time to market, limit downtime &amp; reduce cost. #aws #cloudcomputing #developer", "count": 1, "phrase": "cloudcomputing"}
 2 | {"created_at": "2022-09-13 20:52:06.000000", "timestamp": "2022-09-13 20:53:01.877685", "user": "davidlaredo1", "platform": "Twitter", "text_clean": "accelerate innovation without sacrificing security. learn how migrating to the cloud can help decrease time to market, limit downtime &amp; reduce cost. #aws #cloudcomputing #developer", "count": 1, "phrase": "market"}
 3 | {"created_at": "2022-09-13 20:52:06.000000", "timestamp": "2022-09-13 20:53:01.877685", "user": "davidlaredo1", "platform": "Twitter", "text_clean": "accelerate innovation without sacrificing security. learn how migrating to the cloud can help decrease time to market, limit downtime &amp; reduce cost. #aws #cloudcomputing #developer", "count": 1, "phrase": "the cloud"}
 4 | {"created_at": "2022-09-13 20:52:06.000000", "timestamp": "2022-09-13 20:53:01.877685", "user": "davidlaredo1", "platform": "Twitter", "text_clean": "accelerate innovation without sacrificing security. learn how migrating to the cloud can help decrease time to market, limit downtime &amp; reduce cost. #aws #cloudcomputing #developer", "count": 1, "phrase": "decrease time"}
 5 | {"created_at": "2022-09-13 20:52:06.000000", "timestamp": "2022-09-13 20:53:01.877685", "user": "davidlaredo1", "platform": "Twitter", "text_clean": "accelerate innovation without sacrificing security. learn how migrating to the cloud can help decrease time to market, limit downtime &amp; reduce cost. #aws #cloudcomputing #developer", "count": 1, "phrase": "developer"}
 6 | {"created_at": "2022-09-13 20:52:06.000000", "timestamp": "2022-09-13 20:53:01.877685", "user": "davidlaredo1", "platform": "Twitter", "text_clean": "accelerate innovation without sacrificing security. learn how migrating to the cloud can help decrease time to market, limit downtime &amp; reduce cost. #aws #cloudcomputing #developer", "count": 1, "phrase": "security"}
 7 | {"created_at": "2022-09-13 20:52:06.000000", "timestamp": "2022-09-13 20:53:01.877685", "user": "davidlaredo1", "platform": "Twitter", "text_clean": "accelerate innovation without sacrificing security. learn how migrating to the cloud can help decrease time to market, limit downtime &amp; reduce cost. #aws #cloudcomputing #developer", "count": 1, "phrase": "downtime &amp;"}
 8 | {"created_at": "2022-09-13 20:52:06.000000", "timestamp": "2022-09-13 20:53:01.877685", "user": "davidlaredo1", "platform": "Twitter", "text_clean": "accelerate innovation without sacrificing security. learn how migrating to the cloud can help decrease time to market, limit downtime &amp; reduce cost. #aws #cloudcomputing #developer", "count": 1, "phrase": "innovation"}
 9 | {"created_at": "2022-09-13 20:52:06.000000", "timestamp": "2022-09-13 20:53:01.877685", "user": "davidlaredo1", "platform": "Twitter", "text_clean": "accelerate innovation without sacrificing security. learn how migrating to the cloud can help decrease time to market, limit downtime &amp; reduce cost. #aws #cloudcomputing #developer", "count": 1, "phrase": "cost"}
10 | {"created_at": "2022-09-13 20:52:06.000000", "timestamp": "2022-09-13 20:53:01.877685", "user": "davidlaredo1", "platform": "Twitter", "text_clean": "accelerate innovation without sacrificing security. learn how migrating to the cloud can help decrease time to market, limit downtime &amp; reduce cost. #aws #cloudcomputing #developer", "count": 1, "phrase": "aws"}
11 | 


--------------------------------------------------------------------------------
/sample-files/phrases/2022-09-13/14a2cce4-33b3-11ed-9dd4-d752f0bb0867.json:
--------------------------------------------------------------------------------
1 | {"created_at": "2022-09-13 22:25:25.000000", "timestamp": "2022-09-13 22:26:15.670808", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "protect applications running on an amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "count": 1, "phrase": "sni"}
2 | {"created_at": "2022-09-13 22:25:25.000000", "timestamp": "2022-09-13 22:26:15.670808", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "protect applications running on an amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "count": 1, "phrase": "the approved hostnames"}
3 | {"created_at": "2022-09-13 22:25:25.000000", "timestamp": "2022-09-13 22:26:15.670808", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "protect applications running on an amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "count": 1, "phrase": "applications"}
4 | {"created_at": "2022-09-13 22:25:25.000000", "timestamp": "2022-09-13 22:26:15.670808", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "protect applications running on an amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "count": 1, "phrase": "the network firewall"}
5 | {"created_at": "2022-09-13 22:25:25.000000", "timestamp": "2022-09-13 22:26:15.670808", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "protect applications running on an amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "count": 1, "phrase": "outbound traffic"}
6 | {"created_at": "2022-09-13 22:25:25.000000", "timestamp": "2022-09-13 22:26:15.670808", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "protect applications running on an amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "count": 1, "phrase": "an amazon eks cluster"}
7 | {"created_at": "2022-09-13 22:25:25.000000", "timestamp": "2022-09-13 22:26:15.670808", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "protect applications running on an amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "count": 1, "phrase": "list"}
8 | 


--------------------------------------------------------------------------------
/sample-files/phrases/2022-09-13/88fd7722-33af-11ed-a356-b52a4d4d8f68.json:
--------------------------------------------------------------------------------
1 | {"created_at": "2022-09-13 22:00:00.000000", "timestamp": "2022-09-13 22:00:52.893886", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "protect applications running on amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "count": 1, "phrase": "the approved hostnames"}
2 | {"created_at": "2022-09-13 22:00:00.000000", "timestamp": "2022-09-13 22:00:52.893886", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "protect applications running on amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "count": 1, "phrase": "the network firewall"}
3 | {"created_at": "2022-09-13 22:00:00.000000", "timestamp": "2022-09-13 22:00:52.893886", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "protect applications running on amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "count": 1, "phrase": "amazon eks cluster"}
4 | {"created_at": "2022-09-13 22:00:00.000000", "timestamp": "2022-09-13 22:00:52.893886", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "protect applications running on amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "count": 1, "phrase": "sni"}
5 | {"created_at": "2022-09-13 22:00:00.000000", "timestamp": "2022-09-13 22:00:52.893886", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "protect applications running on amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "count": 1, "phrase": "outbound traffic"}
6 | {"created_at": "2022-09-13 22:00:00.000000", "timestamp": "2022-09-13 22:00:52.893886", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "protect applications running on amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "count": 1, "phrase": "list"}
7 | {"created_at": "2022-09-13 22:00:00.000000", "timestamp": "2022-09-13 22:00:52.893886", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "protect applications running on amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "count": 1, "phrase": "applications"}
8 | 


--------------------------------------------------------------------------------
/sample-files/tweets/2022-09-12/cb1bc907-32b8-11ed-a39b-010c818c1d51.json:
--------------------------------------------------------------------------------
1 | {"text": "RT @ParmidaBeigi: Amazon's 20B language model has fewer than 1/8 the number of parameters of GPT-3, yet outperforms it in several NLP tasks…", "user": "AmazonScience", "created_at": "2022-09-12 16:33:45.000000", "source": "Twitter for iPhone", "platform": "Twitter", "text_clean": "rt @parmidabeigi: amazon's 20b language model has fewer than 1/8 the number of parameters of gpt-3, yet outperforms it in several nlp tasks…", "category_type": "machine learning", "category_type_score": 0.6919503211975098, "category_type_model_result": "{\"machine learning\": 0.6919503211975098, \"compute\": 0.18625091016292572, \"security\": 0.06642784178256989, \"database\": 0.02945137396454811, \"storage\": 0.025919586420059204}", "model": "SageMakerEndpoint-vRHkKJsUqRTs", "sentiment": "NEUTRAL", "timestamp": "2022-09-12 16:34:38.130057", "count": 1}


--------------------------------------------------------------------------------
/sample-files/tweets/2022-09-12/f4491b0d-32dd-11ed-96e7-1dd85b3a04d7.json:
--------------------------------------------------------------------------------
1 | {"text": "What's 🆕 with AWS Security services? \n\n✔️ #AWSFirewallManager adds support for AWS WAF custom requests and responses\n✔️ #AWSSecurityHub launches new security best practice control\n\nDetails on these &amp; more security feature updates: https://t.co/smwxXg4M5x https://t.co/GggLnb2HeK", "user": "AWSSecurityInfo", "created_at": "2022-09-12 20:59:49.000000", "source": "Sprinklr", "platform": "Twitter", "text_clean": "what's with aws security services? #awsfirewallmanager adds support for aws waf custom requests and responses #awssecurityhub launches new security best practice control details on these &amp; more security feature updates:", "category_type": "security", "category_type_score": 0.8688084483146667, "category_type_model_result": "{\"security\": 0.8688084483146667, \"compute\": 0.049058590084314346, \"machine learning\": 0.029557108879089355, \"storage\": 0.027179645374417305, \"database\": 0.025396248325705528}", "model": "SageMakerEndpoint-vRHkKJsUqRTs", "sentiment": "NEUTRAL", "timestamp": "2022-09-12 21:00:38.580243", "count": 1}


--------------------------------------------------------------------------------
/sample-files/tweets/2022-09-13/0e79b260-33a6-11ed-a6f9-adab8308bd5b.json:
--------------------------------------------------------------------------------
1 | {"text": "Accelerate innovation without sacrificing security. 🏎🔐☁️ Learn how migrating to the cloud can help decrease time to market, limit downtime &amp; reduce cost. #AWS #CloudComputing #Developer https://t.co/qZ8XWJdz77", "user": "davidlaredo1", "created_at": "2022-09-13 20:52:06.000000", "source": "Twitter Web App", "platform": "Twitter", "text_clean": "accelerate innovation without sacrificing security. learn how migrating to the cloud can help decrease time to market, limit downtime &amp; reduce cost. #aws #cloudcomputing #developer", "category_type": "security", "category_type_score": 0.7972265481948853, "category_type_model_result": "{\"security\": 0.7972265481948853, \"compute\": 0.16822879016399384, \"database\": 0.012693053111433983, \"storage\": 0.012405334040522575, \"machine learning\": 0.009446307085454464}", "model": "SageMakerEndpoint-vRHkKJsUqRTs", "sentiment": "NEUTRAL", "timestamp": "2022-09-13 20:53:01.877685", "count": 1}


--------------------------------------------------------------------------------
/sample-files/tweets/2022-09-13/14a2cce4-33b3-11ed-9dd4-d752f0bb0867.json:
--------------------------------------------------------------------------------
1 | {"text": "Protect applications running on an Amazon EKS cluster by filtering outbound traffic based on the approved hostnames that are provided by SNI in the Network Firewall Allow list. Learn how: https://t.co/tlZeKspc49 https://t.co/YvQ2rkfREI", "user": "AWSSecurityInfo", "created_at": "2022-09-13 22:25:25.000000", "source": "Twitter Web App", "platform": "Twitter", "text_clean": "protect applications running on an amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "category_type": "security", "category_type_score": 0.9138131737709045, "category_type_model_result": "{\"security\": 0.9138131737709045, \"compute\": 0.03599965199828148, \"machine learning\": 0.021898740902543068, \"storage\": 0.01451999880373478, \"database\": 0.013768448494374752}", "model": "SageMakerEndpoint-vRHkKJsUqRTs", "sentiment": "NEUTRAL", "timestamp": "2022-09-13 22:26:15.670808", "count": 1}


--------------------------------------------------------------------------------
/sample-files/tweets/2022-09-13/88fd7722-33af-11ed-a356-b52a4d4d8f68.json:
--------------------------------------------------------------------------------
1 | {"text": "Protect applications running on Amazon EKS cluster by filtering outbound traffic based on the approved hostnames that are provided by SNI in the Network Firewall Allow list. Learn how: https://t.co/tlZeKsoEeB https://t.co/wibkmsEno2", "user": "AWSSecurityInfo", "created_at": "2022-09-13 22:00:00.000000", "source": "Sprinklr", "platform": "Twitter", "text_clean": "protect applications running on amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "category_type": "security", "category_type_score": 0.9045998454093933, "category_type_model_result": "{\"security\": 0.9045998454093933, \"compute\": 0.039112430065870285, \"machine learning\": 0.02197249047458172, \"database\": 0.01722169667482376, \"storage\": 0.017093613743782043}", "model": "SageMakerEndpoint-vRHkKJsUqRTs", "sentiment": "NEUTRAL", "timestamp": "2022-09-13 22:00:52.893886", "count": 1}


--------------------------------------------------------------------------------
/stream-getter/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # OS X extensions
 10 | *.DS_Store
 11 | 
 12 | # Distribution / packaging
 13 | .Python
 14 | build/
 15 | develop-eggs/
 16 | dist/
 17 | downloads/
 18 | eggs/
 19 | .eggs/
 20 | lib/
 21 | lib64/
 22 | parts/
 23 | sdist/
 24 | var/
 25 | wheels/
 26 | share/python-wheels/
 27 | *.egg-info/
 28 | .installed.cfg
 29 | *.egg
 30 | MANIFEST
 31 | 
 32 | # PyInstaller
 33 | #  Usually these files are written by a python script from a template
 34 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 35 | *.manifest
 36 | *.spec
 37 | 
 38 | # Installer logs
 39 | pip-log.txt
 40 | pip-delete-this-directory.txt
 41 | 
 42 | # Unit test / coverage reports
 43 | htmlcov/
 44 | .tox/
 45 | .nox/
 46 | .coverage
 47 | .coverage.*
 48 | .cache
 49 | nosetests.xml
 50 | coverage.xml
 51 | *.cover
 52 | *.py,cover
 53 | .hypothesis/
 54 | .pytest_cache/
 55 | cover/
 56 | 
 57 | # Translations
 58 | *.mo
 59 | *.pot
 60 | 
 61 | # Django stuff:
 62 | *.log
 63 | local_settings.py
 64 | db.sqlite3
 65 | db.sqlite3-journal
 66 | 
 67 | # Flask stuff:
 68 | instance/
 69 | .webassets-cache
 70 | 
 71 | # Scrapy stuff:
 72 | .scrapy
 73 | 
 74 | # Sphinx documentation
 75 | docs/_build/
 76 | 
 77 | # PyBuilder
 78 | .pybuilder/
 79 | target/
 80 | 
 81 | # Jupyter Notebook
 82 | .ipynb_checkpoints
 83 | 
 84 | # IPython
 85 | profile_default/
 86 | ipython_config.py
 87 | 
 88 | # pyenv
 89 | #   For a library or package, you might want to ignore these files since the code is
 90 | #   intended to run in multiple environments; otherwise, check them in:
 91 | # .python-version
 92 | 
 93 | # pipenv
 94 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 95 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 96 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 97 | #   install all needed dependencies.
 98 | #Pipfile.lock
 99 | 
100 | # poetry
101 | #   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
102 | #   This is especially recommended for binary packages to ensure reproducibility, and is more
103 | #   commonly ignored for libraries.
104 | #   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
105 | #poetry.lock
106 | 
107 | # pdm
108 | #   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
109 | #pdm.lock
110 | #   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
111 | #   in version control.
112 | #   https://pdm.fming.dev/#use-with-ide
113 | .pdm.toml
114 | 
115 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
116 | __pypackages__/
117 | 
118 | # Celery stuff
119 | celerybeat-schedule
120 | celerybeat.pid
121 | 
122 | # SageMath parsed files
123 | *.sage.py
124 | 
125 | # Environments
126 | .env
127 | .venv
128 | env/
129 | venv/
130 | ENV/
131 | env.bak/
132 | venv.bak/
133 | 
134 | # Spyder project settings
135 | .spyderproject
136 | .spyproject
137 | 
138 | # Rope project settings
139 | .ropeproject
140 | 
141 | # mkdocs documentation
142 | /site
143 | 
144 | # mypy
145 | .mypy_cache/
146 | .dmypy.json
147 | dmypy.json
148 | 
149 | # Pyre type checker
150 | .pyre/
151 | 
152 | # pytype static type analyzer
153 | .pytype/
154 | 
155 | # Cython debug symbols
156 | cython_debug/
157 | 
158 | # PyCharm
159 | #  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
160 | #  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
161 | #  and can be added to the global gitignore or merged into this file.  For a more nuclear
162 | #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
163 | #.idea/


--------------------------------------------------------------------------------
/stream-getter/Dockerfile:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | 
 4 | FROM python:3
 5 | 
 6 | WORKDIR /usr/src/app
 7 | 
 8 | COPY requirements.txt ./
 9 | RUN pip install --no-cache-dir -r requirements.txt
10 | 
11 | COPY *.py ./
12 | 
13 | CMD [ "python", "./main.py" ]
14 | 


--------------------------------------------------------------------------------
/stream-getter/backoff.py:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | 
 4 | import logging
 5 | import time
 6 | 
 7 | import requests
 8 | 
 9 | 
10 | class Backoff:
11 | 
12 |     def __init__(self):
13 |         self.wait_time = 0
14 | 
15 |     def wait_on_exception(self, exception):
16 |         if isinstance(exception, requests.exceptions.HTTPError):
17 |             if exception.response.status_code == 429:
18 |                 # Rate limit exceeded:
19 |                 # Start with a 1 minute wait and double each attempt.
20 |                 self.update_wait_time(60, 2, 3600)
21 |             else:
22 |                 # Other HTTP errors:
23 |                 # Start with a 5 second wait, doubling each attempt, up to 320 seconds.
24 |                 self.update_wait_time(5, 2, 320)
25 |         elif isinstance(exception, requests.exceptions.RequestException):
26 |             # Increase the delay in reconnects by 250ms each attempt, up to 16 seconds.
27 |             self.update_wait_time(.25, 1, 16)
28 |         else:
29 |             self.update_wait_time(1, 1, 1)
30 |         logging.info(f'Sleeping for {self.wait_time} seconds...')
31 |         time.sleep(self.wait_time)
32 | 
33 |     def update_wait_time(self, interval, multiplier, max_wait_time):
34 |         if self.wait_time == 0:
35 |             self.wait_time = interval
36 |         else:
37 |             self.wait_time = self.wait_time * multiplier if multiplier != 1 else self.wait_time + interval
38 |         self.wait_time = self.wait_time if self.wait_time <= max_wait_time else max_wait_time
39 | 
40 |     def reset_wait_time(self):
41 |         self.wait_time = 0
42 | 


--------------------------------------------------------------------------------
/stream-getter/main.py:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | 
 4 | import logging
 5 | import os
 6 | import queue
 7 | import threading
 8 | 
 9 | import requests
10 | 
11 | from backoff import Backoff
12 | from sqs_helper import SqsHelper
13 | from stream_match import StreamMatch
14 | 
15 | BEARER_TOKEN = os.environ.get('BEARER_TOKEN')
16 | REQUEST_CONNECT_TIMEOUT_SECONDS = 4
17 | REQUEST_READ_TIMEOUT_SECONDS = 30  # Twitter sends 20-second keep alive heartbeats
18 | STREAM_QUERY_PARAMS = 'tweet.fields=created_at,source&expansions=author_id&user.fields=username'
19 | 
20 | matches_queue = queue.SimpleQueue()
21 | 
22 | 
23 | def bearer_oauth(r):
24 |     r.headers['Authorization'] = f'Bearer {BEARER_TOKEN}'
25 |     r.headers['User-Agent'] = 'v2FilteredStreamPython'
26 |     return r
27 | 
28 | 
29 | def get_tweets_from_twitter():
30 |     backoff = Backoff()
31 |     while True:
32 |         try:
33 |             response = requests.get(
34 |                 f'https://api.twitter.com/2/tweets/search/stream?{STREAM_QUERY_PARAMS}',
35 |                 auth=bearer_oauth,
36 |                 stream=True,
37 |                 timeout=(REQUEST_CONNECT_TIMEOUT_SECONDS, REQUEST_READ_TIMEOUT_SECONDS)
38 |             )
39 |             response.raise_for_status()
40 |             logging.info('Connected to the Twitter stream')
41 |             backoff.reset_wait_time()
42 |             for line in response.iter_lines():
43 |                 if line:
44 |                     logging.info('New match!')
45 |                     decoded_line = line.decode('utf-8')
46 |                     logging.info(decoded_line)
47 |                     matches_queue.put(StreamMatch(decoded_line))
48 |         except Exception as exception:
49 |             logging.error(exception, exc_info=True)
50 |             backoff.wait_on_exception(exception)
51 |             continue
52 | 
53 | 
54 | def send_tweets_to_sqs(sqs_helper):
55 |     backoff = Backoff()
56 |     while True:
57 |         try:
58 |             stream_match = matches_queue.get()
59 |             if stream_match.has_errors():
60 |                 logging.info('Skipping error')
61 |             else:
62 |                 logging.info('Sending tweet to SQS')
63 |                 sqs_helper.send_tweet_to_sqs(stream_match)
64 |                 logging.info('Tweet sent to SQS')
65 |         except Exception as exception:
66 |             logging.error(exception, exc_info=True)
67 |             backoff.wait_on_exception(exception)
68 |             continue
69 | 
70 | 
71 | def main():
72 |     sqs_helper = SqsHelper(os.environ.get('SQS_QUEUE_URL'))
73 |     producer = threading.Thread(target=get_tweets_from_twitter)
74 |     consumer = threading.Thread(target=send_tweets_to_sqs, args=[sqs_helper])
75 |     producer.start()
76 |     consumer.start()
77 |     producer.join()
78 |     consumer.join()
79 | 
80 | 
81 | if __name__ == '__main__':
82 |     logging.basicConfig(level=os.environ.get('LOG_LEVEL', 'WARNING').upper())
83 |     main()
84 | 


--------------------------------------------------------------------------------
/stream-getter/requirements.txt:
--------------------------------------------------------------------------------
1 | backoff==1.11.1
2 | boto3==1.20.11
3 | botocore==1.23.11
4 | requests==2.27.0
5 | 


--------------------------------------------------------------------------------
/stream-getter/sqs_helper.py:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | 
 4 | import boto3
 5 | 
 6 | 
 7 | class SqsHelper:
 8 | 
 9 |     def __init__(self, sqs_queue_url):
10 |         self.sqs_client = boto3.client('sqs')
11 |         self.sqs_queue_url = sqs_queue_url
12 | 
13 |     def send_tweet_to_sqs(self, stream_match):
14 |         self.sqs_client.send_message(
15 |             QueueUrl=self.sqs_queue_url,
16 |             MessageBody=stream_match.to_tweet_json(),
17 |             MessageAttributes={
18 |                 'matching_rule': {
19 |                     'StringValue': stream_match.get_matching_rule(),
20 |                     'DataType': 'String'
21 |                 }
22 |             }
23 |         )
24 | 


--------------------------------------------------------------------------------
/stream-getter/stream_match.py:
--------------------------------------------------------------------------------
 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | 
 4 | import json
 5 | import datetime
 6 | 
 7 | class StreamMatch:
 8 | 
 9 |     def __init__(self, content):
10 |         self.content = content
11 | 
12 |     def to_tweet_json(self):
13 |         input_object = json.loads(self.content)
14 |         user = next(filter(lambda x: x['id'] == input_object['data']['author_id'], input_object['includes']['users']))
15 |         output_object = {
16 |             'text': input_object['data']['text'],
17 |             'user': user['username'],
18 |             'created_at': input_object['data']['created_at'] if 'created_at' in input_object['data'] else datetime.datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%S.000Z'),
19 |             'source': input_object['data']['source'] if 'source' in input_object['data'] else 'Undefined',
20 |             'platform': 'Twitter'
21 |         }
22 |         return json.dumps(output_object, ensure_ascii=False)
23 | 
24 |     def get_matching_rule(self):
25 |         input_object = json.loads(self.content)
26 |         matching_rule = input_object['matching_rules'][0]['tag']
27 |         return matching_rule
28 | 
29 |     def has_errors(self):
30 |         return 'errors' in json.loads(self.content)
31 | 


--------------------------------------------------------------------------------