├── .gitignore ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── Creating_visualizations_with_QuickSight.pdf ├── LICENSE ├── README.md ├── architecture.pdf ├── architecture.png ├── architecture.pptx ├── backend ├── lambdas │ ├── custom_resources │ │ └── lookout4metrics │ │ │ ├── lambda.py │ │ │ └── requirements.txt │ ├── locate_post │ │ └── lambda.py │ ├── process_post │ │ ├── examples │ │ │ ├── examples_eng.py │ │ │ └── examples_es.py │ │ ├── lambda.py │ │ ├── output_models │ │ │ └── models.py │ │ ├── prompt_selector.py │ │ ├── requirements.txt │ │ └── text_preprocessing.py │ ├── save_post │ │ ├── lambda.py │ │ └── requirements.txt │ └── workflow_from_sqs │ │ ├── lambda.py │ │ └── requirements.txt ├── state_machine │ └── process_post.asl.json └── template.yaml ├── data-streammer ├── mexico_big_cities.csv ├── new_years_resolutions_tweets.csv └── stream_posts.py ├── sample-files ├── phrases │ ├── 2022-09-12 │ │ ├── cb1bc907-32b8-11ed-a39b-010c818c1d51.json │ │ └── f4491b0d-32dd-11ed-96e7-1dd85b3a04d7.json │ └── 2022-09-13 │ │ ├── 0e79b260-33a6-11ed-a6f9-adab8308bd5b.json │ │ ├── 14a2cce4-33b3-11ed-9dd4-d752f0bb0867.json │ │ └── 88fd7722-33af-11ed-a356-b52a4d4d8f68.json └── tweets │ ├── 2022-09-12 │ ├── cb1bc907-32b8-11ed-a39b-010c818c1d51.json │ └── f4491b0d-32dd-11ed-96e7-1dd85b3a04d7.json │ └── 2022-09-13 │ ├── 0e79b260-33a6-11ed-a6f9-adab8308bd5b.json │ ├── 14a2cce4-33b3-11ed-9dd4-d752f0bb0867.json │ └── 88fd7722-33af-11ed-a356-b52a4d4d8f68.json └── stream-getter ├── .gitignore ├── Dockerfile ├── backoff.py ├── main.py ├── requirements.txt ├── sqs_helper.py └── stream_match.py /.gitignore: -------------------------------------------------------------------------------- 1 | # OS X extensions 2 | *.DS_Store 3 | 4 | # PyCharm 5 | # JetBrains specific template is maintained in a separate JetBrains.gitignore that can 6 | # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore 7 | # and can be added to the global gitignore or merged into this file. For a more nuclear 8 | # option (not recommended) you can uncomment the following to ignore the entire idea folder. 9 | .idea/ -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *main* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | -------------------------------------------------------------------------------- /Creating_visualizations_with_QuickSight.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/ai-powered-text-insights/ace836af93c6e0c16de8eab606bf0b83d7868156/Creating_visualizations_with_QuickSight.pdf -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | this software and associated documentation files (the "Software"), to deal in 5 | the Software without restriction, including without limitation the rights to 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | the Software, and to permit persons to whom the Software is furnished to do so. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # AI Powered Text Insights 2 | 3 | # Table of Content 4 | 5 | 1. [Overview](#overview) 6 | 2. [Architecture](#architecture) 7 | 3. [Deployment](#deployment) 8 | - [Prerequisites](#prerequisites) 9 | - [Backend resources](#backend-resources) 10 | - [Test the application by streaming sample X.com posts](#optional-test-the-application-by-streaming-sample-xcom-posts) 11 | - [Stream real X.com posts to the application](#optional-test-the-application-by-streaming-sample-xcom-posts) 12 | 4. [Cost](#cost) 13 | - [Optional Costs](#optional-costs) 14 | 5. [Visualize your insights with Amazon QuickSight](#visualize-your-insights-with-amazon-quicksight) 15 | 6. [Cleanup](#clean-up) 16 | 7. [CDK Notes](#cdk-notes) 17 | 8. [Useful commands](#useful-commands) 18 | 9. [License](#license) 19 | 20 | ## Overview 21 | 22 | This package includes the code of a prototype to help you gain insights on how your customers interact with you or your brand on social media. By leveraging Large Language Models (LLMs) on Amazon Bedrock we are able to extract real time insights (like topic, entities, sentiment, and more) from short text of any kind of short text (including posts on social media). We then use these insights to create rich visualizations on Amazon Quicksight and configure automated alerts using Amazon Lookout for Metrics. The solution consists of a text processing pipeline (using AWS Lambda) that extracts the following insights from posts on social media: 23 | 24 | - Topic of the post 25 | - Sentiment of the post 26 | - Entities involved in the post 27 | - Location of the post (if present) 28 | - Links in the post (if present) 29 | - Keyphrases in the post 30 | 31 | This extraction is made by leveraging an LLM (Claude 3 Haiku) to extract such information and store it in a JSON object as described bellow 32 | 33 | ```json 34 | { 35 | "type": "object", 36 | "properties": { 37 | "topic": { 38 | "description": "the main topic of the post", 39 | "type": "string", 40 | "default": "" 41 | }, 42 | "location": { 43 | "description": "the location, if exists, where the events occur", 44 | "type": "string", 45 | "default": "" 46 | }, 47 | "entities": { 48 | "description": "the entities involved in the post", 49 | "type": "list", 50 | "default": [] 51 | }, 52 | "keyphrases": { 53 | "description": "the keyphrases in the post", 54 | "type": "list", 55 | "default": [] 56 | }, 57 | "sentiment": { 58 | "description": "the sentiment of the post", 59 | "type": "string", 60 | "default": "" 61 | }, 62 | "links": { 63 | "description": "any links found withing the post", 64 | "type": "list", 65 | "default": [] 66 | } 67 | } 68 | } 69 | 70 | ``` 71 | 72 | If a processed text contains a location the coordinates of such location are obtained using Amazon Location Services. Processed posts are stored in an Amazon S3 bucket. Data stored in S3 is queried using Amazon Athena. Anomaly detection is performed on the volume of posts per category per period of time using Amazon Lookout for Metrics and notifications are sent when anomalies are detected. All insights can be presented in a QuickSight dashboard. 73 | 74 | The application includes the following resources (directories): 75 | 76 | - `backend`: Contains all the code for the application and the deployment of the resources described in the Architecture diagram 77 | - `data-streamer`: A sample application that streams sample X.com posts about New Year's resolutions to the AI Powered Text insights application 78 | - `stream-getter`: A sample application to receive posts from a real X.com account to the AI Powered Text insights application 79 | - `sample-files`: Example JSON objects of processed posts. 80 | 81 | 82 | ## Architecture 83 | 84 | Deploying the sample application builds the following environment in the AWS Cloud: 85 | 86 | ![architecture](architecture.png) 87 | 88 | 1. An [Amazon Elastic Container Service](https://aws.amazon.com/ecs/) (Amazon ECS) task runs on serverless infrastructure managed by [AWS Fargate](https://aws.amazon.com/fargate/) and maintains an open connection to the social media. 89 | 2. The social media access tokens are securely stored in [AWS Systems Manager Parameter Store](https://aws.amazon.com/systems-manager/), and the container image is hosted on [Amazon Elastic Container Registry](https://aws.amazon.com/ecr/) (Amazon ECR). 90 | 3. When a new post arrives, it’s placed into an [Amazon Simple Queue Service](https://aws.amazon.com/sqs/) (SQS) queue. 91 | 4. The logic of the solution resides in [AWS Lambda](https://aws.amazon.com/lambda/) function microservices, coordinated by [AWS Step Functions](https://aws.amazon.com/step-functions/). 92 | 5. The post is processed in real time by one of the Large Language Models (LLM) supported by [Amazon Bedrock](https://aws.amazon.com/bedrock). 93 | 6. [Amazon Location Service](https://aws.amazon.com/location/) transforms a location name into coordinates.  94 | 7. The post and metadata (insights) are sent to [Amazon Simple Storage Service](https://aws.amazon.com/s3/) (Amazon S3), and [Amazon Athena](https://aws.amazon.com/athena/) queries the processed tweets with standard SQL. 95 | 8. [Amazon Lookout for Metrics](https://aws.amazon.com/lookout-for-metrics/) looks for anomalies in the volume of mentions per category. [Amazon Simple Notification Service](https://aws.amazon.com/sns/) (Amazon SNS) sends an alert to users when an anomaly is detected. 96 | 9. We recommend setting up a [Amazon QuickSight](https://aws.amazon.com/quicksight/) Dashboard so that business users can easily visualize insights. 97 | 98 | ## Deployment 99 | 100 | ### Prerequisites 101 | 102 | * AWS CLI. Refer to [Installing the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html) 103 | * AWS Credentials configured in your environment. Refer to 104 | [Configuration and credential file settings](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html) 105 | * AWS SAM CLI. Refer 106 | to [Installing the AWS SAM CLI](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-sam-cli-install.html) 107 | * AWS Copilot CLI. Refer to 108 | [Install Copilot](https://aws.github.io/copilot-cli/docs/getting-started/install/) 109 | * Docker. Refer to [Docker](https://www.docker.com/products/docker-desktop) 110 | * Get access to Claude 3 Haiku model on Amazon Bedrock. Follow the instructions in the [model access guide](https://docs.aws.amazon.com/bedrock/latest/userguide/model-access.html). 111 | * Authenticate to Amazon ECR public registry. Follow this [guide to authenticate](https://docs.aws.amazon.com/AmazonECR/latest/public/public-registries.html) 112 | * [Optional] Twitter application Bearer token. Refer 113 | to [OAuth 2.0 Bearer Token - Prerequisites](https://developer.twitter.com/en/docs/authentication/oauth-2-0/bearer-tokens) 114 | * [Optional] Twitter Filtered stream rules configured. Refer to the examples in the end of this document and to 115 | [Building rules for filtered stream](https://developer.twitter.com/en/docs/twitter-api/tweets/filtered-stream/integrate/build-a-rule) 116 | 117 | ### Backend resources 118 | 119 | Run the command below, from within the `backend/` directory, to deploy the backend: 120 | 121 | ``` 122 | sam build --use-container && sam deploy --guided 123 | ``` 124 | 125 | Follow the prompts. NOTE: Due to a constraint in Lookout for Metrics naming of databases please name you stack using the following regular expression pattern: [a-zA-Z0-9_]+ 126 | 127 | The command above deploys an AWS CloudFormation stack in your AWS account. You will need the stack's output values to deploy 128 | the Twitter stream getter container. 129 | 130 | #### 1. Data format 131 | 132 | This solution generates the posts insights, stored as JSON files, into four S3 locations, **/posts**, **/links**, **/topics**, and **/phrases**, on the results bucket whose name is specified by the CloudFormation stack's outputs under "PostsBucketName". Under each subfolder data is organized by day following the **YYYY-MM-dd 00:00:00** datetime format. 133 | 134 | Sample output files can be found in this repository under the **/sample_files** folder. 135 | 136 | #### 2. Activate the Lookout for Metrics detector 137 | 138 | To allow for you to provide historical data to the anomaly detector to [reduce the detector’s learning time](https://docs.aws.amazon.com/lookoutmetrics/latest/dev/services-athena.html) the prototype is deployed with the anomaly detector disabled. 139 | 140 | If you have historical data with the same format as the data generated by this solution you may move it to the data S3 bucket generated by deploying the backend (PostsBucketName). Make sure to follow the format of the files in the **/sample_files** folder. 141 | 142 | [Follow the instructions](https://docs.aws.amazon.com/lookoutmetrics/latest/dev/gettingstarted-detector.html) to activate your detector, the detector’s name can be found as part of the CloudFormation stack’s outputs. 143 | 144 | Optionally you can configure alerts for your anomaly detector. [Follow the instructions](https://docs.aws.amazon.com/lookoutmetrics/latest/dev/gettingstarted-detector.html) to create an alert that sends a notification to SNS, the SNS topic name are part of the CloudFormation stack’s outputs. 145 | 146 | ### (Optional) Test the application by streaming sample X.com posts 147 | 148 | This section is entirely optional. It will show you how to stream sample X.com posts to the Amazon SQS queue (2 in the architecture diagram) locally from your computer. 149 | 150 | Navigate to ``data-streamer`` folder and run: 151 | 152 | ``` 153 | python stream_posts.py \ 154 | --queue_url \ 155 | --region 156 | ``` 157 | 158 | ### (Optional) Stream real X.com posts to the application 159 | 160 | This section is entirely optional. It will show you how to deploy the assets under **stream-getter** folder which creates an application (1 in the architecture diagram) to get X.com posts using the [streaming API](https://developer.twitter.com/en/docs/tutorials/stream-tweets-in-real-time). 161 | 162 | Run the command below, from within the `stream-getter/` directory, to deploy the container application: 163 | 164 | ##### 1. Create application 165 | 166 | ``` 167 | copilot app init twitter-app 168 | ``` 169 | 170 | #### 2. Create environment 171 | 172 | ``` 173 | copilot env init --name test --region 174 | ``` 175 | 176 | Replace `` with the same region to which you deployed the backend resources previously. 177 | 178 | Follow the prompts accepting the default values. 179 | 180 | The above command provisions the required network infrastructure (VPC, subnets, security groups, and more). In its 181 | default configuration, Copilot 182 | follows [AWS best practices](https://aws.amazon.com/blogs/containers/amazon-ecs-availability-best-practices/) and 183 | creates a VPC with two public and two private subnets in different Availability Zones (AZs). For security reasons, we'll 184 | soon configure the placement of the service as _private_. Because of that, the service will run on the private subnets 185 | and Copilot will automatically add NAT Gateways, but NAT Gateways increase the overall cost. In case you decide to run 186 | the application in a single AZ to have only one NAT Gateway **(not recommended)**, you can run the following command 187 | instead: 188 | 189 | ``` 190 | copilot env init --name test --region \ 191 | --override-vpc-cidr 10.0.0.0/16 --override-public- cidrs 10.0.0.0/24 --override-private-cidrs 10.0.1.0/24 192 | ``` 193 | 194 | **Note:** The current implementation is prepared to run one container at a time solely. Not only your Twitter account 195 | should allow you to have more than one Twitter's stream connection at a time, but the application also must be modified 196 | to handle other complexities such as duplicates (learn more 197 | in [Recovery and redundancy features](https://developer.twitter.com/en/docs/twitter-api/tweets/filtered-stream/integrate/recovery-and-redundancy-features)). 198 | Even though there will be only one container running at a time, having two AZs is still recommended, because in case 199 | one AZ is down, ECS can run the application in the other AZ. 200 | 201 | #### 3. Deploy the environment 202 | 203 | ``` 204 | copilot env deploy --name test 205 | ``` 206 | 207 | #### 4. Create service 208 | 209 | ``` 210 | copilot svc init --name stream-getter --svc-type "Backend Service" --dockerfile ./Dockerfile 211 | ``` 212 | 213 | #### 5. Create secret to store the Twitter Bearer token 214 | 215 | ``` 216 | copilot secret init --name TwitterBearerToken 217 | ``` 218 | 219 | When prompted to provide the secret, paste the Twitter Bearer token. 220 | 221 | #### 6. Edit service manifest 222 | 223 | Open the file `copilot/stream-getter/manifest.yml` and change its content to the following: 224 | 225 | ``` 226 | name: stream-getter 227 | type: Backend Service 228 | 229 | image: 230 | build: Dockerfile 231 | 232 | cpu: 256 233 | memory: 512 234 | count: 1 235 | exec: true 236 | 237 | network: 238 | vpc: 239 | placement: private 240 | 241 | variables: 242 | SQS_QUEUE_URL: 243 | LOG_LEVEL: info 244 | 245 | secrets: 246 | BEARER_TOKEN: /copilot/${COPILOT_APPLICATION_NAME}/${COPILOT_ENVIRONMENT_NAME}/secrets/TwitterBearerToken 247 | ``` 248 | 249 | Replace `` with the URL of the SQS queue deployed in your AWS account. 250 | 251 | You can use the following command to get the value from the backend AWS CloudFormation stack outputs 252 | (replace `` with the name of your backend stack): 253 | 254 | ``` 255 | aws cloudformation describe-stacks --stack-name \ 256 | --query "Stacks[].Outputs[?OutputKey=='TweetsQueueUrl'][] | [0].OutputValue" 257 | ``` 258 | 259 | #### 7. Add permission to write to the queue 260 | 261 | Create a new file in `copilot/stream-getter/addons/` called `sqs-policy.yaml` with the following content: 262 | 263 | ``` 264 | Parameters: 265 | App: 266 | Type: String 267 | Description: Your application's name. 268 | Env: 269 | Type: String 270 | Description: The environment name your service, job, or workflow is being deployed to. 271 | Name: 272 | Type: String 273 | Description: The name of the service, job, or workflow being deployed. 274 | 275 | Resources: 276 | QueuePolicy: 277 | Type: AWS::IAM::ManagedPolicy 278 | Properties: 279 | PolicyDocument: 280 | Version: 2012-10-17 281 | Statement: 282 | - Sid: SqsActions 283 | Effect: Allow 284 | Action: 285 | - sqs:SendMessage 286 | Resource: 287 | 288 | Outputs: 289 | QueuePolicyArn: 290 | Description: The ARN of the ManagedPolicy to attach to the task role. 291 | Value: !Ref QueuePolicy 292 | 293 | ``` 294 | 295 | Replace `` with the ARN of the SQS queue deployed in your AWS account. 296 | 297 | You can use the following command to get the value from the backend AWS CloudFormation stack outputs 298 | (replace `` with the name of your backend stack): 299 | 300 | ``` 301 | aws cloudformation describe-stacks --stack-name \ 302 | --query "Stacks[].Outputs[?OutputKey=='TweetsQueueArn'][] | [0].OutputValue" 303 | ``` 304 | 305 | After that, your directory should look like the following: 306 | 307 | ``` 308 | . 309 | ├── Dockerfile 310 | ├── backoff.py 311 | ├── copilot 312 | │ ├── stream-getter 313 | │ │ ├── addons 314 | │ │ │ └── sqs-policy.yaml 315 | │ │ └── manifest.yml 316 | │ └── environments 317 | │ └── test 318 | │ └── manifest.yml 319 | ├── main.py 320 | ├── requirements.txt 321 | ├── sqs_helper.py 322 | └── stream_match.py 323 | ``` 324 | 325 | #### 8. Deploy service 326 | 327 | > **IMPORTANT:** The container will connect to the Twitter stream as soon as it starts, after deploying the service. You need your Twitter stream rules configured before connecting to the stream. Therefore, if you haven't configured the rules yet, configure them before proceeding. 328 | 329 | ``` 330 | copilot svc deploy --name stream-getter --env test 331 | ``` 332 | 333 | When the deployment finishes, you should have the container running inside ECS. To check the logs, run the following: 334 | 335 | ``` 336 | copilot svc logs --follow 337 | ``` 338 | 339 | #### 9. Rules examples for filtered stream 340 | 341 | Twitter provides endpoints that enable you to create and manage rules, and apply those rules to filter a stream of 342 | real-time tweets that will return matching public tweets. 343 | 344 | For instance, following is a rule that returns tweets from the accounts `@awscloud`, `@AWSSecurityInfo`, and `@AmazonScience`: 345 | 346 | ``` 347 | from:awscloud OR from:AWSSecurityInfo OR from:AmazonScience 348 | ``` 349 | 350 | To add that rule, issue a request like the following, replacing `` with the Twitter Bearer token: 351 | 352 | ``` 353 | curl -X POST 'https://api.twitter.com/2/tweets/search/stream/rules' \ 354 | -H "Content-type: application/json" \ 355 | -H "Authorization: Bearer " -d \ 356 | '{ 357 | "add": [ 358 | { 359 | "value": "from:awscloud OR from:AWSSecurityInfo OR from:AmazonScience", 360 | "tag": "news" 361 | } 362 | ] 363 | }' 364 | ``` 365 | 366 | ## Cost 367 | 368 | You are responsible for the cost of the AWS services used while running this Guidance. 369 | 370 | As of May 2024, the cost for running this Guidance continuously for one month, with the default settings in the US East (N.Virginia) Region, and processing 1000 posts a day is approximately $150 per month. 371 | 372 | After the stack is destroyed, you will stop incurring in costs. 373 | 374 | The table below shows the resources provisioned by this CDK stack, and their respective cost. The table below does not consider free tier. 375 | 376 | |Resource|Description|Approximate Cost| 377 | |--------|-----------|----------| 378 | | AWS Lambda | Functions for processing the text | 2 USD | 379 | | Amazon Simple Queue Service | Queue to get text to process | 1 USD | 380 | | Amazon Location Service | Pinpoint the locations mentioned in the text | 15 USD | 381 | | Amazon Simple Storage Service | Store the processed short text | 1 USD | 382 | | Amazon Athena | Query processed text and its metadata | 49 USD | 383 | | Amazon QuickSight | Visualize the processed text and metadata | 23 USD | 384 | | Amazon Lookout for Metrics | Detect anomalies in the trends mentiond in the text | 7 USD | 385 | | AWS Step Functions | Manage the text processing pipeline | 4 USD | 386 | | Amazon Bedrock (Claude 3 Haiku) | Extract information from the text | 45 USD | 387 | | Total | | 147 USD | 388 | 389 | ### Optional costs 390 | 391 | The following costs will only be incurred if you deploy the resources in the Optional section [Stream real X.com posts to the application](#optional-stream-real-xcom-posts-to-the-application) 392 | 393 | |Resource|Description|Approximate Cost| 394 | |--------|-----------|----------| 395 | | AWS Fargate | Run the ECR container serverless | 37 USD | 396 | | Total | | 37 USD | 397 | 398 | ## Visualize your insights with Amazon QuickSight 399 | 400 | To create some example visualizations from the processed text data follow the instructions on the [Creating visualizations with QuickSight.pdf](Creating_visualizations_with_QuickSight.pdf) file. 401 | 402 | ## Clean up 403 | 404 | If you don't want to continue using the sample, clean up its resources to avoid further charges. 405 | 406 | Start by deleting the backend AWS CloudFormation stack which, in turn, will remove the underlying resources created then delete all the resources AWS Copilot set up for the container application, run the following commands: 407 | 408 | ``` 409 | sam delete --stack-name 410 | ``` 411 | 412 | Additionally, if you deployed the X.com streaming application delete the deployed resources with: 413 | 414 | ``` 415 | copilot svc delete --name stream-getter 416 | copilot env delete --name test 417 | copilot app delete 418 | ``` 419 | 420 | ## License 421 | 422 | This project is licensed under the MIT-0 License. See the [LICENSE](LICENSE) file. -------------------------------------------------------------------------------- /architecture.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/ai-powered-text-insights/ace836af93c6e0c16de8eab606bf0b83d7868156/architecture.pdf -------------------------------------------------------------------------------- /architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/ai-powered-text-insights/ace836af93c6e0c16de8eab606bf0b83d7868156/architecture.png -------------------------------------------------------------------------------- /architecture.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/ai-powered-text-insights/ace836af93c6e0c16de8eab606bf0b83d7868156/architecture.pptx -------------------------------------------------------------------------------- /backend/lambdas/custom_resources/lookout4metrics/lambda.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | import boto3 5 | import logging 6 | import uuid 7 | import cfnresponse 8 | from crhelper import CfnResource 9 | 10 | logger = logging.getLogger(__name__) 11 | helper = CfnResource(log_level="INFO") 12 | L4M = boto3.client("lookoutmetrics") 13 | 14 | 15 | def create_detector(project_name, frequency): 16 | response_create = '' 17 | 18 | response_create = L4M.create_anomaly_detector( 19 | AnomalyDetectorName=project_name + "-detector-" + str(uuid.uuid1())[:6], 20 | AnomalyDetectorDescription="Text insights anomaly detector", 21 | AnomalyDetectorConfig={ 22 | "AnomalyDetectorFrequency": frequency, 23 | }, 24 | ) 25 | 26 | logger.info(response_create) 27 | 28 | return response_create 29 | 30 | 31 | def define_dataset(detector_arn, project_name, frequency, athena_role_arn, athena_config): 32 | params = { 33 | "AnomalyDetectorArn": detector_arn, 34 | "MetricSetName": project_name + '-metric-set-1', 35 | "MetricList": [ 36 | { 37 | "MetricName": "count", 38 | "AggregationFunction": "AVG", 39 | } 40 | ], 41 | 42 | "DimensionList": ["platform", "topic", "sentiment"], 43 | "Offset": 60, 44 | 45 | "TimestampColumn": { 46 | "ColumnName": "partition_timestamp", 47 | "ColumnFormat": "yyyy-MM-dd HH:mm:ss", 48 | }, 49 | 50 | # "Delay" : 120, # seconds the detector will wait before attempting to read latest data per current time and detection frequency below 51 | "MetricSetFrequency": frequency, 52 | 53 | "MetricSource": { 54 | "AthenaSourceConfig": { 55 | "RoleArn": athena_role_arn, 56 | "DatabaseName": athena_config['db_name'], 57 | "DataCatalog": athena_config['data_catalog'], 58 | "TableName": athena_config['table_name'], 59 | "WorkGroupName": athena_config['workgroup_name'], 60 | } 61 | }, 62 | } 63 | 64 | return params 65 | 66 | 67 | @helper.create 68 | @helper.update 69 | def create(event, context): 70 | logger.info(event) 71 | 72 | cfn_input = event["ResourceProperties"] 73 | target = cfn_input["Target"] 74 | 75 | try: 76 | 77 | athena_role_arn = target['AthenaRoleArn'] 78 | athena_config = { 79 | 'db_name': target['GlueDbName'], 80 | 'data_catalog': target['AwsDataCatalog'], 81 | 'table_name': target['GlueTableName'], 82 | 'workgroup_name': target['AthenaWorkgroupName'] 83 | } 84 | sns_role_arn = target['SnsRoleArn'] 85 | topic_arn = target['SnsTopicArn'] 86 | alert_threshold = target['AlertThreshold'] 87 | 88 | project = 'ai-text-insights' 89 | frequency = target['DetectorFrequency'] 90 | 91 | l4m_detector = create_detector(project, frequency) 92 | 93 | anomaly_detector_arn = l4m_detector["AnomalyDetectorArn"] 94 | dataset = define_dataset(anomaly_detector_arn, project, frequency, athena_role_arn, athena_config) 95 | 96 | L4M.create_metric_set(**dataset) 97 | 98 | L4M.activate_anomaly_detector(AnomalyDetectorArn=anomaly_detector_arn) 99 | 100 | L4M.create_alert( 101 | Action={ 102 | "SNSConfiguration": { 103 | "RoleArn": sns_role_arn, 104 | "SnsTopicArn": topic_arn 105 | } 106 | }, 107 | AlertDescription="Text insights alert", 108 | AlertName=project + "-alert-all", 109 | AnomalyDetectorArn=anomaly_detector_arn, 110 | AlertSensitivityThreshold=int(alert_threshold) 111 | ) 112 | 113 | cfnresponse.send(event, context, cfnresponse.SUCCESS, {'Data': 'Created L4M resource'}, anomaly_detector_arn) 114 | 115 | except Exception as e: 116 | 117 | logger.error(e) 118 | cfnresponse.send(event, context, cfnresponse.FAILED, {'Data': 'Failed to create L4M resource'}, '') 119 | 120 | 121 | @helper.delete 122 | def delete(event, _): 123 | logger.info(event) 124 | 125 | 126 | def handler(_event, _context): 127 | helper(_event, _context) 128 | -------------------------------------------------------------------------------- /backend/lambdas/custom_resources/lookout4metrics/requirements.txt: -------------------------------------------------------------------------------- 1 | boto3==1.24.20 2 | botocore==1.27.20 3 | crhelper==2.0.11 4 | cfnresponse==1.1.2 -------------------------------------------------------------------------------- /backend/lambdas/locate_post/lambda.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | import boto3 5 | import os 6 | import logging 7 | 8 | logging.getLogger().setLevel(os.environ.get('LOG_LEVEL', 'WARNING').upper()) 9 | location_service = boto3.client('location') 10 | 11 | def handler(event, context): 12 | 13 | item = event 14 | 15 | logging.info('Item is located in ' + item['location']) 16 | res_locations = location_service.search_place_index_for_text(FilterCountries=[os.environ['GEO_REGION']], 17 | IndexName=os.environ['PLACE_INDEX_NAME'], 18 | Language=os.environ['LANGUAGE'], 19 | MaxResults=1, 20 | Text=item['location'] 21 | ) 22 | logging.info(res_locations) 23 | 24 | if len(res_locations['Results']) >= 1: 25 | print(res_locations['Results'][0]) 26 | item['longitude'] = res_locations['Results'][0]['Place']['Geometry']['Point'][0] 27 | item['latitude'] = res_locations['Results'][0]['Place']['Geometry']['Point'][1] 28 | 29 | return item 30 | 31 | -------------------------------------------------------------------------------- /backend/lambdas/process_post/examples/examples_eng.py: -------------------------------------------------------------------------------- 1 | examples_eng = [ 2 | 3 | { 4 | "text": 5 | """ 6 | Six months ago, Wall Street Journal reporter Evan Gershkovich was detained in Russia during a reporting trip. 7 | He remains in a Moscow prison. 8 | 9 | We’re offering resources for those who want to show their support for him. #IStandWithEvan https://wsj.com/Evan 10 | """, 11 | "extraction": 12 | """ 13 | { 14 | "topic": "detention of a reporter", 15 | "location": "Moscow", 16 | "entities": ["Evan Gershkovich", "Wall Street Journal"], 17 | "keyphrases": ["reporter", "detained", "prison"], 18 | "sentiment": "negative", 19 | "links": ["https://wsj.com/Evan"], 20 | } 21 | """ 22 | }, 23 | { 24 | "text": 25 | """ 26 | We’re living an internal war": Once-peaceful Ecuador has become engulfed in the cocaine trade, and the bodies are piling up. 27 | """, 28 | "extraction": 29 | """ 30 | { 31 | "topic": "drug war", 32 | "location": "Ecuador", 33 | "entities": ["Ecuador"], 34 | "keyphrases": ["drug war", "cocaine trade"], 35 | "sentiment": "negative", 36 | "links": [], 37 | } 38 | """ 39 | }, 40 | { 41 | "text": 42 | """ 43 | House Democrats will soon face a difficult decision: Are they better off keeping Kevin McCarthy \ 44 | as House speaker, or taking chances with someone else? 45 | """, 46 | "extraction": 47 | """ 48 | { 49 | "topic": "house speaker choice", 50 | "location": "", 51 | "entities": ["Kevin McCarthy", "House Democrats"], 52 | "keyphrases": ["house speaker", "house democrats"], 53 | "sentiment": "neutral", 54 | "links": [], 55 | } 56 | """ 57 | }, 58 | { 59 | "text": 60 | """ 61 | A postpandemic hiring spree has left airports vulnerable to security gaps as new staff gain access to secure areas, \ 62 | creating an opening for criminal groups. 63 | """, 64 | "extraction": 65 | """ 66 | { 67 | "topic": "airport security vulnerabilities", 68 | "location": "", 69 | "entities": [], 70 | "keyphrases": ["security gaps", "secure areas", "criminal groups"], 71 | "sentiment": "negative", 72 | "links": [], 73 | } 74 | """ 75 | } 76 | ] -------------------------------------------------------------------------------- /backend/lambdas/process_post/examples/examples_es.py: -------------------------------------------------------------------------------- 1 | examples_es = [ 2 | { 3 | "text": 4 | """ 5 | Inspírese para innovar en #AWSreInvent 6 | 7 | Asista a la ponencia principal y escuche a los principales líderes de #AWS revelar los últimos lanzamientos de productos y compartir sus 8 | opiniones sobre las últimas tendencias en #CloudComputing 9 | 10 | Regístrese: https://go.aws/3RBhhRq 11 | """, 12 | "extraction": 13 | """ 14 | { 15 | "topic": "ponencia principal de reinvent", 16 | "location": "", 17 | "entities": ["AWS", "AWSreInvent"], 18 | "keyphrases": ["ponencia principal", "lanzamientos de productos", "#CloudComputing", "#AWSreInvent"], 19 | "sentiment": "neutral" 20 | "links: ["https://go.aws/3RBhhRq"] 21 | } 22 | """ 23 | }, 24 | { 25 | "text": 26 | """ 27 | Esta guerra fue forzada sobre nosotros por un enemigo horrendo", dice Netanyahu 28 | 29 | Más información: https://cnn.it/3RKT5MK 30 | """, 31 | "extraction": 32 | """ 33 | { 34 | "topic": "Guerra", 35 | "location": "", 36 | "entities": ["Netanyahu", "enemigo horrendo"], 37 | "keyphrases": ["enemigo horrendo", "guerra"], 38 | "sentiment": "negativo" 39 | "links": ["https://cnn.it/3RKT5MK"] 40 | } 41 | """ 42 | }, 43 | { 44 | "text": 45 | """ 46 | En medio de la cobertura del conflicto entre Israel y Gaza, el equipo de CNN, que se encuentra en Ashdod, Israel, 47 | tuvo que resguardarse de un "bombardeo masivo de cohetes". 48 | """, 49 | "extraction": 50 | """ 51 | { 52 | "topic": "equipo de CNN se resguardo de un bombardeo", 53 | "location": "Ashdod, Israel", 54 | "entities": ["Israel", "Gaza", "CNN"], 55 | "keyphrases": ["bombardeo masivo de cohetes", "conflicto entre Israel y Gaza"], 56 | "sentiment": "negativo", 57 | "links":[] 58 | } 59 | """ 60 | }, 61 | { 62 | "text": 63 | """ 64 | El Consejo de la FIFA acordó por unanimidad celebrar el centenario de la Copa Mundial de la FIFA de la manera más apropiada. \ 65 | Tres países sudamericanos (Uruguay, Argentina y Paraguay) organizarán un partido cada uno de la Copa Mundial de la FIFA 2030. 66 | """, 67 | "extraction": 68 | """ 69 | { 70 | "topic": "Anfitriones copa del mundo 2030", 71 | "location": "", 72 | "entities": ["Uruguay", "Argentina", "Paraguay", "FIFA", "Copa Mundial de la FIFA"], 73 | "keyphrases": ["centenario de la Copa Mundial de la FIFA"], 74 | "sentiment": "positivo", 75 | "links":[] 76 | } 77 | """ 78 | } 79 | ] -------------------------------------------------------------------------------- /backend/lambdas/process_post/lambda.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | import copy 4 | import os 5 | import traceback 6 | 7 | import demoji 8 | 9 | import boto3 10 | import langchain_core 11 | from langchain_aws import ChatBedrock 12 | 13 | from output_models.models import ExtractedInformation, TopicMatch, TextWithInsights 14 | from prompt_selector import get_information_extraction_prompt_selector, get_topic_match_prompt_selector 15 | 16 | from aws_lambda_powertools.utilities.typing import LambdaContext 17 | from aws_lambda_powertools import Logger 18 | 19 | MODEL_ID = os.environ['MODEL_ID'] 20 | AWS_REGION = os.environ['AWS_REGION'] 21 | LANGUAGE_CODE = os.environ['LANGUAGE_CODE'] 22 | META_TOPICS_STR = os.environ['LABELS'] 23 | META_SENTIMENTS_STR = os.environ['SENTIMENT_LABELS'] 24 | 25 | INFORMATION_EXTRACTION_PROMPT_SELECTOR = get_information_extraction_prompt_selector(LANGUAGE_CODE) 26 | TOPIC_MATCH_PROMPT_SELECTOR = get_topic_match_prompt_selector(LANGUAGE_CODE) 27 | 28 | TOPIC_MATCH_MODEL_PARAMETERS = { 29 | "max_tokens": 500, 30 | "temperature": 0.1, 31 | "top_k": 20, 32 | } 33 | 34 | INFORMATION_EXTRACTION_MODEL_PARAMETERS = { 35 | "max_tokens": 1500, 36 | "temperature": 0.1, 37 | "top_k": 20, 38 | } 39 | 40 | bedrock_runtime = boto3.client( 41 | service_name="bedrock-runtime", 42 | region_name=AWS_REGION 43 | ) 44 | 45 | logger = Logger() 46 | 47 | langchain_core.globals.set_debug(True) 48 | 49 | def remove_unknown_values(extracted_info: ExtractedInformation): 50 | 51 | text_insights = copy.deepcopy(extracted_info) 52 | 53 | topic = extracted_info.topic.strip() 54 | location = extracted_info.location.strip() 55 | sentiment = extracted_info.sentiment.strip() 56 | 57 | text_insights.topic = topic if topic != "" else "" 58 | text_insights.location = location if location != "" else "" 59 | text_insights.sentiment = sentiment if sentiment != "" else "" 60 | 61 | return text_insights 62 | 63 | def text_topic_match( 64 | meta_topics: str, 65 | text: str, 66 | ) -> TopicMatch: 67 | 68 | bedrock_llm = ChatBedrock( 69 | model_id=MODEL_ID, 70 | model_kwargs=TOPIC_MATCH_MODEL_PARAMETERS, 71 | client=bedrock_runtime, 72 | ) 73 | 74 | claude_topic_match_prompt_template = TOPIC_MATCH_PROMPT_SELECTOR.get_prompt(MODEL_ID) 75 | 76 | structured_llm = bedrock_llm.with_structured_output(TopicMatch) 77 | 78 | structured_topic_match_chain = claude_topic_match_prompt_template | structured_llm 79 | 80 | topic_match_obj = structured_topic_match_chain.invoke({"meta_topics": meta_topics, "text": text}) 81 | 82 | return topic_match_obj 83 | 84 | def text_information_extraction( 85 | sentiments: str, 86 | text: str 87 | ) -> ExtractedInformation: 88 | 89 | bedrock_llm = ChatBedrock( 90 | model_id=MODEL_ID, 91 | model_kwargs=INFORMATION_EXTRACTION_MODEL_PARAMETERS, 92 | client=bedrock_runtime, 93 | ) 94 | 95 | claude_information_extraction_prompt_template = INFORMATION_EXTRACTION_PROMPT_SELECTOR.get_prompt(MODEL_ID) 96 | 97 | structured_llm = bedrock_llm.with_structured_output(ExtractedInformation) 98 | 99 | structured_chain = claude_information_extraction_prompt_template | structured_llm 100 | 101 | information_extraction_obj = structured_chain.invoke({ 102 | "text": text, 103 | "sentiments": sentiments 104 | }) 105 | 106 | return information_extraction_obj 107 | 108 | @logger.inject_lambda_context(log_event=True) 109 | def handler(event, _context: LambdaContext): 110 | 111 | item = event 112 | text = item['text'] 113 | 114 | clean_text = demoji.replace(text, "") 115 | 116 | # Attemp to extract information from text 117 | try: 118 | 119 | topic_match = text_topic_match(META_TOPICS_STR, clean_text) 120 | 121 | if topic_match.is_match and len(topic_match.related_topics) > 0: 122 | 123 | try: 124 | 125 | insights = text_information_extraction(META_SENTIMENTS_STR, clean_text) 126 | logger.info(f'Text insights:') 127 | logger.info(insights) 128 | 129 | if insights is not None: 130 | 131 | logger.info("Removing values") 132 | logger.debug(insights) 133 | 134 | insights = remove_unknown_values(insights) 135 | 136 | logger.info(" removed") 137 | logger.debug(insights) 138 | 139 | # Create output object 140 | text_insights = TextWithInsights( 141 | text=item["text"], 142 | user=item["user"], 143 | created_at=item["created_at"], 144 | source=item["source"], 145 | platform=item["platform"], 146 | text_clean=clean_text, 147 | meta_topics=topic_match.related_topics, 148 | topic=insights.topic, 149 | location=insights.location, 150 | entities=insights.entities, 151 | keyphrases=insights.keyphrases, 152 | sentiment=insights.sentiment, 153 | links=insights.links, 154 | model_id=MODEL_ID, 155 | process_post=True, 156 | process_location=True if len(insights.location) > 0 else False # Process location only if there is one 157 | ) 158 | 159 | logger.info(f'Invoking next function') 160 | logger.debug(item) 161 | else: 162 | 163 | # Create output object 164 | text_insights = TextWithInsights( 165 | text=item["text"], 166 | user=item["user"], 167 | created_at=item["created_at"], 168 | source=item["source"], 169 | platform=item["platform"], 170 | text_clean=clean_text, 171 | model_id=MODEL_ID, 172 | process_post=False 173 | ) 174 | 175 | logger.info(f'Could not extract information from this text') 176 | logger.debug(item) 177 | 178 | return text_insights.dict() 179 | 180 | except Exception as e: 181 | 182 | logger.error("Unable to extract data from text") 183 | logger.error(traceback.format_exc()) 184 | 185 | raise Exception("Unable to extract data from text") 186 | 187 | else: 188 | logger.info(f'Topic not matched') 189 | return {'process_post': False} 190 | 191 | except Exception as e: 192 | 193 | logger.error(traceback.format_exc()) 194 | 195 | raise Exception("Unable to match topics on text") -------------------------------------------------------------------------------- /backend/lambdas/process_post/output_models/models.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | from langchain_core.pydantic_v1 import BaseModel, Field 5 | from typing import List, Optional, Literal 6 | 7 | class ExtractedInformation(BaseModel): 8 | """Contains information extracted from the text""" 9 | topic: str = Field([], description="The main topic of the text") 10 | location: str = Field("", description="The location where the events occur, empty if no location can be inferred") 11 | entities: List[str] = Field([], description="The entities involved in the text") 12 | keyphrases: List[str] = Field([], description="The keyphrases in the short text") 13 | sentiment: str = Field("neutral", description="The overall sentiment of the text") 14 | links: List[str] = Field([], description="Any links found withing the text") 15 | 16 | class TopicMatch(BaseModel): 17 | """Information regarding the match of a topic to a set of predefined topics""" 18 | understanding: str = Field(description="In your own words the topics to which this text is related") 19 | related_topics: List[str] = Field(description="The list of topics into which the input text can be classified") 20 | is_match: bool = Field(description="true if the text matches one of your topics of interest, false otherwise") 21 | 22 | 23 | class TextWithInsights(ExtractedInformation): 24 | """A text with its extracted insights""" 25 | text: str = Field(description="The original text as written by the user") 26 | user: str = Field(description="The user that wrote the text") 27 | created_at: str = Field(description="The datetime when te text was created") 28 | source: str = Field(description="The source platform from where this text was generated") 29 | platform: str = Field(description="The platform from where this text was generated") 30 | text_clean: str = Field(description="The original text after being pre-processed") 31 | process_post: bool = Field(description="Whether the text should be further processed or not") 32 | meta_topics: List[str] = Field([], description="A list of topics in which the text can be classified"), 33 | process_location: bool = Field(False, description="Whether the text has a location that should be further processed"), 34 | model_id: str = Field(description="The model id that was used to extract these insights") 35 | -------------------------------------------------------------------------------- /backend/lambdas/process_post/prompt_selector.py: -------------------------------------------------------------------------------- 1 | from langchain.chains.prompt_selector import ConditionalPromptSelector 2 | 3 | from langchain_core.prompts.chat import ChatPromptTemplate, HumanMessagePromptTemplate, SystemMessagePromptTemplate, \ 4 | AIMessagePromptTemplate 5 | from langchain_core.prompts.few_shot import FewShotChatMessagePromptTemplate 6 | 7 | from examples.examples_eng import examples_eng 8 | from examples.examples_es import examples_es 9 | 10 | from typing import Callable 11 | 12 | # ANTHROPIC CLAUDE 3 PROMPT TEMPLATES 13 | 14 | # English Prompts 15 | 16 | claude_information_extraction_system_prompt_en = """You are an information extraction system. Your task is to extract key information from the text that will be presented to you. 17 | 18 | Here are some basic rules for the task: 19 | 20 | - You can reason about the task and the information, take your time to think 21 | - You must classify the sentiment of the text in one of the following: {sentiments} 22 | - NEVER fill a value of your own for entries where you are unsure of and rather use default values 23 | 24 | Here are some examples of how to extract information from text: 25 | """ 26 | 27 | claude_information_extraction_user_prompt_en = """ 28 | Extract the information from the following text: 29 | 30 | {text} 31 | """ 32 | 33 | claude_topic_match_system_prompt_en = """ 34 | You are a highly accurate text classification system. Your task is to classify text into one or more categories. Even though you can classify text in any category \ 35 | for this task you are only interested in classifying the text in some of the topics listed below: 36 | 37 | 38 | {meta_topics} 39 | 40 | 41 | Here are some rules for the text classification you will perform: 42 | 43 | - You should not be forced to classify the text in any of the existing categories, its ok if you cant fit a text into the topics of interest 44 | - Only classify a text into a category if you are really sure it belongs to it 45 | - Use only the categories in the list for your classification 46 | - You may classify a text in any number of topics (including zero) but your result must always be a list (even if its empty) 47 | - You can classify text in multiple categories at once 48 | - You can reason about your task, take your time to think 49 | """ 50 | 51 | claude_topic_match_user_prompt_en = user_prompt = """ 52 | Classify the following text into one or more categories of your interest: 53 | 54 | {text} 55 | """ 56 | 57 | examples_prompt_template_eng = ChatPromptTemplate.from_messages( 58 | [ 59 | HumanMessagePromptTemplate.from_template("{text}", input_variables=["text"], validate_template=True), 60 | AIMessagePromptTemplate.from_template("{extraction}", input_variables=["extraction"], validate_template=True) 61 | ] 62 | ) 63 | 64 | few_shot_chat_prompt_eng = FewShotChatMessagePromptTemplate( 65 | example_prompt=examples_prompt_template_eng, 66 | examples=examples_eng, 67 | ) 68 | 69 | CLAUDE_INFORMATION_EXTRACTION_PROMPT_TEMPLATE_EN = ChatPromptTemplate.from_messages([ 70 | SystemMessagePromptTemplate.from_template(claude_information_extraction_system_prompt_en, input_variables=["sentiments"], validate_template=True), 71 | few_shot_chat_prompt_eng, 72 | HumanMessagePromptTemplate.from_template(claude_information_extraction_user_prompt_en, input_variables=["text"], validate_template=True), 73 | ]) 74 | 75 | CLAUDE_TOPIC_MATCH_PROMPT_TEMPLATE_EN = ChatPromptTemplate.from_messages([ 76 | SystemMessagePromptTemplate.from_template(claude_topic_match_system_prompt_en, input_variables=["meta_topics"], validate_template=True), 77 | HumanMessagePromptTemplate.from_template(claude_topic_match_user_prompt_en, input_variables=["text"], validate_template=True), 78 | ]) 79 | 80 | # Spanish Prompts 81 | 82 | claude_information_extraction_system_prompt_es = """ 83 | Eres un sistema de extraccion de informacion. Tu tarea consiste en extraer informacion clave del texto que te sera presentado. 84 | 85 | Estas son algunas reglas basicas que debes seguir: 86 | 87 | - Puedes razonar sobre la tarea y la informacion, tomate un tiempo para pensar 88 | - Debes clasificar el sentimiento en uno de los siguientes: {sentiments} 89 | - NUNCA acompletes un valor del cual no estas seguro con valores tuyos, en su lugar emplea los valores por defecto para cada campo 90 | 91 | Aqui hay algunos ejemplos de como extraer informacion de texto: 92 | """ 93 | 94 | claude_information_extraction_user_prompt_es = """ 95 | Extrae la informacion de siguiente texto: 96 | 97 | {text} 98 | """ 99 | 100 | claude_topic_match_system_prompt_es = """ 101 | Eres un sistema de clasificacion de texto altamente preciso. Tu tarea es clasificar texto en una o mas categorias. Aunque eres capaz de clasificar texto en cualquier \ 102 | categoria para esta tarea solo estas interesado en clasificar el texto en algunas de las categorias listadas aqui: 103 | 104 | 105 | {meta_topics} 106 | 107 | 108 | Aqui hay algunas reglas para la clasificacion de texto que vas a realizar: 109 | 110 | - No te debes ver forzado a clasificar el texto en ninguna de las categorias existentes, esta bien si no puedes clasificar el texto en ninguno de los temas de interes 111 | - Solo clasifica el texto en una categoria si estas completamente seguro que pertenece a esa categoria 112 | - Usa solo las categorias en la lista para tu clasificacion 113 | - Puedes clasificar el texto en cualquier numero de categorias (incluso cero) pero tu respuesta siempre debe ser una lista (aunque sea una lista vacia) 114 | - Puedes clasificar el texto en multiples categorias a la vez 115 | - Puedes razonar sobre tu tarea, tomate tu tiempo para pensar 116 | """ 117 | 118 | claude_topic_match_user_prompt_es = """ 119 | Clasifica el siguiente texto en una o mas categorias de tu interes: 120 | 121 | {text} 122 | """ 123 | 124 | examples_prompt_template_es = ChatPromptTemplate.from_messages( 125 | [ 126 | HumanMessagePromptTemplate.from_template("{text}", input_variables=["text"], validate_template=True), 127 | AIMessagePromptTemplate.from_template("{extraction}", input_variables=["extraction"], validate_template=True) 128 | ] 129 | ) 130 | 131 | few_shot_chat_prompt_es = FewShotChatMessagePromptTemplate( 132 | example_prompt=examples_prompt_template_es, 133 | examples=examples_es, 134 | ) 135 | 136 | CLAUDE_INFORMATION_EXTRACTION_PROMPT_TEMPLATE_ES = ChatPromptTemplate.from_messages([ 137 | SystemMessagePromptTemplate.from_template(claude_information_extraction_system_prompt_es, input_variables=["sentiments"], validate_template=True), 138 | few_shot_chat_prompt_eng, 139 | HumanMessagePromptTemplate.from_template(claude_information_extraction_user_prompt_es, 140 | input_variables=["text"], validate_template=True), 141 | ]) 142 | 143 | CLAUDE_TOPIC_MATCH_PROMPT_TEMPLATE_ES = ChatPromptTemplate.from_messages([ 144 | SystemMessagePromptTemplate.from_template(claude_topic_match_system_prompt_es, input_variables=["meta_topics"], 145 | validate_template=True), 146 | HumanMessagePromptTemplate.from_template(claude_topic_match_user_prompt_es, input_variables=["text"], 147 | validate_template=True), 148 | ]) 149 | 150 | 151 | def is_es(language: str) -> bool: 152 | return "es" == language 153 | 154 | 155 | def is_en(language: str) -> bool: 156 | return "en" == language 157 | 158 | 159 | def is_claude(model_id: str) -> bool: 160 | return "claude" in model_id 161 | 162 | 163 | def is_es_claude(language: str) -> Callable[[str], bool]: 164 | return lambda model_id: is_es(language) and is_claude(model_id) 165 | 166 | 167 | def is_en_claude(language: str) -> Callable[[str], bool]: 168 | return lambda model_id: is_en(language) and is_claude(model_id) 169 | 170 | 171 | def get_information_extraction_prompt_selector(lang: str) -> ConditionalPromptSelector: 172 | return ConditionalPromptSelector( 173 | default_prompt=CLAUDE_INFORMATION_EXTRACTION_PROMPT_TEMPLATE_EN, 174 | conditionals=[ 175 | (is_es_claude(lang), CLAUDE_INFORMATION_EXTRACTION_PROMPT_TEMPLATE_ES), 176 | ] 177 | ) 178 | 179 | 180 | def get_topic_match_prompt_selector(lang: str) -> ConditionalPromptSelector: 181 | return ConditionalPromptSelector( 182 | default_prompt=CLAUDE_TOPIC_MATCH_PROMPT_TEMPLATE_EN, 183 | conditionals=[ 184 | (is_es_claude(lang), CLAUDE_TOPIC_MATCH_PROMPT_TEMPLATE_ES) 185 | ] 186 | ) -------------------------------------------------------------------------------- /backend/lambdas/process_post/requirements.txt: -------------------------------------------------------------------------------- 1 | demoji==1.1.0 2 | boto3 3 | langchain==0.2.6 4 | langchain_aws==0.1.9 5 | langchain-core==0.2.10 6 | aws-lambda-powertools -------------------------------------------------------------------------------- /backend/lambdas/process_post/text_preprocessing.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | import re 5 | 6 | class TweetPreprocessor: 7 | 8 | @staticmethod 9 | def remove_links(tweet): 10 | """Takes a string and removes web links from it""" 11 | tweet = re.sub(r'http\S+', '', tweet) # remove http links 12 | tweet = re.sub(r'bit.ly/\S+', '', tweet) # remove bitly links 13 | tweet = re.sub(r'pic.twitter\S+', '', tweet) 14 | return tweet 15 | 16 | @staticmethod 17 | def remove_users(tweet): 18 | """Takes a string and removes retweet and @user information""" 19 | tweet = re.sub('(RT\s@[A-Za-z]+[A-Za-z0-9-_]+):*', '', tweet) # remove re-tweet 20 | tweet = re.sub('(@[A-Za-z]+[A-Za-z0-9-_]+):*', '', tweet) # remove tweeted at 21 | return tweet 22 | 23 | @staticmethod 24 | def remove_hashtags(tweet): 25 | """Takes a string and removes any hash tags""" 26 | tweet = re.sub('(#[A-Za-z]+[A-Za-z0-9-_]+)', '', tweet) # remove hash tags 27 | return tweet 28 | 29 | @staticmethod 30 | def remove_av(tweet): 31 | """Takes a string and removes AUDIO/VIDEO tags or labels""" 32 | tweet = re.sub('VIDEO:', '', tweet) # remove 'VIDEO:' from start of tweet 33 | tweet = re.sub('AUDIO:', '', tweet) # remove 'AUDIO:' from start of tweet 34 | return tweet 35 | 36 | @staticmethod 37 | def preprocess(tweet): 38 | #tweet = tweet.encode('latin1', 'ignore').decode('latin1') 39 | tweet = tweet.lower() 40 | #tweet = TweetPreprocessor.remove_users(tweet) 41 | tweet = TweetPreprocessor.remove_links(tweet) 42 | #tweet = TweetPreprocessor.remove_hashtags(tweet) 43 | tweet = TweetPreprocessor.remove_av(tweet) 44 | tweet = ' '.join(tweet.split()) # Remove extra spaces 45 | return tweet.strip() 46 | 47 | @staticmethod 48 | def get_hash_tags(tweet): 49 | return re.findall(r"#(\w+)", tweet) -------------------------------------------------------------------------------- /backend/lambdas/save_post/lambda.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | import boto3 5 | import os 6 | import logging 7 | import datetime 8 | import uuid 9 | import io 10 | import json 11 | import jsonlines 12 | 13 | logging.getLogger().setLevel(os.environ.get('LOG_LEVEL', 'WARNING').upper()) 14 | s3 = boto3.client('s3') 15 | 16 | 17 | def create_multi_records(insights, key, dest_key): 18 | items = io.StringIO() 19 | item = {} 20 | 21 | item['created_at'] = insights['created_at'] 22 | item['timestamp'] = insights['timestamp'] 23 | item['user'] = insights['user'] 24 | item['platform'] = insights['platform'] 25 | item['text_clean'] = insights['text_clean'] 26 | item['count'] = 1 27 | 28 | if key == 'meta_topics': 29 | item['sentiment'] = insights['sentiment'].lower().strip() 30 | 31 | if insights['process_location']: 32 | item['location'] = insights['location'] 33 | item['longitude'] = insights['longitude'] 34 | item['latitude'] = insights['latitude'] 35 | 36 | # Write as JSON lines so that Athena can read them 37 | with jsonlines.Writer(items) as writer: 38 | for insights_key in insights[key]: 39 | if key == 'meta_topics': 40 | item[dest_key] = insights_key.lower().strip() 41 | else: 42 | item[dest_key] = insights_key 43 | writer.write(item) 44 | 45 | return items.getvalue() 46 | 47 | 48 | def handler(event, context): 49 | 50 | item = event 51 | 52 | utc_now = datetime.datetime.now(datetime.timezone.utc) 53 | item['timestamp'] = utc_now.strftime('%Y-%m-%d %H:%M:%S.%f') 54 | 55 | # Add count for anomaly detection 56 | item['count'] = 1 57 | 58 | created_at = datetime.datetime.strptime(item['created_at'], '%Y-%m-%dT%H:%M:%S.%fz') 59 | item['created_at'] = created_at.strftime('%Y-%m-%d %H:%M:%S.%f') 60 | 61 | logging.info("Item to be saved") 62 | logging.info(item) 63 | 64 | #Retrive objects 1:N 65 | phrases_json_lines = create_multi_records(item, key='keyphrases', dest_key='phrase') 66 | meta_topics_json_lines = create_multi_records(item, key='meta_topics', dest_key='topic') 67 | links_json_lines = create_multi_records(item, key='links', dest_key='link') 68 | 69 | # Create partition key for S3 70 | partition_timestamp = datetime.datetime(created_at.year, created_at.month, created_at.day, 0, 0, 0) 71 | key = f"{partition_timestamp.strftime('%Y-%m-%d %H:%M:%S')}/{str(uuid.uuid1())}.json" 72 | 73 | logging.info('Key phrases') 74 | logging.info(phrases_json_lines) 75 | 76 | logging.info('Meta topics') 77 | logging.info(meta_topics_json_lines) 78 | 79 | logging.info('Links') 80 | logging.info(links_json_lines) 81 | 82 | # Save item 83 | if item['process_location']: 84 | post = {'text': item['text'], 'user': item['user'], 'created_at': item['created_at'], 'source': item['source'], 85 | 'platform': item['platform'], 'text_clean': item['text_clean'], 'topic': item['topic'].lower().strip(), 86 | 'model': item['model_id'], 'sentiment': item['sentiment'].lower().strip(), 'location': item['location'], 87 | 'longitude': item['longitude'], 'latitude': item['latitude'], 'timestamp': item['timestamp'], 'count': item['count']} 88 | else: 89 | post = {'text': item['text'], 'user': item['user'], 'created_at': item['created_at'], 'source': item['source'], 90 | 'platform': item['platform'], 'text_clean': item['text_clean'], 'topic': item['topic'].lower().strip(), 91 | 'model': item['model_id'], 'sentiment': item['sentiment'].lower().strip(), 'timestamp': item['timestamp'], 92 | 'count': item['count']} 93 | 94 | s3.put_object( 95 | Body=json.dumps(post, ensure_ascii=False), 96 | Bucket=os.environ['POSTS_BUCKET'], 97 | Key='posts/' + key 98 | ) 99 | 100 | # Save 1:N objectsº 101 | 102 | if phrases_json_lines: 103 | # Save key phrases 104 | s3.put_object( 105 | Body=phrases_json_lines, 106 | Bucket=os.environ['POSTS_BUCKET'], 107 | Key='phrases/' + key 108 | ) 109 | 110 | if meta_topics_json_lines: 111 | # Save meta topics 112 | s3.put_object( 113 | Body=meta_topics_json_lines, 114 | Bucket=os.environ['POSTS_BUCKET'], 115 | Key='topics/' + key 116 | ) 117 | 118 | if links_json_lines: 119 | # Save links 120 | s3.put_object( 121 | Body=links_json_lines, 122 | Bucket=os.environ['POSTS_BUCKET'], 123 | Key='links/' + key 124 | ) 125 | 126 | return {'success': True} 127 | -------------------------------------------------------------------------------- /backend/lambdas/save_post/requirements.txt: -------------------------------------------------------------------------------- 1 | jsonlines==3.1.0 -------------------------------------------------------------------------------- /backend/lambdas/workflow_from_sqs/lambda.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | import boto3 5 | import logging 6 | import os 7 | 8 | from aws_lambda_powertools.utilities.batch import BatchProcessor, EventType, batch_processor 9 | from aws_lambda_powertools.utilities.data_classes.sqs_event import SQSRecord 10 | 11 | processor = BatchProcessor(event_type=EventType.SQS) 12 | 13 | logging.getLogger().setLevel(os.environ.get('LOG_LEVEL', 'WARNING').upper()) 14 | 15 | def record_handler(record: SQSRecord): 16 | 17 | message_body = record.body 18 | 19 | if message_body: 20 | logging.info('SQS message body:') 21 | logging.info(message_body) 22 | #item = json.loads(message_body, strict=False) 23 | 24 | logging.info('Starting step function processing:') 25 | 26 | client = boto3.client('stepfunctions') 27 | response = client.start_execution( 28 | stateMachineArn=os.environ['PROCESS_POST_STATE_MACHINE'], 29 | input=message_body 30 | ) 31 | 32 | logging.info('Step function response:') 33 | logging.info(response) 34 | 35 | return response 36 | 37 | 38 | @batch_processor(record_handler=record_handler, processor=processor) 39 | def handler(event, context): 40 | return processor.response() 41 | 42 | 43 | 44 | -------------------------------------------------------------------------------- /backend/lambdas/workflow_from_sqs/requirements.txt: -------------------------------------------------------------------------------- 1 | aws-lambda-powertools==1.24.1 -------------------------------------------------------------------------------- /backend/state_machine/process_post.asl.json: -------------------------------------------------------------------------------- 1 | { 2 | "Comment": "A description of my state machine", 3 | "StartAt": "Extract Insights", 4 | "States": { 5 | "Extract Insights": { 6 | "Type": "Task", 7 | "Resource": "arn:aws:states:::lambda:invoke", 8 | "OutputPath": "$.Payload", 9 | "Parameters": { 10 | "Payload.$": "$", 11 | "FunctionName": "${ExtractInisghtsFunctionArn}" 12 | }, 13 | "Retry": [ 14 | { 15 | "ErrorEquals": [ 16 | "Lambda.ServiceException", 17 | "Lambda.AWSLambdaException", 18 | "Lambda.SdkClientException" 19 | ], 20 | "IntervalSeconds": 2, 21 | "MaxAttempts": 6, 22 | "BackoffRate": 2 23 | } 24 | ], 25 | "Next": "ProcessPost?" 26 | }, 27 | "ProcessPost?": { 28 | "Type": "Choice", 29 | "Choices": [ 30 | { 31 | "Variable": "$.process_post", 32 | "BooleanEquals": true, 33 | "Next": "LocatePost?" 34 | } 35 | ], 36 | "Default": "Success" 37 | }, 38 | "Success": { 39 | "Type": "Succeed" 40 | }, 41 | "LocatePost?": { 42 | "Type": "Choice", 43 | "Choices": [ 44 | { 45 | "Variable": "$.process_location", 46 | "BooleanEquals": true, 47 | "Next": "Locate Post" 48 | } 49 | ], 50 | "Default": "Save Post" 51 | }, 52 | "Locate Post": { 53 | "Type": "Task", 54 | "Resource": "arn:aws:states:::lambda:invoke", 55 | "OutputPath": "$.Payload", 56 | "Parameters": { 57 | "Payload.$": "$", 58 | "FunctionName": "${AddLocationFunctionArn}" 59 | }, 60 | "Retry": [ 61 | { 62 | "ErrorEquals": [ 63 | "Lambda.ServiceException", 64 | "Lambda.AWSLambdaException", 65 | "Lambda.SdkClientException" 66 | ], 67 | "IntervalSeconds": 2, 68 | "MaxAttempts": 6, 69 | "BackoffRate": 2 70 | } 71 | ], 72 | "Next": "Save Post" 73 | }, 74 | "Save Post": { 75 | "Type": "Task", 76 | "Resource": "arn:aws:states:::lambda:invoke", 77 | "OutputPath": "$.Payload", 78 | "Parameters": { 79 | "Payload.$": "$", 80 | "FunctionName": "${SavePostFunctionArn}" 81 | }, 82 | "Retry": [ 83 | { 84 | "ErrorEquals": [ 85 | "Lambda.ServiceException", 86 | "Lambda.AWSLambdaException", 87 | "Lambda.SdkClientException" 88 | ], 89 | "IntervalSeconds": 2, 90 | "MaxAttempts": 6, 91 | "BackoffRate": 2 92 | } 93 | ], 94 | "End": true 95 | } 96 | } 97 | } -------------------------------------------------------------------------------- /backend/template.yaml: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | AWSTemplateFormatVersion: 2010-09-09 5 | Transform: AWS::Serverless-2016-10-31 6 | Description: "(SO9130) AI Powered Text Insights" 7 | 8 | Parameters: 9 | AthenaProjectionRangeStart: 10 | Type: String 11 | Description: Start date of Athena Partition Projection. Only Posts after this date appear in Athena history table. 12 | Default: 2022-01-01 00:00:00 13 | AllowedPattern: '\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}' 14 | ConstraintDescription: Must be a valid date in the yyyy-MM-dd HH:mm:SS format 15 | Labels: 16 | Type: String 17 | Default: 'Health and fitness, Humor, Personal growth, Philanthropy, Recreation and leisure, Family, friends, and, relationships, Career, Finance, Education and training, Time management' 18 | SentimentCategories: 19 | Type: String 20 | Default: 'Positive, Negative, Neutral' 21 | ModelId: 22 | Type: String 23 | Default: 'anthropic.claude-3-sonnet-20240229-v1:0' 24 | AllowedValues: 25 | - 'anthropic.claude-3-haiku-20240307-v1:0' 26 | - 'anthropic.claude-3-sonnet-20240229-v1:0' 27 | Language: 28 | Type: String 29 | Description: The language that the text to be processed is in 30 | Default: "en" 31 | Region: 32 | Type: String 33 | Description: The region for the localization services 34 | Default: "USA" 35 | AnomalyAlertThreshold: 36 | Type: Number 37 | Description: An anomaly must be beyond this threshold to trigger an alert 38 | Default: 50 39 | AnomalyDetectionFrequency: 40 | Type: String 41 | Description: The frequency at which the detector will look for anomalies 42 | Default: PT1H 43 | 44 | Resources: 45 | 46 | ########################## 47 | # Logs for state machine # 48 | ########################## 49 | ExpressLogGroup: 50 | Type: AWS::Logs::LogGroup 51 | Properties: 52 | RetentionInDays: 7 53 | 54 | ################## 55 | # KMS Athena Key # 56 | ################## 57 | AthenaSourceKmsKey: 58 | Type: 'AWS::KMS::Key' 59 | Properties: 60 | Description: Used to encrypt athena data and ADs 61 | EnableKeyRotation: true 62 | KeyPolicy: 63 | Version: '2012-10-17' 64 | Statement: 65 | - Sid: Enable IAM User Permissions 66 | Effect: Allow 67 | Principal: 68 | AWS: !Sub arn:${AWS::Partition}:iam::${AWS::AccountId}:root 69 | Action: "kms:*" 70 | Resource: '*' 71 | 72 | ############## 73 | # Amazon SNS # 74 | ############## 75 | AlertTopic: 76 | Type: AWS::SNS::Topic 77 | Properties: 78 | KmsMasterKeyId: alias/aws/sns 79 | AlertTopicPolicy: 80 | Type: AWS::SNS::TopicPolicy 81 | Properties: 82 | PolicyDocument: 83 | Statement: 84 | - Sid: AllowPublishThroughSSLOnly 85 | Action: sns:Publish 86 | Effect: Deny 87 | Principal: '*' 88 | Resource: 89 | - !Ref AlertTopic 90 | Condition: 91 | Bool: 92 | aws:SecureTransport: false 93 | Topics: 94 | - !Ref AlertTopic 95 | 96 | ############## 97 | # SQS queues # 98 | ############## 99 | PostsQueue: 100 | Type: AWS::SQS::Queue 101 | Properties: 102 | KmsMasterKeyId: alias/aws/sqs 103 | VisibilityTimeout: 200 104 | RedrivePolicy: 105 | deadLetterTargetArn: !GetAtt PostsDeadLetterQueue.Arn 106 | maxReceiveCount: 5 107 | PostsDeadLetterQueue: 108 | Type: AWS::SQS::Queue 109 | Properties: 110 | KmsMasterKeyId: alias/aws/sqs 111 | PostsQueuePolicy: 112 | Type: AWS::SQS::QueuePolicy 113 | Properties: 114 | Queues: 115 | - !Ref PostsQueue 116 | PolicyDocument: 117 | Statement: 118 | - Effect: Deny 119 | Principal: '*' 120 | Action: sqs:SendMessage 121 | Resource: !GetAtt PostsQueue.Arn 122 | Condition: 123 | Bool: 124 | aws:SecureTransport: false 125 | PostsDeadLetterQueuePolicy: 126 | Type: AWS::SQS::QueuePolicy 127 | Properties: 128 | Queues: 129 | - !Ref PostsDeadLetterQueue 130 | PolicyDocument: 131 | Statement: 132 | - Effect: Deny 133 | Principal: '*' 134 | Action: sqs:SendMessage 135 | Resource: !GetAtt PostsDeadLetterQueue.Arn 136 | Condition: 137 | Bool: 138 | aws:SecureTransport: false 139 | 140 | ############## 141 | # S3 buckets # 142 | ############## 143 | PostsBucket: 144 | Type: AWS::S3::Bucket 145 | DeletionPolicy: Retain 146 | Properties: 147 | VersioningConfiguration: 148 | Status: Enabled 149 | BucketEncryption: 150 | ServerSideEncryptionConfiguration: 151 | - ServerSideEncryptionByDefault: 152 | SSEAlgorithm: AES256 153 | PublicAccessBlockConfiguration: 154 | BlockPublicAcls: true 155 | BlockPublicPolicy: true 156 | IgnorePublicAcls: true 157 | RestrictPublicBuckets: true 158 | OwnershipControls: 159 | Rules: 160 | - ObjectOwnership: ObjectWriter 161 | LoggingConfiguration: 162 | DestinationBucketName: !Ref LoggingBucket 163 | LogFilePrefix: "posts-bucket/" 164 | PostsBucketPolicy: 165 | Type: AWS::S3::BucketPolicy 166 | Properties: 167 | Bucket: !Ref PostsBucket 168 | PolicyDocument: 169 | Statement: 170 | - Effect: Deny 171 | Principal: "*" 172 | Action: "*" 173 | Resource: 174 | - !Sub "arn:aws:s3:::${PostsBucket}/*" 175 | - !Sub "arn:aws:s3:::${PostsBucket}" 176 | Condition: 177 | Bool: 178 | aws:SecureTransport: false 179 | AthenaResultsBucket: 180 | Type: AWS::S3::Bucket 181 | DeletionPolicy: Retain 182 | Properties: 183 | VersioningConfiguration: 184 | Status: Enabled 185 | BucketEncryption: 186 | ServerSideEncryptionConfiguration: 187 | - ServerSideEncryptionByDefault: 188 | SSEAlgorithm: AES256 189 | PublicAccessBlockConfiguration: 190 | BlockPublicAcls: true 191 | BlockPublicPolicy: true 192 | IgnorePublicAcls: true 193 | RestrictPublicBuckets: true 194 | OwnershipControls: 195 | Rules: 196 | - ObjectOwnership: ObjectWriter 197 | LoggingConfiguration: 198 | DestinationBucketName: !Ref LoggingBucket 199 | LogFilePrefix: "athena-results-bucket/" 200 | AthenaResultsBucketPolicy: 201 | Type: AWS::S3::BucketPolicy 202 | Properties: 203 | Bucket: !Ref AthenaResultsBucket 204 | PolicyDocument: 205 | Statement: 206 | - Effect: Deny 207 | Principal: "*" 208 | Action: "*" 209 | Resource: 210 | - !Sub "arn:aws:s3:::${AthenaResultsBucket}/*" 211 | - !Sub "arn:aws:s3:::${AthenaResultsBucket}" 212 | Condition: 213 | Bool: 214 | aws:SecureTransport: false 215 | LoggingBucket: 216 | Type: AWS::S3::Bucket 217 | DeletionPolicy: Retain 218 | Properties: 219 | VersioningConfiguration: 220 | Status: Enabled 221 | PublicAccessBlockConfiguration: 222 | BlockPublicAcls: true 223 | BlockPublicPolicy: true 224 | IgnorePublicAcls: true 225 | RestrictPublicBuckets: true 226 | OwnershipControls: 227 | Rules: 228 | - ObjectOwnership: ObjectWriter 229 | AccessControl: LogDeliveryWrite 230 | BucketEncryption: 231 | ServerSideEncryptionConfiguration: 232 | - ServerSideEncryptionByDefault: 233 | SSEAlgorithm: AES256 234 | Metadata: 235 | cfn_nag: 236 | rules_to_suppress: 237 | - id: W35 238 | reason: S3 Bucket access logging not needed here 239 | LoggingBucketPolicy: 240 | Type: AWS::S3::BucketPolicy 241 | Properties: 242 | Bucket: !Ref LoggingBucket 243 | PolicyDocument: 244 | Statement: 245 | - Effect: Deny 246 | Principal: "*" 247 | Action: "*" 248 | Resource: 249 | - !Sub "arn:aws:s3:::${LoggingBucket}/*" 250 | - !Sub "arn:aws:s3:::${LoggingBucket}" 251 | Condition: 252 | Bool: 253 | aws:SecureTransport: false 254 | 255 | ################################### 256 | # State Machine to classify posts # 257 | ################################### 258 | ProcessPostStateMachine: 259 | Type: AWS::Serverless::StateMachine 260 | Properties: 261 | Type: EXPRESS 262 | Logging: 263 | Level: ALL 264 | IncludeExecutionData: True 265 | Destinations: 266 | - CloudWatchLogsLogGroup: 267 | LogGroupArn: !GetAtt ExpressLogGroup.Arn 268 | DefinitionUri: state_machine/process_post.asl.json 269 | DefinitionSubstitutions: 270 | ExtractInisghtsFunctionArn: !GetAtt ExtractInsightsFunction.Arn 271 | AddLocationFunctionArn: !GetAtt AddLocationFunction.Arn 272 | SavePostFunctionArn: !GetAtt SavePostFunction.Arn 273 | Policies: 274 | - LambdaInvokePolicy: 275 | FunctionName: !Ref ExtractInsightsFunction 276 | - LambdaInvokePolicy: 277 | FunctionName: !Ref AddLocationFunction 278 | - LambdaInvokePolicy: 279 | FunctionName: !Ref SavePostFunction 280 | - Version: 2012-10-17 281 | Statement: 282 | - Effect: Allow 283 | Action: "logs:*" 284 | Resource: "*" 285 | 286 | ################################################################## 287 | # SQS queue Lambda function consumer (invokes the state machine) # 288 | ################################################################## 289 | TriggerOnSQSFunction: 290 | Type: AWS::Serverless::Function 291 | Properties: 292 | Runtime: python3.12 293 | Timeout: 180 294 | Handler: lambda.handler 295 | CodeUri: lambdas/workflow_from_sqs/ 296 | Policies: 297 | - SQSPollerPolicy: 298 | QueueName: !GetAtt PostsQueue.QueueName 299 | - StepFunctionsExecutionPolicy: 300 | StateMachineName: !GetAtt ProcessPostStateMachine.Name 301 | Environment: 302 | Variables: 303 | LOG_LEVEL: info 304 | PROCESS_POST_STATE_MACHINE: !Ref ProcessPostStateMachine 305 | Events: 306 | SQSEvent: 307 | Type: SQS 308 | Properties: 309 | Queue: !GetAtt PostsQueue.Arn 310 | BatchSize: 3 311 | FunctionResponseTypes: 312 | - ReportBatchItemFailures 313 | 314 | ######################################################## 315 | # Function to process the post # 316 | ######################################################## 317 | ExtractInsightsFunction: 318 | Type: AWS::Serverless::Function 319 | Properties: 320 | Runtime: python3.12 321 | Timeout: 180 322 | Handler: lambda.handler 323 | CodeUri: lambdas/process_post/ 324 | Policies: 325 | - Version: 2012-10-17 326 | Statement: 327 | - Effect: Allow 328 | Action: "bedrock:InvokeModel" 329 | Resource: "*" 330 | Environment: 331 | Variables: 332 | MODEL_ID: !Ref ModelId 333 | LABELS: !Ref Labels 334 | LOG_LEVEL: info 335 | LANGUAGE_CODE: !Ref Language 336 | SENTIMENT_LABELS: !Ref SentimentCategories 337 | 338 | ################################################ 339 | # Function to obtain coordinates from the post # 340 | ################################################ 341 | AddLocationFunction: 342 | Type: AWS::Serverless::Function 343 | Properties: 344 | Runtime: python3.12 345 | Timeout: 180 346 | Handler: lambda.handler 347 | CodeUri: lambdas/locate_post/ 348 | Policies: 349 | - Version: 2012-10-17 350 | Statement: 351 | - Effect: Allow 352 | Action: "geo:SearchPlaceIndexForText" 353 | Resource: 354 | - !GetAtt PlaceIndex.Arn 355 | Environment: 356 | Variables: 357 | LANGUAGE: !Ref Language 358 | GEO_REGION: !Ref Region 359 | PLACE_INDEX_NAME: !Ref PlaceIndex 360 | LOG_LEVEL: info 361 | 362 | ############################################## 363 | # Function to save the post and its metadata # 364 | ############################################## 365 | SavePostFunction: 366 | Type: AWS::Serverless::Function 367 | Properties: 368 | Runtime: python3.12 369 | Timeout: 180 370 | Handler: lambda.handler 371 | CodeUri: lambdas/save_post/ 372 | Policies: 373 | - S3WritePolicy: 374 | BucketName: !Ref PostsBucket 375 | Environment: 376 | Variables: 377 | POSTS_BUCKET: !Ref PostsBucket 378 | LOG_LEVEL: info 379 | 380 | ######################################################################## 381 | # Lambda to execute custom resource to create the L4M anomaly detector # 382 | ######################################################################## 383 | CreateL4MDetectorFunction: 384 | Type: AWS::Serverless::Function 385 | Properties: 386 | Runtime: python3.12 387 | Timeout: 180 388 | Handler: lambda.handler 389 | CodeUri: lambdas/custom_resources/lookout4metrics/ 390 | Policies: 391 | # Lookout for metrics policy (required for creating the detector) 392 | - Version: 2012-10-17 393 | Statement: 394 | - Effect: Allow 395 | Action: 396 | - lookoutmetrics:CreateAnomalyDetector 397 | - lookoutmetrics:CreateAlert 398 | - lookoutmetrics:ActivateAnomalyDetector 399 | - lookoutmetrics:CreateMetricSet 400 | Resource: 401 | - !Sub "arn:aws:lookoutmetrics:*:${AWS::AccountId}:MetricSet/*/*" 402 | - !Sub "arn:aws:lookoutmetrics:*:${AWS::AccountId}:Alert:*" 403 | - !Sub "arn:aws:lookoutmetrics:*:${AWS::AccountId}:AnomalyDetector:*" 404 | - !Sub "arn:aws:lookoutmetrics:*:${AWS::AccountId}:*" 405 | - Effect: Allow 406 | Action: "iam:PassRole" 407 | Resource: 408 | - !GetAtt AthenaSourceAccessRole.Arn 409 | - !GetAtt SnsPublishRole.Arn 410 | 411 | ####################################################################### 412 | # Custom resource to create anomaly detector with lookout for metrics # 413 | ####################################################################### 414 | CreateLookoutMetricsResource: 415 | Type: Custom::CreateLookoutMetrics 416 | DependsOn: 417 | - AthenaSourceAccessRole 418 | - SnsPublishRole 419 | - AlertTopic 420 | - AthenaWorkGroup 421 | - GlueDatabase 422 | - GlueTopicsTable 423 | Properties: 424 | ServiceToken: !GetAtt CreateL4MDetectorFunction.Arn 425 | Target: 426 | AthenaRoleArn: !GetAtt AthenaSourceAccessRole.Arn 427 | SnsRoleArn: !GetAtt SnsPublishRole.Arn 428 | SnsTopicArn: !Ref AlertTopic 429 | AlertThreshold: !Ref AnomalyAlertThreshold 430 | AthenaWorkgroupName: !Ref AthenaWorkGroup 431 | GlueDbName: !Ref GlueDatabase 432 | GlueTableName: !Ref GlueTopicsTable 433 | AwsDataCatalog: AwsDataCatalog 434 | DetectorFrequency: !Ref AnomalyDetectionFrequency 435 | Body: | 436 | 437 | 438 | ################ 439 | # Athena setup # 440 | ################ 441 | AthenaWorkGroup: 442 | Type: AWS::Athena::WorkGroup 443 | Properties: 444 | Name: !Sub "${AWS::StackName}-PostsWorkGroup" 445 | WorkGroupConfiguration: 446 | EnforceWorkGroupConfiguration: false 447 | ResultConfiguration: 448 | EncryptionConfiguration: 449 | EncryptionOption: SSE_KMS 450 | KmsKey: !Ref AthenaSourceKmsKey 451 | OutputLocation: !Sub "s3://${AthenaResultsBucket}/" 452 | GlueDatabase: 453 | Type: AWS::Glue::Database 454 | Properties: 455 | CatalogId: !Ref "AWS::AccountId" 456 | DatabaseInput: 457 | Name: !Sub "${AWS::StackName}_db" 458 | GluePostsTable: 459 | Type: AWS::Glue::Table 460 | Properties: 461 | CatalogId: !Ref "AWS::AccountId" 462 | DatabaseName: !Ref GlueDatabase 463 | TableInput: 464 | Name: posts 465 | PartitionKeys: 466 | - Name: partition_timestamp 467 | Type: timestamp 468 | StorageDescriptor: 469 | Columns: 470 | - Name: longitude 471 | Type: float 472 | - Name: latitude 473 | Type: float 474 | - Name: location 475 | Type: string 476 | - Name: topic 477 | Type: string 478 | - Name: sentiment 479 | Type: string 480 | - Name: created_at 481 | Type: timestamp 482 | - Name: model 483 | Type: string 484 | - Name: notification 485 | Type: boolean 486 | - Name: timestamp 487 | Type: timestamp 488 | - Name: text 489 | Type: string 490 | - Name: text_clean 491 | Type: string 492 | - Name: user 493 | Type: string 494 | - Name: source 495 | Type: string 496 | - Name: count 497 | Type: tinyint 498 | - Name: platform 499 | Type: string 500 | Compressed: False 501 | Location: !Sub "s3://${PostsBucket}/posts" 502 | InputFormat: org.apache.hadoop.mapred.TextInputFormat 503 | OutputFormat: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat 504 | SerdeInfo: 505 | SerializationLibrary: org.apache.hive.hcatalog.data.JsonSerDe 506 | Parameters: 507 | "projection.enabled": "true" 508 | "projection.partition_timestamp.type": "date" 509 | "projection.partition_timestamp.format": "yyyy-MM-dd HH:mm:SS" 510 | "projection.partition_timestamp.range": !Sub "${AthenaProjectionRangeStart},NOW+1DAY" 511 | "projection.partition_timestamp.interval": "15" 512 | "projection.partition_timestamp.interval.unit": "MINUTES" 513 | "storage.location.template": !Sub "s3://${PostsBucket}/posts/${!partition_timestamp}/" 514 | TableType: EXTERNAL_TABLE 515 | GluePhrasesTable: 516 | Type: AWS::Glue::Table 517 | Properties: 518 | CatalogId: !Ref "AWS::AccountId" 519 | DatabaseName: !Ref GlueDatabase 520 | TableInput: 521 | Name: phrases 522 | PartitionKeys: 523 | - Name: partition_timestamp 524 | Type: timestamp 525 | StorageDescriptor: 526 | Columns: 527 | - Name: created_at 528 | Type: timestamp 529 | - Name: timestamp 530 | Type: timestamp 531 | - Name: text_clean 532 | Type: string 533 | - Name: user 534 | Type: string 535 | - Name: platform 536 | Type: string 537 | - Name: phrase 538 | Type: string 539 | - Name: count 540 | Type: tinyint 541 | Compressed: False 542 | Location: !Sub "s3://${PostsBucket}/phrases" 543 | InputFormat: org.apache.hadoop.mapred.TextInputFormat 544 | OutputFormat: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat 545 | SerdeInfo: 546 | SerializationLibrary: org.apache.hive.hcatalog.data.JsonSerDe 547 | Parameters: 548 | "projection.enabled": "true" 549 | "projection.partition_timestamp.type": "date" 550 | "projection.partition_timestamp.format": "yyyy-MM-dd HH:mm:SS" 551 | "projection.partition_timestamp.range": !Sub "${AthenaProjectionRangeStart},NOW+1DAY" 552 | "projection.partition_timestamp.interval": "1" 553 | "projection.partition_timestamp.interval.unit": "DAYS" 554 | "storage.location.template": !Sub "s3://${PostsBucket}/phrases/${!partition_timestamp}/" 555 | TableType: EXTERNAL_TABLE 556 | GlueTopicsTable: 557 | Type: AWS::Glue::Table 558 | Properties: 559 | CatalogId: !Ref "AWS::AccountId" 560 | DatabaseName: !Ref GlueDatabase 561 | TableInput: 562 | Name: topics 563 | PartitionKeys: 564 | - Name: partition_timestamp 565 | Type: timestamp 566 | StorageDescriptor: 567 | Columns: 568 | - Name: created_at 569 | Type: timestamp 570 | - Name: timestamp 571 | Type: timestamp 572 | - Name: text_clean 573 | Type: string 574 | - Name: user 575 | Type: string 576 | - Name: platform 577 | Type: string 578 | - Name: topic 579 | Type: string 580 | - Name: sentiment 581 | Type: string 582 | - Name: longitude 583 | Type: float 584 | - Name: latitude 585 | Type: float 586 | - Name: location 587 | Type: string 588 | - Name: count 589 | Type: tinyint 590 | Compressed: False 591 | Location: !Sub "s3://${PostsBucket}/topics" 592 | InputFormat: org.apache.hadoop.mapred.TextInputFormat 593 | OutputFormat: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat 594 | SerdeInfo: 595 | SerializationLibrary: org.apache.hive.hcatalog.data.JsonSerDe 596 | Parameters: 597 | "projection.enabled": "true" 598 | "projection.partition_timestamp.type": "date" 599 | "projection.partition_timestamp.format": "yyyy-MM-dd HH:mm:SS" 600 | "projection.partition_timestamp.range": !Sub "${AthenaProjectionRangeStart},NOW+1DAY" 601 | "projection.partition_timestamp.interval": "1" 602 | "projection.partition_timestamp.interval.unit": "DAYS" 603 | "storage.location.template": !Sub "s3://${PostsBucket}/topics/${!partition_timestamp}/" 604 | TableType: EXTERNAL_TABLE 605 | GlueLinksTable: 606 | Type: AWS::Glue::Table 607 | Properties: 608 | CatalogId: !Ref "AWS::AccountId" 609 | DatabaseName: !Ref GlueDatabase 610 | TableInput: 611 | Name: links 612 | PartitionKeys: 613 | - Name: partition_timestamp 614 | Type: timestamp 615 | StorageDescriptor: 616 | Columns: 617 | - Name: created_at 618 | Type: timestamp 619 | - Name: timestamp 620 | Type: timestamp 621 | - Name: text_clean 622 | Type: string 623 | - Name: user 624 | Type: string 625 | - Name: platform 626 | Type: string 627 | - Name: link 628 | Type: string 629 | - Name: count 630 | Type: tinyint 631 | Compressed: False 632 | Location: !Sub "s3://${PostsBucket}/links" 633 | InputFormat: org.apache.hadoop.mapred.TextInputFormat 634 | OutputFormat: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat 635 | SerdeInfo: 636 | SerializationLibrary: org.apache.hive.hcatalog.data.JsonSerDe 637 | Parameters: 638 | "projection.enabled": "true" 639 | "projection.partition_timestamp.type": "date" 640 | "projection.partition_timestamp.format": "yyyy-MM-dd HH:mm:SS" 641 | "projection.partition_timestamp.range": !Sub "${AthenaProjectionRangeStart},NOW+1DAY" 642 | "projection.partition_timestamp.interval": "1" 643 | "projection.partition_timestamp.interval.unit": "DAYS" 644 | "storage.location.template": !Sub "s3://${PostsBucket}/links/${!partition_timestamp}/" 645 | TableType: EXTERNAL_TABLE 646 | 647 | ######################################### 648 | # Athena policy for Lookout for Metrics # 649 | ######################################### 650 | AthenaSourceAccessRole: 651 | Type: AWS::IAM::Role 652 | Properties: 653 | AssumeRolePolicyDocument: 654 | Statement: 655 | - Action: ['sts:AssumeRole'] 656 | Effect: Allow 657 | Principal: 658 | Service: ['lookoutmetrics.amazonaws.com'] 659 | AWS: !Sub "${AWS::AccountId}" 660 | Version: '2012-10-17' 661 | Policies: 662 | - PolicyDocument: 663 | Statement: 664 | - Action: 665 | - 'glue:GetCatalogImportStatus' 666 | - 'glue:GetDatabase' 667 | - 'glue:GetTable' 668 | Effect: Allow 669 | Resource: 670 | - !Sub "arn:${AWS::Partition}:glue:${AWS::Region}:${AWS::AccountId}:catalog" 671 | - !Sub "arn:${AWS::Partition}:glue:${AWS::Region}:${AWS::AccountId}:database/${GlueDatabase}" 672 | - !Sub "arn:${AWS::Partition}:glue:${AWS::Region}:${AWS::AccountId}:table/${GlueDatabase}/*" 673 | - Action: 674 | - 's3:GetObject' 675 | - 's3:ListBucket' 676 | - 's3:PutObject' 677 | - 's3:GetBucketLocation' 678 | - 's3:ListBucketMultipartUploads' 679 | - 's3:ListMultipartUploadParts' 680 | - 's3:AbortMultipartUpload' 681 | Effect: Allow 682 | Resource: 683 | - !Sub "arn:${AWS::Partition}:s3:::${AthenaResultsBucket}" 684 | - !Sub "arn:${AWS::Partition}:s3:::${AthenaResultsBucket}/*" 685 | - !Sub "arn:${AWS::Partition}:s3:::${PostsBucket}" 686 | - !Sub "arn:${AWS::Partition}:s3:::${PostsBucket}/*" 687 | - Action: 688 | - 'athena:CreatePreparedStatement' # On WG 689 | - 'athena:DeletePreparedStatement' # On WG 690 | - 'athena:GetDatabase' 691 | - 'athena:GetPreparedStatement' # On WG 692 | - 'athena:GetQueryExecution' 693 | - 'athena:GetQueryResults' # On WG 694 | - 'athena:GetQueryResultsStream' 695 | - 'athena:GetTableMetadata' 696 | - 'athena:GetWorkGroup' 697 | - 'athena:StartQueryExecution' 698 | Effect: Allow 699 | Resource: 700 | - !Sub "arn:${AWS::Partition}:athena:${AWS::Region}:${AWS::AccountId}:datacatalog/*" 701 | - !Sub "arn:${AWS::Partition}:athena:${AWS::Region}:${AWS::AccountId}:workgroup/primary" 702 | - !Sub "arn:${AWS::Partition}:athena:${AWS::Region}:${AWS::AccountId}:workgroup/${AthenaWorkGroup}" 703 | - Action: [ 'kms:GenerateDataKey', 'kms:Decrypt' ] 704 | Effect: Allow 705 | Resource: 706 | - !GetAtt AthenaSourceKmsKey.Arn 707 | Version: '2012-10-17' 708 | PolicyName: 'AthenaAccessPolicy' 709 | 710 | ###################################### 711 | # SNS policy for Lookout for Metrics # 712 | ###################################### 713 | SnsPublishRole: 714 | Type: AWS::IAM::Role 715 | Properties: 716 | AssumeRolePolicyDocument: 717 | Statement: 718 | - Action: ['sts:AssumeRole'] 719 | Effect: Allow 720 | Principal: 721 | Service: ['lookoutmetrics.amazonaws.com'] 722 | AWS: !Sub "${AWS::AccountId}" 723 | Version: '2012-10-17' 724 | Policies: 725 | - PolicyDocument: 726 | Statement: 727 | - Action: 728 | - 'sns:Publish' 729 | Effect: Allow 730 | Resource: 731 | - !Ref AlertTopic 732 | Version: '2012-10-17' 733 | PolicyName: 'AthenaAccessPolicy' 734 | 735 | ############################################ 736 | # Place index creation (Location services) # 737 | ############################################ 738 | PlaceIndex: 739 | Type: AWS::Location::PlaceIndex 740 | Properties: 741 | DataSource: Esri 742 | Description: Place Index 743 | IndexName: !Sub 'PlaceIndex-${AWS::StackName}' 744 | 745 | 746 | Outputs: 747 | AwsRegion: 748 | Value: !Ref 'AWS::Region' 749 | PostsQueueUrl: 750 | Value: !Ref PostsQueue 751 | PostsQueueArn: 752 | Value: !GetAtt PostsQueue.Arn 753 | AlertTopicArn: 754 | Value: !Ref AlertTopic 755 | PostsBucketName: 756 | Value: !Ref PostsBucket 757 | PlaceIndexName: 758 | Value: !Ref PlaceIndex 759 | AthenaAccessRoleARN: 760 | Value: !GetAtt AthenaSourceAccessRole.Arn 761 | SnsPublishRoleARN: 762 | Value: !GetAtt SnsPublishRole.Arn 763 | KMSAthenaKey: 764 | Value: !Ref AthenaSourceKmsKey 765 | -------------------------------------------------------------------------------- /data-streammer/mexico_big_cities.csv: -------------------------------------------------------------------------------- 1 | Ciudad,Entidad Federativa 2 | , 3 | Ciudad de México, Distrito Federal 4 | Ecatepec,México 5 | Guadalajara,Jalisco 6 | Puebla,Puebla 7 | Juárez,Chihuahua 8 | Tijuana,Baja California 9 | León,Guanajuato 10 | Zapopan,Jalisco 11 | Monterrey,Nuevo León 12 | Nezahualcóyotl,México 13 | Chihuahua,Chihuahua 14 | Naucalpan,México 15 | Mérida,Yucatán 16 | San Luis Potosí,San Luis Potosí 17 | Aguascalientes,Aguascalientes 18 | Hermosillo,Sonora 19 | Saltillo,Coahuila 20 | Mexicali,Baja California 21 | Culiacán,Sinaloa 22 | Guadalupe,Nuevo León 23 | Acapulco,Guerrero 24 | Tlalnepantla,México 25 | Cancún,Quintana Roo 26 | Querétaro,Querétaro 27 | Chimalhuacán,México 28 | Torreón,Coahuila 29 | Morelia,Michoacán 30 | Reynosa,Tamaulipas 31 | Tlaquepaque,Jalisco 32 | Tuxtla,Chiapas 33 | Durango,Durango 34 | Toluca,México 35 | Ciudad López Mateos,México 36 | Cuautitlán Izcalli,México 37 | Apodaca,Nuevo León 38 | Matamoros,Tamaulipas 39 | San Nicolás de los Garza,Nuevo León 40 | Veracruz,Veracruz 41 | Xalapa,Veracruz 42 | Tonalá,Jalisco 43 | Mazatlán,Sinaloa 44 | Irapuato,Guanajuato 45 | Nuevo Laredo,Tamaulipas 46 | Xico,México 47 | Villahermosa,Tabasco 48 | General Escobedo,Nuevo León 49 | Celaya,Guanajuato 50 | Cuernavaca,Morelos 51 | Tepic,Nayarit 52 | Ixtapaluca,México 53 | Ciudad Victoria,Tamaulipas 54 | Ciudad Obregón,Sonora 55 | Tampico,Tamaulipas 56 | Villa Nicolás Romero,México 57 | Ensenada,Baja California 58 | San Francisco Coacalco,México 59 | Santa Catarina,Nuevo León 60 | Uruapan,Michoacán 61 | Gómez Palacio,Durango 62 | Los Mochis,Sinaloa 63 | Pachuca,Hidalgo 64 | Oaxaca,Oaxaca 65 | Soledad de Graciano Sánchez,San Luis Potosí 66 | Tehuacán,Puebla 67 | Ojo de Agua,México 68 | Coatzacoalcos,Veracruz 69 | Campeche,Campeche 70 | Monclova,Coahuila 71 | La Paz,Baja California Sur 72 | Nogales,Sonora 73 | Buenavista,México 74 | Puerto Vallarta,Jalisco 75 | Tapachula,Chiapas 76 | Ciudad Madero,Tamaulipas 77 | San Pablo de las Salinas,México 78 | Chilpancingo,Guerrero 79 | Poza Rica de Hidalgo,Veracruz 80 | Chicoloapan,México 81 | Ciudad del Carmen,Campeche 82 | Chalco,México 83 | Jiutepec,Morelos 84 | Salamanca,Guanajuato 85 | San Luis Río Colorado,Sonora 86 | San Cristóbal de las Casas,Chiapas 87 | Minatitlán,Veracruz 88 | Cuautla,Morelos 89 | Juárez,Nuevo León 90 | Chetumal,Quintana Roo 91 | Piedras Negras,Coahuila 92 | Playa del Carmen,Quintana Roo 93 | Zamora,Michoacán 94 | Córdoba,Veracruz 95 | San Juan del Río,Querétaro 96 | Colima,Colima 97 | Ciudad Acuña,Coahuila 98 | Manzanillo,Colima 99 | Zacatecas,Zacatecas 100 | Veracruz,Veracruz 101 | Ciudad Valles,San Luis Potosí 102 | Guadalupe,Zacatecas 103 | San Pedro Garza García,Nuevo León 104 | Naucalpan de Juárez,México 105 | Fresnillo,Zacatecas 106 | Orizaba,Veracruz 107 | Miramar,Tamaulipas 108 | Iguala,Guerrero 109 | Delicias,Chihuahua 110 | Villa de Álvarez,Colima 111 | Cuauhtémoc,Chihuahua 112 | Navojoa,Sonora 113 | Guaymas,Sonora 114 | Cuautitlán,México 115 | Texcoco de Mora,México 116 | Hidalgo del Parral,Chihuahua 117 | Tepexpan,México 118 | Tulancingo de Bravo,Hidalgo 119 | San Juan Bautista Tuxtepec,Oaxaca 120 | , 121 | , 122 | , 123 | , 124 | -------------------------------------------------------------------------------- /data-streammer/stream_posts.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | import json 5 | import time 6 | import boto3 7 | import argparse 8 | import pandas as pd 9 | import random 10 | from datetime import datetime 11 | 12 | parser = argparse.ArgumentParser(description='Stream sample posts to an Amazon SQS Queue for processing') 13 | parser.add_argument('--queue_url', type=str, help='Amazon SQS Queue URL') 14 | parser.add_argument('--region', type=str, help='The region where the AI Powered Text Insights is deployed') 15 | 16 | sample_cities = ['New York', 'Chicago', 'Kentucky', 'Arkansas', 'Miami', 'Los Angeles', 17 | 'San Francisco', 'New Jersey', 'San Diego', 'Orlando', 'Arlington', 'Washington DC', 18 | 'Las Vegas', 'Portland', 'Seattle', 'Austin', 'Phoenix'] 19 | 20 | # Create a function to send a JSON object to an SQS queue 21 | def send_sqs_message(queue_url, message): 22 | response = sqs.send_message( 23 | QueueUrl=queue_url, 24 | MessageBody=json.dumps(message) 25 | ) 26 | return response 27 | 28 | 29 | def send_batch(posts_batch, queue_url): 30 | 31 | for text in posts_batch: 32 | 33 | utc_now_str = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%S.%fz") 34 | 35 | # Add a location randomly. 36 | if random.uniform(0, 1) < 0.1: 37 | text = text + '. From ' + random.choice(sample_cities) 38 | 39 | message = { 40 | "text": text, 41 | "user": "test_user", 42 | "created_at": utc_now_str, 43 | "source": "test", 44 | "platform": "test client" 45 | } 46 | 47 | print(message) 48 | response = send_sqs_message(queue_url, message) 49 | print(response) 50 | 51 | 52 | if __name__ == '__main__': 53 | 54 | args = parser.parse_args() 55 | 56 | queue_url = args.queue_url 57 | region = args.region 58 | 59 | sqs = boto3.client('sqs') 60 | translate = boto3.client(service_name='translate', region_name=region, use_ssl=True) 61 | 62 | new_years_resolutions_df = pd.read_csv('new_years_resolutions_tweets.csv') 63 | new_years_resolutions_df['text'] = new_years_resolutions_df['text'].map(lambda x: x.encode(encoding='UTF-8', errors='replace').decode()) 64 | 65 | resolutions_text = new_years_resolutions_df['text'].values.tolist() 66 | 67 | #Use Amazon translate to translate the text into Spanish 68 | delta_i = 20 69 | for i in range(0, len(resolutions_text), delta_i): 70 | 71 | end_i = i + delta_i if i + delta_i < len(resolutions_text) else len(resolutions_text) 72 | text_batch = resolutions_text[i:end_i] 73 | 74 | print(f'Streaming from {i} to {end_i}') 75 | 76 | send_batch(text_batch, queue_url) 77 | 78 | time.sleep(60) -------------------------------------------------------------------------------- /sample-files/phrases/2022-09-12/cb1bc907-32b8-11ed-a39b-010c818c1d51.json: -------------------------------------------------------------------------------- 1 | {"created_at": "2022-09-12 16:33:45.000000", "timestamp": "2022-09-12 16:34:38.130057", "user": "AmazonScience", "platform": "Twitter", "text_clean": "rt @parmidabeigi: amazon's 20b language model has fewer than 1/8 the number of parameters of gpt-3, yet outperforms it in several nlp tasks…", "count": 1, "phrase": "the number"} 2 | {"created_at": "2022-09-12 16:33:45.000000", "timestamp": "2022-09-12 16:34:38.130057", "user": "AmazonScience", "platform": "Twitter", "text_clean": "rt @parmidabeigi: amazon's 20b language model has fewer than 1/8 the number of parameters of gpt-3, yet outperforms it in several nlp tasks…", "count": 1, "phrase": "amazon"} 3 | {"created_at": "2022-09-12 16:33:45.000000", "timestamp": "2022-09-12 16:34:38.130057", "user": "AmazonScience", "platform": "Twitter", "text_clean": "rt @parmidabeigi: amazon's 20b language model has fewer than 1/8 the number of parameters of gpt-3, yet outperforms it in several nlp tasks…", "count": 1, "phrase": "20b language model"} 4 | {"created_at": "2022-09-12 16:33:45.000000", "timestamp": "2022-09-12 16:34:38.130057", "user": "AmazonScience", "platform": "Twitter", "text_clean": "rt @parmidabeigi: amazon's 20b language model has fewer than 1/8 the number of parameters of gpt-3, yet outperforms it in several nlp tasks…", "count": 1, "phrase": "gpt-3"} 5 | {"created_at": "2022-09-12 16:33:45.000000", "timestamp": "2022-09-12 16:34:38.130057", "user": "AmazonScience", "platform": "Twitter", "text_clean": "rt @parmidabeigi: amazon's 20b language model has fewer than 1/8 the number of parameters of gpt-3, yet outperforms it in several nlp tasks…", "count": 1, "phrase": "parameters"} 6 | {"created_at": "2022-09-12 16:33:45.000000", "timestamp": "2022-09-12 16:34:38.130057", "user": "AmazonScience", "platform": "Twitter", "text_clean": "rt @parmidabeigi: amazon's 20b language model has fewer than 1/8 the number of parameters of gpt-3, yet outperforms it in several nlp tasks…", "count": 1, "phrase": "tasks…"} 7 | {"created_at": "2022-09-12 16:33:45.000000", "timestamp": "2022-09-12 16:34:38.130057", "user": "AmazonScience", "platform": "Twitter", "text_clean": "rt @parmidabeigi: amazon's 20b language model has fewer than 1/8 the number of parameters of gpt-3, yet outperforms it in several nlp tasks…", "count": 1, "phrase": "fewer than 1/8"} 8 | {"created_at": "2022-09-12 16:33:45.000000", "timestamp": "2022-09-12 16:34:38.130057", "user": "AmazonScience", "platform": "Twitter", "text_clean": "rt @parmidabeigi: amazon's 20b language model has fewer than 1/8 the number of parameters of gpt-3, yet outperforms it in several nlp tasks…", "count": 1, "phrase": "rt @parmidabeigi"} 9 | -------------------------------------------------------------------------------- /sample-files/phrases/2022-09-12/f4491b0d-32dd-11ed-96e7-1dd85b3a04d7.json: -------------------------------------------------------------------------------- 1 | {"created_at": "2022-09-12 20:59:49.000000", "timestamp": "2022-09-12 21:00:38.580243", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "what's with aws security services? #awsfirewallmanager adds support for aws waf custom requests and responses #awssecurityhub launches new security best practice control details on these & more security feature updates:", "count": 1, "phrase": "support"} 2 | {"created_at": "2022-09-12 20:59:49.000000", "timestamp": "2022-09-12 21:00:38.580243", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "what's with aws security services? #awsfirewallmanager adds support for aws waf custom requests and responses #awssecurityhub launches new security best practice control details on these & more security feature updates:", "count": 1, "phrase": "awssecurityhub"} 3 | {"created_at": "2022-09-12 20:59:49.000000", "timestamp": "2022-09-12 21:00:38.580243", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "what's with aws security services? #awsfirewallmanager adds support for aws waf custom requests and responses #awssecurityhub launches new security best practice control details on these & more security feature updates:", "count": 1, "phrase": "aws security services"} 4 | {"created_at": "2022-09-12 20:59:49.000000", "timestamp": "2022-09-12 21:00:38.580243", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "what's with aws security services? #awsfirewallmanager adds support for aws waf custom requests and responses #awssecurityhub launches new security best practice control details on these & more security feature updates:", "count": 1, "phrase": "new security"} 5 | {"created_at": "2022-09-12 20:59:49.000000", "timestamp": "2022-09-12 21:00:38.580243", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "what's with aws security services? #awsfirewallmanager adds support for aws waf custom requests and responses #awssecurityhub launches new security best practice control details on these & more security feature updates:", "count": 1, "phrase": "these & more security feature updates"} 6 | {"created_at": "2022-09-12 20:59:49.000000", "timestamp": "2022-09-12 21:00:38.580243", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "what's with aws security services? #awsfirewallmanager adds support for aws waf custom requests and responses #awssecurityhub launches new security best practice control details on these & more security feature updates:", "count": 1, "phrase": "best practice control details"} 7 | {"created_at": "2022-09-12 20:59:49.000000", "timestamp": "2022-09-12 21:00:38.580243", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "what's with aws security services? #awsfirewallmanager adds support for aws waf custom requests and responses #awssecurityhub launches new security best practice control details on these & more security feature updates:", "count": 1, "phrase": "awsfirewallmanager"} 8 | {"created_at": "2022-09-12 20:59:49.000000", "timestamp": "2022-09-12 21:00:38.580243", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "what's with aws security services? #awsfirewallmanager adds support for aws waf custom requests and responses #awssecurityhub launches new security best practice control details on these & more security feature updates:", "count": 1, "phrase": "aws waf custom requests and responses"} 9 | {"created_at": "2022-09-12 20:59:49.000000", "timestamp": "2022-09-12 21:00:38.580243", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "what's with aws security services? #awsfirewallmanager adds support for aws waf custom requests and responses #awssecurityhub launches new security best practice control details on these & more security feature updates:", "count": 1, "phrase": "#awsfirewallmanager"} 10 | -------------------------------------------------------------------------------- /sample-files/phrases/2022-09-13/0e79b260-33a6-11ed-a6f9-adab8308bd5b.json: -------------------------------------------------------------------------------- 1 | {"created_at": "2022-09-13 20:52:06.000000", "timestamp": "2022-09-13 20:53:01.877685", "user": "davidlaredo1", "platform": "Twitter", "text_clean": "accelerate innovation without sacrificing security. learn how migrating to the cloud can help decrease time to market, limit downtime & reduce cost. #aws #cloudcomputing #developer", "count": 1, "phrase": "cloudcomputing"} 2 | {"created_at": "2022-09-13 20:52:06.000000", "timestamp": "2022-09-13 20:53:01.877685", "user": "davidlaredo1", "platform": "Twitter", "text_clean": "accelerate innovation without sacrificing security. learn how migrating to the cloud can help decrease time to market, limit downtime & reduce cost. #aws #cloudcomputing #developer", "count": 1, "phrase": "market"} 3 | {"created_at": "2022-09-13 20:52:06.000000", "timestamp": "2022-09-13 20:53:01.877685", "user": "davidlaredo1", "platform": "Twitter", "text_clean": "accelerate innovation without sacrificing security. learn how migrating to the cloud can help decrease time to market, limit downtime & reduce cost. #aws #cloudcomputing #developer", "count": 1, "phrase": "the cloud"} 4 | {"created_at": "2022-09-13 20:52:06.000000", "timestamp": "2022-09-13 20:53:01.877685", "user": "davidlaredo1", "platform": "Twitter", "text_clean": "accelerate innovation without sacrificing security. learn how migrating to the cloud can help decrease time to market, limit downtime & reduce cost. #aws #cloudcomputing #developer", "count": 1, "phrase": "decrease time"} 5 | {"created_at": "2022-09-13 20:52:06.000000", "timestamp": "2022-09-13 20:53:01.877685", "user": "davidlaredo1", "platform": "Twitter", "text_clean": "accelerate innovation without sacrificing security. learn how migrating to the cloud can help decrease time to market, limit downtime & reduce cost. #aws #cloudcomputing #developer", "count": 1, "phrase": "developer"} 6 | {"created_at": "2022-09-13 20:52:06.000000", "timestamp": "2022-09-13 20:53:01.877685", "user": "davidlaredo1", "platform": "Twitter", "text_clean": "accelerate innovation without sacrificing security. learn how migrating to the cloud can help decrease time to market, limit downtime & reduce cost. #aws #cloudcomputing #developer", "count": 1, "phrase": "security"} 7 | {"created_at": "2022-09-13 20:52:06.000000", "timestamp": "2022-09-13 20:53:01.877685", "user": "davidlaredo1", "platform": "Twitter", "text_clean": "accelerate innovation without sacrificing security. learn how migrating to the cloud can help decrease time to market, limit downtime & reduce cost. #aws #cloudcomputing #developer", "count": 1, "phrase": "downtime &"} 8 | {"created_at": "2022-09-13 20:52:06.000000", "timestamp": "2022-09-13 20:53:01.877685", "user": "davidlaredo1", "platform": "Twitter", "text_clean": "accelerate innovation without sacrificing security. learn how migrating to the cloud can help decrease time to market, limit downtime & reduce cost. #aws #cloudcomputing #developer", "count": 1, "phrase": "innovation"} 9 | {"created_at": "2022-09-13 20:52:06.000000", "timestamp": "2022-09-13 20:53:01.877685", "user": "davidlaredo1", "platform": "Twitter", "text_clean": "accelerate innovation without sacrificing security. learn how migrating to the cloud can help decrease time to market, limit downtime & reduce cost. #aws #cloudcomputing #developer", "count": 1, "phrase": "cost"} 10 | {"created_at": "2022-09-13 20:52:06.000000", "timestamp": "2022-09-13 20:53:01.877685", "user": "davidlaredo1", "platform": "Twitter", "text_clean": "accelerate innovation without sacrificing security. learn how migrating to the cloud can help decrease time to market, limit downtime & reduce cost. #aws #cloudcomputing #developer", "count": 1, "phrase": "aws"} 11 | -------------------------------------------------------------------------------- /sample-files/phrases/2022-09-13/14a2cce4-33b3-11ed-9dd4-d752f0bb0867.json: -------------------------------------------------------------------------------- 1 | {"created_at": "2022-09-13 22:25:25.000000", "timestamp": "2022-09-13 22:26:15.670808", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "protect applications running on an amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "count": 1, "phrase": "sni"} 2 | {"created_at": "2022-09-13 22:25:25.000000", "timestamp": "2022-09-13 22:26:15.670808", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "protect applications running on an amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "count": 1, "phrase": "the approved hostnames"} 3 | {"created_at": "2022-09-13 22:25:25.000000", "timestamp": "2022-09-13 22:26:15.670808", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "protect applications running on an amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "count": 1, "phrase": "applications"} 4 | {"created_at": "2022-09-13 22:25:25.000000", "timestamp": "2022-09-13 22:26:15.670808", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "protect applications running on an amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "count": 1, "phrase": "the network firewall"} 5 | {"created_at": "2022-09-13 22:25:25.000000", "timestamp": "2022-09-13 22:26:15.670808", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "protect applications running on an amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "count": 1, "phrase": "outbound traffic"} 6 | {"created_at": "2022-09-13 22:25:25.000000", "timestamp": "2022-09-13 22:26:15.670808", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "protect applications running on an amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "count": 1, "phrase": "an amazon eks cluster"} 7 | {"created_at": "2022-09-13 22:25:25.000000", "timestamp": "2022-09-13 22:26:15.670808", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "protect applications running on an amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "count": 1, "phrase": "list"} 8 | -------------------------------------------------------------------------------- /sample-files/phrases/2022-09-13/88fd7722-33af-11ed-a356-b52a4d4d8f68.json: -------------------------------------------------------------------------------- 1 | {"created_at": "2022-09-13 22:00:00.000000", "timestamp": "2022-09-13 22:00:52.893886", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "protect applications running on amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "count": 1, "phrase": "the approved hostnames"} 2 | {"created_at": "2022-09-13 22:00:00.000000", "timestamp": "2022-09-13 22:00:52.893886", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "protect applications running on amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "count": 1, "phrase": "the network firewall"} 3 | {"created_at": "2022-09-13 22:00:00.000000", "timestamp": "2022-09-13 22:00:52.893886", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "protect applications running on amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "count": 1, "phrase": "amazon eks cluster"} 4 | {"created_at": "2022-09-13 22:00:00.000000", "timestamp": "2022-09-13 22:00:52.893886", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "protect applications running on amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "count": 1, "phrase": "sni"} 5 | {"created_at": "2022-09-13 22:00:00.000000", "timestamp": "2022-09-13 22:00:52.893886", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "protect applications running on amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "count": 1, "phrase": "outbound traffic"} 6 | {"created_at": "2022-09-13 22:00:00.000000", "timestamp": "2022-09-13 22:00:52.893886", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "protect applications running on amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "count": 1, "phrase": "list"} 7 | {"created_at": "2022-09-13 22:00:00.000000", "timestamp": "2022-09-13 22:00:52.893886", "user": "AWSSecurityInfo", "platform": "Twitter", "text_clean": "protect applications running on amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "count": 1, "phrase": "applications"} 8 | -------------------------------------------------------------------------------- /sample-files/tweets/2022-09-12/cb1bc907-32b8-11ed-a39b-010c818c1d51.json: -------------------------------------------------------------------------------- 1 | {"text": "RT @ParmidaBeigi: Amazon's 20B language model has fewer than 1/8 the number of parameters of GPT-3, yet outperforms it in several NLP tasks…", "user": "AmazonScience", "created_at": "2022-09-12 16:33:45.000000", "source": "Twitter for iPhone", "platform": "Twitter", "text_clean": "rt @parmidabeigi: amazon's 20b language model has fewer than 1/8 the number of parameters of gpt-3, yet outperforms it in several nlp tasks…", "category_type": "machine learning", "category_type_score": 0.6919503211975098, "category_type_model_result": "{\"machine learning\": 0.6919503211975098, \"compute\": 0.18625091016292572, \"security\": 0.06642784178256989, \"database\": 0.02945137396454811, \"storage\": 0.025919586420059204}", "model": "SageMakerEndpoint-vRHkKJsUqRTs", "sentiment": "NEUTRAL", "timestamp": "2022-09-12 16:34:38.130057", "count": 1} -------------------------------------------------------------------------------- /sample-files/tweets/2022-09-12/f4491b0d-32dd-11ed-96e7-1dd85b3a04d7.json: -------------------------------------------------------------------------------- 1 | {"text": "What's 🆕 with AWS Security services? \n\n✔️ #AWSFirewallManager adds support for AWS WAF custom requests and responses\n✔️ #AWSSecurityHub launches new security best practice control\n\nDetails on these & more security feature updates: https://t.co/smwxXg4M5x https://t.co/GggLnb2HeK", "user": "AWSSecurityInfo", "created_at": "2022-09-12 20:59:49.000000", "source": "Sprinklr", "platform": "Twitter", "text_clean": "what's with aws security services? #awsfirewallmanager adds support for aws waf custom requests and responses #awssecurityhub launches new security best practice control details on these & more security feature updates:", "category_type": "security", "category_type_score": 0.8688084483146667, "category_type_model_result": "{\"security\": 0.8688084483146667, \"compute\": 0.049058590084314346, \"machine learning\": 0.029557108879089355, \"storage\": 0.027179645374417305, \"database\": 0.025396248325705528}", "model": "SageMakerEndpoint-vRHkKJsUqRTs", "sentiment": "NEUTRAL", "timestamp": "2022-09-12 21:00:38.580243", "count": 1} -------------------------------------------------------------------------------- /sample-files/tweets/2022-09-13/0e79b260-33a6-11ed-a6f9-adab8308bd5b.json: -------------------------------------------------------------------------------- 1 | {"text": "Accelerate innovation without sacrificing security. 🏎🔐☁️ Learn how migrating to the cloud can help decrease time to market, limit downtime & reduce cost. #AWS #CloudComputing #Developer https://t.co/qZ8XWJdz77", "user": "davidlaredo1", "created_at": "2022-09-13 20:52:06.000000", "source": "Twitter Web App", "platform": "Twitter", "text_clean": "accelerate innovation without sacrificing security. learn how migrating to the cloud can help decrease time to market, limit downtime & reduce cost. #aws #cloudcomputing #developer", "category_type": "security", "category_type_score": 0.7972265481948853, "category_type_model_result": "{\"security\": 0.7972265481948853, \"compute\": 0.16822879016399384, \"database\": 0.012693053111433983, \"storage\": 0.012405334040522575, \"machine learning\": 0.009446307085454464}", "model": "SageMakerEndpoint-vRHkKJsUqRTs", "sentiment": "NEUTRAL", "timestamp": "2022-09-13 20:53:01.877685", "count": 1} -------------------------------------------------------------------------------- /sample-files/tweets/2022-09-13/14a2cce4-33b3-11ed-9dd4-d752f0bb0867.json: -------------------------------------------------------------------------------- 1 | {"text": "Protect applications running on an Amazon EKS cluster by filtering outbound traffic based on the approved hostnames that are provided by SNI in the Network Firewall Allow list. Learn how: https://t.co/tlZeKspc49 https://t.co/YvQ2rkfREI", "user": "AWSSecurityInfo", "created_at": "2022-09-13 22:25:25.000000", "source": "Twitter Web App", "platform": "Twitter", "text_clean": "protect applications running on an amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "category_type": "security", "category_type_score": 0.9138131737709045, "category_type_model_result": "{\"security\": 0.9138131737709045, \"compute\": 0.03599965199828148, \"machine learning\": 0.021898740902543068, \"storage\": 0.01451999880373478, \"database\": 0.013768448494374752}", "model": "SageMakerEndpoint-vRHkKJsUqRTs", "sentiment": "NEUTRAL", "timestamp": "2022-09-13 22:26:15.670808", "count": 1} -------------------------------------------------------------------------------- /sample-files/tweets/2022-09-13/88fd7722-33af-11ed-a356-b52a4d4d8f68.json: -------------------------------------------------------------------------------- 1 | {"text": "Protect applications running on Amazon EKS cluster by filtering outbound traffic based on the approved hostnames that are provided by SNI in the Network Firewall Allow list. Learn how: https://t.co/tlZeKsoEeB https://t.co/wibkmsEno2", "user": "AWSSecurityInfo", "created_at": "2022-09-13 22:00:00.000000", "source": "Sprinklr", "platform": "Twitter", "text_clean": "protect applications running on amazon eks cluster by filtering outbound traffic based on the approved hostnames that are provided by sni in the network firewall allow list. learn how:", "category_type": "security", "category_type_score": 0.9045998454093933, "category_type_model_result": "{\"security\": 0.9045998454093933, \"compute\": 0.039112430065870285, \"machine learning\": 0.02197249047458172, \"database\": 0.01722169667482376, \"storage\": 0.017093613743782043}", "model": "SageMakerEndpoint-vRHkKJsUqRTs", "sentiment": "NEUTRAL", "timestamp": "2022-09-13 22:00:52.893886", "count": 1} -------------------------------------------------------------------------------- /stream-getter/.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # OS X extensions 10 | *.DS_Store 11 | 12 | # Distribution / packaging 13 | .Python 14 | build/ 15 | develop-eggs/ 16 | dist/ 17 | downloads/ 18 | eggs/ 19 | .eggs/ 20 | lib/ 21 | lib64/ 22 | parts/ 23 | sdist/ 24 | var/ 25 | wheels/ 26 | share/python-wheels/ 27 | *.egg-info/ 28 | .installed.cfg 29 | *.egg 30 | MANIFEST 31 | 32 | # PyInstaller 33 | # Usually these files are written by a python script from a template 34 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 35 | *.manifest 36 | *.spec 37 | 38 | # Installer logs 39 | pip-log.txt 40 | pip-delete-this-directory.txt 41 | 42 | # Unit test / coverage reports 43 | htmlcov/ 44 | .tox/ 45 | .nox/ 46 | .coverage 47 | .coverage.* 48 | .cache 49 | nosetests.xml 50 | coverage.xml 51 | *.cover 52 | *.py,cover 53 | .hypothesis/ 54 | .pytest_cache/ 55 | cover/ 56 | 57 | # Translations 58 | *.mo 59 | *.pot 60 | 61 | # Django stuff: 62 | *.log 63 | local_settings.py 64 | db.sqlite3 65 | db.sqlite3-journal 66 | 67 | # Flask stuff: 68 | instance/ 69 | .webassets-cache 70 | 71 | # Scrapy stuff: 72 | .scrapy 73 | 74 | # Sphinx documentation 75 | docs/_build/ 76 | 77 | # PyBuilder 78 | .pybuilder/ 79 | target/ 80 | 81 | # Jupyter Notebook 82 | .ipynb_checkpoints 83 | 84 | # IPython 85 | profile_default/ 86 | ipython_config.py 87 | 88 | # pyenv 89 | # For a library or package, you might want to ignore these files since the code is 90 | # intended to run in multiple environments; otherwise, check them in: 91 | # .python-version 92 | 93 | # pipenv 94 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 95 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 96 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 97 | # install all needed dependencies. 98 | #Pipfile.lock 99 | 100 | # poetry 101 | # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. 102 | # This is especially recommended for binary packages to ensure reproducibility, and is more 103 | # commonly ignored for libraries. 104 | # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control 105 | #poetry.lock 106 | 107 | # pdm 108 | # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. 109 | #pdm.lock 110 | # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it 111 | # in version control. 112 | # https://pdm.fming.dev/#use-with-ide 113 | .pdm.toml 114 | 115 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm 116 | __pypackages__/ 117 | 118 | # Celery stuff 119 | celerybeat-schedule 120 | celerybeat.pid 121 | 122 | # SageMath parsed files 123 | *.sage.py 124 | 125 | # Environments 126 | .env 127 | .venv 128 | env/ 129 | venv/ 130 | ENV/ 131 | env.bak/ 132 | venv.bak/ 133 | 134 | # Spyder project settings 135 | .spyderproject 136 | .spyproject 137 | 138 | # Rope project settings 139 | .ropeproject 140 | 141 | # mkdocs documentation 142 | /site 143 | 144 | # mypy 145 | .mypy_cache/ 146 | .dmypy.json 147 | dmypy.json 148 | 149 | # Pyre type checker 150 | .pyre/ 151 | 152 | # pytype static type analyzer 153 | .pytype/ 154 | 155 | # Cython debug symbols 156 | cython_debug/ 157 | 158 | # PyCharm 159 | # JetBrains specific template is maintained in a separate JetBrains.gitignore that can 160 | # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore 161 | # and can be added to the global gitignore or merged into this file. For a more nuclear 162 | # option (not recommended) you can uncomment the following to ignore the entire idea folder. 163 | #.idea/ -------------------------------------------------------------------------------- /stream-getter/Dockerfile: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | FROM python:3 5 | 6 | WORKDIR /usr/src/app 7 | 8 | COPY requirements.txt ./ 9 | RUN pip install --no-cache-dir -r requirements.txt 10 | 11 | COPY *.py ./ 12 | 13 | CMD [ "python", "./main.py" ] 14 | -------------------------------------------------------------------------------- /stream-getter/backoff.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | import logging 5 | import time 6 | 7 | import requests 8 | 9 | 10 | class Backoff: 11 | 12 | def __init__(self): 13 | self.wait_time = 0 14 | 15 | def wait_on_exception(self, exception): 16 | if isinstance(exception, requests.exceptions.HTTPError): 17 | if exception.response.status_code == 429: 18 | # Rate limit exceeded: 19 | # Start with a 1 minute wait and double each attempt. 20 | self.update_wait_time(60, 2, 3600) 21 | else: 22 | # Other HTTP errors: 23 | # Start with a 5 second wait, doubling each attempt, up to 320 seconds. 24 | self.update_wait_time(5, 2, 320) 25 | elif isinstance(exception, requests.exceptions.RequestException): 26 | # Increase the delay in reconnects by 250ms each attempt, up to 16 seconds. 27 | self.update_wait_time(.25, 1, 16) 28 | else: 29 | self.update_wait_time(1, 1, 1) 30 | logging.info(f'Sleeping for {self.wait_time} seconds...') 31 | time.sleep(self.wait_time) 32 | 33 | def update_wait_time(self, interval, multiplier, max_wait_time): 34 | if self.wait_time == 0: 35 | self.wait_time = interval 36 | else: 37 | self.wait_time = self.wait_time * multiplier if multiplier != 1 else self.wait_time + interval 38 | self.wait_time = self.wait_time if self.wait_time <= max_wait_time else max_wait_time 39 | 40 | def reset_wait_time(self): 41 | self.wait_time = 0 42 | -------------------------------------------------------------------------------- /stream-getter/main.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | import logging 5 | import os 6 | import queue 7 | import threading 8 | 9 | import requests 10 | 11 | from backoff import Backoff 12 | from sqs_helper import SqsHelper 13 | from stream_match import StreamMatch 14 | 15 | BEARER_TOKEN = os.environ.get('BEARER_TOKEN') 16 | REQUEST_CONNECT_TIMEOUT_SECONDS = 4 17 | REQUEST_READ_TIMEOUT_SECONDS = 30 # Twitter sends 20-second keep alive heartbeats 18 | STREAM_QUERY_PARAMS = 'tweet.fields=created_at,source&expansions=author_id&user.fields=username' 19 | 20 | matches_queue = queue.SimpleQueue() 21 | 22 | 23 | def bearer_oauth(r): 24 | r.headers['Authorization'] = f'Bearer {BEARER_TOKEN}' 25 | r.headers['User-Agent'] = 'v2FilteredStreamPython' 26 | return r 27 | 28 | 29 | def get_tweets_from_twitter(): 30 | backoff = Backoff() 31 | while True: 32 | try: 33 | response = requests.get( 34 | f'https://api.twitter.com/2/tweets/search/stream?{STREAM_QUERY_PARAMS}', 35 | auth=bearer_oauth, 36 | stream=True, 37 | timeout=(REQUEST_CONNECT_TIMEOUT_SECONDS, REQUEST_READ_TIMEOUT_SECONDS) 38 | ) 39 | response.raise_for_status() 40 | logging.info('Connected to the Twitter stream') 41 | backoff.reset_wait_time() 42 | for line in response.iter_lines(): 43 | if line: 44 | logging.info('New match!') 45 | decoded_line = line.decode('utf-8') 46 | logging.info(decoded_line) 47 | matches_queue.put(StreamMatch(decoded_line)) 48 | except Exception as exception: 49 | logging.error(exception, exc_info=True) 50 | backoff.wait_on_exception(exception) 51 | continue 52 | 53 | 54 | def send_tweets_to_sqs(sqs_helper): 55 | backoff = Backoff() 56 | while True: 57 | try: 58 | stream_match = matches_queue.get() 59 | if stream_match.has_errors(): 60 | logging.info('Skipping error') 61 | else: 62 | logging.info('Sending tweet to SQS') 63 | sqs_helper.send_tweet_to_sqs(stream_match) 64 | logging.info('Tweet sent to SQS') 65 | except Exception as exception: 66 | logging.error(exception, exc_info=True) 67 | backoff.wait_on_exception(exception) 68 | continue 69 | 70 | 71 | def main(): 72 | sqs_helper = SqsHelper(os.environ.get('SQS_QUEUE_URL')) 73 | producer = threading.Thread(target=get_tweets_from_twitter) 74 | consumer = threading.Thread(target=send_tweets_to_sqs, args=[sqs_helper]) 75 | producer.start() 76 | consumer.start() 77 | producer.join() 78 | consumer.join() 79 | 80 | 81 | if __name__ == '__main__': 82 | logging.basicConfig(level=os.environ.get('LOG_LEVEL', 'WARNING').upper()) 83 | main() 84 | -------------------------------------------------------------------------------- /stream-getter/requirements.txt: -------------------------------------------------------------------------------- 1 | backoff==1.11.1 2 | boto3==1.20.11 3 | botocore==1.23.11 4 | requests==2.27.0 5 | -------------------------------------------------------------------------------- /stream-getter/sqs_helper.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | import boto3 5 | 6 | 7 | class SqsHelper: 8 | 9 | def __init__(self, sqs_queue_url): 10 | self.sqs_client = boto3.client('sqs') 11 | self.sqs_queue_url = sqs_queue_url 12 | 13 | def send_tweet_to_sqs(self, stream_match): 14 | self.sqs_client.send_message( 15 | QueueUrl=self.sqs_queue_url, 16 | MessageBody=stream_match.to_tweet_json(), 17 | MessageAttributes={ 18 | 'matching_rule': { 19 | 'StringValue': stream_match.get_matching_rule(), 20 | 'DataType': 'String' 21 | } 22 | } 23 | ) 24 | -------------------------------------------------------------------------------- /stream-getter/stream_match.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | import json 5 | import datetime 6 | 7 | class StreamMatch: 8 | 9 | def __init__(self, content): 10 | self.content = content 11 | 12 | def to_tweet_json(self): 13 | input_object = json.loads(self.content) 14 | user = next(filter(lambda x: x['id'] == input_object['data']['author_id'], input_object['includes']['users'])) 15 | output_object = { 16 | 'text': input_object['data']['text'], 17 | 'user': user['username'], 18 | 'created_at': input_object['data']['created_at'] if 'created_at' in input_object['data'] else datetime.datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%S.000Z'), 19 | 'source': input_object['data']['source'] if 'source' in input_object['data'] else 'Undefined', 20 | 'platform': 'Twitter' 21 | } 22 | return json.dumps(output_object, ensure_ascii=False) 23 | 24 | def get_matching_rule(self): 25 | input_object = json.loads(self.content) 26 | matching_rule = input_object['matching_rules'][0]['tag'] 27 | return matching_rule 28 | 29 | def has_errors(self): 30 | return 'errors' in json.loads(self.content) 31 | --------------------------------------------------------------------------------