├── .gitignore
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── NOTICE
├── README.md
├── appdemo.gif
├── application
    ├── backend
    │   ├── pipeline_definitions
    │   │   ├── aws-search-generative.haystack-pipeline.yml
    │   │   └── aws-search.haystack-pipeline.yml
    │   └── search-api.Dockerfile
    └── frontend
    │   ├── search-ui.Dockerfile
    │   ├── utils.py
    │   └── webapp.py
├── cloud9
    └── resize.sh
├── documentation
    ├── aws-cloud9-deployment.md
    ├── clean-up-ingestion-resources.md
    ├── ingest-aws-documentation.md
    ├── ingest-custom-local-documents.md
    ├── ingest-documents-from-url.md
    └── ingest-wait-for-completion.md
├── infrastructure
    ├── data.tf
    ├── locals.tf
    ├── main.tf
    ├── output.tf
    ├── providers.tf
    ├── terraform.tfvars
    ├── tf-backends.tf
    └── variable.tf
├── ingestion
    ├── .terraform.lock.hcl
    ├── Dockerfile
    ├── awsdocs.tfvars
    ├── awsdocs
    │   ├── .gitignore
    │   ├── data
    │   │   └── empty.txt
    │   ├── requirements.txt
    │   ├── scripts
    │   │   ├── 0_setup_env.sh
    │   │   ├── clone_awsdocs.sh
    │   │   ├── run-opensearch.sh
    │   │   ├── run_ingestion_awsdocs.sh
    │   │   ├── run_ingestion_local.sh
    │   │   └── run_ingestion_url.sh
    │   └── src
    │   │   ├── get_faqs.py
    │   │   ├── ingest-pagerank.py
    │   │   └── ingest.py
    ├── conda-env.yaml
    ├── main.tf
    ├── mydocs.tfvars
    ├── output.tf
    ├── run_ingestion_job_ecs.sh
    ├── urldocs.tfvars
    └── variable.tf
├── semantic-search-arch-application.png
├── semantic-search-arch-ingestion.png
└── semantic-search-architecture.drawio


/.gitignore:
--------------------------------------------------------------------------------
 1 | /application/.env
 2 | /application/env.sh
 3 | /ingestion/conda-env
 4 | /ingestion/data/awsdocs/
 5 | /ingestion/data/
 6 | # Local .terraform directories
 7 | **/.terraform/*
 8 | # .tfstate files
 9 | *.tfstate
10 | *.tfstate.*
11 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
 4 | documentation, we greatly value feedback and contributions from our community.
 5 | 
 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
 7 | information to effectively respond to your bug report or contribution.
 8 | 
 9 | 
10 | ## Reporting Bugs/Feature Requests
11 | 
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 | 
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 | 
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 | 
22 | 
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 | 
26 | 1. You are working against the latest source on the *main* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 | 
30 | To send us a pull request, please:
31 | 
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 | 
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 | 
42 | 
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 | 
46 | 
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 | 
52 | 
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 | 
56 | 
57 | ## Licensing
58 | 
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | 
 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 4 | this software and associated documentation files (the "Software"), to deal in
 5 | the Software without restriction, including without limitation the rights to
 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 7 | the Software, and to permit persons to whom the Software is furnished to do so.
 8 | 
 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 | 
16 | 


--------------------------------------------------------------------------------
/NOTICE:
--------------------------------------------------------------------------------
1 | The Semantic Search on AWS Docs repository includes the following third-party software/licensing:
2 | - application/frontend/utils.py
3 | - application/frontend/webapp.py
4 | are taken and modified from Haystack (https://github.com/deepset-ai/haystack)
5 | - https://github.com/deepset-ai/haystack/tree/main/ui/ui/utils.py
6 | - https://github.com/deepset-ai/haystack/tree/main/ui/ui/webapp.py
7 | Copyright 2021 deepset GmbH. Licensed under the Apache License, Version 2.0 
8 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Semantic Search on AWS Docs or Custom Documents
  2 | 
  3 | This sample project demonstrates how to set up AWS infrastructure to perform semantic search and [question answering](https://en.wikipedia.org/wiki/Question_answering) on documents using a transformer machine learning models like BERT, RoBERTa, or GPT (via the [Haystack](https://github.com/deepset-ai/haystack) open source framework).
  4 | 
  5 | As an example, users can type questions about AWS services and find answers from the AWS documentation or custom local documents.
  6 | 
  7 | The deployed solution support 2 answering styles:
  8 | - `extractive question answering` will find the semantically closest
  9 | documents to the questions and highlight the most likeliest answer(s) in these documents.
 10 | - `generative question answering`, also referred to as long form question answering (LFQA), will find the semantically closest documents to the question and generate a formulated answer.
 11 | 
 12 | Please note that this project is intended for demo purposes, see disclaimers below.
 13 | 
 14 | ![](appdemo.gif?raw=true)
 15 | 
 16 | ## Architecture
 17 | 
 18 | ![](semantic-search-arch-application.png?raw=true)
 19 | 
 20 | The main components of this project are:
 21 | 
 22 | * [Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/what-is.html) to store and search documents
 23 | * The [AWS Documentation](https://github.com/awsdocs/) as a sample dataset loaded in the document store
 24 | * The [Haystack framework](https://www.deepset.ai/haystack) to set up an extractive [Question Answering pipeline](https://haystack.deepset.ai/tutorials/first-qa-system) with:
 25 |     * A [Retriever](https://haystack.deepset.ai/pipeline_nodes/retriever) that searches all the documents and returns only the most relevant ones
 26 |         * Retriever used: [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)
 27 |     * A [Reader](https://haystack.deepset.ai/pipeline_nodes/reader) that uses the documents returned by the Retriever and selects a text span which is likely to contain the matching answer to the query
 28 |         * Reader used: [deepset/roberta-base-squad2](https://huggingface.co/deepset/roberta-base-squad2)
 29 | * [Streamlit](https://streamlit.io/) to set up a frontend
 30 | * [Terraform](https://www.terraform.io/) to automate the infrastructure deployment on AWS
 31 | 
 32 | ## How to deploy the solution
 33 | 
 34 | ### Deploy with AWS Cloud9
 35 | Follow our [step-by-step deployment instructions](documentation/aws-cloud9-deployment.md) to deploy the semantic search application if you are new to AWS, Terraform, semantic search, or you prefer detailed setp-by-step instructions.
 36 | 
 37 | For more general deployment instructions follow the sections below.
 38 | 
 39 | ### General Deployment Instructions 
 40 | The backend folder contains a Terraform project that deploys an OpenSearch domain and 2 ECS services:
 41 | 
 42 | * frontend: Streamlit-based UI built by Haystack ([repo](https://github.com/deepset-ai/haystack-demos/tree/main/explore_the_world))
 43 | * search API: REST API built by Haystack
 44 | 
 45 | The main steps to deploy the solution are:
 46 | 
 47 | * Deploy the terraform stack
 48 | * Optional: Ingest the AWS documentation
 49 | 
 50 | #### Pre-requisites
 51 | 
 52 | * Terraform v1.0+ ([getting started guide](https://learn.hashicorp.com/collections/terraform/aws-get-started))
 53 | * Docker installed and running ([getting started guide](https://www.docker.com/get-started/))
 54 | * AWS CLI v2 installed and configured ([getting started guide](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html))
 55 | * An [EC2 Service Limit of at least 8 cores for G-instance type](https://aws.amazon.com/premiumsupport/knowledge-center/ec2-instance-limit/) if you want to deploy this solution with GPU acceleration.  
 56 | Alternatively, you can switch to a CPU instance by changing the `instance_type = "g4dn.2xlarge"` to a CPU instance in the `infrastructure/main.tf` file.
 57 | 
 58 | #### Deploy the application infrastructure terraform stack
 59 | 
 60 | * git clone this repository
 61 | * **Configure**
 62 | Configure and change the infrastructure region, subnets, availability zones in the `infrastructure/terraform.tfvars` file as needed
 63 | * **Initialize**  
 64 | In this example the Terrform state is stored remotely and managed through a backend using S3 and a dynamodb table to acquire the state lock. This allows collaboration on the same Terraform infrastructure from different machines.
 65 | ( If you prefer to use local state instead just remove the `terraform { backend "s3" { ...}}` block from the `infrastructure/tf-backend.tf` file and run directly `terraform init`)
 66 |     * Create an S3 Bucket and DynamoDB to store the Terraform [state backend](https://www.terraform.io/language/settings/backends/s3) in a region of choice.
 67 |       ```shell
 68 |       STATE_REGION=<AWS region>
 69 |       ```
 70 |       ```shell
 71 |       S3_BUCKET=<YOUR-BUCKET-NAME>
 72 |       aws s3 mb s3://$S3_BUCKET -region=$STATE_REGION
 73 |       ```
 74 |       ```shell
 75 |       SYNC_TABLE=<YOUR-TABLE-NAME>
 76 |       aws dynamodb create-table --table-name $SYNC_TABLE --attribute-definitions AttributeName=LockID,AttributeType=S --key-schema   AttributeName=LockID,KeyType=HASH --billing-mode PAY_PER_REQUEST --region=$STATE_REGION
 77 |       ```
 78 |    * Change to the directory containing the application infrastucture's `infrastructure/main.tf` file
 79 |       ```shell
 80 |       cd infrastructure
 81 |       ```
 82 |    * Initialize terraform with the S3 remote state backend by running
 83 |       ```shell
 84 |       terraform init \
 85 |       -backend-config="bucket=$S3_BUCKET" \
 86 |       -backend-config="region=$STATE_REGION" \
 87 |       -backend-config="dynamodb_table=$SYNC_TABLE"
 88 |       ```
 89 | 
 90 | * **Deploy**  
 91 | Run terraform deploy and approve changes by typing yes.
 92 |     ```shell
 93 |     terraform apply
 94 |     ```    
 95 |     ***Please note:*** _deployment can take a long time to push the container depending on the upload bandwidth of your machine.  
 96 |     For faster deployment you can run the terraform deployment from a development environment hosted inside the same AWS region, for example by using the [AWS Cloud9](https://aws.amazon.com/cloud9/) IDE._
 97 | * **Use**  
 98 | Once deployment is completed, browse to the output URL (`loadbalancer_url`) from the Terraform output to see the appliction.   
 99 | However, searches won't return any results until you ingest any documents.
100 | * **Clean up**  
101 |   To remove all created resources of the applications infrastructure again use
102 |   ```shell
103 |   terraform destroy
104 |   ```
105 |   (If you used the ingestion terrform below, make sure to first destroy the ingestion resources to avoid conflicts)
106 | 
107 | #### Ingest the AWS documentation
108 | 
109 | This second terraform stack builds, pushes and runs a docker container as an ECS task.  
110 | The ingestion container downloads either a single (e.g. `amazon-ec2-user-guide`) or all awsdocs repos (256) (`full`) and converts the .md files into .txt using pandoc.  
111 | The .txt documents are then being ingested into the applications OpenSearch cluster in the required haystack format and become available for search
112 | 
113 | ![](semantic-search-arch-ingestion.png?raw=true)
114 | 
115 | * Change from the `infrastructure` directory to the directory containing the ingestion's `ingestion/main.tf`
116 |    ```shell
117 |     cd ../ingestion
118 |     ```
119 | * Init terraform  
120 | (here we are using local state instead of a remote S3 backend for simplicity)
121 |     ```shell
122 |     terraform init
123 |     ```
124 | * Run ingestion as Terraform deployment.  
125 | The S3 remote state file from the previous infrastructure deployment is needed here as input variables.  
126 | It is used as data source to read out the infra's output variables like the OpenSearch endpoint or private subnets. 
127 | You can set the S3 bucket and its region either in the `infrastructure/terraform.tfvars` or passing the input variables via
128 |     ```shell
129 |     terraform apply \
130 |     -var="infra_region=$STATE_REGION" \
131 |     -var="infra_tf_state_s3_bucket=$S3_BUCKET"
132 |     ```   
133 |     ***Please note:*** _deployment can take a long time to push the container depending on the upload bandwidth of your machine. For faster deployment you can build and push the container in AWS, for example by using the [AWS Cloud9](https://aws.amazon.com/cloud9/) IDE._
134 | * Once the previous step finsihes, the ECS ingestion task is started. You can check its progress in the AWS console, for example in Amazon CloudWatch under the log group name `semantic-search` and checking `ingestion-job`. After the task finsihed successfully, the ingested documents are searchable via the application.
135 | * After runing the ingestion job, you can remove the created ingestion resources, e.g. ECR repository or task definition by running
136 |     ```shell
137 |     terraform destroy \
138 |     -var="infra_region=$STATE_REGION" \
139 |     -var="infra_tf_state_s3_bucket=$S3_BUCKET"
140 |     ```
141 | 
142 | #### Ingesting your own documents
143 | 
144 | Take a look at the `ingestion/awsdocs/ingest.py` how adopt the ingestion script for your own documents. In brief, you can ingest local or downloaded files via:
145 | ```python
146 | # Create a wrapper for the existing OpenSearch document store
147 | document_store = OpenSearchDocumentStore(...)
148 | 
149 | # Covert local files
150 | dicts_aws = convert_files_to_docs(dir_path=..., ...)
151 | 
152 | # Write the documents to the OpenSearch document store
153 | document_store.write_documents(dicts_aws, index=...)
154 | 
155 | # Compute and update the embeddings for each document with a transformer ML model. 
156 | # An embedding is the vector representation that is learned by the transformer and that
157 | # allows us to capture and compare the semantic meaning of documents via this 
158 | # vector representation. 
159 | # Be sure to use the same model that you want to use later in the search pipeline.
160 | retriever = EmbeddingRetriever(
161 |     document_store=document_store,
162 |     model_format = "sentence_transformers",
163 |     embedding_model = "all-mpnet-base-v2"
164 | )
165 | document_store.update_embeddings(retriever)
166 | ```
167 | 
168 | ## Security
169 | 
170 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
171 | 
172 | ## Contributing
173 | 
174 | If you want to contribute to Haystack, check out their [GitHub repository](https://github.com/deepset-ai/haystack).
175 | 
176 | ## License
177 | 
178 | This library is licensed under the MIT-0 License. See the LICENSE file.
179 | 
180 | ## Disclaimer
181 | 
182 | This solution is intended to demonstrate the functionality of using machine learning models for semantic search and question answering. They are not intended for production deployment as is.
183 | 
184 | For best practices on modifying this solution for production use cases, please follow the [AWS well-architected guidance](https://aws.amazon.com/architecture/well-architected/).
185 | 


--------------------------------------------------------------------------------
/appdemo.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/semantic-search-aws-docs/6ae2b3b907cdb93666ec01e73acb8f03be245736/appdemo.gif


--------------------------------------------------------------------------------
/application/backend/pipeline_definitions/aws-search-generative.haystack-pipeline.yml:
--------------------------------------------------------------------------------
 1 | # To allow your IDE to autocomplete and validate your YAML pipelines, name them as <name of your choice>.haystack-pipeline.yml
 2 | 
 3 | version: "1.16.0"
 4 | 
 5 | components:    # define all the building-blocks for Pipeline
 6 |   - name: DocumentStore
 7 |     type: OpenSearchDocumentStore
 8 |     params:
 9 |       host: openSearchDomain.REGION.es.amazonaws.com
10 |       index: indexname
11 |       password: openSearchPassword
12 |       port: 443
13 |       return_embedding: false
14 |       username: openSearchUsername
15 |       similarity: cosine
16 |   - name: Retriever
17 |     type: EmbeddingRetriever
18 |     params:
19 |       document_store: DocumentStore    # params can reference other components defined in the YAML
20 |       top_k: 5    
21 |       embedding_model: sentence-transformers/all-mpnet-base-v2
22 |       model_format: sentence_transformers
23 |   - name: BM25Retriever
24 |     type: BM25Retriever
25 |     params:
26 |       document_store: DocumentStore    # params can reference other components defined in the YAML
27 |       top_k: 5
28 |   - name: Reader
29 |     type: Seq2SeqGenerator
30 |     params:
31 |       model_name_or_path: vblagoje/bart_lfqa
32 |   - name: TextFileConverter
33 |     type: TextConverter
34 |   - name: PDFFileConverter
35 |     type: PDFToTextConverter
36 |   - name: Preprocessor
37 |     type: PreProcessor
38 |     params:
39 |       split_by: word
40 |       split_length: 1000
41 |   - name: FileTypeClassifier
42 |     type: FileTypeClassifier
43 | 
44 | pipelines:
45 |   - name: query    # a sample extractive-qa Pipeline
46 |     nodes:
47 |       - name: Retriever
48 |         inputs: [Query]
49 |       - name: Reader
50 |         inputs: [Retriever]
51 |   - name: indexing
52 |     nodes:
53 |       - name: FileTypeClassifier
54 |         inputs: [File]
55 |       - name: TextFileConverter
56 |         inputs: [FileTypeClassifier.output_1]
57 |       - name: PDFFileConverter
58 |         inputs: [FileTypeClassifier.output_2]
59 |       - name: Preprocessor
60 |         inputs: [PDFFileConverter, TextFileConverter]
61 |       - name: Retriever
62 |         inputs: [Preprocessor]
63 |       - name: DocumentStore
64 |         inputs: [Retriever]
65 | 
66 | 
67 | 


--------------------------------------------------------------------------------
/application/backend/pipeline_definitions/aws-search.haystack-pipeline.yml:
--------------------------------------------------------------------------------
 1 | # To allow your IDE to autocomplete and validate your YAML pipelines, name them as <name of your choice>.haystack-pipeline.yml
 2 | version: "1.16.0"
 3 | 
 4 | components:    # define all the building-blocks for Pipeline
 5 |   - name: DocumentStore
 6 |     type: OpenSearchDocumentStore
 7 |     params:
 8 |       host: openSearchDomain.REGION.es.amazonaws.com
 9 |       index: indexname
10 |       password: openSearchPassword
11 |       port: 443
12 |       return_embedding: false
13 |       username: openSearchUsername
14 |       similarity: cosine
15 |   - name: Retriever
16 |     type: EmbeddingRetriever
17 |     params:
18 |       document_store: DocumentStore    # params can reference other components defined in the YAML
19 |       top_k: 5    
20 |       embedding_model: sentence-transformers/all-mpnet-base-v2
21 |       model_format: sentence_transformers
22 |   - name: BM25Retriever
23 |     type: BM25Retriever
24 |     params:
25 |       document_store: DocumentStore    # params can reference other components defined in the YAML
26 |       top_k: 5
27 |   - name: Reader       # custom-name for the component; helpful for visualization & debugging
28 |     type: FARMReader    # Haystack Class name for the component
29 |     params:
30 |       model_name_or_path: deepset/roberta-base-squad2
31 |       context_window_size: 500
32 |       return_no_answer: true
33 |   - name: TextFileConverter
34 |     type: TextConverter
35 |   - name: PDFFileConverter
36 |     type: PDFToTextConverter
37 |   - name: Preprocessor
38 |     type: PreProcessor
39 |     params:
40 |       split_by: word
41 |       split_length: 1000
42 |   - name: FileTypeClassifier
43 |     type: FileTypeClassifier
44 | 
45 | pipelines:
46 |   - name: query    # a sample extractive-qa Pipeline
47 |     nodes:
48 |       - name: Retriever
49 |         inputs: [Query]
50 |       - name: Reader
51 |         inputs: [Retriever]
52 |   - name: indexing
53 |     nodes:
54 |       - name: FileTypeClassifier
55 |         inputs: [File]
56 |       - name: TextFileConverter
57 |         inputs: [FileTypeClassifier.output_1]
58 |       - name: PDFFileConverter
59 |         inputs: [FileTypeClassifier.output_2]
60 |       - name: Preprocessor
61 |         inputs: [PDFFileConverter, TextFileConverter]
62 |       - name: Retriever
63 |         inputs: [Preprocessor]
64 |       - name: DocumentStore
65 |         inputs: [Retriever]


--------------------------------------------------------------------------------
/application/backend/search-api.Dockerfile:
--------------------------------------------------------------------------------
1 | FROM deepset/haystack:gpu-v1.16.0
2 | 
3 | COPY pipeline_definitions/* /home/user/rest_api/pipeline/


--------------------------------------------------------------------------------
/application/frontend/search-ui.Dockerfile:
--------------------------------------------------------------------------------
1 | FROM deepset/haystack-streamlit-ui@sha256:3584978ff23c7eb5f19596bde6f3eeca1bab65a8997758634db5563241e2b1cb
2 | 
3 | COPY utils.py /home/user/ui
4 | COPY webapp.py /home/user/ui


--------------------------------------------------------------------------------
/application/frontend/utils.py:
--------------------------------------------------------------------------------
  1 | # Modified from https://github.com/deepset-ai/haystack/blob/main/ui/utils.py
  2 | # commit 1a0197839c6ee0a90e0f562af5edf57a891d473a 
  3 | # under  Apache-2.0 license 
  4 | ###########################################################################################################
  5 | 
  6 | from typing import List, Dict, Any, Tuple, Optional
  7 | 
  8 | import os
  9 | import logging
 10 | from time import sleep
 11 | 
 12 | import requests
 13 | import streamlit as st
 14 | 
 15 | 
 16 | API_ENDPOINT = os.getenv("API_ENDPOINT", "http://localhost:8000")
 17 | API_ENDPOINT_GENERATIVE = os.getenv("API_ENDPOINT_GENERATIVE", "http://localhost:9000")
 18 | STATUS = "initialized"
 19 | HS_VERSION = "hs_version"
 20 | DOC_REQUEST = "query"
 21 | DOC_FEEDBACK = "feedback"
 22 | DOC_UPLOAD = "file-upload"
 23 | 
 24 | 
 25 | def haystack_is_ready(answer_style):
 26 |     """
 27 |     Used to show the "Haystack is loading..." message
 28 |     """
 29 |     url = f"{API_ENDPOINT}/{STATUS}"
 30 |     if(answer_style=='Generative'):
 31 |         url = f"{API_ENDPOINT_GENERATIVE}/{STATUS}"
 32 |     try:
 33 |         if requests.get(url).status_code < 400:
 34 |             return True
 35 |     except Exception as e:
 36 |         logging.exception(e)
 37 |         sleep(1)  # To avoid spamming a non-existing endpoint at startup
 38 |     return False
 39 | 
 40 | 
 41 | @st.cache
 42 | def haystack_version(answer_style):
 43 |     """
 44 |     Get the Haystack version from the REST API
 45 |     """
 46 |     url = f"{API_ENDPOINT}/{HS_VERSION}"
 47 |     if(answer_style=='Generative'):
 48 |         url = f"{API_ENDPOINT_GENERATIVE}/{HS_VERSION}"
 49 |     return requests.get(url, timeout=0.1).json()["hs_version"]
 50 | 
 51 | 
 52 | def query(query, filters={}, top_k_reader=5, top_k_retriever=5, answer_style='Extractive', debug = False) -> Tuple[List[Dict[str, Any]], Dict[str, str]]:
 53 |     """
 54 |     Send a query to the REST API and parse the answer.
 55 |     Returns both a ready-to-use representation of the results and the raw JSON.
 56 |     """
 57 |     
 58 |     url = f"{API_ENDPOINT}/{DOC_REQUEST}"
 59 |     params = {"filters": filters, "Retriever": {"top_k": top_k_retriever}, "Reader": {"top_k": top_k_reader}, 'Query': {'debug': debug}}
 60 | 
 61 |     if(answer_style=='Generative'):
 62 |         url = f"{API_ENDPOINT_GENERATIVE}/{DOC_REQUEST}"
 63 |         params = {"filters": filters, "Retriever": {"top_k": top_k_retriever}, 'Query': {'debug': debug}}
 64 |     req = {"query": query, "params": params, "debug": debug}
 65 |     logging.info("req")
 66 |     logging.info(req)
 67 |     
 68 |     response_raw = requests.post(url, json=req)
 69 | 
 70 |     if response_raw.status_code >= 400 and response_raw.status_code != 503:
 71 |         raise Exception(f"{vars(response_raw)}")
 72 | 
 73 |     response = response_raw.json()
 74 |     if "errors" in response:
 75 |         raise Exception(", ".join(response["errors"]))
 76 |     logging.info("response")
 77 |     logging.info(response_raw)
 78 |     logging.info(response)
 79 |    
 80 |     # Format response
 81 |     results = []
 82 |     answers = response["answers"]
 83 |     for answer in answers:
 84 |         if answer.get("answer", None):
 85 |             if(answer["type"]=="generative"):
 86 |                 documents = [doc for doc in response["documents"] if doc["id"] in answer["document_ids"]]
 87 |                 document_names = map(lambda doc: doc["meta"]["name"], documents)
 88 |                 results.append(
 89 |                     {
 90 |                         "context": "",
 91 |                         "answer": answer.get("answer", None),
 92 |                         "source": ', '.join(document_names),
 93 |                         "relevance": sum(map(lambda doc: doc["score"], documents))/len(documents),
 94 |                         "document": documents[0],
 95 |                         "offset_start_in_doc": 0,
 96 |                         "_raw": answer,
 97 |                     }
 98 |                 )
 99 |             else:
100 |                 results.append(
101 |                     {
102 |                         "context": "..." + answer["context"] + "...",
103 |                         "answer": answer.get("answer", None),
104 |                         "source": answer["meta"]["name"],
105 |                         "relevance": round(answer["score"] * 100, 2),
106 |                         "document": [doc for doc in response["documents"] if doc["id"] == answer["document_ids"][0]][0] if answer["document_ids"] and len(answer["document_ids"]) > 0 else [],
107 |                         "offset_start_in_doc": answer["offsets_in_document"][0]["start"],
108 |                         "_raw": answer,
109 |                     }
110 |                 )
111 |         else:
112 |             results.append(
113 |                 {
114 |                     "context": None,
115 |                     "answer": None,
116 |                     "document": None,
117 |                     "relevance": round(answer["score"] * 100, 2),
118 |                     "_raw": answer,
119 |                 }
120 |             )
121 |     return results, response
122 | 
123 | 
124 | def send_feedback(query, answer_obj, is_correct_answer, is_correct_document, document) -> None:
125 |     """
126 |     Send a feedback (label) to the REST API
127 |     """
128 |     url = f"{API_ENDPOINT}/{DOC_FEEDBACK}"
129 |     req = {
130 |         "query": query,
131 |         "document": document,
132 |         "is_correct_answer": is_correct_answer,
133 |         "is_correct_document": is_correct_document,
134 |         "origin": "user-feedback",
135 |         "answer": answer_obj,
136 |     }
137 |     response_raw = requests.post(url, json=req)
138 |     if response_raw.status_code >= 400:
139 |         raise ValueError(f"An error was returned [code {response_raw.status_code}]: {response_raw.json()}")
140 | 
141 | 
142 | def upload_doc(file):
143 |     url = f"{API_ENDPOINT}/{DOC_UPLOAD}"
144 |     files = [("files", file)]
145 |     response = requests.post(url, files=files).json()
146 |     return response
147 | 
148 | 
149 | def get_backlink(result) -> Tuple[Optional[str], Optional[str]]:
150 |     if result.get("document", None):
151 |         doc = result["document"]
152 |         if isinstance(doc, dict):
153 |             if doc.get("meta", None):
154 |                 if isinstance(doc["meta"], dict):
155 |                     if doc["meta"].get("url", None) and doc["meta"].get("title", None):
156 |                         return doc["meta"]["url"], doc["meta"]["title"]
157 |     return None, None
158 | 


--------------------------------------------------------------------------------
/application/frontend/webapp.py:
--------------------------------------------------------------------------------
  1 | # Modified from https://github.com/deepset-ai/haystack/blob/main/ui/webapp.py
  2 | # commit 1a0197839c6ee0a90e0f562af5edf57a891d473a 
  3 | # under  Apache-2.0 license 
  4 | ###########################################################################################################
  5 | 
  6 | import os
  7 | import sys
  8 | import logging
  9 | import random
 10 | from pathlib import Path
 11 | from json import JSONDecodeError
 12 | 
 13 | import pandas as pd
 14 | import streamlit as st
 15 | from annotated_text import annotation
 16 | from markdown import markdown
 17 | 
 18 | from ui.utils import haystack_is_ready, query, send_feedback, upload_doc, haystack_version, get_backlink
 19 | 
 20 | 
 21 | # Adjust to a question that you would like users to see in the search bar when they load the UI:
 22 | DEFAULT_QUESTION_AT_STARTUP = os.getenv("DEFAULT_QUESTION_AT_STARTUP", "How  to protect against DDoS attacks?")
 23 | DEFAULT_ANSWER_AT_STARTUP = os.getenv("DEFAULT_ANSWER_AT_STARTUP", "AWS Shield is a managed Distributed Denial of Service (DDoS) protection service that safeguards applications running on AWS. AWS Shield provides always-on detection and automatic inline mitigations that minimize application downtime and latency, so there is no need to engage AWS Support to benefit from DDoS protection")
 24 | 
 25 | # Sliders
 26 | DEFAULT_DOCS_FROM_RETRIEVER = int(os.getenv("DEFAULT_DOCS_FROM_RETRIEVER", "3"))
 27 | DEFAULT_NUMBER_OF_ANSWERS = int(os.getenv("DEFAULT_NUMBER_OF_ANSWERS", "3"))
 28 | 
 29 | 
 30 | 
 31 | def set_state_if_absent(key, value):
 32 |     if key not in st.session_state:
 33 |         st.session_state[key] = value
 34 | 
 35 | 
 36 | def main():
 37 | 
 38 |     st.set_page_config(page_title="Semantic Search - AWS Docs", page_icon="https://a0.awsstatic.com/libra-css/images/site/fav/favicon.ico")
 39 | 
 40 |     # Persistent state
 41 |     set_state_if_absent("question", DEFAULT_QUESTION_AT_STARTUP)
 42 |     set_state_if_absent("answer", DEFAULT_ANSWER_AT_STARTUP)
 43 |     set_state_if_absent("results", None)
 44 |     set_state_if_absent("raw_json", None)
 45 |     set_state_if_absent("random_question_requested", False)
 46 | 
 47 |     # Small callback to reset the interface in case the text of the question changes
 48 |     def reset_results(*args):
 49 |         st.session_state.answer = None
 50 |         st.session_state.results = None
 51 |         st.session_state.raw_json = None
 52 | 
 53 |     # Title
 54 |     st.write("# Semantic Search on AWS")
 55 |     st.markdown(
 56 |         """
 57 | Ask any question on about the documents to see if we can find the correct answer to your query!
 58 | *Note: do not use keywords, but full-fledged questions.* The demo is not optimized to deal with keyword queries and might misunderstand you.
 59 | """,
 60 |         unsafe_allow_html=True,
 61 |     )
 62 | 
 63 |     # Sidebar
 64 |     st.sidebar.header("Options")
 65 |     
 66 |     answer_style = st.sidebar.radio(
 67 |      "Answer Style:",
 68 |      ('Extractive', 'Generative'),
 69 |      index=0,
 70 |      on_change=reset_results
 71 |     )
 72 | 
 73 |     
 74 |     top_k_reader = st.sidebar.slider(
 75 |         "Max. number of answers",
 76 |         min_value=1,
 77 |         max_value=10,
 78 |         value=DEFAULT_NUMBER_OF_ANSWERS,
 79 |         step=1,
 80 |         on_change=reset_results,
 81 |     )
 82 |     top_k_retriever = st.sidebar.slider(
 83 |         "Max. number of documents from retriever",
 84 |         min_value=1,
 85 |         max_value=10,
 86 |         value=DEFAULT_DOCS_FROM_RETRIEVER,
 87 |         step=1,
 88 |         on_change=reset_results,
 89 |     )
 90 |     debug = st.sidebar.checkbox("Show debug info")
 91 | 
 92 | 
 93 |     hs_version = ""
 94 |     try:
 95 |         hs_version = f" <small>(v{haystack_version(answer_style=answer_style)})</small>"
 96 |     except Exception:
 97 |         pass
 98 | 
 99 |     st.sidebar.markdown(
100 |         f"""
101 |     <style>
102 |         a {{
103 |             text-decoration: none;
104 |         }}
105 |         .haystack-footer {{
106 |             text-align: center;
107 |         }}
108 |         .haystack-footer h4 {{
109 |             margin: 0.1rem;
110 |             padding:0;
111 |         }}
112 |         footer {{
113 |             opacity: 0;
114 |         }}
115 |     </style>
116 |     <div class="haystack-footer">
117 |         <hr />
118 |         <h4>Built with <a href="https://www.deepset.ai/haystack">Haystack</a>{hs_version}</h4>
119 |         <p>Get it on <a href="https://github.com/deepset-ai/haystack/">GitHub</a> &nbsp;&nbsp; - &nbsp;&nbsp; Read the <a href="https://haystack.deepset.ai/overview/intro">Docs</a></p>
120 |     </div>
121 |     """,
122 |         unsafe_allow_html=True,
123 |     )
124 | 
125 |     # Search bar
126 |     question = st.text_input("", value=st.session_state.question, max_chars=100, on_change=reset_results)
127 |     col1, col2 = st.columns(2)
128 |     col1.markdown("<style>.stButton button {width:100%;}</style>", unsafe_allow_html=True)
129 |     col2.markdown("<style>.stButton button {width:100%;}</style>", unsafe_allow_html=True)
130 | 
131 |     # Run button
132 |     run_pressed = col1.button("Run")
133 | 
134 | 
135 |     example_questions = [
136 |         "What is Amazon SageMaker?",
137 |         "Why is it a best practice to use multiple availability zones (AZs)?",
138 |         "How to protect against DDoS on AWS?"
139 |     ]
140 | 
141 |     # Get next random question from the CSV
142 |     if col2.button("Random question"):
143 |         reset_results()
144 |         st.session_state.question = random.choice(example_questions)
145 |         st.session_state.answer = ""
146 |         st.session_state.random_question_requested = True
147 |         # Re-runs the script setting the random question as the textbox value
148 |         # Unfortunately necessary as the Random Question button is _below_ the textbox
149 |         raise st.scriptrunner.script_runner.RerunException(st.scriptrunner.script_requests.RerunData(None))
150 |     st.session_state.random_question_requested = False
151 | 
152 |     run_query = (
153 |         run_pressed or question != st.session_state.question
154 |     ) and not st.session_state.random_question_requested
155 | 
156 |     # Check the connection
157 |     with st.spinner("⌛️ &nbsp;&nbsp; Backend is starting..."):
158 |         if not haystack_is_ready(answer_style=answer_style):
159 |             st.error("🚫 &nbsp;&nbsp; Connection Error. Is the backend running?")
160 |             run_query = False
161 |             reset_results()
162 | 
163 |     # Get results for query
164 |     if run_query and question:
165 |         reset_results()
166 |         st.session_state.question = question
167 | 
168 |         with st.spinner(
169 |             "🧠 &nbsp;&nbsp; Performing neural search on documents... \n "
170 |             "Do you want to optimize speed or accuracy? \n"
171 |             "Check out the docs: https://haystack.deepset.ai/usage/optimization "
172 |         ):
173 |             try:
174 |                 st.session_state.results, st.session_state.raw_json = query(
175 |                     question, top_k_reader=top_k_reader, top_k_retriever=top_k_retriever, answer_style=answer_style, debug=debug
176 |                 )
177 |             except JSONDecodeError as je:
178 |                 st.error("👓 &nbsp;&nbsp; An error occurred reading the results. Is the document store working?")
179 |                 return
180 |             except Exception as e:
181 |                 logging.exception(e)
182 |                 if "The server is busy processing requests" in str(e) or "503" in str(e):
183 |                     st.error("🧑‍🌾 &nbsp;&nbsp; All our workers are busy! Try again later.")
184 |                 else:
185 |                     st.error("🐞 &nbsp;&nbsp; An error occurred during the request.")
186 |                 return
187 | 
188 |     if st.session_state.results:
189 | 
190 |         st.write("## Results:")
191 | 
192 |         for count, result in enumerate(st.session_state.results):
193 |             if result["answer"]:
194 |                 answer, context = result["answer"], result["context"]
195 |                 start_idx = context.find(answer)
196 |                 end_idx = start_idx + len(answer)
197 |                 # Hack due to this bug: https://github.com/streamlit/streamlit/issues/3190
198 |                 st.write(
199 |                     markdown(context[:start_idx] + str(annotation(answer, "ANSWER", "#8ef")) + context[end_idx:]),
200 |                     unsafe_allow_html=True,
201 |                 )
202 |                 source = ""
203 |                 url, title = get_backlink(result)
204 |                 if url and title:
205 |                     source = f"[{result['document']['meta']['title']}]({result['document']['meta']['url']})"
206 |                 else:
207 |                     source = f"{result['source']}"
208 |                 st.markdown(f"**Relevance:** {result['relevance']} -  **Source:** {source}")
209 | 
210 |             else:
211 |                 st.info(
212 |                     "🤔 &nbsp;&nbsp; Haystack is unsure whether any of the documents contain an answer to your question. Try to reformulate it!"
213 |                 )
214 |                 st.write("**Relevance:** ", result["relevance"])
215 | 
216 |             st.write("___")
217 |         if debug:
218 |             st.subheader("REST API JSON response")
219 |             st.write(st.session_state.raw_json)
220 | 
221 | main()


--------------------------------------------------------------------------------
/cloud9/resize.sh:
--------------------------------------------------------------------------------
 1 | 
 2 | #!/bin/bash
 3 | 
 4 | # https://docs.aws.amazon.com/cloud9/latest/user-guide/move-environment.html#move-environment-resize
 5 | 
 6 | # Specify the desired volume size in GiB as a command line argument. If not specified, default to 20 GiB.
 7 | SIZE=${1:-20}
 8 | 
 9 | # Get the ID of the environment host Amazon EC2 instance.
10 | TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 60")
11 | INSTANCEID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" -v http://169.254.169.254/latest/meta-data/instance-id 2> /dev/null)
12 | REGION=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" -v http://169.254.169.254/latest/meta-data/placement/region 2> /dev/null)
13 | 
14 | # Get the ID of the Amazon EBS volume associated with the instance.
15 | VOLUMEID=$(aws ec2 describe-instances \
16 |   --instance-id $INSTANCEID \
17 |   --query "Reservations[0].Instances[0].BlockDeviceMappings[0].Ebs.VolumeId" \
18 |   --output text \
19 |   --region $REGION)
20 | 
21 | # Resize the EBS volume.
22 | aws ec2 modify-volume --volume-id $VOLUMEID --size $SIZE
23 | 
24 | # Wait for the resize to finish.
25 | while [ \
26 |   "$(aws ec2 describe-volumes-modifications \
27 |     --volume-id $VOLUMEID \
28 |     --filters Name=modification-state,Values="optimizing","completed" \
29 |     --query "length(VolumesModifications)"\
30 |     --output text)" != "1" ]; do
31 | sleep 1
32 | done
33 | 
34 | # Check if we're on an NVMe filesystem
35 | if [[ -e "/dev/xvda" && $(readlink -f /dev/xvda) = "/dev/xvda" ]]
36 | then
37 | # Rewrite the partition table so that the partition takes up all the space that it can.
38 |   sudo growpart /dev/xvda 1
39 | # Expand the size of the file system.
40 | # Check if we're on AL2
41 |   STR=$(cat /etc/os-release)
42 |   SUB="VERSION_ID=\"2\""
43 |   if [[ "$STR" == *"$SUB"* ]]
44 |   then
45 |     sudo xfs_growfs -d /
46 |   else
47 |     sudo resize2fs /dev/xvda1
48 |   fi
49 | 
50 | else
51 | # Rewrite the partition table so that the partition takes up all the space that it can.
52 |   sudo growpart /dev/nvme0n1 1
53 | 
54 | # Expand the size of the file system.
55 | # Check if we're on AL2
56 |   STR=$(cat /etc/os-release)
57 |   SUB="VERSION_ID=\"2\""
58 |   if [[ "$STR" == *"$SUB"* ]]
59 |   then
60 |     sudo xfs_growfs -d /
61 |   else
62 |     sudo resize2fs /dev/nvme0n1p1
63 |   fi
64 | fi
65 | 


--------------------------------------------------------------------------------
/documentation/aws-cloud9-deployment.md:
--------------------------------------------------------------------------------
 1 | # AWS Cloud9 Deployment
 2 | This document walks through the deployment of Semantic Search on AWS using AWS Cloud9. Following this guide deploys the semantic search application with default configurations.
 3 | 
 4 | ## Create AWS Cloud9 environment
 5 | 1. Open [AWS Cloud9 in AWS](https://console.aws.amazon.com/cloud9control/home)
 6 | 2. Use the **Create environment** button to create a new AWS Cloud9 IDE.
 7 |     1. Give your AWS Cloud9 environment a name, for example `semantic-search-deployment`.
 8 |     2. Select t3.small or m5.large and leave the other configurations as the defaults.
 9 |     3. **Create** the AWS Cloud9 environment.
10 | 
11 | ## Open the AWS Cloud9 environment
12 | 1. Go to your [AWS Cloud9 Dashboard in AWS](https://console.aws.amazon.com/cloud9control/home)
13 | 2. In the list of AWS Cloud9 environments **open** your AWS Cloud9 environment that you created before. The **open** button opens your AWS Cloud9 IDE in a new browser tab. Wait for the Cloud9 IDE to connect. This may take a minute.
14 | 
15 | ## Semantic Search Infrastructure Code
16 | 1. Execute `git clone https://github.com/aws-samples/semantic-search-aws-docs.git` in the terminal of your AWS Cloud9 IDE.
17 | 
18 | ## Modify AWS Cloud9 EC2 Instance
19 | Once your AWS Cloud9 environment creation completes you will need to increase the underlying EC2 instance storage to be able to pull the container images that the semantic search application deploys.
20 | 1. In the Terminal of your AWS Cloud9 IDE make the resize bash script executable `chmod +x ~/environment/semantic-search-aws-docs/cloud9/resize.sh`
21 | 2. Increase the Amazon EBS volume size of your AWS Cloud9 environment to 50 GB `~/environment/semantic-search-aws-docs/cloud9/resize.sh 50`
22 | 
23 | ## AWS CLI Credentials
24 | Terraform needs to access AWS resources from within your AWS Cloud9 environment to deploy the semantic search application. The recommended way to access AWS resources from within the AWS Cloud9 environment is to use [AWS managed temporary credentials](https://docs.aws.amazon.com/cloud9/latest/user-guide/security-iam.html#auth-and-access-control-temporary-managed-credentials-supported). For the semantic search deployment the AWS managed temporary credentials do not have sufficient permissions to create an Amazon EC2 instance profile.
25 | 1. In your AWS Cloud9 IDE at the top left click on **AWS Cloud9** and then on **Preferences**.
26 | 2. On the Preferences page navigate to **AWS Settings**, **Credentials**, and disable **AWS managed temporary credentials**.
27 | 
28 | You need to authenticate directly with the AWS CLI in your AWS Cloud9 IDE. To do so you need to follow one of the AWS CLI [Authentication and access credentials methods](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-authentication.html). 
29 | 
30 | One easy way to setup authentication is by [setting environment variables ](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html#:~:text=supported%20environment%20variables-,How%20to%20set%20environment%20variables,-The%20following%20examples)
31 | 
32 | e.g. similar to the snippet below, but with your own access key and access key.
33 | ```
34 | export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
35 | export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
36 | export AWS_DEFAULT_REGION=us-east-1 
37 | ```
38 | 
39 | Once you authenticated with the AWS CLI you can go on to the next step.
40 | 
41 | ## GPU or CPU deployment
42 | The default configuration uses GPU instances for the semantic search application. If you want to deploy this solution with GPU acceleration you will need to increase the _Running On-Demand G and VT instances_ [EC2 Service quota to at least 8](https://aws.amazon.com/premiumsupport/knowledge-center/ec2-instance-limit/).
43 | 
44 | To deploy the semantic search application without GPU instance open `semantic-search-aws-docs/infrastructure/main.tf` in your AWS Cloud9 IDE. Search for _Using a CPU instance_ in the Terraform file. Uncomment the CPU `image_id` and `instance_type` and add comments before the GPU `image_id` and `instance_type`. The code should now look like the following:
45 | ```
46 | ### Using a CPU instance ###
47 | image_id      =  data.aws_ssm_parameter.ami.value #AMI name like amzn2-ami-ecs-hvm-2.0.20220520-x86_64-ebs
48 | instance_type =  "c6i.2xlarge"
49 | 
50 | ### Using a GPU instance ###
51 | #image_id      = data.aws_ssm_parameter.ami_gpu.value #AMI name like amzn2-ami-ecs-gpu-hvm-2.0.20220520-x86_64-ebs
52 | #instance_type = "g4dn.xlarge"
53 | ```
54 | 
55 | ## Deploy Semantic Search Infrastructure
56 | 1. In your AWS Cloud9 environment terminal navigate to `cd ~/environment/semantic-search-aws-docs/infrastructure`.
57 | 2. Set the following environment variables. Change their value if you are using a different region other than `us-east-1` or if you want to give the Terraform state Amazon S3 bucket and state sync Amazon DynamoDB table different names.
58 | 
59 | ```bash
60 | export REGION=us-east-1
61 | export S3_BUCKET="terraform-semantic-search-state-$(aws sts get-caller-identity --query Account --output text)"
62 | export SYNC_TABLE="terraform-semantic-search-state-sync"
63 | ```
64 | 
65 | 3. Create the Terraform state bucket in Amazon S3 <br>
66 | `aws s3 mb s3://$S3_BUCKET --region=$REGION` 
67 | 4. Create the Terraform state sync table in Amazon DynamoDB <br>
68 | `aws dynamodb create-table --table-name $SYNC_TABLE --attribute-definitions AttributeName=LockID,AttributeType=S --key-schema   AttributeName=LockID,KeyType=HASH --billing-mode PAY_PER_REQUEST --region=$REGION`
69 | 5. Initialize Terraform for the infrastructure deployment <br> `terraform init -backend-config="bucket=$S3_BUCKET" -backend-config="region=$REGION" -backend-config="dynamodb_table=$SYNC_TABLE"`
70 | 6. Deploy the Semantic Search infrastructure with Terraform <br> `terraform apply -var="region=$REGION" -var="index_name=awsdocs"`
71 |     * Change the terraform variable `index_name` if you want to change the name of your [index](https://opensearch.org/docs/latest/dashboards/im-dashboards/index-management/) in the Amazon OpenSearch cluster. The search API uses this variable to search for documents.
72 |     * Enter `yes` when Terraform prompts you _"to perform these actions"_.
73 |     * The deployment will take 10–20 minutes. Wait for completion before moving on with the document ingestion deployment.
74 | 7. You can receive the frontend URL using the following command. It may take some time though until the tasks are in running state <br>
75 | `terraform output loadbalancer_url`
76 | ## Deploy and Run Semantic Search Ingestion
77 | In your AWS Cloud9 environments terminal navigate to <br>
78 | `cd ~/environment/semantic-search-aws-docs/ingestion`. 
79 | * If you want to ingest the AWS documentation follow the <br> **[Ingest AWS Documentation instructions](./ingest-aws-documentation.md)**.
80 | * If you want to ingest your documents from a URL (for example from Amazon S3) follow the <br> **[Ingest Documents from URL instructions](./ingest-documents-from-url.md)**.
81 | * If instead you like to make local documents searchable follow the <br>**[Local Documents Ingestion instructions](./ingest-custom-local-wdocuments.md)**.
82 | 
83 | ### Clean up Ingestion
84 | After ingesting your documents you can remove the ingestion resources. Follow the [Clean up Ingestion Resources instructions](./clean-up-ingestion-resources.md) to clean up the ingestion resources.
85 | 
86 | ## Clean up Infrastructure
87 | Destroy the resources that were deployed for the infrastructure of the semantic search application if you are not using the application anymore.
88 | 1. In your AWS Cloud9 IDE navigate to the ingestion directory `cd ~/environment/semantic-search-aws-docs/infrastructure`
89 | 2. Clean up the semantic search application infrastructure with the `terraform destroy -var="region=$REGION"` command. 
90 |     1. Run `eval REGION=$(terraform output region)` if your `REGION` variable is not set anymore.
91 | 3. Enter `yes` when Terraform prompts you _"Do you really want to destroy all resources?"_.
92 | 


--------------------------------------------------------------------------------
/documentation/clean-up-ingestion-resources.md:
--------------------------------------------------------------------------------
1 | # Clean up Ingestion Resources
2 | You can clean up the ingestion resources immediately after ingesting the documents into your OpenSearch index. The ingestion executes as a one-off Amazon ECS task. For a production scenario with changes to the source documents you should consider [scheduling Amazon ECS tasks](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/scheduling_tasks.html) for ingesting the latest version of documents on a schedule.
3 | 1. In your terminal navigate to the ingestion directory `cd ~/semantic-search-aws-docs/ingestion` in this repository.
4 | 2. Run `terraform destroy -var-file="awsdocs.tfvars" -var="infra_region=$REGION" -var="infra_tf_state_s3_bucket=$S3_BUCKET" -var="docs_src=(eval sed -e 's/^"//' -e 's/"$//' <<< (terraform output docs_src))"` to clean up the ingestion resources.
5 | 3. Enter `yes` when Terraform prompts you _"Do you really want to destroy all resources?"_.


--------------------------------------------------------------------------------
/documentation/ingest-aws-documentation.md:
--------------------------------------------------------------------------------
 1 | # Ingest AWS Documentation
 2 | The next steps guide you through ingesting the [Amazon EC2 User Guide for Linux](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide) into your Amazon OpenSearch index to make the AWS documentation searchable. To learn how you ingest your own documents instead take a look at [Ingesting your own documents](ingest-custom-local-documents.md).
 3 | 1. In your terminal navigate to <br>
 4 |  `cd ~/semantic-search-aws-docs/ingestion` in this repository.
 5 | 2. Initialize Terraform `terraform init`
 6 | 3. Run `AWS_DOCS=amazon-ec2-user-guide`. <br> `amazon-ec2-user-guide` references the name of the [GitHub repository that contains the Amazon EC2 User Guide for Linux](https://github.com/awsdocs/amazon-ec2-user-guide). You can replace `amazon-ec2-user-guide` with any of the AWS documentation repository names from the [AWS Docs GitHub](https://github.com/awsdocs), for example `amazon-eks-user-guide` or `full` to ingest all AWS Docs repos.
 7 | 3. Deploy the ingestion resources <br> 
 8 | ```terraform apply -var-file="awsdocs.tfvars" -var="infra_region=$REGION" -var="infra_tf_state_s3_bucket=$S3_BUCKET" -var="docs_src=$AWS_DOCS"```
 9 |     * Enter `yes` when Terraform prompts you _"to perform these actions"_. 
10 | 4. After the successful deployment of the ingestion resources you need to wait until the ingestion task completes. Follow the [Wait for Ingestion to Complete instructions to check for completion from you terminal](ingest-wait-for-completion.md).
11 | 
12 | ## Clean up Ingestion Resources
13 | After ingesting your documents you can remove the ingestion resources. Follow the [Clean up Ingestion Resources instructions](./clean-up-ingestion-resources.md) to clean up the ingestion resources.


--------------------------------------------------------------------------------
/documentation/ingest-custom-local-documents.md:
--------------------------------------------------------------------------------
 1 | # Local Documents Ingestion
 2 | This document explains step-by-step how to ingest local documents to make them searchable using this AWS semantic search solution.
 3 | 
 4 | ## Deploy Semantic Search Ingestion for Local Documents
 5 | The next steps guide you through ingesting documents from you local storage into the Amazon OpenSearch index. 
 6 | 1. In your terminal navigate to the ingestion directory <br> `cd ~/semantic-search-aws-docs/ingestion`
 7 | 2. Initialize Terraform <br> `terraform init`
 8 | 2. Set `DOCS_DIR` variable to the path of the directory that contains the documents (e.g. `*.txt` files) that you want to ingest. The path needs to be relative to the [Dockerfile](/ingestion/Dockerfile) directory. For example use `DOCS_DIR=mydocs/data` to ingest all the documents located in the [ingestion/mydocs/data](/ingestion/mydocs/data/) directory.
 9 | 3. Deploy the ingestion resources <br>`terraform apply -var-file="mydocs.tfvars" -var="infra_region=$REGION" -var="infra_tf_state_s3_bucket=$S3_BUCKET" -var="docs_src=$DOCS_DIR"`
10 |     1. Enter `yes` when Terraform prompts you _"to perform these actions"_. 
11 | 4. After the successful deployment of the ingestion resources you need to wait until the ingestion task completes. Follow the [Wait for Ingestion to Complete instructions to check for completion from you terminal](ingest-wait-for-completion.md).
12 | 
13 | ## Clean up Ingestion Resources
14 | After ingesting your documents you can remove the ingestion resources. Follow the [Clean up Ingestion Resources instructions](./clean-up-ingestion-resources.md) to clean up the ingestion resources.


--------------------------------------------------------------------------------
/documentation/ingest-documents-from-url.md:
--------------------------------------------------------------------------------
 1 | # Ingest Documents from URL
 2 | This document explains step-by-step how to ingest documents from an archive at a URL to make them searchable using this AWS semantic search solution.
 3 | 
 4 | If you are on MacOS make sure that when compressing your documents into the archive that it does not add AppleDouble blobs `*_` files, see also [this StackExchange answer for the question Create tar archive of a directory, except for hidden files?](https://unix.stackexchange.com/a/9865)
 5 | 
 6 | ## Deploy Semantic Search Ingestion for Documents from URL
 7 | The next steps guide you through ingesting documents from a URL into the Amazon OpenSearch index. The URL needs to point to an archive (zip, gz or tar.gz) and has to be accessible from the `NLPSearchPrivateSubnet` subnet. The subnet can reach the internet through a NAT gateway.
 8 | 1. In your terminal navigate to `cd ~/semantic-search-aws-docs/ingestion`
 9 | 2. Initialize Terraform `terraform init`
10 | 2. Set `DOCS_SRC` variable to the URL from which you want to ingest the documents from, for example  if your documents are in Amazon S3 then you could create a [presigned URL](https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-presigned-url.html) for the archive in Amazon S3 and assign `DOCS_SRC=https://<BUCKET_NAME>.s3.<REGION>.amazonaws.com/data.zip?response-content-disposition=inline&X-Amz-Security-Token=[...]`.
11 | 3. Deploy the ingestion resources `terraform apply -var-file="urldocs.tfvars" -var="infra_region=$REGION" -var="infra_tf_state_s3_bucket=$S3_BUCKET" -var="docs_src=$DOCS_SRC"`
12 |     1. Enter `yes` when Terraform prompts you _"to perform these actions"_. 
13 | 4. After the successful deployment of the ingestion resources you need to wait until the ingestion task completes. Follow the [Wait for Ingestion to Complete instructions to check for completion from you terminal](ingest-wait-for-completion.md).
14 | 
15 | ## Clean up Ingestion Resources
16 | After ingesting your documents you can remove the ingestion resources. Follow the [Clean up Ingestion Resources instructions](./clean-up-ingestion-resources.md) to clean up the ingestion resources.


--------------------------------------------------------------------------------
/documentation/ingest-wait-for-completion.md:
--------------------------------------------------------------------------------
 1 | # Wait for Ingestion to Complete
 2 | After deploying the ingestion Terraform resources you will need to wait for the ingestion to complete before being able to search the documents. Check the [Amazon Elastic Container Service console](https://console.aws.amazon.com/ecs) to see when the task completes, or use below AWS CLI commands to wait for the task to complete.
 3 | 
 4 | ## Wait in AWS Console
 5 | Go to the Tasks page of your Amazon ECS Cluster that the infrastructure stack deployed. The default name of the cluster is *NLPSearchECSCluster*. In Tasks list look for the task that has `ingestion-job` as the Task definition. Wait until the status of the `ingestion-job` task changes from *Running* to *Stopped*.
 6 | 
 7 | ## Use AWS CLI to wait
 8 | 1. Navigate to the infrastructure directory `cd ~/semantic-search-aws-docs/infrastructure`.
 9 | 1. Run `ECS_CLUSTER_ARN=$(terraform output --raw ecs_cluster_arn)` to get the ARN of your Amazon ECS cluster.
10 | 2. Run `TASK_ARN=$(aws ecs list-tasks --family ingestion-job --region $REGION --cluster $ECS_CLUSTER_ARN --output text --query 'taskArns[0]')` to get the ARN of the ECS task that is ingesting the documents.
11 | 3. Run `aws ecs wait tasks-stopped --region $REGION --cluster $ECS_CLUSTER_ARN --tasks $TASK_ARN` to wait until the ingestion of the documents completes. When the command exits with _Waiter TasksStopped failed: Max attempts exceeded_ as the message run the command again. Once the command exits without any output then the ingestion job completed.
12 | 4. After the ingestion completes run `terraform output loadbalancer_url` to get the URL for the semantic search frontend.


--------------------------------------------------------------------------------
/infrastructure/data.tf:
--------------------------------------------------------------------------------
1 | data "aws_caller_identity" "current" {}
2 | 
3 | data "aws_ecr_authorization_token" "token" {}


--------------------------------------------------------------------------------
/infrastructure/locals.tf:
--------------------------------------------------------------------------------
1 | locals {
2 |   aws_ecr_url = "${data.aws_caller_identity.current.account_id}.dkr.ecr.${var.region}.amazonaws.com"
3 | }


--------------------------------------------------------------------------------
/infrastructure/main.tf:
--------------------------------------------------------------------------------
  1 | provider "aws" {
  2 |   region = var.region
  3 |   default_tags {
  4 |     tags = {
  5 |       Project = "semantic-search-aws-docs"
  6 |     }
  7 |   }
  8 | }
  9 | 
 10 | ### Networking ###
 11 | resource "aws_vpc" "aws-vpc" {
 12 |   cidr_block = var.vpc_cidr
 13 | 
 14 |   enable_dns_hostnames = true # Required for DNS-based service discovery
 15 |   enable_dns_support   = true # Required for DNS-based service discovery
 16 |   tags = {
 17 |     name = "NLPSearchVPC"
 18 |   }
 19 | }
 20 | resource "aws_subnet" "private" {
 21 |   vpc_id            = aws_vpc.aws-vpc.id
 22 |   count             = length(var.private_subnets)
 23 |   cidr_block        = element(var.private_subnets, count.index)
 24 |   availability_zone = element(var.availability_zones, count.index)
 25 |   tags = {
 26 |     Name = "NLPSearchPrivateSubnet"
 27 |     Tier = "Private"
 28 |   }
 29 | }
 30 | resource "aws_subnet" "public" {
 31 |   vpc_id                  = aws_vpc.aws-vpc.id
 32 |   cidr_block              = element(var.public_subnets, count.index)
 33 |   availability_zone       = element(var.availability_zones, count.index)
 34 |   count                   = length(var.public_subnets)
 35 |   map_public_ip_on_launch = true
 36 |   tags = {
 37 |     Name = "NLPSearchPublicSubnet"
 38 |     Tier = "Public"
 39 |   }
 40 | }
 41 | resource "aws_internet_gateway" "main" {
 42 |   vpc_id = aws_vpc.aws-vpc.id
 43 | }
 44 | resource "aws_route_table" "public" {
 45 |   vpc_id = aws_vpc.aws-vpc.id
 46 | 
 47 |   route {
 48 |     cidr_block = "0.0.0.0/0"
 49 |     gateway_id = aws_internet_gateway.main.id
 50 |   }
 51 | }
 52 | resource "aws_route_table_association" "public" {
 53 |   count          = length(var.public_subnets)
 54 |   subnet_id      = element(aws_subnet.public.*.id, count.index)
 55 |   route_table_id = aws_route_table.public.id
 56 | }
 57 | resource "aws_alb" "main" {
 58 |   name               = "nlp-search-alb"
 59 |   internal           = false
 60 |   load_balancer_type = "application"
 61 |   subnets            = aws_subnet.public.*.id
 62 |   security_groups    = [aws_security_group.alb.id]
 63 | }
 64 | resource "aws_lb_target_group" "search_ui" {
 65 |   name        = "nlp-search-alb-target-group"
 66 |   port        = 8501
 67 |   protocol    = "HTTP"
 68 |   target_type = "ip"
 69 |   vpc_id      = aws_vpc.aws-vpc.id
 70 | 
 71 |   health_check {
 72 |     protocol            = "HTTP"
 73 |     healthy_threshold   = "3"
 74 |     interval            = "30"
 75 |     matcher             = "200"
 76 |     timeout             = "10"
 77 |     path                = "/"
 78 |     unhealthy_threshold = "2"
 79 |   }
 80 | }
 81 | resource "aws_lb_listener" "search_ui" {
 82 |   load_balancer_arn = aws_alb.main.id
 83 |   port              = "80"
 84 |   protocol          = "HTTP"
 85 | 
 86 |   default_action {
 87 |     type             = "forward"
 88 |     target_group_arn = aws_lb_target_group.search_ui.id
 89 |   }
 90 | }
 91 | 
 92 | resource "aws_eip" "nat_gw" {
 93 |   domain     = "vpc"
 94 |   depends_on = [aws_internet_gateway.main]
 95 | }
 96 | 
 97 | resource "aws_nat_gateway" "main" {
 98 |   allocation_id = aws_eip.nat_gw.id
 99 |   subnet_id     = element(aws_subnet.public.*.id, 0)
100 | 
101 |   depends_on = [aws_internet_gateway.main]
102 | }
103 | 
104 | resource "aws_route_table" "private" {
105 |   vpc_id = aws_vpc.aws-vpc.id
106 |   route {
107 |     cidr_block     = "0.0.0.0/0"
108 |     nat_gateway_id = aws_nat_gateway.main.id
109 |   }
110 | }
111 | 
112 | resource "aws_route_table_association" "private" {
113 |   count          = length(var.private_subnets)
114 |   subnet_id      = element(aws_subnet.private.*.id, count.index)
115 |   route_table_id = aws_route_table.private.id
116 | }
117 | 
118 | ### Logs ###
119 | resource "aws_cloudwatch_log_group" "app" {
120 |   name              = "/semantic-search"
121 |   retention_in_days = 30
122 | }
123 | 
124 | 
125 | ### IAM ###
126 | resource "aws_iam_role" "search_ui" {
127 |   name = "NLPSearchSearchUIECSTaskRole"
128 | 
129 |   assume_role_policy = <<EOF
130 | {
131 |   "Version": "2008-10-17",
132 |   "Statement": [
133 |     {
134 |       "Sid": "",
135 |       "Effect": "Allow",
136 |       "Principal": {
137 |         "Service": "ecs-tasks.amazonaws.com"
138 |       },
139 |       "Action": "sts:AssumeRole"
140 |     }
141 |   ]
142 | }
143 | EOF
144 | }
145 | resource "aws_iam_role_policy_attachment" "search_ui" {
146 |   role       = aws_iam_role.search_ui.name
147 |   policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
148 | }
149 | 
150 | resource "aws_iam_role" "search_api" {
151 |   name = "NLPSearchSearchAPIECSTaskRole"
152 | 
153 |   assume_role_policy = <<EOF
154 | {
155 |   "Version": "2008-10-17",
156 |   "Statement": [
157 |     {
158 |       "Sid": "",
159 |       "Effect": "Allow",
160 |       "Principal": {
161 |         "Service": "ecs-tasks.amazonaws.com"
162 |       },
163 |       "Action": "sts:AssumeRole"
164 |     }
165 |   ]
166 | }
167 | EOF
168 | }
169 | resource "aws_iam_role_policy_attachment" "search_api" {
170 |   role       = aws_iam_role.search_api.name
171 |   policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
172 | }
173 | 
174 | resource "aws_iam_policy" "sm" {
175 |   name        = "ecs-secrets-manager-policy"
176 |   description = "Giving an ECS task access to the OpenSearch credientials in Secrete Manager"
177 | 
178 |   policy = <<EOF
179 | {
180 |   "Version": "2012-10-17",
181 |   "Statement": [
182 |     {
183 |       "Action": [
184 |         "secretsmanager:GetSecretValue"
185 |       ],
186 |       "Effect": "Allow",
187 |       "Resource": "${aws_secretsmanager_secret.opensearch.arn}"
188 |     }
189 |   ]
190 | }
191 | EOF
192 | }
193 | 
194 | resource "aws_iam_role_policy_attachment" "search_api_sm" {
195 |   role       = aws_iam_role.search_api.name
196 |   policy_arn = aws_iam_policy.sm.arn
197 | }
198 | 
199 | ### EC2 instance role/profile with permissions to register in ECS, pull from ECR, logging to CloudWatch
200 | data "aws_iam_policy_document" "ec2_node_assume_role" {
201 |   statement {
202 |     actions = ["sts:AssumeRole"]
203 |     principals {
204 |       type        = "Service"
205 |       identifiers = ["ec2.amazonaws.com"]
206 |     }
207 |   }
208 | }
209 | 
210 | resource "aws_iam_role" "ec2_node" {
211 |   assume_role_policy = data.aws_iam_policy_document.ec2_node_assume_role.json
212 |   name               = "NLPSearchClusterNodeRole"
213 | }
214 | 
215 | resource "aws_iam_role_policy_attachment" "ecs_node" {
216 |   role       = aws_iam_role.ec2_node.name
217 |   policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role"
218 | }
219 | 
220 | resource "aws_iam_role_policy_attachment" "ecs_node_ssm" {
221 |   role       = aws_iam_role.ec2_node.name
222 |   policy_arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
223 | }
224 | 
225 | resource "aws_iam_instance_profile" "ecs_ec2_node" {
226 |   name = "NLPSearchClusterNodeProfile"
227 |   role = aws_iam_role.ec2_node.name
228 | }
229 | 
230 | 
231 | ### Security Groups ###
232 | 
233 | resource "aws_security_group" "search_ui" {
234 |   vpc_id      = aws_vpc.aws-vpc.id
235 |   description = "Managed by Terraform. Configuring TCP ingress on port 8501 from the ALB security group."
236 | 
237 |   ingress {
238 |     description     = "TCP 8501 from ALB"
239 |     from_port       = 8501
240 |     to_port         = 8501
241 |     protocol        = "tcp"
242 |     security_groups = [aws_security_group.alb.id]
243 |   }
244 | 
245 |   egress {
246 |     from_port        = 0
247 |     to_port          = 0
248 |     protocol         = "-1"
249 |     cidr_blocks      = ["0.0.0.0/0"]
250 |     ipv6_cidr_blocks = ["::/0"]
251 |   }
252 |   tags = {
253 |     Name = "NLPSearchSearchUISecurityGroup"
254 |   }
255 | }
256 | 
257 | resource "aws_security_group" "search_api" {
258 |   vpc_id      = aws_vpc.aws-vpc.id
259 |   description = "Managed by Terraform. Configuring TCP ingress on port 8000 from the search UI security group"
260 | 
261 |   ingress {
262 |     description     = "TCP 8000 from Search UI"
263 |     from_port       = 8000
264 |     to_port         = 8000
265 |     protocol        = "tcp"
266 |     security_groups = [aws_security_group.search_ui.id]
267 |   }
268 | 
269 |   egress {
270 |     from_port        = 0
271 |     to_port          = 0
272 |     protocol         = "-1"
273 |     cidr_blocks      = ["0.0.0.0/0"]
274 |     ipv6_cidr_blocks = ["::/0"]
275 |   }
276 |   tags = {
277 |     Name = "NLPSearchSearchAPISecurityGroup"
278 |   }
279 | }
280 | 
281 | resource "aws_security_group" "alb" {
282 |   vpc_id      = aws_vpc.aws-vpc.id
283 |   description = "Managed by Terraform. Configuring HTTP(S) access from the internet"
284 | 
285 |   ingress {
286 |     description      = "HTTP from public internet"
287 |     from_port        = 80
288 |     to_port          = 80
289 |     protocol         = "tcp"
290 |     cidr_blocks      = ["0.0.0.0/0"]
291 |     ipv6_cidr_blocks = ["::/0"]
292 |   }
293 | 
294 |   ingress {
295 |     description      = "TLS from public internet"
296 |     from_port        = 443
297 |     to_port          = 443
298 |     protocol         = "tcp"
299 |     cidr_blocks      = ["0.0.0.0/0"]
300 |     ipv6_cidr_blocks = ["::/0"]
301 |   }
302 | 
303 |   egress {
304 |     from_port        = 0
305 |     to_port          = 0
306 |     protocol         = "-1"
307 |     cidr_blocks      = ["0.0.0.0/0"]
308 |     ipv6_cidr_blocks = ["::/0"]
309 |   }
310 |   tags = {
311 |     Name = "NLPSearchALBSecurityGroup"
312 |   }
313 | }
314 | 
315 | ### ECS Cluster###
316 | resource "aws_ecs_cluster" "main" {
317 |   name = "NLPSearchECSCluster"
318 | }
319 | resource "aws_ecs_cluster_capacity_providers" "main" {
320 |   cluster_name       = aws_ecs_cluster.main.name
321 |   capacity_providers = [aws_ecs_capacity_provider.ec2_gpu.name, "FARGATE"]
322 | 
323 |   default_capacity_provider_strategy {
324 |     capacity_provider = "FARGATE"
325 |     weight            = 1
326 |   }
327 | }
328 | 
329 | resource "aws_ecs_capacity_provider" "ec2_gpu" {
330 |   name = "ec2_gpu_capacity_provider"
331 |   auto_scaling_group_provider {
332 |     auto_scaling_group_arn         = aws_autoscaling_group.ec2_gpu.arn
333 |     managed_termination_protection = "DISABLED"
334 | 
335 |     managed_scaling {
336 |       status          = "ENABLED"
337 |       target_capacity = 50
338 |     }
339 |   }
340 | }
341 | 
342 | resource "aws_autoscaling_group" "ec2_gpu" {
343 |   name_prefix               = "ecs_cluster_node_"
344 |   desired_capacity          = 2
345 |   max_size                  = 10
346 |   min_size                  = 1
347 |   health_check_grace_period = 300
348 |   health_check_type         = "EC2"
349 |   vpc_zone_identifier       = aws_subnet.private.*.id
350 | 
351 |   launch_template {
352 |     id      = aws_launch_template.ec2_gpu.id
353 |     version = "$Latest"
354 |   }
355 | 
356 |   tag {
357 |     key                 = "AmazonECSManaged"
358 |     value               = true
359 |     propagate_at_launch = true
360 |   }
361 | }
362 | 
363 | data "aws_ssm_parameter" "ami" {
364 |   name = "/aws/service/ecs/optimized-ami/amazon-linux-2/recommended/image_id"
365 | }
366 | 
367 | data "aws_ssm_parameter" "ami_gpu" {
368 |   name = "/aws/service/ecs/optimized-ami/amazon-linux-2/gpu/recommended/image_id"
369 | }
370 | 
371 | 
372 | resource "aws_launch_template" "ec2_gpu" {
373 |   name_prefix = "ec2_gpu_launch_template"
374 | 
375 |   ### Using a CPU instance ###
376 |   # image_id      =  data.aws_ssm_parameter.ami.value #AMI name like amzn2-ami-ecs-hvm-2.0.20220520-x86_64-ebs
377 |   # instance_type =  "c6i.2xlarge"
378 | 
379 |   ### Using a GPU instance ###
380 |   image_id      = data.aws_ssm_parameter.ami_gpu.value #AMI name like amzn2-ami-ecs-gpu-hvm-2.0.20220520-x86_64-ebs
381 |   instance_type = "g4dn.xlarge"
382 | 
383 |   vpc_security_group_ids = [aws_security_group.search_api.id]
384 |   iam_instance_profile {
385 |     arn = aws_iam_instance_profile.ecs_ec2_node.arn
386 |   }
387 |   tag_specifications {
388 |     resource_type = "instance"
389 |     tags = {
390 |       Name = "NLPSearchInstance"
391 |     }
392 |   }
393 |   user_data = base64encode(<<EOF
394 | #!/bin/bash
395 | echo ECS_CLUSTER="${aws_ecs_cluster.main.name}" >> /etc/ecs/ecs.config
396 | echo ECS_ENABLE_CONTAINER_METADATA=true >> /etc/ecs/ecs.config
397 | echo ECS_ENABLED_GPU_SUPPORT=true >> /etc/ecs/ecs.config
398 | EOF
399 |   )
400 | }
401 | 
402 | 
403 | ### ECS Tasks and Services ###
404 | 
405 | resource "aws_ecs_service" "search_ui" {
406 |   name            = "search_ui"
407 |   cluster         = aws_ecs_cluster.main.id
408 |   task_definition = aws_ecs_task_definition.search_ui.arn
409 |   desired_count   = 1
410 |   launch_type     = "FARGATE"
411 | 
412 |   network_configuration {
413 |     subnets         = aws_subnet.private.*.id
414 |     security_groups = [aws_security_group.search_ui.id]
415 |   }
416 | 
417 |   load_balancer {
418 |     target_group_arn = aws_lb_target_group.search_ui.arn
419 |     container_name   = "search-ui"
420 |     container_port   = 8501
421 |   }
422 | 
423 |   depends_on = [aws_lb_listener.search_ui, docker_registry_image.search_ui]
424 | 
425 | }
426 | 
427 | 
428 | resource "aws_ecs_task_definition" "search_ui" {
429 |   family                   = "search-ui"
430 |   container_definitions    = <<DEFINITION
431 |   [
432 |     {
433 |       "dnsSearchDomains": null,
434 |       "environmentFiles": null,
435 |       "logConfiguration": {
436 |         "logDriver": "awslogs",
437 |         "options": {
438 |           "awslogs-group": "${aws_cloudwatch_log_group.app.name}",
439 |           "awslogs-region": "${data.aws_region.current.name}",
440 |           "awslogs-stream-prefix": "search-ui"
441 |         }
442 |       },
443 |       "entryPoint": null,
444 |       "portMappings": [
445 |         {
446 |           "hostPort": 8501,
447 |           "protocol": "tcp",
448 |           "containerPort": 8501
449 |         }
450 |       ],
451 |       "command": null,
452 |       "linuxParameters": null,
453 |       "cpu": 0,
454 |       "environment": [
455 |         {
456 |           "name": "API_ENDPOINT",
457 |           "value": "http://api.nlp.service:8000"
458 |         },
459 |         {
460 |           "name": "API_ENDPOINT_GENERATIVE",
461 |           "value": "http://api-gen.nlp.service:8000"
462 |         },
463 |         {
464 |           "name": "EVAL_FILE",
465 |           "value": "eval_labels_example.csv"
466 |         },
467 |         {
468 |           "name": "STREAMLIT_GATHER_USAGE_STATS",
469 |           "value": "false"
470 |         }
471 |       ],
472 |       "resourceRequirements": null,
473 |       "image": "${aws_ecr_repository.search_ui.repository_url}:latest",
474 |       "essential": true,
475 |       "name": "search-ui"
476 |     }
477 |   ]
478 |   DEFINITION
479 |   requires_compatibilities = ["FARGATE"] # Stating that we are using ECS Fargate
480 |   network_mode             = "awsvpc"    # Using awsvpc as our network mode as this is required for Fargate
481 |   memory                   = 4096        # Specifying the memory our container requires
482 |   cpu                      = 2048        # Specifying the CPU our container requires
483 |   execution_role_arn       = aws_iam_role.search_ui.arn
484 | }
485 | 
486 | resource "aws_ecs_service" "search_api" {
487 |   name            = "search_api"
488 |   cluster         = aws_ecs_cluster.main.id
489 |   task_definition = aws_ecs_task_definition.search_api.arn
490 |   desired_count   = 1
491 |   launch_type     = "EC2"
492 | 
493 |   network_configuration {
494 |     subnets         = aws_subnet.private.*.id
495 |     security_groups = [aws_security_group.search_api.id]
496 |   }
497 | 
498 |   service_registries {
499 |     registry_arn = aws_service_discovery_service.api.arn
500 |   }
501 | 
502 |   depends_on = [aws_elasticsearch_domain.es, docker_registry_image.search_api]
503 | 
504 | }
505 | 
506 | 
507 | resource "aws_ecs_service" "search_api_generative" {
508 |   name            = "search_api_generative"
509 |   cluster         = aws_ecs_cluster.main.id
510 |   task_definition = aws_ecs_task_definition.search_api_generative.arn
511 |   desired_count   = 1
512 |   launch_type     = "EC2"
513 | 
514 |   network_configuration {
515 |     subnets         = aws_subnet.private.*.id
516 |     security_groups = [aws_security_group.search_api.id]
517 |   }
518 | 
519 |   service_registries {
520 |     registry_arn = aws_service_discovery_service.api_gen.arn
521 |   }
522 | 
523 |   depends_on = [aws_elasticsearch_domain.es, docker_registry_image.search_api]
524 | 
525 | }
526 | 
527 | resource "aws_ecs_task_definition" "search_api" {
528 |   family                   = "search-api"
529 |   container_definitions    = <<DEFINITION
530 |   [
531 |     {
532 |       "dnsSearchDomains": null,
533 |       "environmentFiles": null,
534 |       "logConfiguration": {
535 |         "logDriver": "awslogs",
536 |         "options": {
537 |           "awslogs-group": "${aws_cloudwatch_log_group.app.name}",
538 |           "awslogs-region": "${data.aws_region.current.name}",
539 |           "awslogs-stream-prefix": "search-api"
540 |         }
541 |       },
542 |       "entryPoint": null,
543 |       "portMappings": [
544 |         {
545 |           "hostPort": 8000,
546 |           "protocol": "tcp",
547 |           "containerPort": 8000
548 |         }
549 |       ],
550 |       "command": null,
551 |       "linuxParameters": null,
552 |       "cpu": 0,
553 |       "secrets": [
554 |           {
555 |               "valueFrom": "${aws_secretsmanager_secret.opensearch.arn}",
556 |               "name": "DOCUMENTSTORE_PARAMS_PASSWORD"
557 |           }
558 |       ],
559 |       "environment": [
560 |         {
561 |             "name": "PIPELINE_YAML_PATH",
562 |             "value": "/home/user/rest_api/pipeline/aws-search.haystack-pipeline.yml"
563 |         },
564 |         {
565 |            "name": "DOCUMENTSTORE_PARAMS_HOST",
566 |            "value": "${aws_elasticsearch_domain.es.endpoint}"
567 |         },
568 |         {
569 |            "name": "DOCUMENTSTORE_PARAMS_PORT",
570 |            "value": "443"
571 |         },
572 |         {
573 |            "name": "DOCUMENTSTORE_PARAMS_INDEX",
574 |            "value": "${var.index_name}"
575 |         },
576 |         {
577 |            "name": "DOCUMENTSTORE_PARAMS_USERNAME",
578 |            "value": "admin"
579 |         }
580 |       ],
581 |       "image": "${aws_ecr_repository.search_api.repository_url}:latest",
582 |       "essential": true,
583 |       "name": "search-api"
584 |     }
585 |   ]
586 |   DEFINITION
587 |   requires_compatibilities = ["EC2"]  # Stating that we are using ECS Fargate
588 |   network_mode             = "awsvpc" # Using awsvpc as our network mode as this is required for Fargate
589 |   memory                   = 8192     # Specifying the memory our container requires
590 |   cpu                      = 4096     # Specifying the CPU our container requires
591 |   execution_role_arn       = aws_iam_role.search_api.arn
592 | }
593 | 
594 | resource "aws_ecs_task_definition" "search_api_generative" {
595 |   family                   = "search-api-generative"
596 |   container_definitions    = <<DEFINITION
597 |   [
598 |     {
599 |       "dnsSearchDomains": null,
600 |       "environmentFiles": null,
601 |       "logConfiguration": {
602 |         "logDriver": "awslogs",
603 |         "options": {
604 |           "awslogs-group": "${aws_cloudwatch_log_group.app.name}",
605 |           "awslogs-region": "${data.aws_region.current.name}",
606 |           "awslogs-stream-prefix": "search-api-generative"
607 |         }
608 |       },
609 |       "entryPoint": null,
610 |       "portMappings": [
611 |         {
612 |           "hostPort": 8000,
613 |           "protocol": "tcp",
614 |           "containerPort": 8000
615 |         }
616 |       ],
617 |       "command": null,
618 |       "linuxParameters": null,
619 |       "cpu": 0,
620 |       "secrets": [
621 |           {
622 |               "valueFrom": "${aws_secretsmanager_secret.opensearch.arn}",
623 |               "name": "DOCUMENTSTORE_PARAMS_PASSWORD"
624 |           }
625 |       ],
626 |       "environment": [
627 |         {
628 |             "name": "PIPELINE_YAML_PATH",
629 |             "value": "/home/user/rest_api/pipeline/aws-search-generative.haystack-pipeline.yml"
630 |         },
631 |         {
632 |            "name": "DOCUMENTSTORE_PARAMS_HOST",
633 |            "value": "${aws_elasticsearch_domain.es.endpoint}"
634 |         },
635 |         {
636 |            "name": "DOCUMENTSTORE_PARAMS_PORT",
637 |            "value": "443"
638 |         },
639 |         {
640 |            "name": "DOCUMENTSTORE_PARAMS_INDEX",
641 |            "value": "${var.index_name}"
642 |         },
643 |         {
644 |            "name": "DOCUMENTSTORE_PARAMS_USERNAME",
645 |            "value": "admin"
646 |         }
647 |       ],
648 |       "image": "${aws_ecr_repository.search_api.repository_url}:latest",
649 |       "essential": true,
650 |       "name": "search-api-generative"
651 |     }
652 |   ]
653 |   DEFINITION
654 |   requires_compatibilities = ["EC2"]  # Stating that we are using ECS Fargate
655 |   network_mode             = "awsvpc" # Using awsvpc as our network mode as this is required for Fargate
656 |   memory                   = 8192     # Specifying the memory our container requires
657 |   cpu                      = 4096     # Specifying the CPU our container requires
658 |   execution_role_arn       = aws_iam_role.search_api.arn
659 | }
660 | 
661 | 
662 | ### Docker & ECR repository ###
663 | 
664 | resource "aws_ecr_repository" "search_api" {
665 |   name         = "search-api"
666 |   force_delete = true
667 | }
668 | 
669 | resource "aws_ecr_repository" "search_ui" {
670 |   name         = "search-ui"
671 |   force_delete = true
672 | }
673 | 
674 | # Following example in https://registry.terraform.io/providers/kreuzwerker/docker/3.0.0/docs/resources/registry_image
675 | resource "docker_registry_image" "search_api" {
676 |   name = docker_image.search_api_image.name
677 | }
678 | resource "docker_image" "search_api_image" {
679 |   name = "${local.aws_ecr_url}/${aws_ecr_repository.search_api.name}:latest"
680 |   build {
681 |     context    = "../application/backend/"
682 |     dockerfile = "search-api.Dockerfile"
683 |   }
684 | }
685 | 
686 | resource "docker_registry_image" "search_ui" {
687 |   name = docker_image.search_ui_image.name
688 | }
689 | 
690 | resource "docker_image" "search_ui_image" {
691 |   name = "${local.aws_ecr_url}/${aws_ecr_repository.search_ui.name}:latest"
692 | 
693 |   build {
694 |     context    = "../application/frontend/"
695 |     dockerfile = "search-ui.Dockerfile"
696 |   }
697 | }
698 | 
699 | 
700 | ### OpenSearch cluster ###
701 | resource "random_password" "password" {
702 |   length           = 16
703 |   min_upper = 1
704 |   min_lower = 1
705 |   min_numeric = 1
706 |   min_special = 1
707 |   special          = true
708 |   override_special = "!#$%&*()-_=+[]{}<>:?"
709 | }
710 | 
711 | resource "aws_secretsmanager_secret" "opensearch" {
712 |   name_prefix = "opensearch-password"
713 | }
714 | 
715 | resource "aws_secretsmanager_secret_version" "opensearch" {
716 |   secret_id     = aws_secretsmanager_secret.opensearch.id
717 |   secret_string = random_password.password.result
718 | }
719 | 
720 | variable "domain" {
721 |   default = "nlp-awsdocs"
722 | }
723 | 
724 | 
725 | data "aws_region" "current" {}
726 | 
727 | resource "aws_security_group" "es" {
728 |   name        = "${aws_vpc.aws-vpc.id}-elasticsearch-${var.domain}"
729 |   description = "Managed by Terraform. Limiting access to opensearch on https from the search API security group."
730 |   vpc_id      = aws_vpc.aws-vpc.id
731 |   tags = {
732 |     Name = "NLPSearchOpenSearchSecurityGroup"
733 |   }
734 | 
735 |   ingress {
736 |     description     = "TLS from Search API"
737 |     from_port       = 443
738 |     to_port         = 443
739 |     protocol        = "tcp"
740 |     security_groups = [aws_security_group.search_api.id]
741 |   }
742 | }
743 | 
744 | resource "aws_iam_service_linked_role" "es" {
745 |   aws_service_name = "es.amazonaws.com"
746 | }
747 | 
748 | resource "aws_elasticsearch_domain" "es" {
749 |   domain_name           = var.domain
750 |   elasticsearch_version = "OpenSearch_1.3"
751 | 
752 |   cluster_config {
753 |     instance_type          = "r6g.large.elasticsearch"
754 |     zone_awareness_enabled = true
755 |     instance_count         = 4
756 |   }
757 | 
758 |   ebs_options {
759 |     volume_size = 50
760 |     volume_type = "gp2"
761 |     ebs_enabled = true
762 |   }
763 | 
764 |   vpc_options {
765 |     subnet_ids         = aws_subnet.private.*.id #[for s in data.aws_subnet.selected : s.id]
766 |     security_group_ids = [aws_security_group.es.id]
767 |   }
768 | 
769 |   advanced_options = {
770 |     "rest.action.multi.allow_explicit_index" = "true"
771 |     "override_main_response_version"         = "true"
772 |   }
773 | 
774 |   node_to_node_encryption {
775 |     enabled = true
776 |   }
777 | 
778 |   encrypt_at_rest {
779 |     enabled = true
780 |   }
781 | 
782 |   domain_endpoint_options {
783 |     enforce_https       = true
784 |     tls_security_policy = "Policy-Min-TLS-1-2-2019-07"
785 |   }
786 |   advanced_security_options {
787 |     enabled                        = true
788 |     internal_user_database_enabled = true
789 |     master_user_options {
790 |       master_user_name     = "admin"
791 |       master_user_password = random_password.password.result
792 |     }
793 |   }
794 | 
795 |   # We are using "Principal": "*" for the Search API.
796 |   # While Haystack provides a python class to use AWS Sig v4 signed requests, 
797 |   # It is currently not possible to configure these as Auth method via the pipeline yaml 
798 |   access_policies = <<CONFIG
799 | {
800 |     "Version": "2012-10-17",
801 |     "Statement": [
802 |         {
803 |             "Action": "es:*",
804 |             "Principal": "*", 
805 |             "Effect": "Allow",
806 |             "Resource": "arn:aws:es:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:domain/${var.domain}/*"
807 |         }
808 |     ]
809 | }
810 | CONFIG
811 | 
812 |   tags = {
813 |     Domain = "NLPSearch"
814 |   }
815 | 
816 |   depends_on = [aws_iam_service_linked_role.es]
817 | }
818 | 
819 | ################# Service discovery #######################
820 | 
821 | resource "aws_service_discovery_private_dns_namespace" "dns_ns" {
822 |   name        = "nlp.service"
823 |   description = "Service discovery DNS namespace"
824 |   vpc         = aws_vpc.aws-vpc.id
825 | }
826 | 
827 | resource "aws_service_discovery_service" "api" {
828 |   name = "api"
829 | 
830 |   dns_config {
831 |     namespace_id = aws_service_discovery_private_dns_namespace.dns_ns.id
832 | 
833 |     dns_records {
834 |       ttl  = 10
835 |       type = "A"
836 |     }
837 |     routing_policy = "MULTIVALUE"
838 |   }
839 | }
840 | 
841 | resource "aws_service_discovery_service" "api_gen" {
842 |   name = "api-gen"
843 | 
844 |   dns_config {
845 |     namespace_id = aws_service_discovery_private_dns_namespace.dns_ns.id
846 | 
847 |     dns_records {
848 |       ttl  = 10
849 |       type = "A"
850 |     }
851 |     routing_policy = "MULTIVALUE"
852 |   }
853 | }


--------------------------------------------------------------------------------
/infrastructure/output.tf:
--------------------------------------------------------------------------------
 1 | output "loadbalancer_url" {
 2 |   description = "Application Loadbalancer endpoint url where you can read the frontend after deploying."
 3 |   value       = aws_alb.main.dns_name
 4 | }
 5 | 
 6 | output "opensearch_endpoint" {
 7 |   description = "Domain-specific endpoint used to submit index, search, and data upload requests."
 8 |   value       = aws_elasticsearch_domain.es.endpoint
 9 | }
10 | 
11 | output "opensearch_password" {
12 |   sensitive   = true
13 |   description = "Password for opensearch endpoint"
14 |   value       = random_password.password.result
15 | }
16 | 
17 | output "security_group_for_open_search_access" {
18 |   value = aws_security_group.search_api.id
19 | }
20 | 
21 | output "private_subnets" {
22 |   value = aws_subnet.private.*.id
23 | }
24 | 
25 | output "ecs_cluster_name" {
26 |   value = aws_ecs_cluster.main.name
27 | }
28 | 
29 | output "ecs_cluster_arn" {
30 |   value = aws_ecs_cluster.main.arn
31 | }
32 | 
33 | output "log_group_name" {
34 |   value = aws_cloudwatch_log_group.app.name
35 | }
36 | output "region" {
37 |   value = data.aws_region.current.name
38 | }
39 | 
40 | output "opensearch_secret" {
41 |   value = aws_secretsmanager_secret.opensearch.arn
42 | }
43 | 
44 | output "ingestion_job_role" {
45 |   value = aws_iam_role.search_api.arn
46 | }
47 | 
48 | output "index_name" {
49 |   value = var.index_name
50 | }


--------------------------------------------------------------------------------
/infrastructure/providers.tf:
--------------------------------------------------------------------------------
1 | provider "docker" {
2 |   registry_auth {
3 |     address  = local.aws_ecr_url
4 |     username = data.aws_ecr_authorization_token.token.user_name
5 |     password = data.aws_ecr_authorization_token.token.password
6 |   }
7 | }


--------------------------------------------------------------------------------
/infrastructure/terraform.tfvars:
--------------------------------------------------------------------------------
1 | # these are zones and subnets examples
2 | region             = "us-east-1"
3 | vpc_cidr           = "10.10.0.0/16"
4 | availability_zones = ["us-east-1a", "us-east-1b"]
5 | public_subnets     = ["10.10.100.0/24", "10.10.101.0/24"]
6 | private_subnets    = ["10.10.0.0/24", "10.10.1.0/24"]


--------------------------------------------------------------------------------
/infrastructure/tf-backends.tf:
--------------------------------------------------------------------------------
 1 | ### Terraform State about deployment ###
 2 | terraform {
 3 |   backend "s3" {
 4 |     key = "semantic-search/terraform.tfstate"
 5 |     #bucket = "tf-backend-bucket-xyz"           # You can replace this with your bucket name!
 6 |     #region = "eu-west-1"                       # You can replace this with your AWS region!
 7 |     #dynamodb_table = "tf-backend-table"        # You can replace this with your DynamoDB table name!
 8 |     encrypt = true
 9 |   }
10 | }
11 | 
12 | terraform {
13 |   required_providers {
14 |     aws = {
15 |       source  = "hashicorp/aws"
16 |       version = ">=4.22"
17 |     }
18 |     docker = {
19 |       source  = "kreuzwerker/docker"
20 |       version = ">=3.0.0"
21 |     }
22 |   }
23 | }
24 | 
25 | 


--------------------------------------------------------------------------------
/infrastructure/variable.tf:
--------------------------------------------------------------------------------
 1 | variable "region" {
 2 |   type        = string
 3 |   description = "The AWS region where to deploy to, e.g. 'us-east-1'."
 4 | }
 5 | variable "vpc_cidr" {
 6 |   type        = string
 7 |   description = "The CIDR used for the created VPC, e.g. '10.10.0.0/16'"
 8 | }
 9 | variable "availability_zones" {
10 |   type        = list(string)
11 |   description = "The availbility zones used to deploy, e.g. 'us-east-1a'"
12 | }
13 | variable "public_subnets" {
14 |   type        = list(string)
15 |   description = "The public subnet ranges used, e.g. '10.10.100.0/24'"
16 | }
17 | variable "private_subnets" {
18 |   type        = list(string)
19 |   description = "The public subnet ranges used, e.g. '10.10.0.0/24'"
20 | }
21 | variable "iam_resource_prefix" {
22 |   type        = string
23 |   description = "A prefix used for all AWS IAM resource names"
24 |   default = ""
25 | }
26 | variable "index_name" {
27 |   type = string
28 |   description = "Amazon OpenSearch documents index name"
29 |   default = "awsdocs"
30 | }


--------------------------------------------------------------------------------
/ingestion/.terraform.lock.hcl:
--------------------------------------------------------------------------------
 1 | # This file is maintained automatically by "terraform init".
 2 | # Manual edits may be lost in future updates.
 3 | 
 4 | provider "registry.terraform.io/hashicorp/aws" {
 5 |   version = "5.3.0"
 6 |   hashes = [
 7 |     "h1:89Ara9HnoQzGsFK1nU0fPD8h0SsHJnlVc8mUfOQSAYE=",
 8 |     "zh:001814dcf6b2329de5e2c9223c4f1e95a0f60d6670046015419053b03b3c0712",
 9 |     "zh:3c511a91f53076c3a1117526bee0880b339261f1eb3feecd7854771bfef7890d",
10 |     "zh:3e6c19e048f06051c9296c7a3236946f37431ce0d84f843585c5f3e8504759d3",
11 |     "zh:476a3d918782a479166f33418192b522698e39702e8a0aec823682d3ee3082f1",
12 |     "zh:5dd0d3bff7a7acabeed600dfbbef797e189c4877f65e4b4ed572cb33e454f602",
13 |     "zh:6627f95a41e30c01b7f7c9e3db1cccba056c5257c36cccfaa0898d526211add2",
14 |     "zh:663023a4244cf7f7df2b08ab204922f7902eefe9a7b51a2c2def1a7dafe6f55f",
15 |     "zh:79cb8a22a131b7d2beb331d8443207eed10fdb4b09655048960bd5d59c8bbf3a",
16 |     "zh:8c2275a0954042cfc44843a6045543744e08bd8cad487f0bc9162cf92a9bcdcc",
17 |     "zh:9b12af85486a96aedd8d7984b0ff811a4b42e3d88dad1a3fb4c0b580d04fa425",
18 |     "zh:ad08ae20b9402461af863772a9e4ff5677e14f3fc86d5b148bd4faaaa361f601",
19 |     "zh:b8b7bd15fc1842aeedc2e5eab03b8357cdb2b9fe3e67dd82ae240be3081bf637",
20 |     "zh:bdb3858c4c632aad8d5c4bff063f3afb18de51cec3167b3496d5bc5856915301",
21 |     "zh:f354a433ec8095b06c2701725411ffb73a20ef9b1aa325434e1bb575b5c86d52",
22 |     "zh:f47e1342883d599f4675dcfdeb9707cdfcfaf53c677f93fd5c410580d4dece13",
23 |   ]
24 | }
25 | 
26 | provider "registry.terraform.io/hashicorp/null" {
27 |   version = "3.2.1"
28 |   hashes = [
29 |     "h1:FbGfc+muBsC17Ohy5g806iuI1hQc4SIexpYCrQHQd8w=",
30 |     "zh:58ed64389620cc7b82f01332e27723856422820cfd302e304b5f6c3436fb9840",
31 |     "zh:62a5cc82c3b2ddef7ef3a6f2fedb7b9b3deff4ab7b414938b08e51d6e8be87cb",
32 |     "zh:63cff4de03af983175a7e37e52d4bd89d990be256b16b5c7f919aff5ad485aa5",
33 |     "zh:74cb22c6700e48486b7cabefa10b33b801dfcab56f1a6ac9b6624531f3d36ea3",
34 |     "zh:78d5eefdd9e494defcb3c68d282b8f96630502cac21d1ea161f53cfe9bb483b3",
35 |     "zh:79e553aff77f1cfa9012a2218b8238dd672ea5e1b2924775ac9ac24d2a75c238",
36 |     "zh:a1e06ddda0b5ac48f7e7c7d59e1ab5a4073bbcf876c73c0299e4610ed53859dc",
37 |     "zh:c37a97090f1a82222925d45d84483b2aa702ef7ab66532af6cbcfb567818b970",
38 |     "zh:e4453fbebf90c53ca3323a92e7ca0f9961427d2f0ce0d2b65523cc04d5d999c2",
39 |     "zh:e80a746921946d8b6761e77305b752ad188da60688cfd2059322875d363be5f5",
40 |     "zh:fbdb892d9822ed0e4cb60f2fedbdbb556e4da0d88d3b942ae963ed6ff091e48f",
41 |     "zh:fca01a623d90d0cad0843102f9b8b9fe0d3ff8244593bd817f126582b52dd694",
42 |   ]
43 | }
44 | 
45 | provider "registry.terraform.io/kreuzwerker/docker" {
46 |   version     = "3.0.2"
47 |   constraints = ">= 2.16.0"
48 |   hashes = [
49 |     "h1:cT2ccWOtlfKYBUE60/v2/4Q6Stk1KYTNnhxSck+VPlU=",
50 |     "zh:15b0a2b2b563d8d40f62f83057d91acb02cd0096f207488d8b4298a59203d64f",
51 |     "zh:23d919de139f7cd5ebfd2ff1b94e6d9913f0977fcfc2ca02e1573be53e269f95",
52 |     "zh:38081b3fe317c7e9555b2aaad325ad3fa516a886d2dfa8605ae6a809c1072138",
53 |     "zh:4a9c5065b178082f79ad8160243369c185214d874ff5048556d48d3edd03c4da",
54 |     "zh:5438ef6afe057945f28bce43d76c4401254073de01a774760169ac1058830ac2",
55 |     "zh:60b7fadc287166e5c9873dfe53a7976d98244979e0ab66428ea0dea1ebf33e06",
56 |     "zh:61c5ec1cb94e4c4a4fb1e4a24576d5f39a955f09afb17dab982de62b70a9bdd1",
57 |     "zh:a38fe9016ace5f911ab00c88e64b156ebbbbfb72a51a44da3c13d442cd214710",
58 |     "zh:c2c4d2b1fd9ebb291c57f524b3bf9d0994ff3e815c0cd9c9bcb87166dc687005",
59 |     "zh:d567bb8ce483ab2cf0602e07eae57027a1a53994aba470fa76095912a505533d",
60 |     "zh:e83bf05ab6a19dd8c43547ce9a8a511f8c331a124d11ac64687c764ab9d5a792",
61 |     "zh:e90c934b5cd65516fbcc454c89a150bfa726e7cf1fe749790c7480bbeb19d387",
62 |     "zh:f05f167d2eaf913045d8e7b88c13757e3cf595dd5cd333057fdafc7c4b7fed62",
63 |     "zh:fcc9c1cea5ce85e8bcb593862e699a881bd36dffd29e2e367f82d15368659c3d",
64 |   ]
65 | }
66 | 


--------------------------------------------------------------------------------
/ingestion/Dockerfile:
--------------------------------------------------------------------------------
 1 | ARG BUILD_ENV="local"
 2 | 
 3 | FROM amazonlinux:2 as local
 4 | ONBUILD ARG DOCS_SRC="awsdocs/data"
 5 | ONBUILD WORKDIR "/awsdocs"
 6 | ONBUILD COPY $DOCS_SRC "data/"
 7 | 
 8 | FROM ${BUILD_ENV}
 9 | ARG SCRIPT_NAME="run_ingestion_awsdocs"
10 | RUN yum install -y amazon-linux-extras
11 | RUN amazon-linux-extras enable python3.8
12 | RUN yum install -y python3.8 jq wget gzip tar git
13 | 
14 | COPY awsdocs/requirements.txt /awsdocs/requirements.txt
15 | RUN python3.8 -m pip --no-cache-dir install -r /awsdocs/requirements.txt
16 | 
17 | RUN curl -O https://dl.xpdfreader.com/xpdf-tools-linux-4.04.tar.gz && tar -xvf xpdf-tools-linux-4.04.tar.gz && cp xpdf-tools-linux-4.04/bin64/pdftotext /usr/local/bin
18 | RUN yum -y install fontconfig
19 | 
20 | COPY awsdocs/scripts /awsdocs/scripts
21 | COPY awsdocs/src /awsdocs/src
22 | 
23 | WORKDIR "/awsdocs/data"
24 | 
25 | 
26 | RUN ln -s /awsdocs/scripts/$SCRIPT_NAME.sh /awsdocs/scripts/docker_entrypoint.sh
27 | 
28 | #RUN python3.8 ../src/ingest.py
29 | ENTRYPOINT [ "bash", "/awsdocs/scripts/docker_entrypoint.sh" ]
30 | CMD ["amazon-ec2-user-guide", "awsdocs"] #repo substring or 'full' for all repos and index name defaults to 'awsdocs'


--------------------------------------------------------------------------------
/ingestion/awsdocs.tfvars:
--------------------------------------------------------------------------------
1 | #infra_region = ""
2 | #infra_tf_state_s3_bucket = ""
3 | infra_tf_state_s3_key = "semantic-search/terraform.tfstate"
4 | docs_src = "amazon-eks-user-guide"  #select the repo to be ingested like "amazon-eks-user-guide"
5 | script_name = "run_ingestion_awsdocs"


--------------------------------------------------------------------------------
/ingestion/awsdocs/.gitignore:
--------------------------------------------------------------------------------
1 | data
2 | 


--------------------------------------------------------------------------------
/ingestion/awsdocs/data/empty.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/semantic-search-aws-docs/6ae2b3b907cdb93666ec01e73acb8f03be245736/ingestion/awsdocs/data/empty.txt


--------------------------------------------------------------------------------
/ingestion/awsdocs/requirements.txt:
--------------------------------------------------------------------------------
1 | farm-haystack==1.16.0
2 | mistletoe>=0.8.2
3 | markdown>=3.3.7
4 | beautifulsoup4>=4.11.1
5 | opensearch-py>=2.0.0
6 | farm-haystack[opensearch,preprocessing,file-conversion]
7 | rfc3986-validator
8 | pydantic==1.*


--------------------------------------------------------------------------------
/ingestion/awsdocs/scripts/0_setup_env.sh:
--------------------------------------------------------------------------------
1 | conda env create --prefix ./conda-env --file=conda-env.yaml ||:
2 | eval "$(conda shell.bash hook)"
3 | conda activate ./conda-env
4 | pip install -r requirements.txt


--------------------------------------------------------------------------------
/ingestion/awsdocs/scripts/clone_awsdocs.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | REPO=${1:-amazon-ec2-user-guide}
 4 | 
 5 | echo "REPO-Substring=$REPO"
 6 | # get number of repos of awsdocs
 7 | curl https://api.github.com/orgs/awsdocs | jq .public_repos
 8 | 
 9 | numRepos=$(curl https://api.github.com/orgs/awsdocs | jq .public_repos)
10 | 
11 | echo "numRepos: $numRepos"
12 | 
13 | if [ "$numRepos" = "null" ]
14 | then
15 |      echo "ERROR: $numRepos is NULL. Github rate limit probably exeeded. Please wait some time before trying again." 1>&2
16 |      exit 1 # terminate and indicate error
17 | fi
18 | 
19 | numPages=$((numRepos / 100 + 1))
20 | echo "numRepos: $numRepos"
21 | echo "numPages: $numPages"
22 | 
23 | # download 100 repos per page and clone the repo in the current directory
24 | for (( c=1; c<=$numPages; c++ ))
25 | do
26 | echo "Page: $c / $numPages"
27 |    curl https://api.github.com/orgs/awsdocs/repos\?per_page\=100\&page\=$c | jq '.[].clone_url' | tr -d \" | while read line || [[ -n $line ]];
28 |    do
29 |       if [[ "$line" == *$REPO* ]] || [[ "$line" == full ]]  ;
30 |       then
31 |           echo "Found: $line"
32 |           if ! git clone -b main $line; then
33 |             if ! git clone -b master $line; then
34 |                git clone $line || true
35 |             fi
36 |           fi
37 |       fi
38 |    done
39 | done


--------------------------------------------------------------------------------
/ingestion/awsdocs/scripts/run-opensearch.sh:
--------------------------------------------------------------------------------
1 | # Recommended: Start Elasticsearch using Docker
2 | #! docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.6.2
3 | 
4 | # docker pull opensearchproject/opensearch:1.0.1
5 | #!docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" opensearchproject/opensearch:1.0.1


--------------------------------------------------------------------------------
/ingestion/awsdocs/scripts/run_ingestion_awsdocs.sh:
--------------------------------------------------------------------------------
1 | SCRIPT_DIR="$( cd -- "$( dirname -- "${BASH_SOURCE[0]:-$0}"; )" &> /dev/null && pwd 2> /dev/null; )";
2 | 
3 | #MAIN=dirname "$0"
4 | 
5 | echo "script_dir MAIN: $SCRIPT_DIR"
6 | bash $SCRIPT_DIR/clone_awsdocs.sh $1
7 | bash $SCRIPT_DIR/run_ingestion_local.sh "./" $2


--------------------------------------------------------------------------------
/ingestion/awsdocs/scripts/run_ingestion_local.sh:
--------------------------------------------------------------------------------
1 | SCRIPT_DIR="$( cd -- "$( dirname -- "${BASH_SOURCE[0]:-$0}"; )" &> /dev/null && pwd 2> /dev/null; )";
2 | 
3 | #MAIN=dirname "$0"
4 | 
5 | echo "script_dir MAIN: $SCRIPT_DIR"
6 | python3.8 $SCRIPT_DIR/../src/ingest.py --src ./ --index_name $2


--------------------------------------------------------------------------------
/ingestion/awsdocs/scripts/run_ingestion_url.sh:
--------------------------------------------------------------------------------
1 | SCRIPT_DIR="$( cd -- "$( dirname -- "${BASH_SOURCE[0]:-$0}"; )" &> /dev/null && pwd 2> /dev/null; )";
2 | 
3 | #MAIN=dirname "$0"
4 | 
5 | echo "script_dir MAIN: $SCRIPT_DIR"
6 | python3.8 $SCRIPT_DIR/../src/ingest.py --src $1 --index_name $2


--------------------------------------------------------------------------------
/ingestion/awsdocs/src/get_faqs.py:
--------------------------------------------------------------------------------
  1 | import urllib.request, json
  2 | from pathlib import Path
  3 | import os
  4 | import re
  5 | url_services="https://aws.amazon.com/api/dirs/items/search?item.directoryId=aws-products&sort_by=item.additionalFields.productNameLowercase&sort_order=asc&size=500&item.locale=en_US&tags.id=!aws-products"
  6 | 
  7 | services_json = "./data/faqs-aws-services.json"
  8 | services_dir = "./data/faqs_aws_services"
  9 | 
 10 | import requests
 11 | from bs4 import BeautifulSoup
 12 | import pandas as pd
 13 | #pd.set_option('display.max_rows', 500)
 14 | pd.set_option('display.max_columns', 10)
 15 | pd.set_option('display.width', 1000)
 16 | import glob
 17 | import tqdm
 18 | 
 19 | def get_amazon_faqs(url = "https://aws.amazon.com/alexaforbusiness/faqs/"):
 20 |     """
 21 |     crawls the frequently asked questions and answers for a given amazon services.
 22 |     Assumes paragraphs in a div with  class=lb-rtxt and h2 below the div to define the category of faqs.
 23 |     :param url:
 24 |     :return:
 25 |     """
 26 |     headers = {
 27 |         'Access-Control-Allow-Origin': '*',
 28 |         'Access-Control-Allow-Methods': 'GET',
 29 |         'Access-Control-Allow-Headers': 'Content-Type',
 30 |         'Access-Control-Max-Age': '3600',
 31 |         'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
 32 |     }
 33 | 
 34 |     req = requests.get(url, headers)
 35 |     soup = BeautifulSoup(req.content, 'html.parser')
 36 |     faqs = []
 37 |     faqs_category = []
 38 |     for div in soup.find_all(class_="lb-rtxt"):
 39 |         cat = div.find_previous_sibling('h2')
 40 |         if cat is None: continue
 41 |         category = cat.getText()
 42 |         paragraphs = div.find_all("p")
 43 |         for p in paragraphs:
 44 |             if "?" in p.getText():
 45 |                 #new question, split
 46 |                 faqs.append([])
 47 |                 faqs_category.append(category)
 48 |             if len(faqs)>0:
 49 |                 faqs[-1].append(p)
 50 | 
 51 |     rows = []
 52 |     for (cat, faq_par) in zip (faqs_category, faqs):
 53 |         question = faq_par[0].getText().strip()
 54 |         answer = "\n\n".join([p.getText() for p in faq_par[1:]]).strip()
 55 |         row= {"type":cat, "question":question, "answer":answer, "url":url}
 56 |         rows.append(row)
 57 |     df = pd.DataFrame(rows)
 58 |     return df
 59 | 
 60 | #get_amazon_faqs()
 61 | 
 62 | 
 63 | if not os.path.exists(services_json):
 64 |     with urllib.request.urlopen(url_services) as url:
 65 |         data_url = json.loads(url.read().decode())
 66 |         with open(services_json, 'w', encoding='utf-8') as f:
 67 |             json.dump(data_url, f, ensure_ascii=False, indent=4)
 68 | os.makedirs(services_dir,exist_ok=True)
 69 | 
 70 | import time
 71 | with open(services_json) as json_file:
 72 |     data = json.load(json_file)
 73 |     items = [item["item"] for item in data["items"]]
 74 |     for item in tqdm.tqdm(items):
 75 |         productUrl = item["additionalFields"]["productUrl"]
 76 | 
 77 |         re_all = re.findall(r"(.*aws.amazon.com/(.+)/)", productUrl)
 78 |         if len(re_all) ==0:
 79 |             print(f"Could not match url, skipping: {productUrl}")
 80 |             continue
 81 | 
 82 |         service = re_all[0][1]
 83 |         service_name = service.replace("/","-")
 84 | 
 85 |         faq_out = os.path.join(services_dir, service_name + "_faqs.csv")
 86 |         if os.path.exists(faq_out):
 87 |             continue
 88 | 
 89 |         faq_url = re_all[0][0]
 90 |         faqs = get_amazon_faqs(faq_url + "faqs")
 91 | 
 92 |         if faqs.shape[0]==0:
 93 |             print(f"Getting Faqs failed for: {faq_url} , {service_name}")
 94 |             faqs = get_amazon_faqs(faq_url + "faq")
 95 |             if faqs.shape[0] == 0:
 96 |                 continue
 97 | 
 98 |         faqs["service"] = service
 99 |         faqs.to_csv(faq_out, index=False)
100 |         print(faqs)
101 |         # time.sleep(1.1)
102 | 
103 | 
104 | files = glob.glob(services_dir+"/*.csv")
105 | df_list = []
106 | for f in files:
107 |     df = pd.read_csv(f)
108 |     df_list.append(df)
109 | 
110 | df_all = pd.concat(df_list)
111 | 
112 | print(df_all.shape)
113 | df_all.to_csv("aws-services-faqs-dataset.csv", index=False)


--------------------------------------------------------------------------------
/ingestion/awsdocs/src/ingest-pagerank.py:
--------------------------------------------------------------------------------
  1 | from bs4 import BeautifulSoup
  2 | from haystack.utils import clean_wiki_text, convert_files_to_docs, fetch_archive_from_http, print_answers
  3 | from haystack.nodes import FARMReader, TransformersReader
  4 | import re
  5 | #
  6 | # # Recommended: Start Elasticsearch using Docker via the Haystack utility function
  7 | # from haystack.utils import launch_es
  8 | #
  9 | # launch_es()
 10 | 
 11 | # Connect to Elasticsearch
 12 | from tqdm import tqdm
 13 | # from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore
 14 | # document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")
 15 | 
 16 | # # Let's first fetch some documents that we want to query
 17 | # # Here: 517 Wikipedia articles for Game of Thrones
 18 | # doc_dir = "data/article_txt_got"
 19 | # s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip"
 20 | # fetch_archive_from_http(url=s3_url, output_dir=doc_dir)
 21 | #
 22 | # # Convert files to dicts
 23 | # # You can optionally supply a cleaning function that is applied to each doc (e.g. to remove footers)
 24 | # # It must take a str as input, and return a str.
 25 | # dicts = convert_files_to_docs(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)
 26 | 
 27 | doc_dir_aws = "data/awsdocs/amazon-ec2-user-guide"
 28 | doc_dir_aws = "data/awsdocs"
 29 | dicts_aws = convert_files_to_docs(dir_path=doc_dir_aws, clean_func=clean_wiki_text, split_paragraphs=True)
 30 | 
 31 | 
 32 | from pathlib import Path
 33 | import markdown
 34 | 
 35 | path = Path(doc_dir_aws)
 36 | 
 37 | references =[]
 38 | 
 39 | doc_to_node = {}
 40 | node_count = 0
 41 | 
 42 | doc_to_link = {}
 43 | 
 44 | for p in tqdm(path.rglob("*.md")):
 45 |   #print("Document: "+p.name)
 46 |   with open(p) as f:
 47 |     contents = f.read()
 48 |     #print(contents)
 49 |     html = markdown.markdown(contents)
 50 |     #print(html)
 51 |     # create soap object
 52 |     soup = BeautifulSoup(html, 'html.parser')
 53 | 
 54 |     # find all the anchor tags with "href"
 55 |     # attribute starting with "https://"
 56 |     for link in soup.find_all('a',
 57 |                               attrs={'href': re.compile("^http")}):
 58 |       # display the actual urls
 59 |       #print(link.get('href'))
 60 |       href = link.get('href').strip("/")
 61 |       htext = link.text
 62 |       href_suffix = href.split("/")[-1]
 63 |       if "#" in href_suffix:
 64 |         href_suffix = href_suffix.split("#")[0]
 65 |       href_suffix = href_suffix.replace(".html", "")
 66 |       source = p.stem
 67 |       target = href_suffix
 68 |       ref = {"source_md":source, "link_suffix":target, "path":str(p), "link_text":link.text, "link_href":href  }
 69 | 
 70 |       if target not in doc_to_link:
 71 |         doc_to_link[source] = href
 72 | 
 73 |       # if source not in doc_to_node:
 74 |       #   node_count +=1
 75 |       #   doc_to_node[source] = node_count
 76 |       # if target not in doc_to_node:
 77 |       #   node_count +=1
 78 |       #   doc_to_node[target] = node_count
 79 | 
 80 |       #print(ref)
 81 |       references.append(ref)
 82 | 
 83 | import pandas as pd
 84 | df = pd.DataFrame(ref)
 85 | df.to_csv("links.csv")
 86 | 
 87 | 
 88 | # If your texts come from a different source (e.g. a DB), you can of course skip convert_files_to_dicts() and create the dictionaries yourself.
 89 | # The default format here is:
 90 | # {
 91 | #    'text': "<DOCUMENT_TEXT_HERE>",
 92 | #    'meta': {'name': "<DOCUMENT_NAME_HERE>", ...}
 93 | #}
 94 | # (Optionally: you can also add more key-value-pairs here, that will be indexed as fields in Elasticsearch and
 95 | # can be accessed later for filtering or shown in the responses of the Finder)
 96 | 
 97 | # TODO: add awsdocs url in meta for each md file, e.g.
 98 | # data/awsdocs/amazon-ec2-user-guide/doc_source/AmazonEFS.md
 99 | # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonEFS.html
100 | # Challenge: How to know the mapping from the file to the public URL
101 | 
102 | # Let's have a look at the first 3 entries:
103 | #print(dicts_aws[:3])
104 | 
105 | # Now, let's write the dicts containing documents to our DB.
106 | #document_store.write_documents(dicts_aws)


--------------------------------------------------------------------------------
/ingestion/awsdocs/src/ingest.py:
--------------------------------------------------------------------------------
  1 | from typing import Callable, Dict, List, Optional, Union, Tuple
  2 | from pathlib import Path
  3 | from bs4 import BeautifulSoup
  4 | from sentence_transformers import SentenceTransformer
  5 | from opensearchpy import OpenSearch, RequestsHttpConnection
  6 | from haystack.nodes.retriever import EmbeddingRetriever
  7 | from haystack.document_stores import OpenSearchDocumentStore
  8 | from urllib.parse import urlparse, unquote
  9 | from os.path import splitext, basename
 10 | 
 11 | import requests
 12 | 
 13 | import io
 14 | import gzip
 15 | import tarfile
 16 | import zipfile
 17 | 
 18 | import argparse
 19 | import json
 20 | import sys
 21 | import os 
 22 | 
 23 | from rfc3986_validator import validate_rfc3986
 24 | 
 25 | from haystack.nodes.file_converter import BaseConverter, DocxToTextConverter, PDFToTextConverter, TextConverter, MarkdownConverter
 26 | from haystack.schema import Document
 27 | from haystack.utils import clean_wiki_text
 28 | 
 29 | import logging
 30 | 
 31 | DEFAULT_DOCS_DIR = "awsdocs/data"
 32 | 
 33 | logger = logging.getLogger(__name__)
 34 | 
 35 | parser = argparse.ArgumentParser(
 36 |                     prog='AWS Semantic Search Ingestion',
 37 |                     description='Ingests documents into Amazon OpenSearch index.',
 38 |                     epilog='Made with ❤️ at AWS')
 39 | 
 40 | parser.add_argument('--src', type=str,
 41 |                     help='Directory or URL where documents are located', default=DEFAULT_DOCS_DIR)
 42 | 
 43 | parser.add_argument('--index_name', type=str,
 44 |                     help='Amazon OpenSearch index name', default="awsdocs")
 45 | 
 46 | # Add markdown conversion
 47 | # Licensed under Apache-2.0 license from deepset-ai haystack
 48 | # https://github.com/deepset-ai/haystack/blob/ba30971d8d77827da9d2c81d82f7d02bf1917d8c/haystack/utils/preprocessing.py
 49 | def convert_files_to_docs(
 50 |     dir_path: str,
 51 |     clean_func: Optional[Callable] = None,
 52 |     split_paragraphs: bool = False,
 53 |     encoding: Optional[str] = None,
 54 |     id_hash_keys: Optional[List[str]] = None,
 55 | ) -> List[Document]:
 56 |     """
 57 |     Convert all files(.txt, .pdf, .docx) in the sub-directories of the given path to Documents that can be written to a
 58 |     Document Store.
 59 | 
 60 |     :param dir_path: The path of the directory containing the Files.
 61 |     :param clean_func: A custom cleaning function that gets applied to each Document (input: str, output: str).
 62 |     :param split_paragraphs: Whether to split text by paragraph.
 63 |     :param encoding: Character encoding to use when converting pdf documents.
 64 |     :param id_hash_keys: A list of Document attribute names from which the Document ID should be hashed from.
 65 |             Useful for generating unique IDs even if the Document contents are identical.
 66 |             To ensure you don't have duplicate Documents in your Document Store if texts are
 67 |             not unique, you can modify the metadata and pass [`"content"`, `"meta"`] to this field.
 68 |             If you do this, the Document ID will be generated by using the content and the defined metadata.
 69 |     """
 70 |     file_paths = [p for p in Path(dir_path).glob("**/*")]
 71 |     allowed_suffixes = [".pdf", ".txt", ".docx", ".md"]
 72 |     suffix2converter: Dict[str, BaseConverter] = {}
 73 | 
 74 |     suffix2paths: Dict[str, List[Path]] = {}
 75 |     for path in file_paths:
 76 |         file_suffix = path.suffix.lower()
 77 |         if file_suffix in allowed_suffixes:
 78 |             if file_suffix not in suffix2paths:
 79 |                 suffix2paths[file_suffix] = []
 80 |             suffix2paths[file_suffix].append(path)
 81 |         elif not path.is_dir():
 82 |             logger.warning(
 83 |                 "Skipped file {0} as type {1} is not supported here. "
 84 |                 "See haystack.file_converter for support of more file types".format(path, file_suffix)
 85 |             )
 86 | 
 87 |     # No need to initialize converter if file type not present
 88 |     for file_suffix in suffix2paths.keys():
 89 |         if file_suffix == ".pdf":
 90 |             suffix2converter[file_suffix] = PDFToTextConverter()
 91 |         if file_suffix == ".txt":
 92 |             suffix2converter[file_suffix] = TextConverter()
 93 |         if file_suffix == ".docx":
 94 |             suffix2converter[file_suffix] = DocxToTextConverter()
 95 |         if file_suffix == ".md":
 96 |             suffix2converter[file_suffix] = MarkdownConverter()
 97 | 
 98 |     documents = []
 99 |     for suffix, paths in suffix2paths.items():
100 |         for path in paths:
101 |             logger.info("Converting {}".format(path))
102 |             # PDFToTextConverter, TextConverter, and DocxToTextConverter return a list containing a single Document
103 |             document = suffix2converter[suffix].convert(
104 |                 file_path=path, meta=None, encoding=encoding, id_hash_keys=id_hash_keys
105 |             )[0]
106 |             text = document.content
107 | 
108 |             if clean_func:
109 |                 text = clean_func(text)
110 | 
111 |             if split_paragraphs:
112 |                 for para in text.split("\n\n"):
113 |                     if not para.strip():  # skip empty paragraphs
114 |                         continue
115 |                     documents.append(Document(content=para, meta={"name": path.name}, id_hash_keys=id_hash_keys))
116 |             else:
117 |                 documents.append(Document(content=text, meta={"name": path.name}, id_hash_keys=id_hash_keys))
118 | 
119 |     return documents
120 | 
121 | # Enable downloading archive from Amazon S3 presigned url and other urls that contain data such as query parameters after the file extension.
122 | # Licensed under Apache-2.0 license from deepset-ai haystack
123 | # https://github.com/deepset-ai/haystack/blob/ba30971d8d77827da9d2c81d82f7d02bf1917d8c/haystack/utils/import_utils.py
124 | def fetch_archive_from_http(
125 |     url: str,
126 |     output_dir: str,
127 |     proxies: Optional[Dict[str, str]] = None,
128 |     timeout: Union[float, Tuple[float, float]] = 10.0,
129 | ) -> bool:
130 |     """
131 |     Fetch an archive (zip, gz or tar.gz) from a url via http and extract content to an output directory.
132 | 
133 |     :param url: http address
134 |     :param output_dir: local path
135 |     :param proxies: proxies details as required by requests library
136 |     :param timeout: How many seconds to wait for the server to send data before giving up,
137 |         as a float, or a :ref:`(connect timeout, read timeout) <timeouts>` tuple.
138 |         Defaults to 10 seconds.
139 |     :return: if anything got fetched
140 |     """
141 |     # verify & prepare local directory
142 |     path = Path(output_dir)
143 |     if not path.exists():
144 |         path.mkdir(parents=True)
145 | 
146 |     is_not_empty = len(list(Path(path).rglob("*"))) > 0
147 |     if is_not_empty:
148 |         logger.info("Found data stored in '%s'. Delete this first if you really want to fetch new data.", output_dir)
149 |         return False
150 |     else:
151 |         logger.info("Fetching from %s to '%s'", url, output_dir)
152 |         
153 |         parsed = urlparse(url)
154 |         root, extension = splitext(parsed.path)
155 |         archive_extension = extension[1:]
156 |         
157 |         request_data = requests.get(url, proxies=proxies, timeout=timeout)
158 | 
159 |         if archive_extension == "zip":
160 |             zip_archive = zipfile.ZipFile(io.BytesIO(request_data.content))
161 |             zip_archive.extractall(output_dir)
162 |         elif archive_extension == "gz" and not "tar.gz" in url:
163 |             gzip_archive = gzip.GzipFile(fileobj=io.BytesIO(request_data.content))
164 |             file_content = gzip_archive.read()
165 |             file_name = unquote(basename(root[1:]))
166 |             with open(f"{output_dir}/{file_name}", "wb") as file:
167 |                 file.write(file_content)
168 |         elif archive_extension in ["gz", "bz2", "xz"]:
169 |             tar_archive = tarfile.open(fileobj=io.BytesIO(request_data.content), mode="r|*")
170 |             tar_archive.extractall(output_dir)
171 |         else:
172 |             logger.warning(
173 |                 "Skipped url %s as file type is not supported here. "
174 |                 "See haystack documentation for support of more file types",
175 |                 url,
176 |             )
177 |         return True
178 | 
179 | host = os.environ['OPENSEARCH_HOST']
180 | password = os.environ['OPENSEARCH_PASSWORD']
181 | 
182 | args = parser.parse_args()
183 | 
184 | 
185 | docs_src = args.src
186 | index_name = args.index_name
187 | #if len(sys.argv)>1:
188 |  # doc_dir_aws = sys.argv[1]
189 | #try:
190 | #    is_url = validators.url(docs_src)
191 | #except validators.ValidationFailure:
192 | #    is_url = False
193 | 
194 | if validate_rfc3986(docs_src):
195 |     fetch_archive_from_http(url=docs_src, output_dir=DEFAULT_DOCS_DIR)
196 |     doc_dir_aws = DEFAULT_DOCS_DIR
197 | else:
198 |     doc_dir_aws = docs_src
199 | 
200 | 
201 | print(f"doc_dir_aws {doc_dir_aws}")
202 | print(f"index_name {index_name}")
203 | 
204 | 
205 | document_store = OpenSearchDocumentStore(
206 |         host = host,
207 |         port = 443,
208 |         username = 'admin',
209 |         password = password,
210 |         scheme = 'https',
211 |         verify_certs = False,
212 |         similarity='cosine'
213 |     )
214 | 
215 | dicts_aws = convert_files_to_docs(dir_path=doc_dir_aws, clean_func=clean_wiki_text, split_paragraphs=True)
216 | 
217 | path = Path(doc_dir_aws)
218 | 
219 | # Let's have a look at the first 3 entries:
220 | print("First 3 documents to be ingested")
221 | print(dicts_aws[:3])
222 | 
223 | print(f"Starting Ingestion, Documents: {len(dicts_aws)}")
224 | 
225 | # Now, let's write the dicts containing documents to our DB.
226 | document_store.write_documents(dicts_aws, index=index_name)
227 | 
228 | print(f"Finished Ingestion, Documents: {len(dicts_aws)}")
229 | 
230 | print(f"Started Update Embeddings, Documents: {len(dicts_aws)}")
231 | # Calculate and store a dense embedding for each document
232 | retriever = EmbeddingRetriever(
233 |     document_store=document_store,
234 |     model_format = "sentence_transformers",
235 |     embedding_model = "sentence-transformers/all-mpnet-base-v2"
236 | )
237 | document_store.update_embeddings(
238 |     retriever=retriever,
239 |     index=index_name
240 | )
241 | print(f"Finished Update Embeddings, Documents: {len(dicts_aws)}")


--------------------------------------------------------------------------------
/ingestion/conda-env.yaml:
--------------------------------------------------------------------------------
1 | name: ingestion
2 | 
3 | dependencies:
4 |   - python=3.8


--------------------------------------------------------------------------------
/ingestion/main.tf:
--------------------------------------------------------------------------------
  1 | provider "aws" {
  2 |   region = data.terraform_remote_state.infra.outputs.region
  3 |   default_tags {
  4 |     tags = {
  5 |       Project = "semantic-search-aws-docs"
  6 |     }
  7 |   }
  8 | }
  9 | 
 10 | provider "docker" {
 11 |   registry_auth {
 12 |     address  = local.aws_ecr_url
 13 |     username = data.aws_ecr_authorization_token.token.user_name
 14 |     password = data.aws_ecr_authorization_token.token.password
 15 |   }
 16 | }
 17 | 
 18 | terraform {
 19 |   required_providers {
 20 |     docker = {
 21 |       source  = "kreuzwerker/docker"
 22 |       version = ">=2.16.0"
 23 |     }
 24 |   }
 25 | }
 26 | 
 27 | data "aws_caller_identity" "current" {}
 28 | data "aws_ecr_authorization_token" "token" {}
 29 | 
 30 | locals {
 31 |   aws_ecr_url       = "${data.aws_caller_identity.current.account_id}.dkr.ecr.${data.terraform_remote_state.infra.outputs.region}.amazonaws.com"
 32 |   build_environment = length(regexall(".*local.*", var.script_name)) > 0 ? "local" : "amazonlinux:2" # if the documents are local then we use the Docker container that can handle local documents
 33 | }
 34 | 
 35 | resource "aws_ecr_repository" "ingestion_job" {
 36 |   name         = "ingestion-job"
 37 |   force_delete = true
 38 | }
 39 | 
 40 | resource "docker_registry_image" "ingestion_job" {
 41 |   name = docker_image.ingestion_job_image.name
 42 | }
 43 | 
 44 | resource "docker_image" "ingestion_job_image" {
 45 |   name = "${local.aws_ecr_url}/${aws_ecr_repository.ingestion_job.name}:latest"
 46 |   build {
 47 |     context    = "${path.cwd}/"
 48 |     dockerfile = "Dockerfile"
 49 |     build_args = {
 50 |       DOCS_SRC    = var.docs_src
 51 |       SCRIPT_NAME = var.script_name
 52 |       BUILD_ENV   = local.build_environment
 53 |     }
 54 |   }
 55 | }
 56 | 
 57 | 
 58 | data "terraform_remote_state" "infra" {
 59 |   backend = "s3"
 60 |   config = {
 61 |     bucket = var.infra_tf_state_s3_bucket
 62 |     key    = var.infra_tf_state_s3_key
 63 |     region = var.infra_region
 64 |   }
 65 | }
 66 | 
 67 | 
 68 | resource "aws_ecs_task_definition" "ingestion_job" {
 69 |   family                   = "ingestion-job"
 70 |   container_definitions    = <<DEFINITION
 71 |   [
 72 |     {
 73 |       "dnsSearchDomains": null,
 74 |       "environmentFiles": null,
 75 |       "logConfiguration": {
 76 |         "logDriver": "awslogs",
 77 |         "options": {
 78 |           "awslogs-group": "${data.terraform_remote_state.infra.outputs.log_group_name}",
 79 |           "awslogs-region": "${data.terraform_remote_state.infra.outputs.region}",
 80 |           "awslogs-stream-prefix": "ingestion-job"
 81 |         }
 82 |       },
 83 |       "entryPoint": null,
 84 |       "portMappings":  null,
 85 |       "command": [
 86 |         "${var.docs_src}",
 87 |         "${data.terraform_remote_state.infra.outputs.index_name}"
 88 |       ],
 89 |       "linuxParameters": null,
 90 |       "cpu": 0,
 91 |       "secrets": [
 92 |           {
 93 |               "valueFrom": "${data.terraform_remote_state.infra.outputs.opensearch_secret}",
 94 |               "name": "OPENSEARCH_PASSWORD"
 95 |           }
 96 |       ],
 97 |       "environment": [
 98 |         {
 99 |            "name": "OPENSEARCH_HOST",
100 |            "value": "${data.terraform_remote_state.infra.outputs.opensearch_endpoint}"
101 |         }
102 |       ],
103 |       "image": "${aws_ecr_repository.ingestion_job.repository_url}:latest",
104 |       "essential": true,
105 |       "name": "ingestion-job"
106 |     }
107 |   ]
108 |   DEFINITION
109 |   requires_compatibilities = ["FARGATE"] # Stating that we are using ECS Fargate
110 |   network_mode             = "awsvpc"    # Using awsvpc as our network mode as this is required for Fargate
111 |   memory                   = 8192        # Specifying the memory our container requires
112 |   cpu                      = 4096        # Specifying the CPU our container requires
113 |   execution_role_arn       = data.terraform_remote_state.infra.outputs.ingestion_job_role
114 | }
115 | 
116 | 
117 | # Execute the ingestion job as once-off ECS task (by running a shell script with the aws cli command)
118 | # There is currently no native aws terraform support for the ECS run-task api.
119 | resource "null_resource" "run_ingestion_ecs_job" {
120 | 
121 |   # rerun on every apply. 
122 |   triggers = {
123 |     timestamp = "${timestamp()}"
124 |   }
125 |   provisioner "local-exec" {
126 |     command = "bash run_ingestion_job_ecs.sh"
127 |     environment = {
128 |       REGION             = "${data.terraform_remote_state.infra.outputs.region}"
129 |       ECS_CLUSTER_NAME   = "${data.terraform_remote_state.infra.outputs.ecs_cluster_name}"
130 |       JOB_SUBNETS        = "${jsonencode(data.terraform_remote_state.infra.outputs.private_subnets)}"
131 |       JOB_SECURITY_GROUP = "${data.terraform_remote_state.infra.outputs.security_group_for_open_search_access}"
132 |       DOCS_SRC           = "${var.docs_src}"
133 |       INDEX_NAME         = "${data.terraform_remote_state.infra.outputs.index_name}"
134 |     }
135 |   }
136 |   depends_on = [
137 |     docker_registry_image.ingestion_job
138 |   ]
139 | }
140 | 


--------------------------------------------------------------------------------
/ingestion/mydocs.tfvars:
--------------------------------------------------------------------------------
1 | infra_tf_state_s3_key = "semantic-search/terraform.tfstate"
2 | script_name = "run_ingestion_local"
3 | docs_src = "mydocs/data" # The directory that contains the documents to ingest


--------------------------------------------------------------------------------
/ingestion/output.tf:
--------------------------------------------------------------------------------
 1 | output "infra_region" {
 2 |   description = "The ingestion resources deployed in this AWS region."
 3 |   value       = var.infra_region
 4 | }
 5 | 
 6 | output "docs_src" {
 7 |   description = "The deployment ingested these documents."
 8 |   value       = var.docs_src
 9 | }
10 | 
11 | output "infra_tf_state_s3_bucket" {
12 |   description = "The deployment used this S3 bucket for the state of the infrastructure deployment."
13 |   value       = var.infra_tf_state_s3_bucket
14 | }


--------------------------------------------------------------------------------
/ingestion/run_ingestion_job_ecs.sh:
--------------------------------------------------------------------------------
1 | aws ecs run-task --region "$REGION" --cluster="$ECS_CLUSTER_NAME" --task-definition=ingestion-job --network-configuration='{"awsvpcConfiguration": {"subnets": '$JOB_SUBNETS',"securityGroups": ["'$JOB_SECURITY_GROUP'"], "assignPublicIp": "DISABLED" }}' --overrides='{"containerOverrides":[{"name":"ingestion-job","command":["'${DOCS_SRC}'","'${INDEX_NAME}'"]}]}'
2 | 
3 | 


--------------------------------------------------------------------------------
/ingestion/urldocs.tfvars:
--------------------------------------------------------------------------------
1 | infra_tf_state_s3_key = "semantic-search/terraform.tfstate"
2 | script_name = "run_ingestion_url"


--------------------------------------------------------------------------------
/ingestion/variable.tf:
--------------------------------------------------------------------------------
 1 | variable "infra_tf_state_s3_bucket" {
 2 |   type        = string
 3 |   description = "The S3 Bucket where the remote TF state of the application infrastructure is stored"
 4 | }
 5 | 
 6 | variable "infra_tf_state_s3_key" {
 7 |   type        = string
 8 |   description = "The S3 key where the remote TF state of the application nfrastructure is stored"
 9 | }
10 | 
11 | variable "infra_region" {
12 |   type        = string
13 |   description = "The AWS region where to the application infrastructure is deployed, e.g. 'us-east-1'."
14 | }
15 | 
16 | #variable "aws_docs" {
17 | #  type        = string
18 | #  description = "The AWS Documentation to be ingested. Use 'full' or specific documents like 'amazon-eks-user-guide' "
19 | #}
20 | 
21 | variable "docs_src" {
22 |   type        = string
23 |   description = "One of the following three options: 1. a path to the local directory that contains documents or 2. an url that points to a archive (zip, gz or tar.gz), or 3. AWS documentation to be ingested, use 'full' or specific documents like 'amazon-eks-user-guide'. This docs_src variable will specifies the documents to ingest into the Amazon OpenSearch index."
24 |   default = "amazon-eks-user-guide"
25 | }
26 | 
27 | variable "script_name" {
28 |   type = string
29 |   description = "The script that the ingestion container runs to ingest the documents into the Amazon OpenSearch index. Set to 'run_ingestion_local' to ingest documents from a local directory (set variable docs_src to local path). Set to 'run_ingestion_awsdocs_local' to ingest the AWS documentation (set docs_src to AWS documentation). Set to 'run_ingestion_url' to ingest documents from an url (set docs_src to the url)."
30 |   default = "run_ingestion_awsdocs_local"
31 | }


--------------------------------------------------------------------------------
/semantic-search-arch-application.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/semantic-search-aws-docs/6ae2b3b907cdb93666ec01e73acb8f03be245736/semantic-search-arch-application.png


--------------------------------------------------------------------------------
/semantic-search-arch-ingestion.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/semantic-search-aws-docs/6ae2b3b907cdb93666ec01e73acb8f03be245736/semantic-search-arch-ingestion.png


--------------------------------------------------------------------------------
/semantic-search-architecture.drawio:
--------------------------------------------------------------------------------
1 | <mxfile host="drawio.corp.amazon.com" modified="2022-10-14T15:29:46.762Z" agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Firefox/102.0" etag="l-13uMLLQ8eSyHB_1xQk" version="12.4.8" type="device"><diagram id="HPEkUiFn431g3xU8a1Zm" name="Page-1">7VxZd9o4FP41PDbHO/DImnYmXc4wbadPOcJWQI2xPLIgkF8/ki1jWxJbwIFkSHNO0dViWfd+d9MNDbs3W94SEE8/4wCGDcsIlg2737As03M89h+nrDJK2xaECUFBRjIKwgg9QzEzp85RABNBy0gU45CiuEr0cRRBn1ZogBD8VB32gMOgQojBBCqEkQ9ClfoTBXSaUVuuUdA/QjSZ5k82DdEzA/lgQUimIMBPJZI9aNg9gjHNPs2WPRjyw6uey3BD73pjBEZ0nwnP0SKxhz/un5et7pfvbjjsjb98EKssQDgXL/w9gURsmK7yU0geIfX56xgNu4vnNEQR7K1PnBMfcER7OMQknWCzf0O+je6EgADBoi/CEeTDURjKw/uMnlCCH6E0OADJFAbiQQtIKGL8uQNjGH7DCaIIR6xvjCnFs9KATogmvIPimFGBaPlsL+wF7e6UzkLWNsXeheSZVt4WL88fCZI4e9EHtOT76MYY8VUGC7ZYIhZh7I35hNlywpFwA54S52aepM9SWZWfO9srXJZIgnW3EM8gJSs2RPS229kMASOr3bxxM8pTIZa5VE5LEpnTgADCZL10ISvsgxCXA0THUkSn83PECL0QzwNFgtIjS7fgdtkv21TPaLisp8dbN5YrEeR2s0ow1RZfo0qQ280qwZSXN6Xnm/IGSwSlVVnekJ5vlDbo9jdhaANWSpL6NEUUjmLg81N9YnK2h/QyzUgBexYRa8jCm40JQxAnaLyeRaA/JwlawL9gki1ubBLyCcHzON3+Jz+Fotp7zz7e+1ww7kFIFZyXFUZZM4gD2I7oED6kK7JTQdHkLm317W06qaJPToBNK9f5q9zqGQo0PUeDTdszjwfnh1X3T0Buv/q/jRYZ90eP5Ef3g2Ur6FQgCQNm6UQTEzrFExyBcFBQu4x1UbA+qWLMHeYcSKXpN6R0JcQPzCmuyqt6uNtUSYLnxIdb3ip3AACZQLptnDhr/oZbGUhgCCgT8qrbcGpmqEa2YXkpDAK0YB8n/GNnBp4ZergRz/rGJO/KKezhpQm6NeI4ZEBJLaJl3GEQHLcewxiIfO4P6Ce8spfg9K1mp/P+vARQsO0+ZEy7H+fnfiiC9tdZjqSybEvjT3ganeWdwJ/QosRRUDLopf7E3ffR34O/FGnTGArFqrgdr9fyysJhbuS9zE9JVo30p0aOmFaVJY5qRGwdQ9bW5+QcaSocGVECwSxElJGHhB0QjISKYSe05NBMNc+QKWhAYWMd+hzGtfVRvwWuNatca6tcWzO2zLU18eRca2lwxAkoSijXKkmVYWQeRcxvYsQRBISpb8vofPv07tFmG5fGN8t7ly6bsafPZrmX5bMZu7lR8Xe0Aa42yNUFutpgVw14K8PSEFTzBJmoozVVoqkOy6NWlaij6UJ0ebapmW1KszcHyPu6j6xv2GwNDKfU10cspBVeYMRxofqXfcPtmU2dZntIfy7RydQ6lARmuMzicRbLJ7rI/EFY6fqUrGW4uVe59jNbip6tK2ulx7XqZl5x/YZw3Xbc/tA6DNduxza67v8G18w+Jmx7SerO3SeQLJBfJ8pd++JiSdPVoFxJb3yKJjChkEcwfexzv/joBEyexPkawyj3p1+8mh/OE1pkX3avIOkxxj9aFUNFRGVJnqEgyPxInvYFRT64qgY5krnrKFLD5qacjCZqqEsKnbYkhW2dFJq6NGxtYmhdjc2bNjZXJ3K3sWF686w5MR2gndoArbMrV0BfAf2eAG3VmXazLywetNQk9wuybiVewSjo8LojzvEQ+4+cFI7Tds74VJYAofk44RmxmUMU5uuo2VXH7XqOkuJjDCKrf8qNX7xx4+bN/rLc2V+VW98gQewc09qAw7OtO5N/ubrcfWHr7Jn8K7tyeQZ374SgWO4b18klG9OSZCt7LTGqEC9loiVPzF5TmZjK5fodjghqdPfIsqgW0uezMDBBflU6q7KjsttUObutbmw3Yw/n6zb4H8lqR0r62+0Xst5xpYVkj+N0otCN/3W/Dj4+fJmO494iiBIv9varKBiJyN8yApT4mGmf7KTp9Mggt1zmZXwG8Z7h6DX1U7vxzkqsZmld2AbTrMH7ZmvdlvO3jtlW7HVdmR2t5FsayT9KB8Iloqn55E5p1vyVm3L2ubCevLEqNfS2U9J0O7Xrlsunc2nJdflartyKfN7BetJsSeLTfFU9uU8V3Ass5m4mn83C2fKBey/knLSQ7bRelXNqoH2tLr5WF7/R6uKjjPD6T2EEEpuGpZpgXRbMalo1GWFdidHllurtOP3N6ucyS/W0+20rHMkut7JK4D/w+M2U6J2eW7ZZ9WFftdRLH77tE8nXX+pVQ+poq2e0zdCXfWL9mYkjKqcc9ANtvTQc61m1qpg3vdd1ZfP3v149XK8e3mn2Qi5Iq8EUuMYZC9K2qKsrrt8qrq9ZySML0k6P8vMWpOlRrqsEuhakXU5B2uml8MwFaXoxvNavvGljc3UiDypIO0vmp66CND2g1cSPAugKd7USuE0hh5IYKJr6QDmRWfsExyGe4ORmguh0Pt6LcTvq0d2qi+/lS5ztztI4RbrnjDVGe+R2NFebWwT2VQp7LKMqB01TWmLfuzC3VV2o5djyhWjNyZ/1d5JsdR/7+CkKsy99yC7LMidyg3MHZhyO0ThJc7SEQdYybhH9mGPwXfhtOzRFDkTBWFebGm6qqsI+/GaFNYvv+srkovjGNHvwHw==</diagram></mxfile>


--------------------------------------------------------------------------------