├── .gitignore ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── NOTICE ├── README.md ├── appdemo.gif ├── application ├── backend │ ├── pipeline_definitions │ │ ├── aws-search-generative.haystack-pipeline.yml │ │ └── aws-search.haystack-pipeline.yml │ └── search-api.Dockerfile └── frontend │ ├── search-ui.Dockerfile │ ├── utils.py │ └── webapp.py ├── cloud9 └── resize.sh ├── documentation ├── aws-cloud9-deployment.md ├── clean-up-ingestion-resources.md ├── ingest-aws-documentation.md ├── ingest-custom-local-documents.md ├── ingest-documents-from-url.md └── ingest-wait-for-completion.md ├── infrastructure ├── data.tf ├── locals.tf ├── main.tf ├── output.tf ├── providers.tf ├── terraform.tfvars ├── tf-backends.tf └── variable.tf ├── ingestion ├── .terraform.lock.hcl ├── Dockerfile ├── awsdocs.tfvars ├── awsdocs │ ├── .gitignore │ ├── data │ │ └── empty.txt │ ├── requirements.txt │ ├── scripts │ │ ├── 0_setup_env.sh │ │ ├── clone_awsdocs.sh │ │ ├── run-opensearch.sh │ │ ├── run_ingestion_awsdocs.sh │ │ ├── run_ingestion_local.sh │ │ └── run_ingestion_url.sh │ └── src │ │ ├── get_faqs.py │ │ ├── ingest-pagerank.py │ │ └── ingest.py ├── conda-env.yaml ├── main.tf ├── mydocs.tfvars ├── output.tf ├── run_ingestion_job_ecs.sh ├── urldocs.tfvars └── variable.tf ├── semantic-search-arch-application.png ├── semantic-search-arch-ingestion.png └── semantic-search-architecture.drawio /.gitignore: -------------------------------------------------------------------------------- 1 | /application/.env 2 | /application/env.sh 3 | /ingestion/conda-env 4 | /ingestion/data/awsdocs/ 5 | /ingestion/data/ 6 | # Local .terraform directories 7 | **/.terraform/* 8 | # .tfstate files 9 | *.tfstate 10 | *.tfstate.* 11 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *main* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | this software and associated documentation files (the "Software"), to deal in 5 | the Software without restriction, including without limitation the rights to 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | the Software, and to permit persons to whom the Software is furnished to do so. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | -------------------------------------------------------------------------------- /NOTICE: -------------------------------------------------------------------------------- 1 | The Semantic Search on AWS Docs repository includes the following third-party software/licensing: 2 | - application/frontend/utils.py 3 | - application/frontend/webapp.py 4 | are taken and modified from Haystack (https://github.com/deepset-ai/haystack) 5 | - https://github.com/deepset-ai/haystack/tree/main/ui/ui/utils.py 6 | - https://github.com/deepset-ai/haystack/tree/main/ui/ui/webapp.py 7 | Copyright 2021 deepset GmbH. Licensed under the Apache License, Version 2.0 8 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Semantic Search on AWS Docs or Custom Documents 2 | 3 | This sample project demonstrates how to set up AWS infrastructure to perform semantic search and [question answering](https://en.wikipedia.org/wiki/Question_answering) on documents using a transformer machine learning models like BERT, RoBERTa, or GPT (via the [Haystack](https://github.com/deepset-ai/haystack) open source framework). 4 | 5 | As an example, users can type questions about AWS services and find answers from the AWS documentation or custom local documents. 6 | 7 | The deployed solution support 2 answering styles: 8 | - `extractive question answering` will find the semantically closest 9 | documents to the questions and highlight the most likeliest answer(s) in these documents. 10 | - `generative question answering`, also referred to as long form question answering (LFQA), will find the semantically closest documents to the question and generate a formulated answer. 11 | 12 | Please note that this project is intended for demo purposes, see disclaimers below. 13 | 14 | ![](appdemo.gif?raw=true) 15 | 16 | ## Architecture 17 | 18 | ![](semantic-search-arch-application.png?raw=true) 19 | 20 | The main components of this project are: 21 | 22 | * [Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/what-is.html) to store and search documents 23 | * The [AWS Documentation](https://github.com/awsdocs/) as a sample dataset loaded in the document store 24 | * The [Haystack framework](https://www.deepset.ai/haystack) to set up an extractive [Question Answering pipeline](https://haystack.deepset.ai/tutorials/first-qa-system) with: 25 | * A [Retriever](https://haystack.deepset.ai/pipeline_nodes/retriever) that searches all the documents and returns only the most relevant ones 26 | * Retriever used: [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) 27 | * A [Reader](https://haystack.deepset.ai/pipeline_nodes/reader) that uses the documents returned by the Retriever and selects a text span which is likely to contain the matching answer to the query 28 | * Reader used: [deepset/roberta-base-squad2](https://huggingface.co/deepset/roberta-base-squad2) 29 | * [Streamlit](https://streamlit.io/) to set up a frontend 30 | * [Terraform](https://www.terraform.io/) to automate the infrastructure deployment on AWS 31 | 32 | ## How to deploy the solution 33 | 34 | ### Deploy with AWS Cloud9 35 | Follow our [step-by-step deployment instructions](documentation/aws-cloud9-deployment.md) to deploy the semantic search application if you are new to AWS, Terraform, semantic search, or you prefer detailed setp-by-step instructions. 36 | 37 | For more general deployment instructions follow the sections below. 38 | 39 | ### General Deployment Instructions 40 | The backend folder contains a Terraform project that deploys an OpenSearch domain and 2 ECS services: 41 | 42 | * frontend: Streamlit-based UI built by Haystack ([repo](https://github.com/deepset-ai/haystack-demos/tree/main/explore_the_world)) 43 | * search API: REST API built by Haystack 44 | 45 | The main steps to deploy the solution are: 46 | 47 | * Deploy the terraform stack 48 | * Optional: Ingest the AWS documentation 49 | 50 | #### Pre-requisites 51 | 52 | * Terraform v1.0+ ([getting started guide](https://learn.hashicorp.com/collections/terraform/aws-get-started)) 53 | * Docker installed and running ([getting started guide](https://www.docker.com/get-started/)) 54 | * AWS CLI v2 installed and configured ([getting started guide](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html)) 55 | * An [EC2 Service Limit of at least 8 cores for G-instance type](https://aws.amazon.com/premiumsupport/knowledge-center/ec2-instance-limit/) if you want to deploy this solution with GPU acceleration. 56 | Alternatively, you can switch to a CPU instance by changing the `instance_type = "g4dn.2xlarge"` to a CPU instance in the `infrastructure/main.tf` file. 57 | 58 | #### Deploy the application infrastructure terraform stack 59 | 60 | * git clone this repository 61 | * **Configure** 62 | Configure and change the infrastructure region, subnets, availability zones in the `infrastructure/terraform.tfvars` file as needed 63 | * **Initialize** 64 | In this example the Terrform state is stored remotely and managed through a backend using S3 and a dynamodb table to acquire the state lock. This allows collaboration on the same Terraform infrastructure from different machines. 65 | ( If you prefer to use local state instead just remove the `terraform { backend "s3" { ...}}` block from the `infrastructure/tf-backend.tf` file and run directly `terraform init`) 66 | * Create an S3 Bucket and DynamoDB to store the Terraform [state backend](https://www.terraform.io/language/settings/backends/s3) in a region of choice. 67 | ```shell 68 | STATE_REGION= 69 | ``` 70 | ```shell 71 | S3_BUCKET= 72 | aws s3 mb s3://$S3_BUCKET -region=$STATE_REGION 73 | ``` 74 | ```shell 75 | SYNC_TABLE= 76 | aws dynamodb create-table --table-name $SYNC_TABLE --attribute-definitions AttributeName=LockID,AttributeType=S --key-schema AttributeName=LockID,KeyType=HASH --billing-mode PAY_PER_REQUEST --region=$STATE_REGION 77 | ``` 78 | * Change to the directory containing the application infrastucture's `infrastructure/main.tf` file 79 | ```shell 80 | cd infrastructure 81 | ``` 82 | * Initialize terraform with the S3 remote state backend by running 83 | ```shell 84 | terraform init \ 85 | -backend-config="bucket=$S3_BUCKET" \ 86 | -backend-config="region=$STATE_REGION" \ 87 | -backend-config="dynamodb_table=$SYNC_TABLE" 88 | ``` 89 | 90 | * **Deploy** 91 | Run terraform deploy and approve changes by typing yes. 92 | ```shell 93 | terraform apply 94 | ``` 95 | ***Please note:*** _deployment can take a long time to push the container depending on the upload bandwidth of your machine. 96 | For faster deployment you can run the terraform deployment from a development environment hosted inside the same AWS region, for example by using the [AWS Cloud9](https://aws.amazon.com/cloud9/) IDE._ 97 | * **Use** 98 | Once deployment is completed, browse to the output URL (`loadbalancer_url`) from the Terraform output to see the appliction. 99 | However, searches won't return any results until you ingest any documents. 100 | * **Clean up** 101 | To remove all created resources of the applications infrastructure again use 102 | ```shell 103 | terraform destroy 104 | ``` 105 | (If you used the ingestion terrform below, make sure to first destroy the ingestion resources to avoid conflicts) 106 | 107 | #### Ingest the AWS documentation 108 | 109 | This second terraform stack builds, pushes and runs a docker container as an ECS task. 110 | The ingestion container downloads either a single (e.g. `amazon-ec2-user-guide`) or all awsdocs repos (256) (`full`) and converts the .md files into .txt using pandoc. 111 | The .txt documents are then being ingested into the applications OpenSearch cluster in the required haystack format and become available for search 112 | 113 | ![](semantic-search-arch-ingestion.png?raw=true) 114 | 115 | * Change from the `infrastructure` directory to the directory containing the ingestion's `ingestion/main.tf` 116 | ```shell 117 | cd ../ingestion 118 | ``` 119 | * Init terraform 120 | (here we are using local state instead of a remote S3 backend for simplicity) 121 | ```shell 122 | terraform init 123 | ``` 124 | * Run ingestion as Terraform deployment. 125 | The S3 remote state file from the previous infrastructure deployment is needed here as input variables. 126 | It is used as data source to read out the infra's output variables like the OpenSearch endpoint or private subnets. 127 | You can set the S3 bucket and its region either in the `infrastructure/terraform.tfvars` or passing the input variables via 128 | ```shell 129 | terraform apply \ 130 | -var="infra_region=$STATE_REGION" \ 131 | -var="infra_tf_state_s3_bucket=$S3_BUCKET" 132 | ``` 133 | ***Please note:*** _deployment can take a long time to push the container depending on the upload bandwidth of your machine. For faster deployment you can build and push the container in AWS, for example by using the [AWS Cloud9](https://aws.amazon.com/cloud9/) IDE._ 134 | * Once the previous step finsihes, the ECS ingestion task is started. You can check its progress in the AWS console, for example in Amazon CloudWatch under the log group name `semantic-search` and checking `ingestion-job`. After the task finsihed successfully, the ingested documents are searchable via the application. 135 | * After runing the ingestion job, you can remove the created ingestion resources, e.g. ECR repository or task definition by running 136 | ```shell 137 | terraform destroy \ 138 | -var="infra_region=$STATE_REGION" \ 139 | -var="infra_tf_state_s3_bucket=$S3_BUCKET" 140 | ``` 141 | 142 | #### Ingesting your own documents 143 | 144 | Take a look at the `ingestion/awsdocs/ingest.py` how adopt the ingestion script for your own documents. In brief, you can ingest local or downloaded files via: 145 | ```python 146 | # Create a wrapper for the existing OpenSearch document store 147 | document_store = OpenSearchDocumentStore(...) 148 | 149 | # Covert local files 150 | dicts_aws = convert_files_to_docs(dir_path=..., ...) 151 | 152 | # Write the documents to the OpenSearch document store 153 | document_store.write_documents(dicts_aws, index=...) 154 | 155 | # Compute and update the embeddings for each document with a transformer ML model. 156 | # An embedding is the vector representation that is learned by the transformer and that 157 | # allows us to capture and compare the semantic meaning of documents via this 158 | # vector representation. 159 | # Be sure to use the same model that you want to use later in the search pipeline. 160 | retriever = EmbeddingRetriever( 161 | document_store=document_store, 162 | model_format = "sentence_transformers", 163 | embedding_model = "all-mpnet-base-v2" 164 | ) 165 | document_store.update_embeddings(retriever) 166 | ``` 167 | 168 | ## Security 169 | 170 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information. 171 | 172 | ## Contributing 173 | 174 | If you want to contribute to Haystack, check out their [GitHub repository](https://github.com/deepset-ai/haystack). 175 | 176 | ## License 177 | 178 | This library is licensed under the MIT-0 License. See the LICENSE file. 179 | 180 | ## Disclaimer 181 | 182 | This solution is intended to demonstrate the functionality of using machine learning models for semantic search and question answering. They are not intended for production deployment as is. 183 | 184 | For best practices on modifying this solution for production use cases, please follow the [AWS well-architected guidance](https://aws.amazon.com/architecture/well-architected/). 185 | -------------------------------------------------------------------------------- /appdemo.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/semantic-search-aws-docs/6ae2b3b907cdb93666ec01e73acb8f03be245736/appdemo.gif -------------------------------------------------------------------------------- /application/backend/pipeline_definitions/aws-search-generative.haystack-pipeline.yml: -------------------------------------------------------------------------------- 1 | # To allow your IDE to autocomplete and validate your YAML pipelines, name them as .haystack-pipeline.yml 2 | 3 | version: "1.16.0" 4 | 5 | components: # define all the building-blocks for Pipeline 6 | - name: DocumentStore 7 | type: OpenSearchDocumentStore 8 | params: 9 | host: openSearchDomain.REGION.es.amazonaws.com 10 | index: indexname 11 | password: openSearchPassword 12 | port: 443 13 | return_embedding: false 14 | username: openSearchUsername 15 | similarity: cosine 16 | - name: Retriever 17 | type: EmbeddingRetriever 18 | params: 19 | document_store: DocumentStore # params can reference other components defined in the YAML 20 | top_k: 5 21 | embedding_model: sentence-transformers/all-mpnet-base-v2 22 | model_format: sentence_transformers 23 | - name: BM25Retriever 24 | type: BM25Retriever 25 | params: 26 | document_store: DocumentStore # params can reference other components defined in the YAML 27 | top_k: 5 28 | - name: Reader 29 | type: Seq2SeqGenerator 30 | params: 31 | model_name_or_path: vblagoje/bart_lfqa 32 | - name: TextFileConverter 33 | type: TextConverter 34 | - name: PDFFileConverter 35 | type: PDFToTextConverter 36 | - name: Preprocessor 37 | type: PreProcessor 38 | params: 39 | split_by: word 40 | split_length: 1000 41 | - name: FileTypeClassifier 42 | type: FileTypeClassifier 43 | 44 | pipelines: 45 | - name: query # a sample extractive-qa Pipeline 46 | nodes: 47 | - name: Retriever 48 | inputs: [Query] 49 | - name: Reader 50 | inputs: [Retriever] 51 | - name: indexing 52 | nodes: 53 | - name: FileTypeClassifier 54 | inputs: [File] 55 | - name: TextFileConverter 56 | inputs: [FileTypeClassifier.output_1] 57 | - name: PDFFileConverter 58 | inputs: [FileTypeClassifier.output_2] 59 | - name: Preprocessor 60 | inputs: [PDFFileConverter, TextFileConverter] 61 | - name: Retriever 62 | inputs: [Preprocessor] 63 | - name: DocumentStore 64 | inputs: [Retriever] 65 | 66 | 67 | -------------------------------------------------------------------------------- /application/backend/pipeline_definitions/aws-search.haystack-pipeline.yml: -------------------------------------------------------------------------------- 1 | # To allow your IDE to autocomplete and validate your YAML pipelines, name them as .haystack-pipeline.yml 2 | version: "1.16.0" 3 | 4 | components: # define all the building-blocks for Pipeline 5 | - name: DocumentStore 6 | type: OpenSearchDocumentStore 7 | params: 8 | host: openSearchDomain.REGION.es.amazonaws.com 9 | index: indexname 10 | password: openSearchPassword 11 | port: 443 12 | return_embedding: false 13 | username: openSearchUsername 14 | similarity: cosine 15 | - name: Retriever 16 | type: EmbeddingRetriever 17 | params: 18 | document_store: DocumentStore # params can reference other components defined in the YAML 19 | top_k: 5 20 | embedding_model: sentence-transformers/all-mpnet-base-v2 21 | model_format: sentence_transformers 22 | - name: BM25Retriever 23 | type: BM25Retriever 24 | params: 25 | document_store: DocumentStore # params can reference other components defined in the YAML 26 | top_k: 5 27 | - name: Reader # custom-name for the component; helpful for visualization & debugging 28 | type: FARMReader # Haystack Class name for the component 29 | params: 30 | model_name_or_path: deepset/roberta-base-squad2 31 | context_window_size: 500 32 | return_no_answer: true 33 | - name: TextFileConverter 34 | type: TextConverter 35 | - name: PDFFileConverter 36 | type: PDFToTextConverter 37 | - name: Preprocessor 38 | type: PreProcessor 39 | params: 40 | split_by: word 41 | split_length: 1000 42 | - name: FileTypeClassifier 43 | type: FileTypeClassifier 44 | 45 | pipelines: 46 | - name: query # a sample extractive-qa Pipeline 47 | nodes: 48 | - name: Retriever 49 | inputs: [Query] 50 | - name: Reader 51 | inputs: [Retriever] 52 | - name: indexing 53 | nodes: 54 | - name: FileTypeClassifier 55 | inputs: [File] 56 | - name: TextFileConverter 57 | inputs: [FileTypeClassifier.output_1] 58 | - name: PDFFileConverter 59 | inputs: [FileTypeClassifier.output_2] 60 | - name: Preprocessor 61 | inputs: [PDFFileConverter, TextFileConverter] 62 | - name: Retriever 63 | inputs: [Preprocessor] 64 | - name: DocumentStore 65 | inputs: [Retriever] -------------------------------------------------------------------------------- /application/backend/search-api.Dockerfile: -------------------------------------------------------------------------------- 1 | FROM deepset/haystack:gpu-v1.16.0 2 | 3 | COPY pipeline_definitions/* /home/user/rest_api/pipeline/ -------------------------------------------------------------------------------- /application/frontend/search-ui.Dockerfile: -------------------------------------------------------------------------------- 1 | FROM deepset/haystack-streamlit-ui@sha256:3584978ff23c7eb5f19596bde6f3eeca1bab65a8997758634db5563241e2b1cb 2 | 3 | COPY utils.py /home/user/ui 4 | COPY webapp.py /home/user/ui -------------------------------------------------------------------------------- /application/frontend/utils.py: -------------------------------------------------------------------------------- 1 | # Modified from https://github.com/deepset-ai/haystack/blob/main/ui/utils.py 2 | # commit 1a0197839c6ee0a90e0f562af5edf57a891d473a 3 | # under Apache-2.0 license 4 | ########################################################################################################### 5 | 6 | from typing import List, Dict, Any, Tuple, Optional 7 | 8 | import os 9 | import logging 10 | from time import sleep 11 | 12 | import requests 13 | import streamlit as st 14 | 15 | 16 | API_ENDPOINT = os.getenv("API_ENDPOINT", "http://localhost:8000") 17 | API_ENDPOINT_GENERATIVE = os.getenv("API_ENDPOINT_GENERATIVE", "http://localhost:9000") 18 | STATUS = "initialized" 19 | HS_VERSION = "hs_version" 20 | DOC_REQUEST = "query" 21 | DOC_FEEDBACK = "feedback" 22 | DOC_UPLOAD = "file-upload" 23 | 24 | 25 | def haystack_is_ready(answer_style): 26 | """ 27 | Used to show the "Haystack is loading..." message 28 | """ 29 | url = f"{API_ENDPOINT}/{STATUS}" 30 | if(answer_style=='Generative'): 31 | url = f"{API_ENDPOINT_GENERATIVE}/{STATUS}" 32 | try: 33 | if requests.get(url).status_code < 400: 34 | return True 35 | except Exception as e: 36 | logging.exception(e) 37 | sleep(1) # To avoid spamming a non-existing endpoint at startup 38 | return False 39 | 40 | 41 | @st.cache 42 | def haystack_version(answer_style): 43 | """ 44 | Get the Haystack version from the REST API 45 | """ 46 | url = f"{API_ENDPOINT}/{HS_VERSION}" 47 | if(answer_style=='Generative'): 48 | url = f"{API_ENDPOINT_GENERATIVE}/{HS_VERSION}" 49 | return requests.get(url, timeout=0.1).json()["hs_version"] 50 | 51 | 52 | def query(query, filters={}, top_k_reader=5, top_k_retriever=5, answer_style='Extractive', debug = False) -> Tuple[List[Dict[str, Any]], Dict[str, str]]: 53 | """ 54 | Send a query to the REST API and parse the answer. 55 | Returns both a ready-to-use representation of the results and the raw JSON. 56 | """ 57 | 58 | url = f"{API_ENDPOINT}/{DOC_REQUEST}" 59 | params = {"filters": filters, "Retriever": {"top_k": top_k_retriever}, "Reader": {"top_k": top_k_reader}, 'Query': {'debug': debug}} 60 | 61 | if(answer_style=='Generative'): 62 | url = f"{API_ENDPOINT_GENERATIVE}/{DOC_REQUEST}" 63 | params = {"filters": filters, "Retriever": {"top_k": top_k_retriever}, 'Query': {'debug': debug}} 64 | req = {"query": query, "params": params, "debug": debug} 65 | logging.info("req") 66 | logging.info(req) 67 | 68 | response_raw = requests.post(url, json=req) 69 | 70 | if response_raw.status_code >= 400 and response_raw.status_code != 503: 71 | raise Exception(f"{vars(response_raw)}") 72 | 73 | response = response_raw.json() 74 | if "errors" in response: 75 | raise Exception(", ".join(response["errors"])) 76 | logging.info("response") 77 | logging.info(response_raw) 78 | logging.info(response) 79 | 80 | # Format response 81 | results = [] 82 | answers = response["answers"] 83 | for answer in answers: 84 | if answer.get("answer", None): 85 | if(answer["type"]=="generative"): 86 | documents = [doc for doc in response["documents"] if doc["id"] in answer["document_ids"]] 87 | document_names = map(lambda doc: doc["meta"]["name"], documents) 88 | results.append( 89 | { 90 | "context": "", 91 | "answer": answer.get("answer", None), 92 | "source": ', '.join(document_names), 93 | "relevance": sum(map(lambda doc: doc["score"], documents))/len(documents), 94 | "document": documents[0], 95 | "offset_start_in_doc": 0, 96 | "_raw": answer, 97 | } 98 | ) 99 | else: 100 | results.append( 101 | { 102 | "context": "..." + answer["context"] + "...", 103 | "answer": answer.get("answer", None), 104 | "source": answer["meta"]["name"], 105 | "relevance": round(answer["score"] * 100, 2), 106 | "document": [doc for doc in response["documents"] if doc["id"] == answer["document_ids"][0]][0] if answer["document_ids"] and len(answer["document_ids"]) > 0 else [], 107 | "offset_start_in_doc": answer["offsets_in_document"][0]["start"], 108 | "_raw": answer, 109 | } 110 | ) 111 | else: 112 | results.append( 113 | { 114 | "context": None, 115 | "answer": None, 116 | "document": None, 117 | "relevance": round(answer["score"] * 100, 2), 118 | "_raw": answer, 119 | } 120 | ) 121 | return results, response 122 | 123 | 124 | def send_feedback(query, answer_obj, is_correct_answer, is_correct_document, document) -> None: 125 | """ 126 | Send a feedback (label) to the REST API 127 | """ 128 | url = f"{API_ENDPOINT}/{DOC_FEEDBACK}" 129 | req = { 130 | "query": query, 131 | "document": document, 132 | "is_correct_answer": is_correct_answer, 133 | "is_correct_document": is_correct_document, 134 | "origin": "user-feedback", 135 | "answer": answer_obj, 136 | } 137 | response_raw = requests.post(url, json=req) 138 | if response_raw.status_code >= 400: 139 | raise ValueError(f"An error was returned [code {response_raw.status_code}]: {response_raw.json()}") 140 | 141 | 142 | def upload_doc(file): 143 | url = f"{API_ENDPOINT}/{DOC_UPLOAD}" 144 | files = [("files", file)] 145 | response = requests.post(url, files=files).json() 146 | return response 147 | 148 | 149 | def get_backlink(result) -> Tuple[Optional[str], Optional[str]]: 150 | if result.get("document", None): 151 | doc = result["document"] 152 | if isinstance(doc, dict): 153 | if doc.get("meta", None): 154 | if isinstance(doc["meta"], dict): 155 | if doc["meta"].get("url", None) and doc["meta"].get("title", None): 156 | return doc["meta"]["url"], doc["meta"]["title"] 157 | return None, None 158 | -------------------------------------------------------------------------------- /application/frontend/webapp.py: -------------------------------------------------------------------------------- 1 | # Modified from https://github.com/deepset-ai/haystack/blob/main/ui/webapp.py 2 | # commit 1a0197839c6ee0a90e0f562af5edf57a891d473a 3 | # under Apache-2.0 license 4 | ########################################################################################################### 5 | 6 | import os 7 | import sys 8 | import logging 9 | import random 10 | from pathlib import Path 11 | from json import JSONDecodeError 12 | 13 | import pandas as pd 14 | import streamlit as st 15 | from annotated_text import annotation 16 | from markdown import markdown 17 | 18 | from ui.utils import haystack_is_ready, query, send_feedback, upload_doc, haystack_version, get_backlink 19 | 20 | 21 | # Adjust to a question that you would like users to see in the search bar when they load the UI: 22 | DEFAULT_QUESTION_AT_STARTUP = os.getenv("DEFAULT_QUESTION_AT_STARTUP", "How to protect against DDoS attacks?") 23 | DEFAULT_ANSWER_AT_STARTUP = os.getenv("DEFAULT_ANSWER_AT_STARTUP", "AWS Shield is a managed Distributed Denial of Service (DDoS) protection service that safeguards applications running on AWS. AWS Shield provides always-on detection and automatic inline mitigations that minimize application downtime and latency, so there is no need to engage AWS Support to benefit from DDoS protection") 24 | 25 | # Sliders 26 | DEFAULT_DOCS_FROM_RETRIEVER = int(os.getenv("DEFAULT_DOCS_FROM_RETRIEVER", "3")) 27 | DEFAULT_NUMBER_OF_ANSWERS = int(os.getenv("DEFAULT_NUMBER_OF_ANSWERS", "3")) 28 | 29 | 30 | 31 | def set_state_if_absent(key, value): 32 | if key not in st.session_state: 33 | st.session_state[key] = value 34 | 35 | 36 | def main(): 37 | 38 | st.set_page_config(page_title="Semantic Search - AWS Docs", page_icon="https://a0.awsstatic.com/libra-css/images/site/fav/favicon.ico") 39 | 40 | # Persistent state 41 | set_state_if_absent("question", DEFAULT_QUESTION_AT_STARTUP) 42 | set_state_if_absent("answer", DEFAULT_ANSWER_AT_STARTUP) 43 | set_state_if_absent("results", None) 44 | set_state_if_absent("raw_json", None) 45 | set_state_if_absent("random_question_requested", False) 46 | 47 | # Small callback to reset the interface in case the text of the question changes 48 | def reset_results(*args): 49 | st.session_state.answer = None 50 | st.session_state.results = None 51 | st.session_state.raw_json = None 52 | 53 | # Title 54 | st.write("# Semantic Search on AWS") 55 | st.markdown( 56 | """ 57 | Ask any question on about the documents to see if we can find the correct answer to your query! 58 | *Note: do not use keywords, but full-fledged questions.* The demo is not optimized to deal with keyword queries and might misunderstand you. 59 | """, 60 | unsafe_allow_html=True, 61 | ) 62 | 63 | # Sidebar 64 | st.sidebar.header("Options") 65 | 66 | answer_style = st.sidebar.radio( 67 | "Answer Style:", 68 | ('Extractive', 'Generative'), 69 | index=0, 70 | on_change=reset_results 71 | ) 72 | 73 | 74 | top_k_reader = st.sidebar.slider( 75 | "Max. number of answers", 76 | min_value=1, 77 | max_value=10, 78 | value=DEFAULT_NUMBER_OF_ANSWERS, 79 | step=1, 80 | on_change=reset_results, 81 | ) 82 | top_k_retriever = st.sidebar.slider( 83 | "Max. number of documents from retriever", 84 | min_value=1, 85 | max_value=10, 86 | value=DEFAULT_DOCS_FROM_RETRIEVER, 87 | step=1, 88 | on_change=reset_results, 89 | ) 90 | debug = st.sidebar.checkbox("Show debug info") 91 | 92 | 93 | hs_version = "" 94 | try: 95 | hs_version = f" (v{haystack_version(answer_style=answer_style)})" 96 | except Exception: 97 | pass 98 | 99 | st.sidebar.markdown( 100 | f""" 101 | 116 | 121 | """, 122 | unsafe_allow_html=True, 123 | ) 124 | 125 | # Search bar 126 | question = st.text_input("", value=st.session_state.question, max_chars=100, on_change=reset_results) 127 | col1, col2 = st.columns(2) 128 | col1.markdown("", unsafe_allow_html=True) 129 | col2.markdown("", unsafe_allow_html=True) 130 | 131 | # Run button 132 | run_pressed = col1.button("Run") 133 | 134 | 135 | example_questions = [ 136 | "What is Amazon SageMaker?", 137 | "Why is it a best practice to use multiple availability zones (AZs)?", 138 | "How to protect against DDoS on AWS?" 139 | ] 140 | 141 | # Get next random question from the CSV 142 | if col2.button("Random question"): 143 | reset_results() 144 | st.session_state.question = random.choice(example_questions) 145 | st.session_state.answer = "" 146 | st.session_state.random_question_requested = True 147 | # Re-runs the script setting the random question as the textbox value 148 | # Unfortunately necessary as the Random Question button is _below_ the textbox 149 | raise st.scriptrunner.script_runner.RerunException(st.scriptrunner.script_requests.RerunData(None)) 150 | st.session_state.random_question_requested = False 151 | 152 | run_query = ( 153 | run_pressed or question != st.session_state.question 154 | ) and not st.session_state.random_question_requested 155 | 156 | # Check the connection 157 | with st.spinner("⌛️    Backend is starting..."): 158 | if not haystack_is_ready(answer_style=answer_style): 159 | st.error("🚫    Connection Error. Is the backend running?") 160 | run_query = False 161 | reset_results() 162 | 163 | # Get results for query 164 | if run_query and question: 165 | reset_results() 166 | st.session_state.question = question 167 | 168 | with st.spinner( 169 | "🧠    Performing neural search on documents... \n " 170 | "Do you want to optimize speed or accuracy? \n" 171 | "Check out the docs: https://haystack.deepset.ai/usage/optimization " 172 | ): 173 | try: 174 | st.session_state.results, st.session_state.raw_json = query( 175 | question, top_k_reader=top_k_reader, top_k_retriever=top_k_retriever, answer_style=answer_style, debug=debug 176 | ) 177 | except JSONDecodeError as je: 178 | st.error("👓    An error occurred reading the results. Is the document store working?") 179 | return 180 | except Exception as e: 181 | logging.exception(e) 182 | if "The server is busy processing requests" in str(e) or "503" in str(e): 183 | st.error("🧑‍🌾    All our workers are busy! Try again later.") 184 | else: 185 | st.error("🐞    An error occurred during the request.") 186 | return 187 | 188 | if st.session_state.results: 189 | 190 | st.write("## Results:") 191 | 192 | for count, result in enumerate(st.session_state.results): 193 | if result["answer"]: 194 | answer, context = result["answer"], result["context"] 195 | start_idx = context.find(answer) 196 | end_idx = start_idx + len(answer) 197 | # Hack due to this bug: https://github.com/streamlit/streamlit/issues/3190 198 | st.write( 199 | markdown(context[:start_idx] + str(annotation(answer, "ANSWER", "#8ef")) + context[end_idx:]), 200 | unsafe_allow_html=True, 201 | ) 202 | source = "" 203 | url, title = get_backlink(result) 204 | if url and title: 205 | source = f"[{result['document']['meta']['title']}]({result['document']['meta']['url']})" 206 | else: 207 | source = f"{result['source']}" 208 | st.markdown(f"**Relevance:** {result['relevance']} - **Source:** {source}") 209 | 210 | else: 211 | st.info( 212 | "🤔    Haystack is unsure whether any of the documents contain an answer to your question. Try to reformulate it!" 213 | ) 214 | st.write("**Relevance:** ", result["relevance"]) 215 | 216 | st.write("___") 217 | if debug: 218 | st.subheader("REST API JSON response") 219 | st.write(st.session_state.raw_json) 220 | 221 | main() -------------------------------------------------------------------------------- /cloud9/resize.sh: -------------------------------------------------------------------------------- 1 | 2 | #!/bin/bash 3 | 4 | # https://docs.aws.amazon.com/cloud9/latest/user-guide/move-environment.html#move-environment-resize 5 | 6 | # Specify the desired volume size in GiB as a command line argument. If not specified, default to 20 GiB. 7 | SIZE=${1:-20} 8 | 9 | # Get the ID of the environment host Amazon EC2 instance. 10 | TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 60") 11 | INSTANCEID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" -v http://169.254.169.254/latest/meta-data/instance-id 2> /dev/null) 12 | REGION=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" -v http://169.254.169.254/latest/meta-data/placement/region 2> /dev/null) 13 | 14 | # Get the ID of the Amazon EBS volume associated with the instance. 15 | VOLUMEID=$(aws ec2 describe-instances \ 16 | --instance-id $INSTANCEID \ 17 | --query "Reservations[0].Instances[0].BlockDeviceMappings[0].Ebs.VolumeId" \ 18 | --output text \ 19 | --region $REGION) 20 | 21 | # Resize the EBS volume. 22 | aws ec2 modify-volume --volume-id $VOLUMEID --size $SIZE 23 | 24 | # Wait for the resize to finish. 25 | while [ \ 26 | "$(aws ec2 describe-volumes-modifications \ 27 | --volume-id $VOLUMEID \ 28 | --filters Name=modification-state,Values="optimizing","completed" \ 29 | --query "length(VolumesModifications)"\ 30 | --output text)" != "1" ]; do 31 | sleep 1 32 | done 33 | 34 | # Check if we're on an NVMe filesystem 35 | if [[ -e "/dev/xvda" && $(readlink -f /dev/xvda) = "/dev/xvda" ]] 36 | then 37 | # Rewrite the partition table so that the partition takes up all the space that it can. 38 | sudo growpart /dev/xvda 1 39 | # Expand the size of the file system. 40 | # Check if we're on AL2 41 | STR=$(cat /etc/os-release) 42 | SUB="VERSION_ID=\"2\"" 43 | if [[ "$STR" == *"$SUB"* ]] 44 | then 45 | sudo xfs_growfs -d / 46 | else 47 | sudo resize2fs /dev/xvda1 48 | fi 49 | 50 | else 51 | # Rewrite the partition table so that the partition takes up all the space that it can. 52 | sudo growpart /dev/nvme0n1 1 53 | 54 | # Expand the size of the file system. 55 | # Check if we're on AL2 56 | STR=$(cat /etc/os-release) 57 | SUB="VERSION_ID=\"2\"" 58 | if [[ "$STR" == *"$SUB"* ]] 59 | then 60 | sudo xfs_growfs -d / 61 | else 62 | sudo resize2fs /dev/nvme0n1p1 63 | fi 64 | fi 65 | -------------------------------------------------------------------------------- /documentation/aws-cloud9-deployment.md: -------------------------------------------------------------------------------- 1 | # AWS Cloud9 Deployment 2 | This document walks through the deployment of Semantic Search on AWS using AWS Cloud9. Following this guide deploys the semantic search application with default configurations. 3 | 4 | ## Create AWS Cloud9 environment 5 | 1. Open [AWS Cloud9 in AWS](https://console.aws.amazon.com/cloud9control/home) 6 | 2. Use the **Create environment** button to create a new AWS Cloud9 IDE. 7 | 1. Give your AWS Cloud9 environment a name, for example `semantic-search-deployment`. 8 | 2. Select t3.small or m5.large and leave the other configurations as the defaults. 9 | 3. **Create** the AWS Cloud9 environment. 10 | 11 | ## Open the AWS Cloud9 environment 12 | 1. Go to your [AWS Cloud9 Dashboard in AWS](https://console.aws.amazon.com/cloud9control/home) 13 | 2. In the list of AWS Cloud9 environments **open** your AWS Cloud9 environment that you created before. The **open** button opens your AWS Cloud9 IDE in a new browser tab. Wait for the Cloud9 IDE to connect. This may take a minute. 14 | 15 | ## Semantic Search Infrastructure Code 16 | 1. Execute `git clone https://github.com/aws-samples/semantic-search-aws-docs.git` in the terminal of your AWS Cloud9 IDE. 17 | 18 | ## Modify AWS Cloud9 EC2 Instance 19 | Once your AWS Cloud9 environment creation completes you will need to increase the underlying EC2 instance storage to be able to pull the container images that the semantic search application deploys. 20 | 1. In the Terminal of your AWS Cloud9 IDE make the resize bash script executable `chmod +x ~/environment/semantic-search-aws-docs/cloud9/resize.sh` 21 | 2. Increase the Amazon EBS volume size of your AWS Cloud9 environment to 50 GB `~/environment/semantic-search-aws-docs/cloud9/resize.sh 50` 22 | 23 | ## AWS CLI Credentials 24 | Terraform needs to access AWS resources from within your AWS Cloud9 environment to deploy the semantic search application. The recommended way to access AWS resources from within the AWS Cloud9 environment is to use [AWS managed temporary credentials](https://docs.aws.amazon.com/cloud9/latest/user-guide/security-iam.html#auth-and-access-control-temporary-managed-credentials-supported). For the semantic search deployment the AWS managed temporary credentials do not have sufficient permissions to create an Amazon EC2 instance profile. 25 | 1. In your AWS Cloud9 IDE at the top left click on **AWS Cloud9** and then on **Preferences**. 26 | 2. On the Preferences page navigate to **AWS Settings**, **Credentials**, and disable **AWS managed temporary credentials**. 27 | 28 | You need to authenticate directly with the AWS CLI in your AWS Cloud9 IDE. To do so you need to follow one of the AWS CLI [Authentication and access credentials methods](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-authentication.html). 29 | 30 | One easy way to setup authentication is by [setting environment variables ](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html#:~:text=supported%20environment%20variables-,How%20to%20set%20environment%20variables,-The%20following%20examples) 31 | 32 | e.g. similar to the snippet below, but with your own access key and access key. 33 | ``` 34 | export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE 35 | export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY 36 | export AWS_DEFAULT_REGION=us-east-1 37 | ``` 38 | 39 | Once you authenticated with the AWS CLI you can go on to the next step. 40 | 41 | ## GPU or CPU deployment 42 | The default configuration uses GPU instances for the semantic search application. If you want to deploy this solution with GPU acceleration you will need to increase the _Running On-Demand G and VT instances_ [EC2 Service quota to at least 8](https://aws.amazon.com/premiumsupport/knowledge-center/ec2-instance-limit/). 43 | 44 | To deploy the semantic search application without GPU instance open `semantic-search-aws-docs/infrastructure/main.tf` in your AWS Cloud9 IDE. Search for _Using a CPU instance_ in the Terraform file. Uncomment the CPU `image_id` and `instance_type` and add comments before the GPU `image_id` and `instance_type`. The code should now look like the following: 45 | ``` 46 | ### Using a CPU instance ### 47 | image_id = data.aws_ssm_parameter.ami.value #AMI name like amzn2-ami-ecs-hvm-2.0.20220520-x86_64-ebs 48 | instance_type = "c6i.2xlarge" 49 | 50 | ### Using a GPU instance ### 51 | #image_id = data.aws_ssm_parameter.ami_gpu.value #AMI name like amzn2-ami-ecs-gpu-hvm-2.0.20220520-x86_64-ebs 52 | #instance_type = "g4dn.xlarge" 53 | ``` 54 | 55 | ## Deploy Semantic Search Infrastructure 56 | 1. In your AWS Cloud9 environment terminal navigate to `cd ~/environment/semantic-search-aws-docs/infrastructure`. 57 | 2. Set the following environment variables. Change their value if you are using a different region other than `us-east-1` or if you want to give the Terraform state Amazon S3 bucket and state sync Amazon DynamoDB table different names. 58 | 59 | ```bash 60 | export REGION=us-east-1 61 | export S3_BUCKET="terraform-semantic-search-state-$(aws sts get-caller-identity --query Account --output text)" 62 | export SYNC_TABLE="terraform-semantic-search-state-sync" 63 | ``` 64 | 65 | 3. Create the Terraform state bucket in Amazon S3
66 | `aws s3 mb s3://$S3_BUCKET --region=$REGION` 67 | 4. Create the Terraform state sync table in Amazon DynamoDB
68 | `aws dynamodb create-table --table-name $SYNC_TABLE --attribute-definitions AttributeName=LockID,AttributeType=S --key-schema AttributeName=LockID,KeyType=HASH --billing-mode PAY_PER_REQUEST --region=$REGION` 69 | 5. Initialize Terraform for the infrastructure deployment
`terraform init -backend-config="bucket=$S3_BUCKET" -backend-config="region=$REGION" -backend-config="dynamodb_table=$SYNC_TABLE"` 70 | 6. Deploy the Semantic Search infrastructure with Terraform
`terraform apply -var="region=$REGION" -var="index_name=awsdocs"` 71 | * Change the terraform variable `index_name` if you want to change the name of your [index](https://opensearch.org/docs/latest/dashboards/im-dashboards/index-management/) in the Amazon OpenSearch cluster. The search API uses this variable to search for documents. 72 | * Enter `yes` when Terraform prompts you _"to perform these actions"_. 73 | * The deployment will take 10–20 minutes. Wait for completion before moving on with the document ingestion deployment. 74 | 7. You can receive the frontend URL using the following command. It may take some time though until the tasks are in running state
75 | `terraform output loadbalancer_url` 76 | ## Deploy and Run Semantic Search Ingestion 77 | In your AWS Cloud9 environments terminal navigate to
78 | `cd ~/environment/semantic-search-aws-docs/ingestion`. 79 | * If you want to ingest the AWS documentation follow the
**[Ingest AWS Documentation instructions](./ingest-aws-documentation.md)**. 80 | * If you want to ingest your documents from a URL (for example from Amazon S3) follow the
**[Ingest Documents from URL instructions](./ingest-documents-from-url.md)**. 81 | * If instead you like to make local documents searchable follow the
**[Local Documents Ingestion instructions](./ingest-custom-local-wdocuments.md)**. 82 | 83 | ### Clean up Ingestion 84 | After ingesting your documents you can remove the ingestion resources. Follow the [Clean up Ingestion Resources instructions](./clean-up-ingestion-resources.md) to clean up the ingestion resources. 85 | 86 | ## Clean up Infrastructure 87 | Destroy the resources that were deployed for the infrastructure of the semantic search application if you are not using the application anymore. 88 | 1. In your AWS Cloud9 IDE navigate to the ingestion directory `cd ~/environment/semantic-search-aws-docs/infrastructure` 89 | 2. Clean up the semantic search application infrastructure with the `terraform destroy -var="region=$REGION"` command. 90 | 1. Run `eval REGION=$(terraform output region)` if your `REGION` variable is not set anymore. 91 | 3. Enter `yes` when Terraform prompts you _"Do you really want to destroy all resources?"_. 92 | -------------------------------------------------------------------------------- /documentation/clean-up-ingestion-resources.md: -------------------------------------------------------------------------------- 1 | # Clean up Ingestion Resources 2 | You can clean up the ingestion resources immediately after ingesting the documents into your OpenSearch index. The ingestion executes as a one-off Amazon ECS task. For a production scenario with changes to the source documents you should consider [scheduling Amazon ECS tasks](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/scheduling_tasks.html) for ingesting the latest version of documents on a schedule. 3 | 1. In your terminal navigate to the ingestion directory `cd ~/semantic-search-aws-docs/ingestion` in this repository. 4 | 2. Run `terraform destroy -var-file="awsdocs.tfvars" -var="infra_region=$REGION" -var="infra_tf_state_s3_bucket=$S3_BUCKET" -var="docs_src=(eval sed -e 's/^"//' -e 's/"$//' <<< (terraform output docs_src))"` to clean up the ingestion resources. 5 | 3. Enter `yes` when Terraform prompts you _"Do you really want to destroy all resources?"_. -------------------------------------------------------------------------------- /documentation/ingest-aws-documentation.md: -------------------------------------------------------------------------------- 1 | # Ingest AWS Documentation 2 | The next steps guide you through ingesting the [Amazon EC2 User Guide for Linux](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide) into your Amazon OpenSearch index to make the AWS documentation searchable. To learn how you ingest your own documents instead take a look at [Ingesting your own documents](ingest-custom-local-documents.md). 3 | 1. In your terminal navigate to
4 | `cd ~/semantic-search-aws-docs/ingestion` in this repository. 5 | 2. Initialize Terraform `terraform init` 6 | 3. Run `AWS_DOCS=amazon-ec2-user-guide`.
`amazon-ec2-user-guide` references the name of the [GitHub repository that contains the Amazon EC2 User Guide for Linux](https://github.com/awsdocs/amazon-ec2-user-guide). You can replace `amazon-ec2-user-guide` with any of the AWS documentation repository names from the [AWS Docs GitHub](https://github.com/awsdocs), for example `amazon-eks-user-guide` or `full` to ingest all AWS Docs repos. 7 | 3. Deploy the ingestion resources
8 | ```terraform apply -var-file="awsdocs.tfvars" -var="infra_region=$REGION" -var="infra_tf_state_s3_bucket=$S3_BUCKET" -var="docs_src=$AWS_DOCS"``` 9 | * Enter `yes` when Terraform prompts you _"to perform these actions"_. 10 | 4. After the successful deployment of the ingestion resources you need to wait until the ingestion task completes. Follow the [Wait for Ingestion to Complete instructions to check for completion from you terminal](ingest-wait-for-completion.md). 11 | 12 | ## Clean up Ingestion Resources 13 | After ingesting your documents you can remove the ingestion resources. Follow the [Clean up Ingestion Resources instructions](./clean-up-ingestion-resources.md) to clean up the ingestion resources. -------------------------------------------------------------------------------- /documentation/ingest-custom-local-documents.md: -------------------------------------------------------------------------------- 1 | # Local Documents Ingestion 2 | This document explains step-by-step how to ingest local documents to make them searchable using this AWS semantic search solution. 3 | 4 | ## Deploy Semantic Search Ingestion for Local Documents 5 | The next steps guide you through ingesting documents from you local storage into the Amazon OpenSearch index. 6 | 1. In your terminal navigate to the ingestion directory
`cd ~/semantic-search-aws-docs/ingestion` 7 | 2. Initialize Terraform
`terraform init` 8 | 2. Set `DOCS_DIR` variable to the path of the directory that contains the documents (e.g. `*.txt` files) that you want to ingest. The path needs to be relative to the [Dockerfile](/ingestion/Dockerfile) directory. For example use `DOCS_DIR=mydocs/data` to ingest all the documents located in the [ingestion/mydocs/data](/ingestion/mydocs/data/) directory. 9 | 3. Deploy the ingestion resources
`terraform apply -var-file="mydocs.tfvars" -var="infra_region=$REGION" -var="infra_tf_state_s3_bucket=$S3_BUCKET" -var="docs_src=$DOCS_DIR"` 10 | 1. Enter `yes` when Terraform prompts you _"to perform these actions"_. 11 | 4. After the successful deployment of the ingestion resources you need to wait until the ingestion task completes. Follow the [Wait for Ingestion to Complete instructions to check for completion from you terminal](ingest-wait-for-completion.md). 12 | 13 | ## Clean up Ingestion Resources 14 | After ingesting your documents you can remove the ingestion resources. Follow the [Clean up Ingestion Resources instructions](./clean-up-ingestion-resources.md) to clean up the ingestion resources. -------------------------------------------------------------------------------- /documentation/ingest-documents-from-url.md: -------------------------------------------------------------------------------- 1 | # Ingest Documents from URL 2 | This document explains step-by-step how to ingest documents from an archive at a URL to make them searchable using this AWS semantic search solution. 3 | 4 | If you are on MacOS make sure that when compressing your documents into the archive that it does not add AppleDouble blobs `*_` files, see also [this StackExchange answer for the question Create tar archive of a directory, except for hidden files?](https://unix.stackexchange.com/a/9865) 5 | 6 | ## Deploy Semantic Search Ingestion for Documents from URL 7 | The next steps guide you through ingesting documents from a URL into the Amazon OpenSearch index. The URL needs to point to an archive (zip, gz or tar.gz) and has to be accessible from the `NLPSearchPrivateSubnet` subnet. The subnet can reach the internet through a NAT gateway. 8 | 1. In your terminal navigate to `cd ~/semantic-search-aws-docs/ingestion` 9 | 2. Initialize Terraform `terraform init` 10 | 2. Set `DOCS_SRC` variable to the URL from which you want to ingest the documents from, for example if your documents are in Amazon S3 then you could create a [presigned URL](https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-presigned-url.html) for the archive in Amazon S3 and assign `DOCS_SRC=https://.s3..amazonaws.com/data.zip?response-content-disposition=inline&X-Amz-Security-Token=[...]`. 11 | 3. Deploy the ingestion resources `terraform apply -var-file="urldocs.tfvars" -var="infra_region=$REGION" -var="infra_tf_state_s3_bucket=$S3_BUCKET" -var="docs_src=$DOCS_SRC"` 12 | 1. Enter `yes` when Terraform prompts you _"to perform these actions"_. 13 | 4. After the successful deployment of the ingestion resources you need to wait until the ingestion task completes. Follow the [Wait for Ingestion to Complete instructions to check for completion from you terminal](ingest-wait-for-completion.md). 14 | 15 | ## Clean up Ingestion Resources 16 | After ingesting your documents you can remove the ingestion resources. Follow the [Clean up Ingestion Resources instructions](./clean-up-ingestion-resources.md) to clean up the ingestion resources. -------------------------------------------------------------------------------- /documentation/ingest-wait-for-completion.md: -------------------------------------------------------------------------------- 1 | # Wait for Ingestion to Complete 2 | After deploying the ingestion Terraform resources you will need to wait for the ingestion to complete before being able to search the documents. Check the [Amazon Elastic Container Service console](https://console.aws.amazon.com/ecs) to see when the task completes, or use below AWS CLI commands to wait for the task to complete. 3 | 4 | ## Wait in AWS Console 5 | Go to the Tasks page of your Amazon ECS Cluster that the infrastructure stack deployed. The default name of the cluster is *NLPSearchECSCluster*. In Tasks list look for the task that has `ingestion-job` as the Task definition. Wait until the status of the `ingestion-job` task changes from *Running* to *Stopped*. 6 | 7 | ## Use AWS CLI to wait 8 | 1. Navigate to the infrastructure directory `cd ~/semantic-search-aws-docs/infrastructure`. 9 | 1. Run `ECS_CLUSTER_ARN=$(terraform output --raw ecs_cluster_arn)` to get the ARN of your Amazon ECS cluster. 10 | 2. Run `TASK_ARN=$(aws ecs list-tasks --family ingestion-job --region $REGION --cluster $ECS_CLUSTER_ARN --output text --query 'taskArns[0]')` to get the ARN of the ECS task that is ingesting the documents. 11 | 3. Run `aws ecs wait tasks-stopped --region $REGION --cluster $ECS_CLUSTER_ARN --tasks $TASK_ARN` to wait until the ingestion of the documents completes. When the command exits with _Waiter TasksStopped failed: Max attempts exceeded_ as the message run the command again. Once the command exits without any output then the ingestion job completed. 12 | 4. After the ingestion completes run `terraform output loadbalancer_url` to get the URL for the semantic search frontend. -------------------------------------------------------------------------------- /infrastructure/data.tf: -------------------------------------------------------------------------------- 1 | data "aws_caller_identity" "current" {} 2 | 3 | data "aws_ecr_authorization_token" "token" {} -------------------------------------------------------------------------------- /infrastructure/locals.tf: -------------------------------------------------------------------------------- 1 | locals { 2 | aws_ecr_url = "${data.aws_caller_identity.current.account_id}.dkr.ecr.${var.region}.amazonaws.com" 3 | } -------------------------------------------------------------------------------- /infrastructure/main.tf: -------------------------------------------------------------------------------- 1 | provider "aws" { 2 | region = var.region 3 | default_tags { 4 | tags = { 5 | Project = "semantic-search-aws-docs" 6 | } 7 | } 8 | } 9 | 10 | ### Networking ### 11 | resource "aws_vpc" "aws-vpc" { 12 | cidr_block = var.vpc_cidr 13 | 14 | enable_dns_hostnames = true # Required for DNS-based service discovery 15 | enable_dns_support = true # Required for DNS-based service discovery 16 | tags = { 17 | name = "NLPSearchVPC" 18 | } 19 | } 20 | resource "aws_subnet" "private" { 21 | vpc_id = aws_vpc.aws-vpc.id 22 | count = length(var.private_subnets) 23 | cidr_block = element(var.private_subnets, count.index) 24 | availability_zone = element(var.availability_zones, count.index) 25 | tags = { 26 | Name = "NLPSearchPrivateSubnet" 27 | Tier = "Private" 28 | } 29 | } 30 | resource "aws_subnet" "public" { 31 | vpc_id = aws_vpc.aws-vpc.id 32 | cidr_block = element(var.public_subnets, count.index) 33 | availability_zone = element(var.availability_zones, count.index) 34 | count = length(var.public_subnets) 35 | map_public_ip_on_launch = true 36 | tags = { 37 | Name = "NLPSearchPublicSubnet" 38 | Tier = "Public" 39 | } 40 | } 41 | resource "aws_internet_gateway" "main" { 42 | vpc_id = aws_vpc.aws-vpc.id 43 | } 44 | resource "aws_route_table" "public" { 45 | vpc_id = aws_vpc.aws-vpc.id 46 | 47 | route { 48 | cidr_block = "0.0.0.0/0" 49 | gateway_id = aws_internet_gateway.main.id 50 | } 51 | } 52 | resource "aws_route_table_association" "public" { 53 | count = length(var.public_subnets) 54 | subnet_id = element(aws_subnet.public.*.id, count.index) 55 | route_table_id = aws_route_table.public.id 56 | } 57 | resource "aws_alb" "main" { 58 | name = "nlp-search-alb" 59 | internal = false 60 | load_balancer_type = "application" 61 | subnets = aws_subnet.public.*.id 62 | security_groups = [aws_security_group.alb.id] 63 | } 64 | resource "aws_lb_target_group" "search_ui" { 65 | name = "nlp-search-alb-target-group" 66 | port = 8501 67 | protocol = "HTTP" 68 | target_type = "ip" 69 | vpc_id = aws_vpc.aws-vpc.id 70 | 71 | health_check { 72 | protocol = "HTTP" 73 | healthy_threshold = "3" 74 | interval = "30" 75 | matcher = "200" 76 | timeout = "10" 77 | path = "/" 78 | unhealthy_threshold = "2" 79 | } 80 | } 81 | resource "aws_lb_listener" "search_ui" { 82 | load_balancer_arn = aws_alb.main.id 83 | port = "80" 84 | protocol = "HTTP" 85 | 86 | default_action { 87 | type = "forward" 88 | target_group_arn = aws_lb_target_group.search_ui.id 89 | } 90 | } 91 | 92 | resource "aws_eip" "nat_gw" { 93 | domain = "vpc" 94 | depends_on = [aws_internet_gateway.main] 95 | } 96 | 97 | resource "aws_nat_gateway" "main" { 98 | allocation_id = aws_eip.nat_gw.id 99 | subnet_id = element(aws_subnet.public.*.id, 0) 100 | 101 | depends_on = [aws_internet_gateway.main] 102 | } 103 | 104 | resource "aws_route_table" "private" { 105 | vpc_id = aws_vpc.aws-vpc.id 106 | route { 107 | cidr_block = "0.0.0.0/0" 108 | nat_gateway_id = aws_nat_gateway.main.id 109 | } 110 | } 111 | 112 | resource "aws_route_table_association" "private" { 113 | count = length(var.private_subnets) 114 | subnet_id = element(aws_subnet.private.*.id, count.index) 115 | route_table_id = aws_route_table.private.id 116 | } 117 | 118 | ### Logs ### 119 | resource "aws_cloudwatch_log_group" "app" { 120 | name = "/semantic-search" 121 | retention_in_days = 30 122 | } 123 | 124 | 125 | ### IAM ### 126 | resource "aws_iam_role" "search_ui" { 127 | name = "NLPSearchSearchUIECSTaskRole" 128 | 129 | assume_role_policy = <> /etc/ecs/ecs.config 396 | echo ECS_ENABLE_CONTAINER_METADATA=true >> /etc/ecs/ecs.config 397 | echo ECS_ENABLED_GPU_SUPPORT=true >> /etc/ecs/ecs.config 398 | EOF 399 | ) 400 | } 401 | 402 | 403 | ### ECS Tasks and Services ### 404 | 405 | resource "aws_ecs_service" "search_ui" { 406 | name = "search_ui" 407 | cluster = aws_ecs_cluster.main.id 408 | task_definition = aws_ecs_task_definition.search_ui.arn 409 | desired_count = 1 410 | launch_type = "FARGATE" 411 | 412 | network_configuration { 413 | subnets = aws_subnet.private.*.id 414 | security_groups = [aws_security_group.search_ui.id] 415 | } 416 | 417 | load_balancer { 418 | target_group_arn = aws_lb_target_group.search_ui.arn 419 | container_name = "search-ui" 420 | container_port = 8501 421 | } 422 | 423 | depends_on = [aws_lb_listener.search_ui, docker_registry_image.search_ui] 424 | 425 | } 426 | 427 | 428 | resource "aws_ecs_task_definition" "search_ui" { 429 | family = "search-ui" 430 | container_definitions = <=0.8.2 3 | markdown>=3.3.7 4 | beautifulsoup4>=4.11.1 5 | opensearch-py>=2.0.0 6 | farm-haystack[opensearch,preprocessing,file-conversion] 7 | rfc3986-validator 8 | pydantic==1.* -------------------------------------------------------------------------------- /ingestion/awsdocs/scripts/0_setup_env.sh: -------------------------------------------------------------------------------- 1 | conda env create --prefix ./conda-env --file=conda-env.yaml ||: 2 | eval "$(conda shell.bash hook)" 3 | conda activate ./conda-env 4 | pip install -r requirements.txt -------------------------------------------------------------------------------- /ingestion/awsdocs/scripts/clone_awsdocs.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | REPO=${1:-amazon-ec2-user-guide} 4 | 5 | echo "REPO-Substring=$REPO" 6 | # get number of repos of awsdocs 7 | curl https://api.github.com/orgs/awsdocs | jq .public_repos 8 | 9 | numRepos=$(curl https://api.github.com/orgs/awsdocs | jq .public_repos) 10 | 11 | echo "numRepos: $numRepos" 12 | 13 | if [ "$numRepos" = "null" ] 14 | then 15 | echo "ERROR: $numRepos is NULL. Github rate limit probably exeeded. Please wait some time before trying again." 1>&2 16 | exit 1 # terminate and indicate error 17 | fi 18 | 19 | numPages=$((numRepos / 100 + 1)) 20 | echo "numRepos: $numRepos" 21 | echo "numPages: $numPages" 22 | 23 | # download 100 repos per page and clone the repo in the current directory 24 | for (( c=1; c<=$numPages; c++ )) 25 | do 26 | echo "Page: $c / $numPages" 27 | curl https://api.github.com/orgs/awsdocs/repos\?per_page\=100\&page\=$c | jq '.[].clone_url' | tr -d \" | while read line || [[ -n $line ]]; 28 | do 29 | if [[ "$line" == *$REPO* ]] || [[ "$line" == full ]] ; 30 | then 31 | echo "Found: $line" 32 | if ! git clone -b main $line; then 33 | if ! git clone -b master $line; then 34 | git clone $line || true 35 | fi 36 | fi 37 | fi 38 | done 39 | done -------------------------------------------------------------------------------- /ingestion/awsdocs/scripts/run-opensearch.sh: -------------------------------------------------------------------------------- 1 | # Recommended: Start Elasticsearch using Docker 2 | #! docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.6.2 3 | 4 | # docker pull opensearchproject/opensearch:1.0.1 5 | #!docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" opensearchproject/opensearch:1.0.1 -------------------------------------------------------------------------------- /ingestion/awsdocs/scripts/run_ingestion_awsdocs.sh: -------------------------------------------------------------------------------- 1 | SCRIPT_DIR="$( cd -- "$( dirname -- "${BASH_SOURCE[0]:-$0}"; )" &> /dev/null && pwd 2> /dev/null; )"; 2 | 3 | #MAIN=dirname "$0" 4 | 5 | echo "script_dir MAIN: $SCRIPT_DIR" 6 | bash $SCRIPT_DIR/clone_awsdocs.sh $1 7 | bash $SCRIPT_DIR/run_ingestion_local.sh "./" $2 -------------------------------------------------------------------------------- /ingestion/awsdocs/scripts/run_ingestion_local.sh: -------------------------------------------------------------------------------- 1 | SCRIPT_DIR="$( cd -- "$( dirname -- "${BASH_SOURCE[0]:-$0}"; )" &> /dev/null && pwd 2> /dev/null; )"; 2 | 3 | #MAIN=dirname "$0" 4 | 5 | echo "script_dir MAIN: $SCRIPT_DIR" 6 | python3.8 $SCRIPT_DIR/../src/ingest.py --src ./ --index_name $2 -------------------------------------------------------------------------------- /ingestion/awsdocs/scripts/run_ingestion_url.sh: -------------------------------------------------------------------------------- 1 | SCRIPT_DIR="$( cd -- "$( dirname -- "${BASH_SOURCE[0]:-$0}"; )" &> /dev/null && pwd 2> /dev/null; )"; 2 | 3 | #MAIN=dirname "$0" 4 | 5 | echo "script_dir MAIN: $SCRIPT_DIR" 6 | python3.8 $SCRIPT_DIR/../src/ingest.py --src $1 --index_name $2 -------------------------------------------------------------------------------- /ingestion/awsdocs/src/get_faqs.py: -------------------------------------------------------------------------------- 1 | import urllib.request, json 2 | from pathlib import Path 3 | import os 4 | import re 5 | url_services="https://aws.amazon.com/api/dirs/items/search?item.directoryId=aws-products&sort_by=item.additionalFields.productNameLowercase&sort_order=asc&size=500&item.locale=en_US&tags.id=!aws-products" 6 | 7 | services_json = "./data/faqs-aws-services.json" 8 | services_dir = "./data/faqs_aws_services" 9 | 10 | import requests 11 | from bs4 import BeautifulSoup 12 | import pandas as pd 13 | #pd.set_option('display.max_rows', 500) 14 | pd.set_option('display.max_columns', 10) 15 | pd.set_option('display.width', 1000) 16 | import glob 17 | import tqdm 18 | 19 | def get_amazon_faqs(url = "https://aws.amazon.com/alexaforbusiness/faqs/"): 20 | """ 21 | crawls the frequently asked questions and answers for a given amazon services. 22 | Assumes paragraphs in a div with class=lb-rtxt and h2 below the div to define the category of faqs. 23 | :param url: 24 | :return: 25 | """ 26 | headers = { 27 | 'Access-Control-Allow-Origin': '*', 28 | 'Access-Control-Allow-Methods': 'GET', 29 | 'Access-Control-Allow-Headers': 'Content-Type', 30 | 'Access-Control-Max-Age': '3600', 31 | 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0' 32 | } 33 | 34 | req = requests.get(url, headers) 35 | soup = BeautifulSoup(req.content, 'html.parser') 36 | faqs = [] 37 | faqs_category = [] 38 | for div in soup.find_all(class_="lb-rtxt"): 39 | cat = div.find_previous_sibling('h2') 40 | if cat is None: continue 41 | category = cat.getText() 42 | paragraphs = div.find_all("p") 43 | for p in paragraphs: 44 | if "?" in p.getText(): 45 | #new question, split 46 | faqs.append([]) 47 | faqs_category.append(category) 48 | if len(faqs)>0: 49 | faqs[-1].append(p) 50 | 51 | rows = [] 52 | for (cat, faq_par) in zip (faqs_category, faqs): 53 | question = faq_par[0].getText().strip() 54 | answer = "\n\n".join([p.getText() for p in faq_par[1:]]).strip() 55 | row= {"type":cat, "question":question, "answer":answer, "url":url} 56 | rows.append(row) 57 | df = pd.DataFrame(rows) 58 | return df 59 | 60 | #get_amazon_faqs() 61 | 62 | 63 | if not os.path.exists(services_json): 64 | with urllib.request.urlopen(url_services) as url: 65 | data_url = json.loads(url.read().decode()) 66 | with open(services_json, 'w', encoding='utf-8') as f: 67 | json.dump(data_url, f, ensure_ascii=False, indent=4) 68 | os.makedirs(services_dir,exist_ok=True) 69 | 70 | import time 71 | with open(services_json) as json_file: 72 | data = json.load(json_file) 73 | items = [item["item"] for item in data["items"]] 74 | for item in tqdm.tqdm(items): 75 | productUrl = item["additionalFields"]["productUrl"] 76 | 77 | re_all = re.findall(r"(.*aws.amazon.com/(.+)/)", productUrl) 78 | if len(re_all) ==0: 79 | print(f"Could not match url, skipping: {productUrl}") 80 | continue 81 | 82 | service = re_all[0][1] 83 | service_name = service.replace("/","-") 84 | 85 | faq_out = os.path.join(services_dir, service_name + "_faqs.csv") 86 | if os.path.exists(faq_out): 87 | continue 88 | 89 | faq_url = re_all[0][0] 90 | faqs = get_amazon_faqs(faq_url + "faqs") 91 | 92 | if faqs.shape[0]==0: 93 | print(f"Getting Faqs failed for: {faq_url} , {service_name}") 94 | faqs = get_amazon_faqs(faq_url + "faq") 95 | if faqs.shape[0] == 0: 96 | continue 97 | 98 | faqs["service"] = service 99 | faqs.to_csv(faq_out, index=False) 100 | print(faqs) 101 | # time.sleep(1.1) 102 | 103 | 104 | files = glob.glob(services_dir+"/*.csv") 105 | df_list = [] 106 | for f in files: 107 | df = pd.read_csv(f) 108 | df_list.append(df) 109 | 110 | df_all = pd.concat(df_list) 111 | 112 | print(df_all.shape) 113 | df_all.to_csv("aws-services-faqs-dataset.csv", index=False) -------------------------------------------------------------------------------- /ingestion/awsdocs/src/ingest-pagerank.py: -------------------------------------------------------------------------------- 1 | from bs4 import BeautifulSoup 2 | from haystack.utils import clean_wiki_text, convert_files_to_docs, fetch_archive_from_http, print_answers 3 | from haystack.nodes import FARMReader, TransformersReader 4 | import re 5 | # 6 | # # Recommended: Start Elasticsearch using Docker via the Haystack utility function 7 | # from haystack.utils import launch_es 8 | # 9 | # launch_es() 10 | 11 | # Connect to Elasticsearch 12 | from tqdm import tqdm 13 | # from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore 14 | # document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document") 15 | 16 | # # Let's first fetch some documents that we want to query 17 | # # Here: 517 Wikipedia articles for Game of Thrones 18 | # doc_dir = "data/article_txt_got" 19 | # s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip" 20 | # fetch_archive_from_http(url=s3_url, output_dir=doc_dir) 21 | # 22 | # # Convert files to dicts 23 | # # You can optionally supply a cleaning function that is applied to each doc (e.g. to remove footers) 24 | # # It must take a str as input, and return a str. 25 | # dicts = convert_files_to_docs(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True) 26 | 27 | doc_dir_aws = "data/awsdocs/amazon-ec2-user-guide" 28 | doc_dir_aws = "data/awsdocs" 29 | dicts_aws = convert_files_to_docs(dir_path=doc_dir_aws, clean_func=clean_wiki_text, split_paragraphs=True) 30 | 31 | 32 | from pathlib import Path 33 | import markdown 34 | 35 | path = Path(doc_dir_aws) 36 | 37 | references =[] 38 | 39 | doc_to_node = {} 40 | node_count = 0 41 | 42 | doc_to_link = {} 43 | 44 | for p in tqdm(path.rglob("*.md")): 45 | #print("Document: "+p.name) 46 | with open(p) as f: 47 | contents = f.read() 48 | #print(contents) 49 | html = markdown.markdown(contents) 50 | #print(html) 51 | # create soap object 52 | soup = BeautifulSoup(html, 'html.parser') 53 | 54 | # find all the anchor tags with "href" 55 | # attribute starting with "https://" 56 | for link in soup.find_all('a', 57 | attrs={'href': re.compile("^http")}): 58 | # display the actual urls 59 | #print(link.get('href')) 60 | href = link.get('href').strip("/") 61 | htext = link.text 62 | href_suffix = href.split("/")[-1] 63 | if "#" in href_suffix: 64 | href_suffix = href_suffix.split("#")[0] 65 | href_suffix = href_suffix.replace(".html", "") 66 | source = p.stem 67 | target = href_suffix 68 | ref = {"source_md":source, "link_suffix":target, "path":str(p), "link_text":link.text, "link_href":href } 69 | 70 | if target not in doc_to_link: 71 | doc_to_link[source] = href 72 | 73 | # if source not in doc_to_node: 74 | # node_count +=1 75 | # doc_to_node[source] = node_count 76 | # if target not in doc_to_node: 77 | # node_count +=1 78 | # doc_to_node[target] = node_count 79 | 80 | #print(ref) 81 | references.append(ref) 82 | 83 | import pandas as pd 84 | df = pd.DataFrame(ref) 85 | df.to_csv("links.csv") 86 | 87 | 88 | # If your texts come from a different source (e.g. a DB), you can of course skip convert_files_to_dicts() and create the dictionaries yourself. 89 | # The default format here is: 90 | # { 91 | # 'text': "", 92 | # 'meta': {'name': "", ...} 93 | #} 94 | # (Optionally: you can also add more key-value-pairs here, that will be indexed as fields in Elasticsearch and 95 | # can be accessed later for filtering or shown in the responses of the Finder) 96 | 97 | # TODO: add awsdocs url in meta for each md file, e.g. 98 | # data/awsdocs/amazon-ec2-user-guide/doc_source/AmazonEFS.md 99 | # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonEFS.html 100 | # Challenge: How to know the mapping from the file to the public URL 101 | 102 | # Let's have a look at the first 3 entries: 103 | #print(dicts_aws[:3]) 104 | 105 | # Now, let's write the dicts containing documents to our DB. 106 | #document_store.write_documents(dicts_aws) -------------------------------------------------------------------------------- /ingestion/awsdocs/src/ingest.py: -------------------------------------------------------------------------------- 1 | from typing import Callable, Dict, List, Optional, Union, Tuple 2 | from pathlib import Path 3 | from bs4 import BeautifulSoup 4 | from sentence_transformers import SentenceTransformer 5 | from opensearchpy import OpenSearch, RequestsHttpConnection 6 | from haystack.nodes.retriever import EmbeddingRetriever 7 | from haystack.document_stores import OpenSearchDocumentStore 8 | from urllib.parse import urlparse, unquote 9 | from os.path import splitext, basename 10 | 11 | import requests 12 | 13 | import io 14 | import gzip 15 | import tarfile 16 | import zipfile 17 | 18 | import argparse 19 | import json 20 | import sys 21 | import os 22 | 23 | from rfc3986_validator import validate_rfc3986 24 | 25 | from haystack.nodes.file_converter import BaseConverter, DocxToTextConverter, PDFToTextConverter, TextConverter, MarkdownConverter 26 | from haystack.schema import Document 27 | from haystack.utils import clean_wiki_text 28 | 29 | import logging 30 | 31 | DEFAULT_DOCS_DIR = "awsdocs/data" 32 | 33 | logger = logging.getLogger(__name__) 34 | 35 | parser = argparse.ArgumentParser( 36 | prog='AWS Semantic Search Ingestion', 37 | description='Ingests documents into Amazon OpenSearch index.', 38 | epilog='Made with ❤️ at AWS') 39 | 40 | parser.add_argument('--src', type=str, 41 | help='Directory or URL where documents are located', default=DEFAULT_DOCS_DIR) 42 | 43 | parser.add_argument('--index_name', type=str, 44 | help='Amazon OpenSearch index name', default="awsdocs") 45 | 46 | # Add markdown conversion 47 | # Licensed under Apache-2.0 license from deepset-ai haystack 48 | # https://github.com/deepset-ai/haystack/blob/ba30971d8d77827da9d2c81d82f7d02bf1917d8c/haystack/utils/preprocessing.py 49 | def convert_files_to_docs( 50 | dir_path: str, 51 | clean_func: Optional[Callable] = None, 52 | split_paragraphs: bool = False, 53 | encoding: Optional[str] = None, 54 | id_hash_keys: Optional[List[str]] = None, 55 | ) -> List[Document]: 56 | """ 57 | Convert all files(.txt, .pdf, .docx) in the sub-directories of the given path to Documents that can be written to a 58 | Document Store. 59 | 60 | :param dir_path: The path of the directory containing the Files. 61 | :param clean_func: A custom cleaning function that gets applied to each Document (input: str, output: str). 62 | :param split_paragraphs: Whether to split text by paragraph. 63 | :param encoding: Character encoding to use when converting pdf documents. 64 | :param id_hash_keys: A list of Document attribute names from which the Document ID should be hashed from. 65 | Useful for generating unique IDs even if the Document contents are identical. 66 | To ensure you don't have duplicate Documents in your Document Store if texts are 67 | not unique, you can modify the metadata and pass [`"content"`, `"meta"`] to this field. 68 | If you do this, the Document ID will be generated by using the content and the defined metadata. 69 | """ 70 | file_paths = [p for p in Path(dir_path).glob("**/*")] 71 | allowed_suffixes = [".pdf", ".txt", ".docx", ".md"] 72 | suffix2converter: Dict[str, BaseConverter] = {} 73 | 74 | suffix2paths: Dict[str, List[Path]] = {} 75 | for path in file_paths: 76 | file_suffix = path.suffix.lower() 77 | if file_suffix in allowed_suffixes: 78 | if file_suffix not in suffix2paths: 79 | suffix2paths[file_suffix] = [] 80 | suffix2paths[file_suffix].append(path) 81 | elif not path.is_dir(): 82 | logger.warning( 83 | "Skipped file {0} as type {1} is not supported here. " 84 | "See haystack.file_converter for support of more file types".format(path, file_suffix) 85 | ) 86 | 87 | # No need to initialize converter if file type not present 88 | for file_suffix in suffix2paths.keys(): 89 | if file_suffix == ".pdf": 90 | suffix2converter[file_suffix] = PDFToTextConverter() 91 | if file_suffix == ".txt": 92 | suffix2converter[file_suffix] = TextConverter() 93 | if file_suffix == ".docx": 94 | suffix2converter[file_suffix] = DocxToTextConverter() 95 | if file_suffix == ".md": 96 | suffix2converter[file_suffix] = MarkdownConverter() 97 | 98 | documents = [] 99 | for suffix, paths in suffix2paths.items(): 100 | for path in paths: 101 | logger.info("Converting {}".format(path)) 102 | # PDFToTextConverter, TextConverter, and DocxToTextConverter return a list containing a single Document 103 | document = suffix2converter[suffix].convert( 104 | file_path=path, meta=None, encoding=encoding, id_hash_keys=id_hash_keys 105 | )[0] 106 | text = document.content 107 | 108 | if clean_func: 109 | text = clean_func(text) 110 | 111 | if split_paragraphs: 112 | for para in text.split("\n\n"): 113 | if not para.strip(): # skip empty paragraphs 114 | continue 115 | documents.append(Document(content=para, meta={"name": path.name}, id_hash_keys=id_hash_keys)) 116 | else: 117 | documents.append(Document(content=text, meta={"name": path.name}, id_hash_keys=id_hash_keys)) 118 | 119 | return documents 120 | 121 | # Enable downloading archive from Amazon S3 presigned url and other urls that contain data such as query parameters after the file extension. 122 | # Licensed under Apache-2.0 license from deepset-ai haystack 123 | # https://github.com/deepset-ai/haystack/blob/ba30971d8d77827da9d2c81d82f7d02bf1917d8c/haystack/utils/import_utils.py 124 | def fetch_archive_from_http( 125 | url: str, 126 | output_dir: str, 127 | proxies: Optional[Dict[str, str]] = None, 128 | timeout: Union[float, Tuple[float, float]] = 10.0, 129 | ) -> bool: 130 | """ 131 | Fetch an archive (zip, gz or tar.gz) from a url via http and extract content to an output directory. 132 | 133 | :param url: http address 134 | :param output_dir: local path 135 | :param proxies: proxies details as required by requests library 136 | :param timeout: How many seconds to wait for the server to send data before giving up, 137 | as a float, or a :ref:`(connect timeout, read timeout) ` tuple. 138 | Defaults to 10 seconds. 139 | :return: if anything got fetched 140 | """ 141 | # verify & prepare local directory 142 | path = Path(output_dir) 143 | if not path.exists(): 144 | path.mkdir(parents=True) 145 | 146 | is_not_empty = len(list(Path(path).rglob("*"))) > 0 147 | if is_not_empty: 148 | logger.info("Found data stored in '%s'. Delete this first if you really want to fetch new data.", output_dir) 149 | return False 150 | else: 151 | logger.info("Fetching from %s to '%s'", url, output_dir) 152 | 153 | parsed = urlparse(url) 154 | root, extension = splitext(parsed.path) 155 | archive_extension = extension[1:] 156 | 157 | request_data = requests.get(url, proxies=proxies, timeout=timeout) 158 | 159 | if archive_extension == "zip": 160 | zip_archive = zipfile.ZipFile(io.BytesIO(request_data.content)) 161 | zip_archive.extractall(output_dir) 162 | elif archive_extension == "gz" and not "tar.gz" in url: 163 | gzip_archive = gzip.GzipFile(fileobj=io.BytesIO(request_data.content)) 164 | file_content = gzip_archive.read() 165 | file_name = unquote(basename(root[1:])) 166 | with open(f"{output_dir}/{file_name}", "wb") as file: 167 | file.write(file_content) 168 | elif archive_extension in ["gz", "bz2", "xz"]: 169 | tar_archive = tarfile.open(fileobj=io.BytesIO(request_data.content), mode="r|*") 170 | tar_archive.extractall(output_dir) 171 | else: 172 | logger.warning( 173 | "Skipped url %s as file type is not supported here. " 174 | "See haystack documentation for support of more file types", 175 | url, 176 | ) 177 | return True 178 | 179 | host = os.environ['OPENSEARCH_HOST'] 180 | password = os.environ['OPENSEARCH_PASSWORD'] 181 | 182 | args = parser.parse_args() 183 | 184 | 185 | docs_src = args.src 186 | index_name = args.index_name 187 | #if len(sys.argv)>1: 188 | # doc_dir_aws = sys.argv[1] 189 | #try: 190 | # is_url = validators.url(docs_src) 191 | #except validators.ValidationFailure: 192 | # is_url = False 193 | 194 | if validate_rfc3986(docs_src): 195 | fetch_archive_from_http(url=docs_src, output_dir=DEFAULT_DOCS_DIR) 196 | doc_dir_aws = DEFAULT_DOCS_DIR 197 | else: 198 | doc_dir_aws = docs_src 199 | 200 | 201 | print(f"doc_dir_aws {doc_dir_aws}") 202 | print(f"index_name {index_name}") 203 | 204 | 205 | document_store = OpenSearchDocumentStore( 206 | host = host, 207 | port = 443, 208 | username = 'admin', 209 | password = password, 210 | scheme = 'https', 211 | verify_certs = False, 212 | similarity='cosine' 213 | ) 214 | 215 | dicts_aws = convert_files_to_docs(dir_path=doc_dir_aws, clean_func=clean_wiki_text, split_paragraphs=True) 216 | 217 | path = Path(doc_dir_aws) 218 | 219 | # Let's have a look at the first 3 entries: 220 | print("First 3 documents to be ingested") 221 | print(dicts_aws[:3]) 222 | 223 | print(f"Starting Ingestion, Documents: {len(dicts_aws)}") 224 | 225 | # Now, let's write the dicts containing documents to our DB. 226 | document_store.write_documents(dicts_aws, index=index_name) 227 | 228 | print(f"Finished Ingestion, Documents: {len(dicts_aws)}") 229 | 230 | print(f"Started Update Embeddings, Documents: {len(dicts_aws)}") 231 | # Calculate and store a dense embedding for each document 232 | retriever = EmbeddingRetriever( 233 | document_store=document_store, 234 | model_format = "sentence_transformers", 235 | embedding_model = "sentence-transformers/all-mpnet-base-v2" 236 | ) 237 | document_store.update_embeddings( 238 | retriever=retriever, 239 | index=index_name 240 | ) 241 | print(f"Finished Update Embeddings, Documents: {len(dicts_aws)}") -------------------------------------------------------------------------------- /ingestion/conda-env.yaml: -------------------------------------------------------------------------------- 1 | name: ingestion 2 | 3 | dependencies: 4 | - python=3.8 -------------------------------------------------------------------------------- /ingestion/main.tf: -------------------------------------------------------------------------------- 1 | provider "aws" { 2 | region = data.terraform_remote_state.infra.outputs.region 3 | default_tags { 4 | tags = { 5 | Project = "semantic-search-aws-docs" 6 | } 7 | } 8 | } 9 | 10 | provider "docker" { 11 | registry_auth { 12 | address = local.aws_ecr_url 13 | username = data.aws_ecr_authorization_token.token.user_name 14 | password = data.aws_ecr_authorization_token.token.password 15 | } 16 | } 17 | 18 | terraform { 19 | required_providers { 20 | docker = { 21 | source = "kreuzwerker/docker" 22 | version = ">=2.16.0" 23 | } 24 | } 25 | } 26 | 27 | data "aws_caller_identity" "current" {} 28 | data "aws_ecr_authorization_token" "token" {} 29 | 30 | locals { 31 | aws_ecr_url = "${data.aws_caller_identity.current.account_id}.dkr.ecr.${data.terraform_remote_state.infra.outputs.region}.amazonaws.com" 32 | build_environment = length(regexall(".*local.*", var.script_name)) > 0 ? "local" : "amazonlinux:2" # if the documents are local then we use the Docker container that can handle local documents 33 | } 34 | 35 | resource "aws_ecr_repository" "ingestion_job" { 36 | name = "ingestion-job" 37 | force_delete = true 38 | } 39 | 40 | resource "docker_registry_image" "ingestion_job" { 41 | name = docker_image.ingestion_job_image.name 42 | } 43 | 44 | resource "docker_image" "ingestion_job_image" { 45 | name = "${local.aws_ecr_url}/${aws_ecr_repository.ingestion_job.name}:latest" 46 | build { 47 | context = "${path.cwd}/" 48 | dockerfile = "Dockerfile" 49 | build_args = { 50 | DOCS_SRC = var.docs_src 51 | SCRIPT_NAME = var.script_name 52 | BUILD_ENV = local.build_environment 53 | } 54 | } 55 | } 56 | 57 | 58 | data "terraform_remote_state" "infra" { 59 | backend = "s3" 60 | config = { 61 | bucket = var.infra_tf_state_s3_bucket 62 | key = var.infra_tf_state_s3_key 63 | region = var.infra_region 64 | } 65 | } 66 | 67 | 68 | resource "aws_ecs_task_definition" "ingestion_job" { 69 | family = "ingestion-job" 70 | container_definitions = <7VxZd9o4FP41PDbHO/DImnYmXc4wbadPOcJWQI2xPLIgkF8/ki1jWxJbwIFkSHNO0dViWfd+d9MNDbs3W94SEE8/4wCGDcsIlg2737As03M89h+nrDJK2xaECUFBRjIKwgg9QzEzp85RABNBy0gU45CiuEr0cRRBn1ZogBD8VB32gMOgQojBBCqEkQ9ClfoTBXSaUVuuUdA/QjSZ5k82DdEzA/lgQUimIMBPJZI9aNg9gjHNPs2WPRjyw6uey3BD73pjBEZ0nwnP0SKxhz/un5et7pfvbjjsjb98EKssQDgXL/w9gURsmK7yU0geIfX56xgNu4vnNEQR7K1PnBMfcER7OMQknWCzf0O+je6EgADBoi/CEeTDURjKw/uMnlCCH6E0OADJFAbiQQtIKGL8uQNjGH7DCaIIR6xvjCnFs9KATogmvIPimFGBaPlsL+wF7e6UzkLWNsXeheSZVt4WL88fCZI4e9EHtOT76MYY8VUGC7ZYIhZh7I35hNlywpFwA54S52aepM9SWZWfO9srXJZIgnW3EM8gJSs2RPS229kMASOr3bxxM8pTIZa5VE5LEpnTgADCZL10ISvsgxCXA0THUkSn83PECL0QzwNFgtIjS7fgdtkv21TPaLisp8dbN5YrEeR2s0ow1RZfo0qQ280qwZSXN6Xnm/IGSwSlVVnekJ5vlDbo9jdhaANWSpL6NEUUjmLg81N9YnK2h/QyzUgBexYRa8jCm40JQxAnaLyeRaA/JwlawL9gki1ubBLyCcHzON3+Jz+Fotp7zz7e+1ww7kFIFZyXFUZZM4gD2I7oED6kK7JTQdHkLm317W06qaJPToBNK9f5q9zqGQo0PUeDTdszjwfnh1X3T0Buv/q/jRYZ90eP5Ef3g2Ur6FQgCQNm6UQTEzrFExyBcFBQu4x1UbA+qWLMHeYcSKXpN6R0JcQPzCmuyqt6uNtUSYLnxIdb3ip3AACZQLptnDhr/oZbGUhgCCgT8qrbcGpmqEa2YXkpDAK0YB8n/GNnBp4ZergRz/rGJO/KKezhpQm6NeI4ZEBJLaJl3GEQHLcewxiIfO4P6Ce8spfg9K1mp/P+vARQsO0+ZEy7H+fnfiiC9tdZjqSybEvjT3ganeWdwJ/QosRRUDLopf7E3ffR34O/FGnTGArFqrgdr9fyysJhbuS9zE9JVo30p0aOmFaVJY5qRGwdQ9bW5+QcaSocGVECwSxElJGHhB0QjISKYSe05NBMNc+QKWhAYWMd+hzGtfVRvwWuNatca6tcWzO2zLU18eRca2lwxAkoSijXKkmVYWQeRcxvYsQRBISpb8vofPv07tFmG5fGN8t7ly6bsafPZrmX5bMZu7lR8Xe0Aa42yNUFutpgVw14K8PSEFTzBJmoozVVoqkOy6NWlaij6UJ0ebapmW1KszcHyPu6j6xv2GwNDKfU10cspBVeYMRxofqXfcPtmU2dZntIfy7RydQ6lARmuMzicRbLJ7rI/EFY6fqUrGW4uVe59jNbip6tK2ulx7XqZl5x/YZw3Xbc/tA6DNduxza67v8G18w+Jmx7SerO3SeQLJBfJ8pd++JiSdPVoFxJb3yKJjChkEcwfexzv/joBEyexPkawyj3p1+8mh/OE1pkX3avIOkxxj9aFUNFRGVJnqEgyPxInvYFRT64qgY5krnrKFLD5qacjCZqqEsKnbYkhW2dFJq6NGxtYmhdjc2bNjZXJ3K3sWF686w5MR2gndoArbMrV0BfAf2eAG3VmXazLywetNQk9wuybiVewSjo8LojzvEQ+4+cFI7Tds74VJYAofk44RmxmUMU5uuo2VXH7XqOkuJjDCKrf8qNX7xx4+bN/rLc2V+VW98gQewc09qAw7OtO5N/ubrcfWHr7Jn8K7tyeQZ374SgWO4b18klG9OSZCt7LTGqEC9loiVPzF5TmZjK5fodjghqdPfIsqgW0uezMDBBflU6q7KjsttUObutbmw3Yw/n6zb4H8lqR0r62+0Xst5xpYVkj+N0otCN/3W/Dj4+fJmO494iiBIv9varKBiJyN8yApT4mGmf7KTp9Mggt1zmZXwG8Z7h6DX1U7vxzkqsZmld2AbTrMH7ZmvdlvO3jtlW7HVdmR2t5FsayT9KB8Iloqn55E5p1vyVm3L2ubCevLEqNfS2U9J0O7Xrlsunc2nJdflartyKfN7BetJsSeLTfFU9uU8V3Ass5m4mn83C2fKBey/knLSQ7bRelXNqoH2tLr5WF7/R6uKjjPD6T2EEEpuGpZpgXRbMalo1GWFdidHllurtOP3N6ucyS/W0+20rHMkut7JK4D/w+M2U6J2eW7ZZ9WFftdRLH77tE8nXX+pVQ+poq2e0zdCXfWL9mYkjKqcc9ANtvTQc61m1qpg3vdd1ZfP3v149XK8e3mn2Qi5Iq8EUuMYZC9K2qKsrrt8qrq9ZySML0k6P8vMWpOlRrqsEuhakXU5B2uml8MwFaXoxvNavvGljc3UiDypIO0vmp66CND2g1cSPAugKd7USuE0hh5IYKJr6QDmRWfsExyGe4ORmguh0Pt6LcTvq0d2qi+/lS5ztztI4RbrnjDVGe+R2NFebWwT2VQp7LKMqB01TWmLfuzC3VV2o5djyhWjNyZ/1d5JsdR/7+CkKsy99yC7LMidyg3MHZhyO0ThJc7SEQdYybhH9mGPwXfhtOzRFDkTBWFebGm6qqsI+/GaFNYvv+srkovjGNHvwHw== --------------------------------------------------------------------------------