├── .gitignore
├── README.md
├── cloudformation
    └── opensearch-index.yaml
├── config.py
├── data
    ├── opensearch-documentation-2.7.json
    └── opensearch-website.json
├── notebooks
    ├── opensearch_indexing_pipeline.ipynb
    └── rag_pipeline.ipynb
├── opensearch_indexing_pipeline.py
├── rag_pipeline.py
└── requirements.txt


/.gitignore:
--------------------------------------------------------------------------------
1 | __pycache__


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Haystack Retrieval-Augmented Generative QA Pipelines with SageMaker JumpStart
  2 | This repo is a showcase of how you can use models deployed on [SageMaker JumpStart](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-jumpstart.html) in your Haystack Retrieval Augmented Generative AI pipelines.
  3 | 
  4 | **Instructions:**  
  5 | - [Starting an OpenSearch service](#starting-an-opensearch-service)  
  6 | - [Indexing Documents to OpenSearch](#the-indexing-pipeline-write-documents-to-opensearch)   
  7 | - [The RAG Pipeline](#the-rag-pipeline)   
  8 | 
  9 | **The Repo Structure**  
 10 | This repository contains 2 runnable Python scripts for indexing and the retrieval augmented pipeline respectively,  with instructions on how to run them below:
 11 | 
 12 | `opensearch_indexing_pipeline.py`
 13 | 
 14 | `rag_pipeline.py`
 15 | 
 16 |  We've also included notebooks for them both in `notebooks/` which you can optionally use to create and run each pipeline step by step.
 17 | 
 18 | **Prerequisites**  
 19 | To run the Haystack pipelines and use the models from SageMaker in this repository, you need an AWS account, and we suggest setting up [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) on your machine.
 20 | 
 21 | 
 22 | ## The Data
 23 | This showcase includes some documents we've crawled from the OpenSearch website and documentation pages. 
 24 | You can index these into your own `OpenSearchDocumentStore` using `opensearch_indexing_pipeline.py` or `notebooks/opensearch_indexing_pipeline.ipynb`.
 25 | 
 26 | ## The Model
 27 | For this demo, we deployed the `falcon-40b-instruct` model on SageMaker Jumpstart. Once deployed, you can use your own credentials in the `PromptNode`.
 28 | To deploy a model on JumpStart, simply log in to your AWS account and go to the Studio on SageMaker. 
 29 | Navigate to JumpStart and deploy `falcon-40b-instruct`. This may take a few minutes:
 30 | <img width="949" alt="image" src="https://github.com/deepset-ai/haystack-sagemaker/assets/15802862/b7a1adee-eb9c-4258-b3e0-bf5942f9c960">
 31 | 
 32 | ## Starting an OpenSearch service
 33 | ### Option 1: OpenSearch service on AWS
 34 | **Requirements:** An AWS account and AWS CLI
 35 | 
 36 | You can use the provided CloudFormation template `opsearch-index.yaml` to deploy an OpenSearch service on AWS.
 37 | 
 38 | Set the `--stack-name` and `OSPassword` to your preferred values and run the following.
 39 | You may also change the default `OSDomainName` and `OSUsername` values, set to `opensearch-haystack-domain` and `admin` respectively, in `opensearch-index.yaml`
 40 | 
 41 | ```bash
 42 | aws cloudformation create-stack --stack-name HaystackOpensearch --template-body file://cloudformation/opensearch-index.yaml --parameters ParameterKey=InstanceType,ParameterValue=r5.large.search ParameterKey=InstanceCount,ParameterValue=3 ParameterKey=OSPassword,ParameterValue=Password123!
 43 | ```
 44 | You can then retrieve your OpenSearch host required to [Write documents](#writing-documents) by running:
 45 | ```bash
 46 | aws cloudformation describe-stacks --stack-name HaystackOpensearch --query "Stacks[0].Outputs[?OutputKey=='OpenSearchEndpoint'].OutputValue" --output text
 47 | ```
 48 | ### Option 2: Local OpenSearch service
 49 | **Requirements:** Docker
 50 | 
 51 | Another option is to have a local OpenSearch service. For this, you may simply run:
 52 | ```python
 53 | from haystack.utils import launch_opensearch
 54 | 
 55 | launch_opensearch()
 56 | ```
 57 | This will start an OpenSearch service on `localhost:9200`
 58 | 
 59 | ## The Indexing Pipeline: Write Documents to OpenSearch
 60 | To run the scripts and notebooks provided here, first clone the repo and install the requirements.
 61 | ```bash
 62 | git clone git@github.com:deepset-ai/haystack-sagemaker.git
 63 | cd haystack-sagemaker
 64 | pip install -r requirements.txt
 65 | ```
 66 | 
 67 | ### Writing documents
 68 | You can use a Haystack indexing pipeline to prepare and write documents to an `OpenSearchDocumentStore`.
 69 | 1. Set your environment variables:
 70 | ```bash
 71 | export OPENSEARCH_HOST='your_opensearch_host'
 72 | export OPENSEARCH_PORT='your_opensearch_port'
 73 | export OPENSEARCH_USERNAME='your_opensearch_username'
 74 | export OPENSEARCH_PASSWORD='your_opensearch_password'
 75 | ```
 76 | 2. Use the indexing pipeline to write the preprocessed documents to your OpenSearch index:
 77 | #### Option 1:
 78 | For this demo, we've prepared documents which have been crawled from the OpenSearch documentation and website. As an example of how you may use an S3 bucket, we've also made them available [here](https://haystack-public-demo-files.s3.eu-central-1.amazonaws.com/haystack-sagemaker-demo/opensearch-documentation-2.7.json) and [here](https://haystack-public-demo-files.s3.eu-central-1.amazonaws.com/haystack-sagemaker-demo/opensearch-website.json)
 79 | 
 80 | Run `python opensearch_indexing_pipeline.py --fetch-files` to fetch these 2 files from S3 or modify the source code in `opensearch_indexing_pipeline.py` to fetch your own files from an S3 bucket. This will fetch the specified files from the S3 bucket, and put them in `data/`. The script will then preprocess and prepare `Documents` from these files, and write them to your `OpenSearchDocumentStore`.
 81 | 
 82 | #### Option 2:
 83 | Run `python opensearch_indexing_pipeline.py`
 84 | 
 85 | This will write the same files, already available in `data/`, to your `OpenSearchDocumentStore`
 86 | 
 87 | 
 88 | ## The RAG Pipeline
 89 | 
 90 | An indexing pipeline prepares and writes documents to a `DocumentStore` so that they are in a format which is useable by your choice of NLP pipeline and language models.
 91 | 
 92 | On the other hand, a query pipeline is any combination of Haystack nodes that may consume a user query and result in a response.
 93 | Here, you will find a retrieval augmented question answering pipeine in `rag_pipeline.py`.
 94 | 
 95 | ```bash
 96 | export SAGEMAKER_MODEL_ENDPOINT=your_falcon_40b_instruc_endpoint
 97 | export AWS_PROFILE_NAME=your_aws_profile
 98 | export AWS_REGION_NAME=your_aws_region
 99 | ```
100 | 
101 | Running the following will start a retrieval augmented QA pipeline with the prompt defined in the `PromptTemplate`. Feel free to modify this template or even use one of our prompts from the [PromptHub](https://prompthub.deepset.ai) to experiment with different instructions.
102 | 
103 | ```bash
104 | python rag_pipeline.py
105 | ```
106 | 
107 | Then, ask some questions about OpenSearch 🥳 👇
108 | 
109 | https://github.com/deepset-ai/haystack-sagemaker/assets/15802862/40563962-2d75-415b-bac4-b25eaa5798e5
110 | 
111 | 


--------------------------------------------------------------------------------
/cloudformation/opensearch-index.yaml:
--------------------------------------------------------------------------------
 1 | ---
 2 | AWSTemplateFormatVersion: '2010-09-09'
 3 | Description: AWS CloudFormation Template for OpenSearch Service
 4 | 
 5 | Parameters:
 6 |   InstanceType:
 7 |     Description: Instance Type for OpenSearch Cluster
 8 |     Type: String
 9 |     Default: m5.large.search
10 | 
11 |   InstanceCount:
12 |     Description: Number of Instances for OpenSearch Cluster
13 |     Type: Number
14 |     Default: 2
15 |   
16 |   OSUsername:
17 |     Description: Username of the OS Admin
18 |     Type: String
19 |     Default: admin
20 |   
21 |   OSPassword:
22 |     Description: Password of the OS Admin
23 |     Type: String
24 |   
25 |   OSDomainName:
26 |     Description: Domain name for OpenSearch
27 |     Type: String
28 |     Default: opensearch-haystack-domain 
29 |     
30 | Outputs:
31 |   OpenSearchEndpoint:
32 |     Description: OpenSearch Endpoint URL
33 |     Value: !Sub ${OpenSearchDomain.DomainEndpoint}
34 | 
35 | Resources:
36 |   OpenSearchDomain:
37 |     Type: 'AWS::OpenSearchService::Domain'
38 |     Properties:
39 |       DomainName: !Ref OSDomainName
40 |       EngineVersion: 'OpenSearch_2.5'
41 |       ClusterConfig:
42 |         InstanceType: !Ref InstanceType
43 |         InstanceCount: !Ref InstanceCount
44 |       AdvancedSecurityOptions:
45 |         Enabled: true
46 |         InternalUserDatabaseEnabled: true
47 |         MasterUserOptions: 
48 |           MasterUserName: !Ref OSUsername
49 |           MasterUserPassword: !Ref OSPassword
50 |       EncryptionAtRestOptions: 
51 |           Enabled: true
52 |       NodeToNodeEncryptionOptions:
53 |           Enabled: true
54 |       DomainEndpointOptions:
55 |           EnforceHTTPS: true
56 |       EBSOptions:
57 |         EBSEnabled: true
58 |         VolumeType: gp2
59 |         VolumeSize: 10
60 |       AccessPolicies:
61 |         Version: '2012-10-17'
62 |         Statement:
63 |           - Effect: Allow
64 |             Principal:
65 |               AWS: '*'
66 |             Action: 'es:*'
67 |             Resource: '*'


--------------------------------------------------------------------------------
/config.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | os.environ["HAYSTACK_PROGRESS_BARS"]='0'
 4 | OPENSEARCH_HOST = os.getenv('OPENSEARCH_HOST')
 5 | OPENSEARCH_PORT = os.getenv('OPENSEARCH_PORT')
 6 | OPENSEARCH_USERNAME = os.getenv('OPENSEARCH_USERNAME')
 7 | OPENSEARCH_PASSWORD = os.getenv('OPENSEARCH_PASSWORD')
 8 | SAGEMAKER_MODEL_ENDPOINT = os.getenv('SAGEMAKER_MODEL_ENDPOINT')
 9 | AWS_PROFILE_NAME = os.getenv('AWS_PROFILE_NAME')
10 | AWS_REGION_NAME = os.getenv('AWS_REGION_NAME')


--------------------------------------------------------------------------------
/notebooks/opensearch_indexing_pipeline.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "attachments": {},
  5 |    "cell_type": "markdown",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "## Launch an OpenSearch DocumentStore\n",
  9 |     "\n",
 10 |     "Below, we create an `OpenSearchDocumentStore` which by default connects to an OpenSearch service.\n",
 11 |     "\n",
 12 |     "Chage the host, port, username and password below to run the following indexing pipeline on your own OpenSearch service in AWS.\n",
 13 |     "\n",
 14 |     "If you would like to use an OpenSearch servide on AWS, first follow the [instructions](/README.md#option-1-opensearch-service-on-aws) on how to deploy an OpenSearch instance with the provided `opensearch-index.yaml` CloudFormation template.\n",
 15 |     "\n",
 16 |     "Another option is to simply launch an OpenSearch instance locally (for which you need docker). If you would prefer to do this, first run:\n",
 17 |     "\n",
 18 |     "```python\n",
 19 |     "from haystack.utils import launch_opensearch\n",
 20 |     "\n",
 21 |     "launch_opensearch()\n",
 22 |     "```"
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "code",
 27 |    "execution_count": null,
 28 |    "metadata": {},
 29 |    "outputs": [],
 30 |    "source": [
 31 |     "from haystack.document_stores import OpenSearchDocumentStore\n",
 32 |     "\n",
 33 |     "doc_store = OpenSearchDocumentStore(host='your_opensearch_host', port=443, username= \"admin\", password=\"admin\", embedding_dim=384)"
 34 |    ]
 35 |   },
 36 |   {
 37 |    "attachments": {},
 38 |    "cell_type": "markdown",
 39 |    "metadata": {},
 40 |    "source": [
 41 |     "## Indexing Pipeline to write Documents to OpenSearch\n",
 42 |     "\n",
 43 |     "An indexing pipeline allows you to prepare your files and write them to a database that you would like to use with your NLP application. In this example, we're using OpenSearch as our vector database. We define an indexing pipeline that converts JSON files (that have been crawled from the OpenSearch documentation and website), and creates embeddings for them using the `sentence-transformers/all-MiniLM-L12-v2` model, which is a very small embedding model."
 44 |    ]
 45 |   },
 46 |   {
 47 |    "cell_type": "code",
 48 |    "execution_count": 8,
 49 |    "metadata": {},
 50 |    "outputs": [],
 51 |    "source": [
 52 |     "from haystack.nodes import JsonConverter\n",
 53 |     "\n",
 54 |     "converter = JsonConverter()"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "code",
 59 |    "execution_count": 9,
 60 |    "metadata": {},
 61 |    "outputs": [],
 62 |    "source": [
 63 |     "from haystack.nodes import PreProcessor\n",
 64 |     "\n",
 65 |     "preprocessor = PreProcessor (\n",
 66 |     "        clean_empty_lines=True, \n",
 67 |     "        split_by='word',\n",
 68 |     "        split_respect_sentence_boundary=True,\n",
 69 |     "        split_length=80,\n",
 70 |     "        split_overlap=20\n",
 71 |     "    )"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "code",
 76 |    "execution_count": null,
 77 |    "metadata": {},
 78 |    "outputs": [],
 79 |    "source": [
 80 |     "from haystack.nodes import EmbeddingRetriever\n",
 81 |     "\n",
 82 |     "retriever = EmbeddingRetriever(document_store=doc_store, embedding_model=\"sentence-transformers/all-MiniLM-L12-v2\", devices=[\"mps\"], top_k=5)"
 83 |    ]
 84 |   },
 85 |   {
 86 |    "cell_type": "code",
 87 |    "execution_count": 11,
 88 |    "metadata": {},
 89 |    "outputs": [],
 90 |    "source": [
 91 |     "from haystack import Pipeline\n",
 92 |     "\n",
 93 |     "indexing_pipeline = Pipeline()\n",
 94 |     "indexing_pipeline.add_node(component=converter, name=\"Converter\", inputs=[\"File\"])\n",
 95 |     "indexing_pipeline.add_node(component=preprocessor, name=\"Preprocessor\", inputs=[\"Converter\"])\n",
 96 |     "indexing_pipeline.add_node(component=retriever, name=\"Retriever\", inputs=[\"Preprocessor\"])\n",
 97 |     "indexing_pipeline.add_node(component=doc_store, name=\"DocumentStore\", inputs=[\"Retriever\"])"
 98 |    ]
 99 |   },
100 |   {
101 |    "cell_type": "code",
102 |    "execution_count": null,
103 |    "metadata": {},
104 |    "outputs": [],
105 |    "source": [
106 |     "indexing_pipeline.run(file_paths=[\"data/opensearch-documentation-2.7.json\", \"data/opensearch-website.json\"])"
107 |    ]
108 |   }
109 |  ],
110 |  "metadata": {
111 |   "kernelspec": {
112 |    "display_name": "sagemeker",
113 |    "language": "python",
114 |    "name": "python3"
115 |   },
116 |   "language_info": {
117 |    "codemirror_mode": {
118 |     "name": "ipython",
119 |     "version": 3
120 |    },
121 |    "file_extension": ".py",
122 |    "mimetype": "text/x-python",
123 |    "name": "python",
124 |    "nbconvert_exporter": "python",
125 |    "pygments_lexer": "ipython3",
126 |    "version": "3.11.3"
127 |   },
128 |   "orig_nbformat": 4
129 |  },
130 |  "nbformat": 4,
131 |  "nbformat_minor": 2
132 | }
133 | 


--------------------------------------------------------------------------------
/notebooks/rag_pipeline.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "attachments": {},
  5 |    "cell_type": "markdown",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "## Initialize your OpenSearchDocumentStore\n",
  9 |     "\n",
 10 |     "You can use `opensearch_indexing_pipeline.py` or the `notebooks/opensearch_indexing_pipeline.ipynb` notebook for a step by step guide to index the example files to your own `OpenSearchDocumentStore`. You may do this locally, or deploy it on AWS. Depending on your setup, once you have a running DocumentStore, connect to it in the cell below by providing the right credentials to `host`, `port`, `username` and `password`."
 11 |    ]
 12 |   },
 13 |   {
 14 |    "cell_type": "code",
 15 |    "execution_count": null,
 16 |    "metadata": {},
 17 |    "outputs": [],
 18 |    "source": [
 19 |     "from haystack.document_stores import OpenSearchDocumentStore\n",
 20 |     "\n",
 21 |     "doc_store = OpenSearchDocumentStore(host='your_opensearch_host', port=443, username= \"admin\", password=\"admin\", embedding_dim=384)"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "attachments": {},
 26 |    "cell_type": "markdown",
 27 |    "metadata": {},
 28 |    "source": [
 29 |     "## Initialize a PromptNode with your SageMaker Endpoint Credentials\n",
 30 |     "\n",
 31 |     "Once you've deployed your model on SageMaker provide your own credentials in `model_name_or_path`, `profile_name` and `region_name`"
 32 |    ]
 33 |   },
 34 |   {
 35 |    "cell_type": "code",
 36 |    "execution_count": 6,
 37 |    "metadata": {},
 38 |    "outputs": [],
 39 |    "source": [
 40 |     "from haystack.nodes import AnswerParser, EmbeddingRetriever, PromptNode, PromptTemplate\n",
 41 |     "\n",
 42 |     "question_answering = PromptTemplate(prompt=\"Given the context please answer the question. If the answer is not contained within the context below, say 'I don't know'.\\n\" \n",
 43 |     "                                            \"Context: {join(documents)};\\n Question: {query};\\n Answer: \", output_parser=AnswerParser(reference_pattern=r\"Document\\[(\\d+)\\]\"))\n",
 44 |     "\n",
 45 |     "gen_qa_with_references = PromptNode(default_prompt_template=question_answering,  model_name_or_path=\"jumpstart-dft-falcon-40b-instruct\", model_kwargs={\"aws_profile_name\": \"default\", \"aws_region_name\": \"us-east-1\"})\n"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "code",
 50 |    "execution_count": null,
 51 |    "metadata": {},
 52 |    "outputs": [],
 53 |    "source": [
 54 |     "retriever = EmbeddingRetriever(document_store=doc_store, embedding_model=\"sentence-transformers/all-MiniLM-L12-v2\", devices=[\"mps\"], top_k=5)"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "attachments": {},
 59 |    "cell_type": "markdown",
 60 |    "metadata": {},
 61 |    "source": [
 62 |     "## Create a retrieval-augmented QA pipeline"
 63 |    ]
 64 |   },
 65 |   {
 66 |    "cell_type": "code",
 67 |    "execution_count": 8,
 68 |    "metadata": {},
 69 |    "outputs": [],
 70 |    "source": [
 71 |     "from haystack import Pipeline\n",
 72 |     "\n",
 73 |     "pipe = Pipeline()\n",
 74 |     "pipe.add_node(component=retriever, name=\"Retriever\", inputs=['Query'])\n",
 75 |     "pipe.add_node(component=gen_qa_with_references, name=\"GenQAWithRefPromptNode\", inputs=[\"Retriever\"])"
 76 |    ]
 77 |   },
 78 |   {
 79 |    "cell_type": "code",
 80 |    "execution_count": null,
 81 |    "metadata": {},
 82 |    "outputs": [],
 83 |    "source": [
 84 |     "from haystack.utils import print_answers\n",
 85 |     "\n",
 86 |     "result = pipe.run(\"How do I install the opensearch cli?\", params={\"Retriever\":{\"top_k\": 3}})"
 87 |    ]
 88 |   },
 89 |   {
 90 |    "cell_type": "code",
 91 |    "execution_count": null,
 92 |    "metadata": {},
 93 |    "outputs": [],
 94 |    "source": [
 95 |     "print_answers(results=result, details=\"minimum\")"
 96 |    ]
 97 |   }
 98 |  ],
 99 |  "metadata": {
100 |   "kernelspec": {
101 |    "display_name": "sagemeker",
102 |    "language": "python",
103 |    "name": "python3"
104 |   },
105 |   "language_info": {
106 |    "codemirror_mode": {
107 |     "name": "ipython",
108 |     "version": 3
109 |    },
110 |    "file_extension": ".py",
111 |    "mimetype": "text/x-python",
112 |    "name": "python",
113 |    "nbconvert_exporter": "python",
114 |    "pygments_lexer": "ipython3",
115 |    "version": "3.11.3"
116 |   },
117 |   "orig_nbformat": 4
118 |  },
119 |  "nbformat": 4,
120 |  "nbformat_minor": 2
121 | }
122 | 


--------------------------------------------------------------------------------
/opensearch_indexing_pipeline.py:
--------------------------------------------------------------------------------
 1 | import boto3
 2 | import os
 3 | import sys
 4 | from haystack import Pipeline
 5 | from haystack.document_stores import OpenSearchDocumentStore
 6 | from haystack.nodes import EmbeddingRetriever, JsonConverter, PreProcessor
 7 | from config import OPENSEARCH_HOST, OPENSEARCH_PORT, OPENSEARCH_USERNAME, OPENSEARCH_PASSWORD
 8 | 
 9 | def fetch_files():
10 |     s3 = boto3.client('s3')
11 |     s3.download_file('haystack-public-demo-files', 'haystack-sagemaker-demo/opensearch-documentation-2.7.json', 'data/opensearch-documentation-2.7.json')
12 |     s3.download_file('haystack-public-demo-files', 'haystack-sagemaker-demo/opensearch-website.json', 'data/opensearch-website.json')
13 | 
14 | def write_documents():
15 |     doc_store = OpenSearchDocumentStore(host=OPENSEARCH_HOST, port=OPENSEARCH_PORT, username=OPENSEARCH_USERNAME, password=OPENSEARCH_PASSWORD, embedding_dim=384)
16 |     
17 |     converter = JsonConverter()
18 |     preprocessor = PreProcessor (
19 |             clean_empty_lines=True, 
20 |             split_by='word',
21 |             split_respect_sentence_boundary=True,
22 |             split_length=80,
23 |             split_overlap=20
24 |         )
25 |     retriever = EmbeddingRetriever(document_store=doc_store, embedding_model="sentence-transformers/all-MiniLM-L12-v2", devices=["mps"], top_k=5)
26 | 
27 |     indexing_pipeline = Pipeline()
28 |     indexing_pipeline.add_node(component=converter, name="Converter", inputs=["File"])
29 |     indexing_pipeline.add_node(component=preprocessor, name="Preprocessor", inputs=["Converter"])
30 |     indexing_pipeline.add_node(component=retriever, name="Retriever", inputs=["Preprocessor"])
31 |     indexing_pipeline.add_node(component=doc_store, name="DocumentStore", inputs=["Retriever"])
32 |     
33 |     files_to_index = ["data/" + f for f in os.listdir("data")]
34 |     indexing_pipeline.run(file_paths=files_to_index)
35 | 
36 | if __name__ == "__main__":
37 |     if len(sys.argv) > 1 and sys.argv[1] == "--fetch-files" :
38 |         fetch_files()
39 |     write_documents()


--------------------------------------------------------------------------------
/rag_pipeline.py:
--------------------------------------------------------------------------------
 1 | import warnings
 2 | 
 3 | from haystack import Pipeline
 4 | from haystack.document_stores import OpenSearchDocumentStore
 5 | from haystack.nodes import AnswerParser, EmbeddingRetriever, PromptNode, PromptTemplate
 6 | from haystack.utils import print_answers
 7 | from config import OPENSEARCH_HOST, OPENSEARCH_PORT, OPENSEARCH_USERNAME, OPENSEARCH_PASSWORD, SAGEMAKER_MODEL_ENDPOINT, AWS_PROFILE_NAME, AWS_REGION_NAME
 8 | 
 9 | warnings.filterwarnings("ignore")
10 | 
11 | def start_query_pipeline():
12 |     doc_store = OpenSearchDocumentStore(host=OPENSEARCH_HOST, port=OPENSEARCH_PORT, username=OPENSEARCH_USERNAME, password=OPENSEARCH_PASSWORD, embedding_dim=384)
13 | 
14 |     question_answering = PromptTemplate(prompt="Given the context please answer the question. If the answer is not contained within the context below, say 'I don't know'.\n" 
15 |                                                 "Context: {join(documents)};\n Question: {query};\n Answer: ", output_parser=AnswerParser(reference_pattern=r"Document\[(\d+)\]"))
16 | 
17 |     gen_qa_prompt = PromptNode(default_prompt_template=question_answering,  max_length=200, model_name_or_path=SAGEMAKER_MODEL_ENDPOINT, model_kwargs={"aws_profile_name": AWS_PROFILE_NAME, 
18 |                                                                                                                                        "aws_region_name": AWS_REGION_NAME})
19 |     retriever = EmbeddingRetriever(document_store=doc_store, embedding_model="sentence-transformers/all-MiniLM-L12-v2", devices=["mps"])
20 | 
21 |     pipe = Pipeline()
22 |     pipe.add_node(component=retriever, name="Retriever", inputs=['Query'])
23 |     pipe.add_node(component=gen_qa_prompt, name="GenQAWPromptNode", inputs=["Retriever"])
24 |     return pipe
25 | 
26 | if __name__ == "__main__":
27 |     print('Starting up the query pipeline...')
28 |     query_pipeline = start_query_pipeline()
29 | 
30 |     while True:
31 |         query = input("Ask a question: ")
32 |         result = query_pipeline.run(query, params={"Retriever":{"top_k": 5}})
33 |         print(result['answers'][0].answer)


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | farm-haystack[aws,preprocessing,opensearch,inference] @ git+https://github.com/deepset-ai/haystack.git
2 | 


--------------------------------------------------------------------------------