├── .DS_Store
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── Home.py
├── LICENSE
├── README.md
├── images
    ├── Architecture.png
    └── Output.png
└── requirements.txt


/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/rag-qna-bot-for-your-website-using-langchain-amazon-aurorapg-and-amazon-bedrock/4c805c0f227b7b78cda1fc7eceecef3fe75ea0d6/.DS_Store


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
 4 | documentation, we greatly value feedback and contributions from our community.
 5 | 
 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
 7 | information to effectively respond to your bug report or contribution.
 8 | 
 9 | 
10 | ## Reporting Bugs/Feature Requests
11 | 
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 | 
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 | 
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 | 
22 | 
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 | 
26 | 1. You are working against the latest source on the *main* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 | 
30 | To send us a pull request, please:
31 | 
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 | 
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 | 
42 | 
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 | 
46 | 
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 | 
52 | 
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 | 
56 | 
57 | ## Licensing
58 | 
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 | 


--------------------------------------------------------------------------------
/Home.py:
--------------------------------------------------------------------------------
  1 | import streamlit as st
  2 | from langchain.vectorstores.pgvector import PGVector
  3 | from langchain.memory import ConversationBufferMemory
  4 | from langchain.chains import ConversationalRetrievalChain
  5 | from langchain.text_splitter import RecursiveCharacterTextSplitter
  6 | from langchain.llms.bedrock import Bedrock
  7 | from langchain.embeddings import BedrockEmbeddings
  8 | from langchain.document_loaders.recursive_url_loader import RecursiveUrlLoader
  9 | from langchain.memory import PostgresChatMessageHistory
 10 | from fake_useragent import UserAgent
 11 | from bs4 import BeautifulSoup as Soup
 12 | import os
 13 | import boto3
 14 | from botocore.exceptions import ClientError
 15 | import tempfile
 16 | import time
 17 | import random
 18 | import hashlib
 19 | import json
 20 | import secrets
 21 | 
 22 | #replace the secret_name and region_name with AWS secret manager where your credentials are stored
 23 | def get_secret():
 24 |     sm_key_name = "enter-your-secret-key-name"
 25 |     region_name = "us-west-2"
 26 |     session = boto3.session.Session()
 27 |     client = session.client(
 28 |         service_name='secretsmanager',
 29 |         region_name=region_name
 30 |     )
 31 |     try:
 32 |         get_secret_value_response = client.get_secret_value(
 33 |             SecretId=sm_key_name
 34 |         )
 35 |     except ClientError as e:
 36 |         print(e)
 37 |     secret = get_secret_value_response['SecretString']
 38 |     return secret
 39 | 
 40 | def generate_session_id():
 41 |     t = int(time.time() * 1000)
 42 |     r = secrets.randbelow(1000000)
 43 |     return hashlib.md5(bytes(str(t) + str(r), 'utf-8'), usedforsecurity=False).hexdigest()
 44 | 
 45 | def get_text_chunks(text):
 46 |     text_splitter = RecursiveCharacterTextSplitter(
 47 |         separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""],
 48 |         chunk_size=512,
 49 |         chunk_overlap=103,
 50 |         length_function=len
 51 |     )
 52 |     chunks = text_splitter.split_text(text)
 53 |     return chunks
 54 | 
 55 | # replace the model_id and region_name if you are trying to call a different bedrock model
 56 | def get_vectorstore(text_chunks):
 57 |     embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1",region_name="us-west-2")
 58 |     try:
 59 |         if text_chunks is None:
 60 |             return PGVector(
 61 |                 connection_string=CONNECTION_STRING,
 62 |                 embedding_function=embeddings
 63 |             )
 64 |         return PGVector.from_texts(texts=text_chunks, embedding=embeddings, connection_string=CONNECTION_STRING)
 65 |     except Exception as e:
 66 |         print(e)
 67 |         print(text_chunks)
 68 | 
 69 | 
 70 | def get_conversation_chain(vectorstore):
 71 |     llm = Bedrock(model_id="anthropic.claude-instant-v1",region_name="us-west-2")
 72 |     message_history = PostgresChatMessageHistory(
 73 |     connection_string="postgresql://"+secret["username"]+":"+secret["password"]+"@"+secret["host"]+"/genai",
 74 |     session_id=generate_session_id())
 75 |     memory = ConversationBufferMemory(memory_key="chat_history", chat_memory=message_history, return_source_documents=True, return_messages=True)
 76 |     conversation_chain = ConversationalRetrievalChain.from_llm(
 77 |         llm=llm,
 78 |         retriever=vectorstore.as_retriever(),
 79 |         memory=memory
 80 |     )
 81 |     return conversation_chain
 82 | 
 83 | 
 84 | def handle_userinput(user_question):
 85 |     bot_template = "BOT : {0}"
 86 |     user_template = "USER : {0}"
 87 |     try:
 88 |         response = st.session_state.conversation({'question': user_question})
 89 |         print(response)
 90 |     except ValueError as e:
 91 |         st.write(e)
 92 |         st.write("Sorry, please ask again in a different way.")
 93 |         return
 94 | 
 95 |     st.session_state.chat_history = response['chat_history']
 96 |     st.write(user_template.replace("{0}", response['question']))
 97 |     st.write(bot_template.replace( "{0}", response['answer']))
 98 |     for i, message in enumerate(st.session_state.chat_history):
 99 |         if i % 2 == 0:
100 |             st.write(user_template.replace(
101 |                 "{0}", message.content))
102 |         else:
103 |             st.write(bot_template.replace(
104 |                 "{0}", message.content))
105 | 
106 | 
107 | def main():
108 |     st.title("Build a QnA bot for your website using RAG")
109 |     web_input = st.text_input("Enter an web link and click on 'Process'")
110 |     depth = [1,2,3,4]
111 |     max_depth = st.selectbox("Select the max depth", depth)
112 |     exclude_dir = st.text_input("Enter the subdirectories to exclude(ex: news,weather,learn etc.)")
113 |     exclude_list=[]
114 |     if len(exclude_dir)==0:
115 |         exclude_list=exclude_list
116 |     else:
117 |         exclude = exclude_dir.split(",")
118 |         exclude_list = [web_input + item.strip() for item in exclude]
119 |     if st.button("Process"):
120 |         with st.spinner("Processing"):
121 |             header_template = {}
122 |             header_template["User-Agent"] = UserAgent().random
123 |             loader = RecursiveUrlLoader(url=web_input,headers=header_template,exclude_dirs=exclude_list, max_depth=max_depth, extractor=lambda x: Soup(x, "html.parser").text)
124 |             docs = loader.load()
125 |             for i in docs:
126 |                 text_chunks = get_text_chunks(i.page_content)
127 |                 #source = [i.metadata]
128 |                 vectorstore = get_vectorstore(text_chunks)
129 |             st.session_state.conversation = get_conversation_chain(vectorstore)
130 | 
131 |     if "conversation" not in st.session_state:
132 |         st.session_state.conversation = get_conversation_chain(get_vectorstore(None))
133 |     if "chat_history" not in st.session_state:
134 |         st.session_state.chat_history = None
135 |     
136 |     user_question = st.text_input("Ask a question about your documents:")
137 |     if user_question:
138 |         handle_userinput(user_question)
139 | 
140 | # enter the appropriate DB name
141 | if __name__ == '__main__':  
142 |     secret = json.loads(get_secret())
143 |     CONNECTION_STRING = PGVector.connection_string_from_db_params(                                                  
144 |         driver = "psycopg2",
145 |         user = secret["username"],                                      
146 |         password = secret["password"],                                  
147 |         host = secret["host"],                                            
148 |         port = 5432,                                          
149 |         database = "genai"                                      
150 |     )
151 |     main()


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT No Attribution
 2 | 
 3 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 6 | this software and associated documentation files (the "Software"), to deal in
 7 | the Software without restriction, including without limitation the rights to
 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 9 | the Software, and to permit persons to whom the Software is furnished to do so.
10 | 
11 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
12 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
13 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
14 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
15 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
16 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
17 | 
18 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ## Build a QnA bot for your website using RAG, Amazon Aurora PostgresSQL and Amazon Bedrock
 2 | 
 3 | ### Use Case :
 4 | 
 5 | In a digital age inundated with information, website visitors seek instant access to relevant content without the hassle of navigating through extensive web pages. A Q&A bot addresses this need by efficiently retrieving precise information based on natural language queries. This is particularly pertinent for websites with a wealth of data, FAQs, case studies, and other content, streamlining user interaction and information retrieval.
 6 | 
 7 | This code provides a solution for website owners seeking to enhance user engagement and streamline information retrieval by presenting a comprehensive guide on building a Natural Language Q&A Bot using Retrieval Augmented Generation (RAG) and Amazon Bedrock Models.
 8 | 
 9 | This approach leverages the power of langchain modules ([recursive_url](https://python.langchain.com/docs/integrations/document_loaders/recursive_url)) to recurisively get all the contents on a webpage based on the maximum depth we provide, convert the data into embeddings using Amazon Bedrock Titan Embeddings Model, store the embeddings in Aurora PostgresSQL using pgvector extension and retrieve the answer based on the question provided by the user using Amazon Bedrock 3P Models (Antropic Claude v2).
10 | 
11 | ### Architecture
12 | 
13 | The architecture is as follows:
14 | 
15 | ![Architecture Diagram](images/Architecture.png)
16 | 
17 | 1. Companies possess a repository of knowledge resources in webpages like FAQ docs. These pages can have links to multiple webpages with more content. We can scrap the data using ([recursive_url](https://python.langchain.com/docs/integrations/document_loaders/recursive_url)) module in Langchain.  
18 | 
19 | 2. Utilizing the Titan Embedding model from Amazon Bedrock, these resources are transformed into vector representations, ensuring their compatibility for advanced processing.
20 | 
21 | 3. The generated vector embeddings are then stored within Amazon Aurora PostgreSQL, utilizing the specialized pgVector capabilities for efficient storage and retrieval.
22 | 
23 | 4. A user initiates the process by posing a question, for instance, "How can AWS support vector databases?".
24 | 
25 | 5. The user's question is seamlessly translated into its vector embedding, facilitating subsequent computational comparisons.
26 | 
27 | 6. A semantic search operation is executed on the Amazon Aurora PostgreSQL database, employing the vectorized representations to identify knowledge resources with relevant
28 | information.
29 | 
30 | 7. The extracted answers from the search are fed into the Antropic Claude v2 model provided by Amazon Bedrock.
31 | 
32 | 8. Leveraging the enhanced context and knowledge derived from the semantic search, the model generates a comprehensive response.
33 | 
34 | 9. The generated response is subsequently delivered back to the user, providing them with a meaningful and informed answer to their initial question.
35 | 
36 | 10. To retain context and support future interactions, the chat history is stored in Amazon DynamoDB, ensuring a seamless continuation of the conversation with the user.
37 | 
38 | ### Execution
39 | 
40 | 1. Make sure your Amazon Aurora PostgreSQL Database is setup with pgvector extension.
41 | 
42 | 2. Make sure you have access to the Amazon Bedrock models you are trying to use.
43 | 
44 | 3. The EC2 where you are runing the code needs access to the Amazon Aurora PostgreSQL Database, Amazon Bedrock Aws Secrets Manager via IAM Role
45 | 
46 | 4. The code runs on Python3.10. Activate a virtual environment and install all the requirements
47 | ```
48 | python3.10 -m venv venv
49 | source venv/bin/activate
50 | pip3 install -r requirements.txt
51 | ```
52 | 
53 | 5. Modify the code variables Home.py with the appropriate details according to your environment: 
54 | 
55 | sm_key_name -> AWS Secret Manager key where for storing your Amazon Aurora PostgreSQL details
56 | 
57 | database -> the database name where embeddings will be created
58 | 
59 | model_id -> The Bedrock Model which you will be using
60 | 
61 | 6. To run the Streamlit application
62 | ```
63 | streamlit run Home.py
64 | ```
65 | 
66 | ### Application
67 | ![Application](images/Output.png)
68 | 
69 | #### Inputs: 
70 | 1. Web Link: Website you would like to scrape
71 | 
72 | 2. Max Depth: Maximum recursive depths you would like to scrape
73 | 
74 | 3. Exclude: subdirectories if any you would like to exclude
75 | 
76 | #### Process Button:
77 | 1. To store the embeddings in Amazon AuroraPG
78 | 
79 | #### Output:
80 | 1. Enter the Question to get the response
81 | 
82 | ### Security
83 | 
84 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
85 | 
86 | ### License
87 | 
88 | This library is licensed under the MIT-0 License. See the LICENSE file.
89 | 
90 | 


--------------------------------------------------------------------------------
/images/Architecture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/rag-qna-bot-for-your-website-using-langchain-amazon-aurorapg-and-amazon-bedrock/4c805c0f227b7b78cda1fc7eceecef3fe75ea0d6/images/Architecture.png


--------------------------------------------------------------------------------
/images/Output.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/rag-qna-bot-for-your-website-using-langchain-amazon-aurorapg-and-amazon-bedrock/4c805c0f227b7b78cda1fc7eceecef3fe75ea0d6/images/Output.png


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | SQLAlchemy==2.0.19
 2 | streamlit==1.12.0
 3 | streamlit-chat==0.1.1
 4 | langchain
 5 | boto3==1.28.61
 6 | botocore==1.31.61
 7 | altair<5
 8 | pydantic==1.10.9
 9 | psycopg==3.1.10
10 | psycopg-binary==3.1.10
11 | psycopg2-binary==2.9.6
12 | pgvector==0.1.8
13 | fake-useragent==1.3.0
14 | beautifulsoup4==4.12.2


--------------------------------------------------------------------------------