├── requirements.txt
├── example.env
├── LICENSE
├── README.md
├── ingest.py
├── chat_with_website_ollama.py
├── chat_with_website_openai.py
├── main.py
└── .gitignore


/requirements.txt:
--------------------------------------------------------------------------------
1 | langchain
2 | langchainhub
3 | streamlit
4 | bs4
5 | chromadb
6 | tiktoken
7 | unstructured
8 | chainlit
9 | libmagic


--------------------------------------------------------------------------------
/example.env:
--------------------------------------------------------------------------------
1 | # OPENAI_API_KEY=sk-******
2 | LANGCHAIN_TRACING_V2=true
3 | LANGCHAIN_ENDPOINT="https://api.smith.langchain.com"
4 | LANGCHAIN_API_KEY="ls__***"
5 | LANGCHAIN_PROJECT="your_langsmith_project"
6 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2023 Sudarshan Koirala
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # chat-with-website
 2 | Simple Streamlit as well as Chainlit app to have interaction with your website URL.
 3 | 
 4 | ### Chat with your documents 🚀
 5 | - [OpenAI model](https://platform.openai.com/docs/models) as Large Language model
 6 | - [Ollama](https://ollama.ai/) and `mistral` as Large Language model
 7 | - [LangChain](https://python.langchain.com/en/latest/modules/models/llms/integrations/huggingface_hub.html) as a Framework for LLM
 8 | - [Streamlit](https://streamlit.io/) as well as [Chainlit](https://docs.chainlit.io/) for deploying.
 9 | 
10 | ## System Requirements
11 | 
12 | You must have Python 3.9 or later installed. Earlier versions of python may not compile.  
13 | 
14 | ---
15 | 
16 | ## Steps to Replicate 
17 | 
18 | 1. Fork this repository and create a codespace in GitHub as I showed you in the youtube video OR Clone it locally.
19 | ```
20 | git clone https://github.com/sudarshan-koirala/chat-with-website.git
21 | cd chat-with-website
22 | ```
23 | 
24 | 2. Rename example.env to .env with `cp example.env .env`and input the OpenAI API key as follows. Get OpenAI API key from this [URL](https://platform.openai.com/account/api-keys). You need to create an account in OpenAI webiste if you haven't already.
25 |    ```
26 |    OPENAI_API_KEY=your_openai_api_key
27 |    ```
28 | 
29 |    For langsmith, take the environment variables from [LangSmith](https://smith.langchain.com/) website
30 |    
31 | 3. Create a virtualenv and activate it
32 |    ```
33 |    python3 -m venv .venv && source .venv/bin/activate
34 |    ```
35 | 
36 | 4. Run the following command in the terminal to install necessary python packages:
37 |    ```
38 |    pip install -r requirements.txt
39 |    ```
40 | 
41 | 5. Run the following command in your terminal to start the chat UI:
42 |    ```
43 |    streamlit run chat_with_website_openai.py
44 |    streamlit run chat_with_website_ollama.py
45 |    ```
46 | 
47 | 6. For chainlit, use the following command in your terminal.
48 | ```
49 | python3 ingest.py #for ingesting
50 | chainlit run main.py #for chainlit ui
51 | ```
52 | 


--------------------------------------------------------------------------------
/ingest.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import warnings
 3 | 
 4 | from langchain.text_splitter import RecursiveCharacterTextSplitter
 5 | from langchain_community.document_loaders import (
 6 |     UnstructuredURLLoader
 7 | )
 8 | from langchain_community.embeddings import OllamaEmbeddings
 9 | from langchain_community.vectorstores import Chroma
10 | 
11 | warnings.simplefilter("ignore")
12 | 
13 | ABS_PATH: str = os.path.dirname(os.path.abspath(__file__))
14 | DB_DIR: str = os.path.join(ABS_PATH, "dburl")
15 | 
16 | 
17 | # Create vector database
18 | def create_vector_database():
19 |     """
20 |     Creates a vector database using document loaders and embeddings.
21 | 
22 |     This function loads urls,
23 |     splits the loaded documents into chunks, transforms them into embeddings using OllamaEmbeddings,
24 |     and finally persists the embeddings into a Chroma vector database.
25 | 
26 |     """
27 |     # Initialize loader
28 |     urls = ['https://docs.gpt4all.io/', 'https://ollama.com/library/llama2']
29 |     
30 |     url_loader = UnstructuredURLLoader(urls = urls, show_progress_bar=True)
31 |     loaded_documents = url_loader.load()
32 |     #len(loaded_documents)
33 | 
34 |     # Split loaded documents into chunks
35 |     text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
36 |     chunked_documents = text_splitter.split_documents(loaded_documents)
37 |     #len(chunked_documents)
38 |     #chunked_documents[0]
39 | 
40 |     # Initialize Ollama Embeddings
41 |     ollama_embeddings = OllamaEmbeddings(model="mistral")
42 | 
43 |     # Create and persist a Chroma vector database from the chunked documents
44 |     vector_database = Chroma.from_documents(
45 |         documents=chunked_documents,
46 |         embedding=ollama_embeddings,
47 |         persist_directory=DB_DIR,
48 |     )
49 | 
50 |     vector_database.persist()
51 |     
52 |     # query it
53 |     #query = "Who are the authors of the paper"
54 |     #docs = vector_database.similarity_search(query)
55 | 
56 | 
57 |     # print results
58 |     #print(docs[0].page_content)
59 | 
60 | 
61 | if __name__ == "__main__":
62 |     create_vector_database()


--------------------------------------------------------------------------------
/chat_with_website_ollama.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | import streamlit as st
 4 | from dotenv import load_dotenv
 5 | from langchain.chains import RetrievalQA
 6 | #from langchain.chat_models import ChatOpenAI
 7 | from langchain_community.document_loaders import WebBaseLoader
 8 | #from langchain.embeddings import OpenAIEmbeddings
 9 | from langchain_community.embeddings import OllamaEmbeddings
10 | from langchain_community.llms import Ollama
11 | from langchain.text_splitter import CharacterTextSplitter
12 | from langchain_community.vectorstores import Chroma
13 | 
14 | # Load environment variables from .env file (Optional)
15 | load_dotenv()
16 | 
17 | # Optional
18 | #OPENAI_API_KEY= os.getenv("OPENAI_API_KEY")
19 | LANGCHAIN_TRACING_V2 = os.getenv("LANGCHAIN_TRACING_V2")
20 | LANGCHAIN_ENDPOINT=os.getenv("LANGCHAIN_ENDPOINT")
21 | LANGCHAIN_API_KEY=os.getenv("LANGCHAIN_API_KEY")
22 | LANGCHAIN_PROJECT=os.getenv("LANGCHAIN_PROJECT")
23 | 
24 | 
25 | def main():
26 |     # Set the title and subtitle of the app
27 |     st.title('🦜🔗 Chat With Website')
28 |     st.subheader('Input your website URL, ask questions, and receive answers directly from the website.')
29 | 
30 |     url = st.text_input("Insert The website URL")
31 | 
32 |     prompt = st.text_input("Ask a question (query/prompt)")
33 |     if st.button("Submit Query", type="primary"):
34 |         ABS_PATH: str = os.path.dirname(os.path.abspath(__file__))
35 |         DB_DIR: str = os.path.join(ABS_PATH, "db")
36 | 
37 |         # Load data from the specified URL
38 |         loader = WebBaseLoader(url)
39 |         data = loader.load()
40 | 
41 |         # Split the loaded data
42 |         text_splitter = CharacterTextSplitter(separator='\n', 
43 |                                         chunk_size=1000, 
44 |                                         chunk_overlap=40)
45 | 
46 |         docs = text_splitter.split_documents(data)
47 | 
48 |         # Create Ollama embeddings
49 |         #openai_embeddings = OpenAIEmbeddings()
50 |         ollama_embeddings = OllamaEmbeddings(model="mistral")
51 | 
52 |         # Create a Chroma vector database from the documents
53 |         vectordb = Chroma.from_documents(documents=docs, 
54 |                                         embedding=ollama_embeddings,
55 |                                         persist_directory=DB_DIR)
56 | 
57 |         vectordb.persist()
58 | 
59 |         # Create a retriever from the Chroma vector database
60 |         retriever = vectordb.as_retriever(search_kwargs={"k": 3})
61 | 
62 |         # Use a mistral llm from Ollama
63 |         llm = Ollama(model="mistral")
64 | 
65 |         # Create a RetrievalQA from the model and retriever
66 |         qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)
67 | 
68 |         # Run the prompt and return the response
69 |         response = qa(prompt)
70 |         st.write(response)
71 |         
72 | 
73 | if __name__ == '__main__':
74 |     main()


--------------------------------------------------------------------------------
/chat_with_website_openai.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | import streamlit as st
 4 | from dotenv import load_dotenv
 5 | from langchain.chains import RetrievalQA
 6 | from langchain.chat_models import ChatOpenAI
 7 | from langchain.document_loaders import WebBaseLoader
 8 | from langchain.embeddings import OpenAIEmbeddings
 9 | from langchain.prompts.chat import (ChatPromptTemplate,
10 |                                     HumanMessagePromptTemplate,
11 |                                     SystemMessagePromptTemplate)
12 | from langchain.text_splitter import CharacterTextSplitter
13 | from langchain.vectorstores import Chroma
14 | 
15 | # Load environment variables from .env file (Optional)
16 | load_dotenv()
17 | 
18 | OPENAI_API_KEY= os.getenv("OPENAI_API_KEY")
19 | 
20 | system_template = """Use the following pieces of context to answer the users question.
21 | If you don't know the answer, just say that you don't know, don't try to make up an answer.
22 | """
23 | 
24 | messages = [
25 |     SystemMessagePromptTemplate.from_template(system_template),
26 |     HumanMessagePromptTemplate.from_template("{question}"),
27 | ]
28 | prompt = ChatPromptTemplate.from_messages(messages)
29 | chain_type_kwargs = {"prompt": prompt}
30 | 
31 | 
32 | def main():
33 |     # Set the title and subtitle of the app
34 |     st.title('🦜🔗 Chat With Website')
35 |     st.subheader('Input your website URL, ask questions, and receive answers directly from the website.')
36 | 
37 |     url = st.text_input("Insert The website URL")
38 | 
39 |     prompt = st.text_input("Ask a question (query/prompt)")
40 |     if st.button("Submit Query", type="primary"):
41 |         ABS_PATH: str = os.path.dirname(os.path.abspath(__file__))
42 |         DB_DIR: str = os.path.join(ABS_PATH, "db")
43 | 
44 |         # Load data from the specified URL
45 |         loader = WebBaseLoader(url)
46 |         data = loader.load()
47 | 
48 |         # Split the loaded data
49 |         text_splitter = CharacterTextSplitter(separator='\n', 
50 |                                         chunk_size=500, 
51 |                                         chunk_overlap=40)
52 | 
53 |         docs = text_splitter.split_documents(data)
54 | 
55 |         # Create OpenAI embeddings
56 |         openai_embeddings = OpenAIEmbeddings()
57 | 
58 |         # Create a Chroma vector database from the documents
59 |         vectordb = Chroma.from_documents(documents=docs, 
60 |                                         embedding=openai_embeddings,
61 |                                         persist_directory=DB_DIR)
62 | 
63 |         vectordb.persist()
64 | 
65 |         # Create a retriever from the Chroma vector database
66 |         retriever = vectordb.as_retriever(search_kwargs={"k": 3})
67 | 
68 |         # Use a ChatOpenAI model
69 |         llm = ChatOpenAI(model_name='gpt-3.5-turbo')
70 | 
71 |         # Create a RetrievalQA from the model and retriever
72 |         qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)
73 | 
74 |         # Run the prompt and return the response
75 |         response = qa(prompt)
76 |         st.write(response)
77 |         
78 | 
79 | if __name__ == '__main__':
80 |     main()


--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
  1 | # import required dependencies
  2 | # https://docs.chainlit.io/integrations/langchain
  3 | import os
  4 | from langchain import hub
  5 | from langchain_community.embeddings import OllamaEmbeddings
  6 | from langchain_community.vectorstores import Chroma
  7 | from langchain_community.llms import Ollama
  8 | from langchain.callbacks.manager import CallbackManager
  9 | from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
 10 | import chainlit as cl
 11 | from langchain.chains import RetrievalQA
 12 | 
 13 | ABS_PATH: str = os.path.dirname(os.path.abspath(__file__))
 14 | DB_DIR: str = os.path.join(ABS_PATH, "dburl")
 15 | 
 16 | 
 17 | # Set up RetrievelQA model
 18 | rag_prompt_mistral = hub.pull("rlm/rag-prompt-mistral")
 19 | 
 20 | 
 21 | def load_model():
 22 |     llm = Ollama(
 23 |         model="mistral",
 24 |         verbose=True,
 25 |         callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
 26 |     )
 27 |     return llm
 28 | 
 29 | 
 30 | def retrieval_qa_chain(llm, vectorstore):
 31 |     qa_chain = RetrievalQA.from_chain_type(
 32 |         llm,
 33 |         retriever=vectorstore.as_retriever(),
 34 |         chain_type_kwargs={"prompt": rag_prompt_mistral},
 35 |         return_source_documents=True,
 36 |     )
 37 |     return qa_chain
 38 | 
 39 | 
 40 | def qa_bot():
 41 |     llm = load_model()
 42 |     DB_PATH = DB_DIR
 43 |     vectorstore = Chroma(
 44 |         persist_directory=DB_PATH, embedding_function=OllamaEmbeddings(model="mistral")
 45 |     )
 46 | 
 47 |     qa = retrieval_qa_chain(llm, vectorstore)
 48 |     return qa
 49 | 
 50 | 
 51 | @cl.on_chat_start
 52 | async def start():
 53 |     """
 54 |     Initializes the bot when a new chat starts.
 55 | 
 56 |     This asynchronous function creates a new instance of the retrieval QA bot,
 57 |     sends a welcome message, and stores the bot instance in the user's session.
 58 |     """
 59 |     chain = qa_bot()
 60 |     welcome_message = cl.Message(content="Starting the bot...")
 61 |     await welcome_message.send()
 62 |     welcome_message.content = (
 63 |         "Hi, Welcome to Chat With Documents using Ollama (mistral model) and LangChain."
 64 |     )
 65 |     await welcome_message.update()
 66 |     cl.user_session.set("chain", chain)
 67 | 
 68 | 
 69 | @cl.on_message
 70 | async def main(message):
 71 |     """
 72 |     Processes incoming chat messages.
 73 | 
 74 |     This asynchronous function retrieves the QA bot instance from the user's session,
 75 |     sets up a callback handler for the bot's response, and executes the bot's
 76 |     call method with the given message and callback. The bot's answer and source
 77 |     documents are then extracted from the response.
 78 |     """
 79 |     chain = cl.user_session.get("chain")
 80 |     cb = cl.AsyncLangchainCallbackHandler()
 81 |     cb.answer_reached = True
 82 |     # res=await chain.acall(message, callbacks=[cb])
 83 |     res = await chain.acall(message.content, callbacks=[cb])
 84 |     #print(f"response: {res}")
 85 |     answer = res["result"]
 86 |     #answer = answer.replace(".", ".\n")
 87 |     source_documents = res["source_documents"]
 88 | 
 89 |     text_elements = []  # type: List[cl.Text]
 90 | 
 91 |     if source_documents:
 92 |         for source_idx, source_doc in enumerate(source_documents):
 93 |             source_name = f"source_{source_idx}"
 94 |             # Create the text element referenced in the message
 95 |             text_elements.append(
 96 |                 cl.Text(content=source_doc.page_content, name=source_name)
 97 |             )
 98 |         source_names = [text_el.name for text_el in text_elements]
 99 | 
100 |         if source_names:
101 |             answer += f"\nSources: {', '.join(source_names)}"
102 |         else:
103 |             answer += "\nNo sources found"
104 | 
105 |     await cl.Message(content=answer, elements=text_elements).send()


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | share/python-wheels/
 24 | *.egg-info/
 25 | .installed.cfg
 26 | *.egg
 27 | MANIFEST
 28 | 
 29 | # PyInstaller
 30 | #  Usually these files are written by a python script from a template
 31 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 32 | *.manifest
 33 | *.spec
 34 | 
 35 | # Installer logs
 36 | pip-log.txt
 37 | pip-delete-this-directory.txt
 38 | 
 39 | # Unit test / coverage reports
 40 | htmlcov/
 41 | .tox/
 42 | .nox/
 43 | .coverage
 44 | .coverage.*
 45 | .cache
 46 | nosetests.xml
 47 | coverage.xml
 48 | *.cover
 49 | *.py,cover
 50 | .hypothesis/
 51 | .pytest_cache/
 52 | cover/
 53 | 
 54 | # Translations
 55 | *.mo
 56 | *.pot
 57 | 
 58 | # Django stuff:
 59 | *.log
 60 | local_settings.py
 61 | db.sqlite3
 62 | db.sqlite3-journal
 63 | 
 64 | # Flask stuff:
 65 | instance/
 66 | .webassets-cache
 67 | 
 68 | # Scrapy stuff:
 69 | .scrapy
 70 | 
 71 | # Sphinx documentation
 72 | docs/_build/
 73 | 
 74 | # PyBuilder
 75 | .pybuilder/
 76 | target/
 77 | 
 78 | # Jupyter Notebook
 79 | .ipynb_checkpoints
 80 | 
 81 | # IPython
 82 | profile_default/
 83 | ipython_config.py
 84 | 
 85 | # pyenv
 86 | #   For a library or package, you might want to ignore these files since the code is
 87 | #   intended to run in multiple environments; otherwise, check them in:
 88 | # .python-version
 89 | 
 90 | # pipenv
 91 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 92 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 93 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 94 | #   install all needed dependencies.
 95 | #Pipfile.lock
 96 | 
 97 | # poetry
 98 | #   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
 99 | #   This is especially recommended for binary packages to ensure reproducibility, and is more
100 | #   commonly ignored for libraries.
101 | #   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
102 | #poetry.lock
103 | 
104 | # pdm
105 | #   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
106 | #pdm.lock
107 | #   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
108 | #   in version control.
109 | #   https://pdm.fming.dev/#use-with-ide
110 | .pdm.toml
111 | 
112 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
113 | __pypackages__/
114 | 
115 | # Celery stuff
116 | celerybeat-schedule
117 | celerybeat.pid
118 | 
119 | # SageMath parsed files
120 | *.sage.py
121 | 
122 | # Environments
123 | .env
124 | .venv
125 | env/
126 | venv/
127 | ENV/
128 | env.bak/
129 | venv.bak/
130 | 
131 | # Spyder project settings
132 | .spyderproject
133 | .spyproject
134 | 
135 | # Rope project settings
136 | .ropeproject
137 | 
138 | # mkdocs documentation
139 | /site
140 | 
141 | # mypy
142 | .mypy_cache/
143 | .dmypy.json
144 | dmypy.json
145 | 
146 | # Pyre type checker
147 | .pyre/
148 | 
149 | # pytype static type analyzer
150 | .pytype/
151 | 
152 | # Cython debug symbols
153 | cython_debug/
154 | 
155 | # PyCharm
156 | #  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
157 | #  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
158 | #  and can be added to the global gitignore or merged into this file.  For a more nuclear
159 | #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
160 | #.idea/
161 | .chainlit/
162 | chainlit.md
163 | 
164 | 


--------------------------------------------------------------------------------