├── README.md
├── constants.py
├── db
    ├── chroma-collections.parquet
    ├── chroma-embeddings.parquet
    └── index
    │   ├── id_to_uuid_b9718edd-fd19-45ce-8e90-58831cc9eefc.pkl
    │   ├── index_b9718edd-fd19-45ce-8e90-58831cc9eefc.bin
    │   ├── index_metadata_b9718edd-fd19-45ce-8e90-58831cc9eefc.pkl
    │   └── uuid_to_id_b9718edd-fd19-45ce-8e90-58831cc9eefc.pkl
├── ingest.py
├── own private gpt.PNG
├── owngpt.py
├── requirements.txt
└── source_documents
    └── chatGpt.txt


/README.md:
--------------------------------------------------------------------------------
 1 | # OwnGPT
 2 | Create Own ChatGPT with your documents using streamlit UI on your own device using GPT models. No data leaves your device and 100% private.
 3 | 
 4 | 
 5 | This project was inspired by the original privateGPT (https://github.com/imartinez/privateGPT). Most of the description here is inspired by the original privateGPT. 
 6 | 
 7 | Ask questions to your documents without an internet connection, using the power of LLMs. 100% private, no data leaves your execution environment at any point. You can ingest documents and ask questions without an internet connection!
 8 | 
 9 | Built with LLM:[ggml-gpt4all-j-v1.3-groovy.bin](https://gpt4all.io/models/ggml-gpt4all-j-v1.3-groovy.bin). 
10 | 
11 | 
12 | # Environment Setup
13 | In order to set your environment up to run the code here, first install all requirements:
14 | 
15 | ```shell
16 | pip install -r requirements.txt
17 | ```
18 | 
19 | 
20 | ## Instructions for ingesting your own dataset
21 | 
22 | Put any and all your files into the `source_documents` directory
23 | 
24 | The supported extensions are:
25 | 
26 |    - `.csv`: CSV,
27 |    - `.docx`: Word Document,
28 |    - `.enex`: EverNote,
29 |    - `.eml`: Email,
30 |    - `.epub`: EPub,
31 |    - `.html`: HTML File,
32 |    - `.md`: Markdown,
33 |    - `.msg`: Outlook Message,
34 |    - `.odt`: Open Document Text,
35 |    - `.pdf`: Portable Document Format (PDF),
36 |    - `.pptx` : PowerPoint Document,
37 |    - `.txt`: Text file (UTF-8),
38 | 
39 | Run the following command to ingest all the data.
40 | 
41 | ```shell
42 | python ingest.py
43 | ```
44 | 
45 | It will create an index containing the local vectorstore. Will take time, depending on the size of your documents.
46 | You can ingest as many documents as you want, and all will be accumulated in the local embeddings database. 
47 | If you want to start from an empty database, delete the `index`.
48 | 
49 | Note: When you run this for the first time, it will download take time as it has to download the embedding model. In the subseqeunt runs, no data will leave your local enviroment and can be run without internet connection.
50 | 
51 | ## Ask questions to your documents, locally! using sreamlit UI
52 | In order to ask a question, run a command like:
53 | 
54 | To run the Streamlit app, use the following command:
55 | ```
56 | streamlit run owngpt.py --server.address localhost
57 | ```
58 | This command launches the Streamlit app and connects it to the backend server running at `localhost`.
59 | 
60 | And wait for "http://localhost:8501/" web App running on local system.
61 | 
62 | Enter your Query in TextBox and Hit enter. Wait while the LLM model consumes the prompt and prepares the answer. show your query and answer below TextBox. as show in below figure.
63 | 
64 | 
65 | ![own private gpt](https://github.com/aviggithub/owngpt/assets/46967951/938e588e-5f7d-48e5-b63f-0db925071886)
66 | 
67 | 
68 | # How does it work?
69 | Selecting the right local models and the power of `LangChain` you can run the entire pipeline locally, without any data leaving your environment, and with reasonable performance.
70 | 
71 | - `ingest.py` uses `LangChain` tools to parse the document and create embeddings locally using `InstructorEmbeddings`. It then stores the result in a local vector database using `Chroma` vector store. 
72 | - `streamlit run owngpt.py` uses a local LLM (ggml-gpt4all-j-v1.3-groovy.bin) to understand questions and create answers. The context for the answers is extracted from the local vector store using a similarity search to locate the right piece of context from the docs.
73 | - You can replace this local LLM with any other LLM from the HuggingFace. Make sure whatever LLM you select is in the HF format.
74 | 
75 | # System Requirements
76 | 
77 | ## Python Version
78 | To use this software, you must have Python 3.10.0 or later installed. Earlier versions of Python will not compile.
79 | 
80 | ## C++ Compiler
81 | If you encounter an error while building a wheel during the `pip install` process, you may need to install a C++ compiler on your computer.
82 | 
83 | ### For Windows 10/11
84 | To install a C++ compiler on Windows 10/11, follow these steps:
85 | 
86 | 1. Install Visual Studio 2022.
87 | 2. Make sure the following components are selected:
88 |    * Universal Windows Platform development
89 |    * C++ CMake tools for Windows
90 | 3. Download the MinGW installer from the [MinGW website](https://sourceforge.net/projects/mingw/).
91 | 4. Run the installer and select the "gcc" component.
92 | 
93 | 
94 | 
95 | # Disclaimer
96 | This is a test project to validate the feasibility of a fully local private solution for question answering using LLMs and Vector embeddings. It is not production ready, and it is not meant to be used in production. ggml-gpt4all-j-v1.3-groovy.bin is based on the GPT4all model so that has the original Gpt4all license. 
97 | 
98 | 


--------------------------------------------------------------------------------
/constants.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | from dotenv import load_dotenv
 3 | from chromadb.config import Settings
 4 | 
 5 | load_dotenv()
 6 | 
 7 | # Define the folder for storing database
 8 | PERSIST_DIRECTORY = os.environ.get('PERSIST_DIRECTORY')
 9 | 
10 | # Define the Chroma settings
11 | CHROMA_SETTINGS = Settings(
12 |         chroma_db_impl='duckdb+parquet',
13 |         persist_directory=PERSIST_DIRECTORY,
14 |         anonymized_telemetry=False
15 | )


--------------------------------------------------------------------------------
/db/chroma-collections.parquet:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aviggithub/OwnGPT/05a4e089ce2e82774b47f8c3a55830083c13dfa1/db/chroma-collections.parquet


--------------------------------------------------------------------------------
/db/chroma-embeddings.parquet:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aviggithub/OwnGPT/05a4e089ce2e82774b47f8c3a55830083c13dfa1/db/chroma-embeddings.parquet


--------------------------------------------------------------------------------
/db/index/id_to_uuid_b9718edd-fd19-45ce-8e90-58831cc9eefc.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aviggithub/OwnGPT/05a4e089ce2e82774b47f8c3a55830083c13dfa1/db/index/id_to_uuid_b9718edd-fd19-45ce-8e90-58831cc9eefc.pkl


--------------------------------------------------------------------------------
/db/index/index_b9718edd-fd19-45ce-8e90-58831cc9eefc.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aviggithub/OwnGPT/05a4e089ce2e82774b47f8c3a55830083c13dfa1/db/index/index_b9718edd-fd19-45ce-8e90-58831cc9eefc.bin


--------------------------------------------------------------------------------
/db/index/index_metadata_b9718edd-fd19-45ce-8e90-58831cc9eefc.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aviggithub/OwnGPT/05a4e089ce2e82774b47f8c3a55830083c13dfa1/db/index/index_metadata_b9718edd-fd19-45ce-8e90-58831cc9eefc.pkl


--------------------------------------------------------------------------------
/db/index/uuid_to_id_b9718edd-fd19-45ce-8e90-58831cc9eefc.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aviggithub/OwnGPT/05a4e089ce2e82774b47f8c3a55830083c13dfa1/db/index/uuid_to_id_b9718edd-fd19-45ce-8e90-58831cc9eefc.pkl


--------------------------------------------------------------------------------
/ingest.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import glob
 3 | from typing import List
 4 | from dotenv import load_dotenv
 5 | 
 6 | from langchain.document_loaders import (
 7 |     CSVLoader,
 8 |     EverNoteLoader,
 9 |     PDFMinerLoader,
10 |     TextLoader,
11 |     UnstructuredEmailLoader,
12 |     UnstructuredEPubLoader,
13 |     UnstructuredHTMLLoader,
14 |     UnstructuredMarkdownLoader,
15 |     UnstructuredODTLoader,
16 |     UnstructuredPowerPointLoader,
17 |     UnstructuredWordDocumentLoader,
18 | )
19 | 
20 | from langchain.text_splitter import RecursiveCharacterTextSplitter
21 | from langchain.vectorstores import Chroma
22 | from langchain.embeddings import HuggingFaceEmbeddings
23 | from langchain.docstore.document import Document
24 | from constants import CHROMA_SETTINGS
25 | 
26 | 
27 | # Map file extensions to document loaders and their arguments
28 | LOADER_MAPPING = {
29 |     ".csv": (CSVLoader, {}),
30 |     # ".docx": (Docx2txtLoader, {}),
31 |     ".doc": (UnstructuredWordDocumentLoader, {}),
32 |     ".docx": (UnstructuredWordDocumentLoader, {}),
33 |     ".enex": (EverNoteLoader, {}),
34 |     ".eml": (UnstructuredEmailLoader, {}),
35 |     ".epub": (UnstructuredEPubLoader, {}),
36 |     ".html": (UnstructuredHTMLLoader, {}),
37 |     ".md": (UnstructuredMarkdownLoader, {}),
38 |     ".odt": (UnstructuredODTLoader, {}),
39 |     ".pdf": (PDFMinerLoader, {}),
40 |     ".ppt": (UnstructuredPowerPointLoader, {}),
41 |     ".pptx": (UnstructuredPowerPointLoader, {}),
42 |     ".txt": (TextLoader, {"encoding": "utf8"}),
43 |     # Add more mappings for other file extensions and loaders as needed
44 | }
45 | 
46 | 
47 | load_dotenv()
48 | 
49 | 
50 | def load_single_document(file_path: str) -> Document:
51 |     ext = "." + file_path.rsplit(".", 1)[-1]
52 |     if ext in LOADER_MAPPING:
53 |         loader_class, loader_args = LOADER_MAPPING[ext]
54 |         loader = loader_class(file_path, **loader_args)
55 |         return loader.load()[0]
56 | 
57 |     raise ValueError(f"Unsupported file extension '{ext}'")
58 | 
59 | 
60 | def load_documents(source_dir: str) -> List[Document]:
61 |     # Loads all documents from source documents directory
62 |     all_files = []
63 |     for ext in LOADER_MAPPING:
64 |         all_files.extend(
65 |             glob.glob(os.path.join(source_dir, f"**/*{ext}"), recursive=True)
66 |         )
67 |     return [load_single_document(file_path) for file_path in all_files]
68 | 
69 | 
70 | def main():
71 |     # Load environment variables
72 |     persist_directory = os.environ.get('PERSIST_DIRECTORY')
73 |     source_directory = os.environ.get('SOURCE_DIRECTORY', 'source_documents')
74 |     embeddings_model_name = os.environ.get('EMBEDDINGS_MODEL_NAME')
75 | 
76 |     # Load documents and split in chunks
77 |     print(f"Loading documents from {source_directory}")
78 |     chunk_size = 500
79 |     chunk_overlap = 50
80 |     documents = load_documents(source_directory)
81 |     text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
82 |     texts = text_splitter.split_documents(documents)
83 |     print(f"Loaded {len(documents)} documents from {source_directory}")
84 |     print(f"Split into {len(texts)} chunks of text (max. {chunk_size} characters each)")
85 | 
86 |     # Create embeddings
87 |     embeddings = HuggingFaceEmbeddings(model_name=embeddings_model_name)
88 |     
89 |     # Create and store locally vectorstore
90 |     db = Chroma.from_documents(texts, embeddings, persist_directory=persist_directory, client_settings=CHROMA_SETTINGS)
91 |     db.persist()
92 |     db = None
93 | 
94 | 
95 | if __name__ == "__main__":
96 |     main()
97 | 


--------------------------------------------------------------------------------
/own private gpt.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aviggithub/OwnGPT/05a4e089ce2e82774b47f8c3a55830083c13dfa1/own private gpt.PNG


--------------------------------------------------------------------------------
/owngpt.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | """
 3 | Create ownChat web application streamlit and private gpt
 4 | @author: Avinash G
 5 | """
 6 | from dotenv import load_dotenv
 7 | import streamlit as st
 8 | from dotenv import load_dotenv
 9 | from langchain.chains import RetrievalQA
10 | from langchain.embeddings import HuggingFaceEmbeddings
11 | from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
12 | from langchain.vectorstores import Chroma
13 | from langchain.llms import GPT4All, LlamaCpp
14 | import os
15 | from fastapi import FastAPI, UploadFile, File
16 | from typing import List, Optional
17 | import urllib.parse
18 | 
19 | load_dotenv()
20 | 
21 | embeddings_model_name = os.environ.get("EMBEDDINGS_MODEL_NAME")
22 | persist_directory = os.environ.get('PERSIST_DIRECTORY')
23 | 
24 | model_type = os.environ.get('MODEL_TYPE')
25 | model_path = os.environ.get('MODEL_PATH')
26 | model_n_ctx = os.environ.get('MODEL_N_CTX')
27 | source_directory = os.environ.get('SOURCE_DIRECTORY', 'source_documents')
28 | 
29 | from constants import CHROMA_SETTINGS
30 | 
31 | secret = ''
32 | st.set_page_config(
33 |     page_title="Own ChatGPT App",
34 |     page_icon=":robot:"
35 | )
36 | 
37 | 
38 | def private_gpt_generate_msg(human_msg):
39 |     embeddings = HuggingFaceEmbeddings(model_name=embeddings_model_name)
40 |     db = Chroma(persist_directory=persist_directory,collection_name=collection_name, embedding_function=embeddings, client_settings=CHROMA_SETTINGS)
41 |     retriever = db.as_retriever()
42 |     # Prepare the LLM
43 |     callbacks = [StreamingStdOutCallbackHandler()]
44 |     match model_type:
45 |         case "LlamaCpp":
46 |             llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, callbacks=callbacks, verbose=False)
47 |         case "GPT4All":
48 |             llm = GPT4All(model=model_path, n_ctx=model_n_ctx, backend='gptj', callbacks=callbacks, verbose=False)
49 |         case _default:
50 |             print(f"Model {model_type} not supported!")
51 |             exit;
52 |     qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True)
53 |     
54 |     # Get the answer from the chain
55 |     res = qa(human_msg)
56 |     print(res)   
57 |     answer, docs = res['result'], res['source_documents']
58 |     return answer
59 | 	
60 | 
61 | 
62 | st.header("Own ChatGPT App Private")
63 | 
64 | if 'Bot_msg' not in st.session_state:
65 |     st.session_state['Bot_msg'] = []
66 | 
67 | if 'History_msg' not in st.session_state:
68 |     st.session_state['History_msg'] = []
69 | 
70 | 
71 | def get_text():
72 |     input_text = st.text_input("Enter Your Text", key="input")
73 |     return input_text 
74 | 
75 | 
76 | user_input = get_text()
77 | 
78 | if user_input:
79 |     st.session_state.History_msg.append(user_input)
80 |     st.session_state.Bot_msg.append(Bot_generate_msg(user_input))
81 | 
82 | if st.session_state['Bot_msg']:
83 |     for i in range(len(st.session_state['Bot_msg'])-1, -1, -1):
84 |         st.markdown("BOT :- "+" "+st.session_state["Bot_msg"][i])
85 |         st.markdown("HUMAN :- "+"\n"+st.session_state['History_msg'][i])
86 |  
87 |         


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | langchain==0.0.171
 2 | pygpt4all==1.1.0
 3 | chromadb==0.3.23
 4 | urllib3==2.0.2
 5 | pdfminer.six==20221105
 6 | python-dotenv==1.0.0
 7 | unstructured==0.6.6
 8 | extract-msg==0.41.1
 9 | tabulate==0.9.0
10 | pandoc==2.3
11 | pypandoc==1.11
12 | llama-cpp-python==0.1.50
13 | streamlit


--------------------------------------------------------------------------------
/source_documents/chatGpt.txt:
--------------------------------------------------------------------------------
 1 | ChatGPT
 2 | Examples
 3 | 	Explain quantum computing in simple terms
 4 | 	Got any creative ideas for a 10 year old’s birthday?
 5 | 	How do I make an HTTP request in Javascript?
 6 | Capabilities
 7 | 	Remembers what user said earlier in the conversation
 8 | 	Allows user to provide follow-up corrections
 9 | 	Trained to decline inappropriate requests
10 | Limitations
11 | 	May occasionally generate incorrect information
12 | 	May occasionally produce harmful instructions or biased content
13 | 	Limited knowledge of world and events after 2021
14 | 	
15 | 
16 | ChatGPT FAQ
17 | Commonly asked questions about ChatGPT
18 | Natalie avatar
19 | Written by Natalie
20 | Updated today
21 | How much does it cost to use ChatGPT?
22 | 
23 | 	During the initial research preview, ChatGPT is free to use.
24 | 
25 | How does ChatGPT work?
26 | 
27 | 	ChatGPT is fine-tuned from GPT-3.5, a language model trained to produce text. ChatGPT was optimized for dialogue by using Reinforcement Learning with Human Feedback (RLHF) – a method that uses human demonstrations to guide the model toward desired behavior.
28 | 
29 | Why does the AI seem so real and lifelike? 
30 | 
31 | 	These models were trained on vast amounts of data from the internet written by humans, including conversations, so the responses it provides may sound human-like. It is important to keep in mind that this is a direct result of the system's design (i.e. maximizing the similarity between outputs and the dataset the models were trained on) and that such outputs may be inaccurate, untruthful, and otherwise misleading at times.
32 | 
33 | Can I trust that the AI is telling me the truth?
34 | 
35 | 	ChatGPT is not connected to the internet, and it can occasionally produce incorrect answers. It has limited knowledge of world and events after 2021 and may also occasionally produce harmful instructions or biased content.
36 | 
37 | 	We'd recommend checking whether responses from the model are accurate or not. If you find an answer is incorrect, please provide that feedback by using the "Thumbs Down" button.
38 | 
39 | Who can view my conversations?
40 | 
41 | 	As part of our commitment to safe and responsible AI, we review conversations to improve our systems and to ensure the content complies with our policies and safety requirements. 
42 | 
43 | Will you use my conversations for training?
44 | 
45 | 	Yes. Your conversations may be reviewed by our AI trainers to improve our systems.
46 | 
47 | Can you delete my data?
48 | 
49 | 	Yes, please follow the data deletion process here: https://help.openai.com/en/articles/6378407-how-can-i-delete-my-account
50 | 
51 | Can you delete specific prompts?
52 | 
53 | 	No, we are not able to delete specific prompts from your history. Please don't share any sensitive information in your conversations.
54 | 
55 | Can I see my history of threads? How can I save a conversation I’ve had?
56 | 
57 | 	No, a view of your conversation history is not possible at this time, but this is a feature we are looking into.
58 | 
59 | Where do you save my personal and conversation data?
60 | 
61 | 	For more information on how we handle data, please see our Privacy Policy and Terms of Use.
62 | 
63 | How can I implement this? Is there any implementation guide for this?
64 | 
65 | 	ChatGPT is being made available as a research preview so we can learn about its strengths and weaknesses. It is not available in the API.
66 | 
67 | Do I need a new account if I already have a Labs or Playground account?
68 | 
69 | 	If you have an existing account at labs.openai.com or beta.openai.com, then you can login directly at chat.openai.com using the same login information. If you don't have an account, you'll need to sign-up for a new account at chat.openai.com.
70 | 	
71 | 	


--------------------------------------------------------------------------------