├── README.md ├── constants.py ├── db ├── chroma-collections.parquet ├── chroma-embeddings.parquet └── index │ ├── id_to_uuid_b9718edd-fd19-45ce-8e90-58831cc9eefc.pkl │ ├── index_b9718edd-fd19-45ce-8e90-58831cc9eefc.bin │ ├── index_metadata_b9718edd-fd19-45ce-8e90-58831cc9eefc.pkl │ └── uuid_to_id_b9718edd-fd19-45ce-8e90-58831cc9eefc.pkl ├── ingest.py ├── own private gpt.PNG ├── owngpt.py ├── requirements.txt └── source_documents └── chatGpt.txt /README.md: -------------------------------------------------------------------------------- 1 | # OwnGPT 2 | Create Own ChatGPT with your documents using streamlit UI on your own device using GPT models. No data leaves your device and 100% private. 3 | 4 | 5 | This project was inspired by the original privateGPT (https://github.com/imartinez/privateGPT). Most of the description here is inspired by the original privateGPT. 6 | 7 | Ask questions to your documents without an internet connection, using the power of LLMs. 100% private, no data leaves your execution environment at any point. You can ingest documents and ask questions without an internet connection! 8 | 9 | Built with LLM:[ggml-gpt4all-j-v1.3-groovy.bin](https://gpt4all.io/models/ggml-gpt4all-j-v1.3-groovy.bin). 10 | 11 | 12 | # Environment Setup 13 | In order to set your environment up to run the code here, first install all requirements: 14 | 15 | ```shell 16 | pip install -r requirements.txt 17 | ``` 18 | 19 | 20 | ## Instructions for ingesting your own dataset 21 | 22 | Put any and all your files into the `source_documents` directory 23 | 24 | The supported extensions are: 25 | 26 | - `.csv`: CSV, 27 | - `.docx`: Word Document, 28 | - `.enex`: EverNote, 29 | - `.eml`: Email, 30 | - `.epub`: EPub, 31 | - `.html`: HTML File, 32 | - `.md`: Markdown, 33 | - `.msg`: Outlook Message, 34 | - `.odt`: Open Document Text, 35 | - `.pdf`: Portable Document Format (PDF), 36 | - `.pptx` : PowerPoint Document, 37 | - `.txt`: Text file (UTF-8), 38 | 39 | Run the following command to ingest all the data. 40 | 41 | ```shell 42 | python ingest.py 43 | ``` 44 | 45 | It will create an index containing the local vectorstore. Will take time, depending on the size of your documents. 46 | You can ingest as many documents as you want, and all will be accumulated in the local embeddings database. 47 | If you want to start from an empty database, delete the `index`. 48 | 49 | Note: When you run this for the first time, it will download take time as it has to download the embedding model. In the subseqeunt runs, no data will leave your local enviroment and can be run without internet connection. 50 | 51 | ## Ask questions to your documents, locally! using sreamlit UI 52 | In order to ask a question, run a command like: 53 | 54 | To run the Streamlit app, use the following command: 55 | ``` 56 | streamlit run owngpt.py --server.address localhost 57 | ``` 58 | This command launches the Streamlit app and connects it to the backend server running at `localhost`. 59 | 60 | And wait for "http://localhost:8501/" web App running on local system. 61 | 62 | Enter your Query in TextBox and Hit enter. Wait while the LLM model consumes the prompt and prepares the answer. show your query and answer below TextBox. as show in below figure. 63 | 64 | 65 | ![own private gpt](https://github.com/aviggithub/owngpt/assets/46967951/938e588e-5f7d-48e5-b63f-0db925071886) 66 | 67 | 68 | # How does it work? 69 | Selecting the right local models and the power of `LangChain` you can run the entire pipeline locally, without any data leaving your environment, and with reasonable performance. 70 | 71 | - `ingest.py` uses `LangChain` tools to parse the document and create embeddings locally using `InstructorEmbeddings`. It then stores the result in a local vector database using `Chroma` vector store. 72 | - `streamlit run owngpt.py` uses a local LLM (ggml-gpt4all-j-v1.3-groovy.bin) to understand questions and create answers. The context for the answers is extracted from the local vector store using a similarity search to locate the right piece of context from the docs. 73 | - You can replace this local LLM with any other LLM from the HuggingFace. Make sure whatever LLM you select is in the HF format. 74 | 75 | # System Requirements 76 | 77 | ## Python Version 78 | To use this software, you must have Python 3.10.0 or later installed. Earlier versions of Python will not compile. 79 | 80 | ## C++ Compiler 81 | If you encounter an error while building a wheel during the `pip install` process, you may need to install a C++ compiler on your computer. 82 | 83 | ### For Windows 10/11 84 | To install a C++ compiler on Windows 10/11, follow these steps: 85 | 86 | 1. Install Visual Studio 2022. 87 | 2. Make sure the following components are selected: 88 | * Universal Windows Platform development 89 | * C++ CMake tools for Windows 90 | 3. Download the MinGW installer from the [MinGW website](https://sourceforge.net/projects/mingw/). 91 | 4. Run the installer and select the "gcc" component. 92 | 93 | 94 | 95 | # Disclaimer 96 | This is a test project to validate the feasibility of a fully local private solution for question answering using LLMs and Vector embeddings. It is not production ready, and it is not meant to be used in production. ggml-gpt4all-j-v1.3-groovy.bin is based on the GPT4all model so that has the original Gpt4all license. 97 | 98 | -------------------------------------------------------------------------------- /constants.py: -------------------------------------------------------------------------------- 1 | import os 2 | from dotenv import load_dotenv 3 | from chromadb.config import Settings 4 | 5 | load_dotenv() 6 | 7 | # Define the folder for storing database 8 | PERSIST_DIRECTORY = os.environ.get('PERSIST_DIRECTORY') 9 | 10 | # Define the Chroma settings 11 | CHROMA_SETTINGS = Settings( 12 | chroma_db_impl='duckdb+parquet', 13 | persist_directory=PERSIST_DIRECTORY, 14 | anonymized_telemetry=False 15 | ) -------------------------------------------------------------------------------- /db/chroma-collections.parquet: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aviggithub/OwnGPT/05a4e089ce2e82774b47f8c3a55830083c13dfa1/db/chroma-collections.parquet -------------------------------------------------------------------------------- /db/chroma-embeddings.parquet: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aviggithub/OwnGPT/05a4e089ce2e82774b47f8c3a55830083c13dfa1/db/chroma-embeddings.parquet -------------------------------------------------------------------------------- /db/index/id_to_uuid_b9718edd-fd19-45ce-8e90-58831cc9eefc.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aviggithub/OwnGPT/05a4e089ce2e82774b47f8c3a55830083c13dfa1/db/index/id_to_uuid_b9718edd-fd19-45ce-8e90-58831cc9eefc.pkl -------------------------------------------------------------------------------- /db/index/index_b9718edd-fd19-45ce-8e90-58831cc9eefc.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aviggithub/OwnGPT/05a4e089ce2e82774b47f8c3a55830083c13dfa1/db/index/index_b9718edd-fd19-45ce-8e90-58831cc9eefc.bin -------------------------------------------------------------------------------- /db/index/index_metadata_b9718edd-fd19-45ce-8e90-58831cc9eefc.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aviggithub/OwnGPT/05a4e089ce2e82774b47f8c3a55830083c13dfa1/db/index/index_metadata_b9718edd-fd19-45ce-8e90-58831cc9eefc.pkl -------------------------------------------------------------------------------- /db/index/uuid_to_id_b9718edd-fd19-45ce-8e90-58831cc9eefc.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aviggithub/OwnGPT/05a4e089ce2e82774b47f8c3a55830083c13dfa1/db/index/uuid_to_id_b9718edd-fd19-45ce-8e90-58831cc9eefc.pkl -------------------------------------------------------------------------------- /ingest.py: -------------------------------------------------------------------------------- 1 | import os 2 | import glob 3 | from typing import List 4 | from dotenv import load_dotenv 5 | 6 | from langchain.document_loaders import ( 7 | CSVLoader, 8 | EverNoteLoader, 9 | PDFMinerLoader, 10 | TextLoader, 11 | UnstructuredEmailLoader, 12 | UnstructuredEPubLoader, 13 | UnstructuredHTMLLoader, 14 | UnstructuredMarkdownLoader, 15 | UnstructuredODTLoader, 16 | UnstructuredPowerPointLoader, 17 | UnstructuredWordDocumentLoader, 18 | ) 19 | 20 | from langchain.text_splitter import RecursiveCharacterTextSplitter 21 | from langchain.vectorstores import Chroma 22 | from langchain.embeddings import HuggingFaceEmbeddings 23 | from langchain.docstore.document import Document 24 | from constants import CHROMA_SETTINGS 25 | 26 | 27 | # Map file extensions to document loaders and their arguments 28 | LOADER_MAPPING = { 29 | ".csv": (CSVLoader, {}), 30 | # ".docx": (Docx2txtLoader, {}), 31 | ".doc": (UnstructuredWordDocumentLoader, {}), 32 | ".docx": (UnstructuredWordDocumentLoader, {}), 33 | ".enex": (EverNoteLoader, {}), 34 | ".eml": (UnstructuredEmailLoader, {}), 35 | ".epub": (UnstructuredEPubLoader, {}), 36 | ".html": (UnstructuredHTMLLoader, {}), 37 | ".md": (UnstructuredMarkdownLoader, {}), 38 | ".odt": (UnstructuredODTLoader, {}), 39 | ".pdf": (PDFMinerLoader, {}), 40 | ".ppt": (UnstructuredPowerPointLoader, {}), 41 | ".pptx": (UnstructuredPowerPointLoader, {}), 42 | ".txt": (TextLoader, {"encoding": "utf8"}), 43 | # Add more mappings for other file extensions and loaders as needed 44 | } 45 | 46 | 47 | load_dotenv() 48 | 49 | 50 | def load_single_document(file_path: str) -> Document: 51 | ext = "." + file_path.rsplit(".", 1)[-1] 52 | if ext in LOADER_MAPPING: 53 | loader_class, loader_args = LOADER_MAPPING[ext] 54 | loader = loader_class(file_path, **loader_args) 55 | return loader.load()[0] 56 | 57 | raise ValueError(f"Unsupported file extension '{ext}'") 58 | 59 | 60 | def load_documents(source_dir: str) -> List[Document]: 61 | # Loads all documents from source documents directory 62 | all_files = [] 63 | for ext in LOADER_MAPPING: 64 | all_files.extend( 65 | glob.glob(os.path.join(source_dir, f"**/*{ext}"), recursive=True) 66 | ) 67 | return [load_single_document(file_path) for file_path in all_files] 68 | 69 | 70 | def main(): 71 | # Load environment variables 72 | persist_directory = os.environ.get('PERSIST_DIRECTORY') 73 | source_directory = os.environ.get('SOURCE_DIRECTORY', 'source_documents') 74 | embeddings_model_name = os.environ.get('EMBEDDINGS_MODEL_NAME') 75 | 76 | # Load documents and split in chunks 77 | print(f"Loading documents from {source_directory}") 78 | chunk_size = 500 79 | chunk_overlap = 50 80 | documents = load_documents(source_directory) 81 | text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap) 82 | texts = text_splitter.split_documents(documents) 83 | print(f"Loaded {len(documents)} documents from {source_directory}") 84 | print(f"Split into {len(texts)} chunks of text (max. {chunk_size} characters each)") 85 | 86 | # Create embeddings 87 | embeddings = HuggingFaceEmbeddings(model_name=embeddings_model_name) 88 | 89 | # Create and store locally vectorstore 90 | db = Chroma.from_documents(texts, embeddings, persist_directory=persist_directory, client_settings=CHROMA_SETTINGS) 91 | db.persist() 92 | db = None 93 | 94 | 95 | if __name__ == "__main__": 96 | main() 97 | -------------------------------------------------------------------------------- /own private gpt.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aviggithub/OwnGPT/05a4e089ce2e82774b47f8c3a55830083c13dfa1/own private gpt.PNG -------------------------------------------------------------------------------- /owngpt.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Create ownChat web application streamlit and private gpt 4 | @author: Avinash G 5 | """ 6 | from dotenv import load_dotenv 7 | import streamlit as st 8 | from dotenv import load_dotenv 9 | from langchain.chains import RetrievalQA 10 | from langchain.embeddings import HuggingFaceEmbeddings 11 | from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler 12 | from langchain.vectorstores import Chroma 13 | from langchain.llms import GPT4All, LlamaCpp 14 | import os 15 | from fastapi import FastAPI, UploadFile, File 16 | from typing import List, Optional 17 | import urllib.parse 18 | 19 | load_dotenv() 20 | 21 | embeddings_model_name = os.environ.get("EMBEDDINGS_MODEL_NAME") 22 | persist_directory = os.environ.get('PERSIST_DIRECTORY') 23 | 24 | model_type = os.environ.get('MODEL_TYPE') 25 | model_path = os.environ.get('MODEL_PATH') 26 | model_n_ctx = os.environ.get('MODEL_N_CTX') 27 | source_directory = os.environ.get('SOURCE_DIRECTORY', 'source_documents') 28 | 29 | from constants import CHROMA_SETTINGS 30 | 31 | secret = '' 32 | st.set_page_config( 33 | page_title="Own ChatGPT App", 34 | page_icon=":robot:" 35 | ) 36 | 37 | 38 | def private_gpt_generate_msg(human_msg): 39 | embeddings = HuggingFaceEmbeddings(model_name=embeddings_model_name) 40 | db = Chroma(persist_directory=persist_directory,collection_name=collection_name, embedding_function=embeddings, client_settings=CHROMA_SETTINGS) 41 | retriever = db.as_retriever() 42 | # Prepare the LLM 43 | callbacks = [StreamingStdOutCallbackHandler()] 44 | match model_type: 45 | case "LlamaCpp": 46 | llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, callbacks=callbacks, verbose=False) 47 | case "GPT4All": 48 | llm = GPT4All(model=model_path, n_ctx=model_n_ctx, backend='gptj', callbacks=callbacks, verbose=False) 49 | case _default: 50 | print(f"Model {model_type} not supported!") 51 | exit; 52 | qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True) 53 | 54 | # Get the answer from the chain 55 | res = qa(human_msg) 56 | print(res) 57 | answer, docs = res['result'], res['source_documents'] 58 | return answer 59 | 60 | 61 | 62 | st.header("Own ChatGPT App Private") 63 | 64 | if 'Bot_msg' not in st.session_state: 65 | st.session_state['Bot_msg'] = [] 66 | 67 | if 'History_msg' not in st.session_state: 68 | st.session_state['History_msg'] = [] 69 | 70 | 71 | def get_text(): 72 | input_text = st.text_input("Enter Your Text", key="input") 73 | return input_text 74 | 75 | 76 | user_input = get_text() 77 | 78 | if user_input: 79 | st.session_state.History_msg.append(user_input) 80 | st.session_state.Bot_msg.append(Bot_generate_msg(user_input)) 81 | 82 | if st.session_state['Bot_msg']: 83 | for i in range(len(st.session_state['Bot_msg'])-1, -1, -1): 84 | st.markdown("BOT :- "+" "+st.session_state["Bot_msg"][i]) 85 | st.markdown("HUMAN :- "+"\n"+st.session_state['History_msg'][i]) 86 | 87 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | langchain==0.0.171 2 | pygpt4all==1.1.0 3 | chromadb==0.3.23 4 | urllib3==2.0.2 5 | pdfminer.six==20221105 6 | python-dotenv==1.0.0 7 | unstructured==0.6.6 8 | extract-msg==0.41.1 9 | tabulate==0.9.0 10 | pandoc==2.3 11 | pypandoc==1.11 12 | llama-cpp-python==0.1.50 13 | streamlit -------------------------------------------------------------------------------- /source_documents/chatGpt.txt: -------------------------------------------------------------------------------- 1 | ChatGPT 2 | Examples 3 | Explain quantum computing in simple terms 4 | Got any creative ideas for a 10 year old’s birthday? 5 | How do I make an HTTP request in Javascript? 6 | Capabilities 7 | Remembers what user said earlier in the conversation 8 | Allows user to provide follow-up corrections 9 | Trained to decline inappropriate requests 10 | Limitations 11 | May occasionally generate incorrect information 12 | May occasionally produce harmful instructions or biased content 13 | Limited knowledge of world and events after 2021 14 | 15 | 16 | ChatGPT FAQ 17 | Commonly asked questions about ChatGPT 18 | Natalie avatar 19 | Written by Natalie 20 | Updated today 21 | How much does it cost to use ChatGPT? 22 | 23 | During the initial research preview, ChatGPT is free to use. 24 | 25 | How does ChatGPT work? 26 | 27 | ChatGPT is fine-tuned from GPT-3.5, a language model trained to produce text. ChatGPT was optimized for dialogue by using Reinforcement Learning with Human Feedback (RLHF) – a method that uses human demonstrations to guide the model toward desired behavior. 28 | 29 | Why does the AI seem so real and lifelike? 30 | 31 | These models were trained on vast amounts of data from the internet written by humans, including conversations, so the responses it provides may sound human-like. It is important to keep in mind that this is a direct result of the system's design (i.e. maximizing the similarity between outputs and the dataset the models were trained on) and that such outputs may be inaccurate, untruthful, and otherwise misleading at times. 32 | 33 | Can I trust that the AI is telling me the truth? 34 | 35 | ChatGPT is not connected to the internet, and it can occasionally produce incorrect answers. It has limited knowledge of world and events after 2021 and may also occasionally produce harmful instructions or biased content. 36 | 37 | We'd recommend checking whether responses from the model are accurate or not. If you find an answer is incorrect, please provide that feedback by using the "Thumbs Down" button. 38 | 39 | Who can view my conversations? 40 | 41 | As part of our commitment to safe and responsible AI, we review conversations to improve our systems and to ensure the content complies with our policies and safety requirements. 42 | 43 | Will you use my conversations for training? 44 | 45 | Yes. Your conversations may be reviewed by our AI trainers to improve our systems. 46 | 47 | Can you delete my data? 48 | 49 | Yes, please follow the data deletion process here: https://help.openai.com/en/articles/6378407-how-can-i-delete-my-account 50 | 51 | Can you delete specific prompts? 52 | 53 | No, we are not able to delete specific prompts from your history. Please don't share any sensitive information in your conversations. 54 | 55 | Can I see my history of threads? How can I save a conversation I’ve had? 56 | 57 | No, a view of your conversation history is not possible at this time, but this is a feature we are looking into. 58 | 59 | Where do you save my personal and conversation data? 60 | 61 | For more information on how we handle data, please see our Privacy Policy and Terms of Use. 62 | 63 | How can I implement this? Is there any implementation guide for this? 64 | 65 | ChatGPT is being made available as a research preview so we can learn about its strengths and weaknesses. It is not available in the API. 66 | 67 | Do I need a new account if I already have a Labs or Playground account? 68 | 69 | If you have an existing account at labs.openai.com or beta.openai.com, then you can login directly at chat.openai.com using the same login information. If you don't have an account, you'll need to sign-up for a new account at chat.openai.com. 70 | 71 | --------------------------------------------------------------------------------