├── requirements.txt ├── .DS_Store ├── dump ├── txt.pkl └── .DS_Store ├── data └── .DS_Store ├── store └── .DS_Store ├── README.md ├── docGPT_core.py └── RVS.py /requirements.txt: -------------------------------------------------------------------------------- 1 | langchain 2 | gradio 3 | numpy 4 | sklearn -------------------------------------------------------------------------------- /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ssm123ssm/docGPT-pharm/HEAD/.DS_Store -------------------------------------------------------------------------------- /dump/txt.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ssm123ssm/docGPT-pharm/HEAD/dump/txt.pkl -------------------------------------------------------------------------------- /data/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ssm123ssm/docGPT-pharm/HEAD/data/.DS_Store -------------------------------------------------------------------------------- /dump/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ssm123ssm/docGPT-pharm/HEAD/dump/.DS_Store -------------------------------------------------------------------------------- /store/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ssm123ssm/docGPT-pharm/HEAD/store/.DS_Store -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # docGPT - Document Intelligence 2 | 3 | Welcome to **docGPT**, an easily implemented document intelligence program written as an abstraction over langchain. Simply put all the unstructured data (eg PDF files) in the `data` folder and start asking questions about them! 4 | 5 | Useful for Retrieval and Summarization tasks. 6 | 7 | ## Overview 8 | 9 | Place your files in the `data` folder. Install dependencies and run 10 | 11 | ``` 12 | from docGPT_core import * 13 | from RVS import * 14 | 15 | set_tokens(OPENAI_TOKEN='OPENAI TOKEN HERE',HF_TOKEN="HF TOKEN HERE") 16 | llm = Llm(model_type='gpt-3.5-turbo') 17 | openai_embeddings = Embedding(model_type='openai') 18 | persona = Persona(personality_type='explainer') 19 | 20 | vs = VectorStore(embedding_model=openai_embeddings) 21 | chain = Chain(retriever=vs, llm=llm, persona=persona) 22 | ``` 23 | 24 | You can start chatting with your data now. 25 | 26 | `chain.qa(inputs={"query": "Your question"})` 27 | 28 | ## Installation 29 | 30 | Clone this repo. Install dependencies langchain, gradio, numpy and sklearn using `pip install -r requirements.txt` 31 | 32 | ## Preparation 33 | 34 | Import docGPT_core and RVS modules. 35 | 36 | ``` 37 | from docGPT_core import * 38 | from RVS import * 39 | ``` 40 | You have to define the OpenAI token, the LLM to be used and the embedding model to be used. The default LLM is OpenAI `gpt-3.5-turbo` and default embedding model is `text-embedding-ada-002`. If you are planning on using Huggingface models, HugingFace token also needs to be set. 41 | 42 | ``` 43 | #setting tokens 44 | set_tokens(OPENAI_TOKEN='OPENAI TOKEN HERE',HF_TOKEN="HF TOKEN HERE") 45 | 46 | #setting models 47 | llm = Llm(model_type='gpt-3.5-turbo') 48 | openai_embeddings = Embedding(model_type='openai') 49 | 50 | #setting the personality of the model 51 | persona = Persona(personality_type='explainer') 52 | ``` 53 | 54 | ## Vectorstore creation 55 | 56 | Place all the files you want to create the vectorstore for, in the `data` folder and run call the VectorStore function. You can pass the chunk size and the overlaps if necessary. 57 | 58 | ``` 59 | vs = VectorStore(embedding_model=openai_embeddings) 60 | ``` 61 | 62 | This might take some time depending on the size of the documents. If the data in the `data` folder need OCR, additional packages including Tesseract OCR needs to be installed. 63 | 64 | You can save the created vectorstore in disk to avoid re-creating it with `vs.save('my_vectorstore.pkl')` and reload it later with `load_vectorstore('my_vectorstore.pkl')` 65 | 66 | ## Chain for prompting 67 | 68 | The QA chain can be initiated with one line of code 69 | 70 | `chain = Chain(retriever=vs, llm=llm, persona=persona)`, 71 | 72 | The `retriever` used in the chain object creation in above example is the default retriever attached to the vectorstore. If you want to retrieve n-number of documents, not the default 3 documents, use Retriever class. 73 | 74 | Eg: `chain = Chain(retriever=Retriever(vs, k=4), llm=llm, persona=persona)` 75 | 76 | and can query the vectorstore with 77 | 78 | `chain.qa(inputs={"query": "Your question"})` 79 | 80 | ## Summarization 81 | 82 | RVS module provides functions for summarization. 83 | 84 | `summarize()` function returns the summary of the vectorstore as a passage or as key points. 85 | 86 | eg: `summarize(vectorstore=vs, llm=llm, max_tokens=2500, summary=False, keypoints=True)` 87 | 88 | Toggle `summary` and `keypoints` booleans to return either kind of summary. The method uses Representative Vector Summarization method explained in [https://arxiv.org/abs/2308.00479v1](https://arxiv.org/abs/2308.00479v1). 89 | Change the `max_tokens` parameter, depending on the number of chunks to be included and the size of each chunk. Suggestion is to start with a value like 2000, and depending on the answer, fine tune. 90 | 91 | ### Further control 92 | 93 | Mentioned above is the highest level API for creating the QA engine. Please refer this document for finer control. 94 | -------------------------------------------------------------------------------- /docGPT_core.py: -------------------------------------------------------------------------------- 1 | # Imports 2 | from langchain.chains import RetrievalQA, ConversationalRetrievalChain 3 | from langchain.document_loaders import DirectoryLoader, UnstructuredFileLoader 4 | from langchain.memory import ConversationBufferMemory 5 | from langchain.text_splitter import RecursiveCharacterTextSplitter 6 | from langchain import PromptTemplate, FAISS, ConversationChain, HuggingFaceHub 7 | from langchain.llms import GPT4All, OpenAIChat 8 | from langchain.chat_models import ChatOpenAI 9 | from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler 10 | from langchain.embeddings import HuggingFaceEmbeddings, OpenAIEmbeddings 11 | import os 12 | import pickle 13 | import logging 14 | 15 | logging.basicConfig(level=logging.WARNING) 16 | 17 | 18 | def set_tokens(OPENAI_TOKEN, HF_TOKEN): 19 | os.environ['HUGGINGFACEHUB_API_TOKEN'] = HF_TOKEN 20 | os.environ['OPENAI_API_KEY'] = OPENAI_TOKEN 21 | 22 | 23 | def load_vectorstore(store_name='vectorstore'): 24 | with open(store_name, 'rb') as f: 25 | logging.warning("Vectorstore loaded from disk") 26 | return pickle.load(f) 27 | 28 | 29 | def load_txt(store_name='dump/txt.pkl'): 30 | with open(store_name, 'rb') as f: 31 | logging.warning("Txt splits loaded from disk") 32 | return pickle.load(f) 33 | 34 | 35 | class VectorStore: 36 | def __init__(self, embedding_model, doc_path='data', chunk_size=1000, chunk_overlap=200, file=False, file_path=None, 37 | from_large_embeddings=False, vectorstore=None): 38 | logging.warning("Loading input files...") 39 | if not from_large_embeddings: 40 | my_loader = DirectoryLoader(doc_path) 41 | if file: 42 | my_loader = UnstructuredFileLoader(file_path=file_path) 43 | docs = my_loader.load() 44 | text_split = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap) 45 | logging.warning("Starting ingestion...") 46 | text = text_split.split_documents(docs) 47 | with open('dump/txt.pkl', 'wb') as f: 48 | pickle.dump(text, f) 49 | logging.warning("Text splits dumped...") 50 | 51 | # text = load_txt() 52 | self.store = FAISS.from_documents(text, embedding_model.model) 53 | self.retriever = self.store.as_retriever(search_type="similarity", search_kwargs={"k": 2}) 54 | else: 55 | self.store = vectorstore 56 | self.retriever = self.store.as_retriever(search_type="similarity", search_kwargs={"k": 2}) 57 | 58 | logging.warning("Vector store created in memory. Use save method to write the store to disk.") 59 | 60 | def save(self, store_name='vectorstore'): 61 | path = store_name 62 | 63 | # Save the object to disk 64 | with open(path, 'wb') as f: 65 | pickle.dump(self, f) 66 | logging.warning("VectorStore on disk now...") 67 | 68 | 69 | class Embedding: 70 | def __init__(self, model_name='sentence-transformers/all-MiniLM-L6-v2', model_type="hf"): 71 | md = HuggingFaceEmbeddings(model_name=model_name) if model_type == "hf" else OpenAIEmbeddings() 72 | self.model = md 73 | 74 | 75 | class Retriever: 76 | def __init__(self, vectorstore, search_type='similarity', k=2): 77 | self.retriever = vectorstore.store.as_retriever(search_type=search_type, search_kwargs={"k": k}) 78 | 79 | 80 | class Llm: 81 | def __init__(self, model_type='gpt4all', model_path='bin/nous-hermes-13b.ggmlv3.q4_0.bin'): 82 | repo_id = "tiiuae/falcon-7b-instruct" 83 | if model_type == 'gpt4all': 84 | callbacks = [StreamingStdOutCallbackHandler()] 85 | self.model = GPT4All(model=model_path, backend="gptj", callbacks=callbacks, verbose=True) 86 | elif model_type == 'hf': 87 | self.model = HuggingFaceHub(repo_id=repo_id, 88 | model_kwargs={"temperature": 0.2, "max_new_tokens": 2000}) 89 | else: 90 | self.model = ChatOpenAI(model=model_type, temperature=0) 91 | 92 | 93 | class Persona: 94 | def __init__(self, personality_type='truthful'): 95 | 96 | prompt_template_mcq = """Use the following pieces of context to answer the question at the end. If the answer 97 | is not in the context, guess the most probable answer. If the context does not provide the answer, say The context 98 | does not prove the exact answer, but the most probable answer is... 99 | 100 | {context} 101 | 102 | Question: {question} 103 | You must give an answer. 104 | """ 105 | 106 | prompt_template_idiot = """Say I don't know 107 | {context} 108 | {question} 109 | You should not give the answer. Instead pretend that you don't know the answer. 110 | """ 111 | 112 | prompt_template_truthful = """Use the following pieces of context to answer the question at the end. 113 | 114 | {context} 115 | 116 | Question: {question} 117 | Give a detailed answer. If the context does not provide the answer, say the answer cannot be found from the 118 | given context. Always start with OK.... 119 | """ 120 | 121 | prompt_template_explainer = """Use the following pieces of context to answer the question at the end. 122 | 123 | {context} 124 | 125 | Question: {question} 126 | Give a detailed answer with the help of the provided piece of information. 127 | """ 128 | 129 | prompt_template_rapper = """Use the following pieces of context to answer the question at the end. 130 | 131 | {context} 132 | 133 | Question: {question} 134 | Give a detailed answer with the help of the provided piece of information. You are a rapper and answer in a rap. 135 | """ 136 | if personality_type == 'mcq': 137 | prompt_template = prompt_template_mcq 138 | elif personality_type == 'idiot': 139 | prompt_template = prompt_template_idiot 140 | elif personality_type == 'explainer': 141 | prompt_template = prompt_template_explainer 142 | elif personality_type == 'rapper': 143 | prompt_template = prompt_template_rapper 144 | else: 145 | prompt_template = prompt_template_truthful 146 | 147 | prompt = PromptTemplate( 148 | template=prompt_template, input_variables=["context", "question"] 149 | ) 150 | 151 | pr = {"prompt": prompt} 152 | self.persona = pr 153 | 154 | 155 | class Chain: 156 | def __init__(self, retriever, llm, persona, chain_type="stuff", source_nodes=True): 157 | self.qa = RetrievalQA.from_chain_type(llm=llm.model, chain_type=chain_type, retriever=retriever.retriever, 158 | # return_source_documents=source_nodes, 159 | chain_type_kwargs=persona.persona, verbose=True) 160 | self.con_qa = RetrievalQA.from_chain_type(llm=llm.model, chain_type=chain_type, retriever=retriever.retriever, 161 | chain_type_kwargs=persona.persona, verbose=True, 162 | memory=ConversationBufferMemory(), ) 163 | 164 | 165 | -------------------------------------------------------------------------------- /RVS.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import statistics 3 | import logging 4 | from langchain.chains.summarize import load_summarize_chain 5 | from langchain.docstore.document import Document 6 | from langchain import PromptTemplate 7 | from sklearn.cluster import KMeans 8 | from tqdm import tqdm 9 | 10 | 11 | def summarize(vectorstore, llm, embedding_dim=1536, max_tokens=10000, summary=True, keypoints=False, questions=False): 12 | index = vectorstore.store.index 13 | num_items = len(vectorstore.store.index_to_docstore_id) 14 | embedding_dim = embedding_dim 15 | vectors = [] 16 | 17 | for i in range(num_items): 18 | vectors.append(index.reconstruct(i)) 19 | 20 | embedding_matrix = np.array(vectors) 21 | doc_index = (vectorstore.store.docstore.__dict__['_dict']) 22 | chunk_tokens = [] 23 | 24 | for key, value in doc_index.items(): 25 | chunk_tokens.append(llm.model.get_num_tokens(value.page_content)) 26 | 27 | mean_chunk_size = statistics.mean(chunk_tokens) 28 | target = max_tokens 29 | 30 | if target // mean_chunk_size <= len(chunk_tokens): 31 | num_clusters = (target // mean_chunk_size).__int__() 32 | else: 33 | num_clusters = len(chunk_tokens).__int__() 34 | 35 | logging.warning(f"Number of chunks chosen: {num_clusters}") 36 | print(f"Can afford {num_clusters} clusters , with mean chunk size of {mean_chunk_size} tokens, out of {len(chunk_tokens)} total chunks") 37 | 38 | kmeans = KMeans(n_clusters=num_clusters, random_state=42, n_init=10).fit(embedding_matrix) 39 | 40 | closest_indices = [] 41 | 42 | for i in range(num_clusters): 43 | distances = np.linalg.norm(vectors - kmeans.cluster_centers_[i], axis=1) 44 | closest_index = np.argmin(distances) 45 | closest_indices.append(closest_index) 46 | 47 | selected_indices = sorted(closest_indices) 48 | doc_ids = list(map(vectorstore.store.index_to_docstore_id.get, selected_indices)) 49 | contents = list(map(vectorstore.store.docstore.__dict__['_dict'].get, doc_ids)) 50 | 51 | map_prompt = """ 52 | You will be given a single passage of a document. This section will be enclosed in triple backticks (```) 53 | Your goal is to identify what the passage tries to describe and give the general idea tha passage is discussing, as a summary. Do not focus on specific details and try to understand the general context. Start with This section is mainly obout, 54 | ```{text}``` 55 | GENERAL IDEA: 56 | """ 57 | map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text"]) 58 | map_chain = load_summarize_chain(llm=llm.model, 59 | chain_type="stuff", 60 | prompt=map_prompt_template) 61 | 62 | #Summary mappings 63 | print('Mapping summaries') 64 | 65 | results = [] 66 | with tqdm(total=len(contents), desc="Processing documents") as pbar: 67 | for i in contents: 68 | res_2 = map_chain({"input_documents": [i]})['output_text'] 69 | results.append(res_2) 70 | pbar.update(1) 71 | 72 | #results = [map_chain({"input_documents": [i]})['output_text'] for i in contents] 73 | summary_map = ''.join(['\n\nSummary: ' + s for s in results]) 74 | summary_doc = Document(page_content = summary_map) 75 | 76 | summary_prompt = """ 77 | You will be given a set of summaries of randomly selected passages from a document. 78 | Your goal is to write a paragraph on what the document is likely to be about. 79 | 80 | ```{text}``` 81 | 82 | The document is: 83 | """ 84 | 85 | insights_prompt = """ 86 | You will be given a set of summaries of passages from a document. 87 | Your goal is to generate an overall general summary of the document using the summaries provided below within triple backticks. 88 | 89 | ```{text}``` 90 | 91 | OVERALL CONTENT: Provide a list of bullet points. 92 | """ 93 | 94 | questions_prompt = """ 95 | You will be given a set of summaries of passages from a document. 96 | Your goal is to generate an overall general comprehensive summary of the document using the summaries provided below within triple backticks and ask them as questions. 97 | 98 | ```{text}``` 99 | 100 | QUESTIONS: Provide a list of questions. 101 | """ 102 | 103 | summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) 104 | insights_prompt_template = PromptTemplate(template=insights_prompt, input_variables=["text"]) 105 | questions_prompt_template = PromptTemplate(template=questions_prompt, input_variables=["text"]) 106 | 107 | summary_chain = load_summarize_chain(llm=llm.model, 108 | chain_type="stuff", 109 | prompt=summary_prompt_template) 110 | insights_chain = load_summarize_chain(llm=llm.model, 111 | chain_type="stuff", 112 | prompt=insights_prompt_template) 113 | questions_chain = load_summarize_chain(llm=llm.model, 114 | chain_type="stuff", 115 | prompt=questions_prompt_template) 116 | 117 | final_summary = None 118 | insights = None 119 | questions = None 120 | 121 | if(summary): 122 | final_summary = summary_chain({"input_documents": [summary_doc]})['output_text'] 123 | if(keypoints): 124 | insights = insights_chain({"input_documents": [summary_doc]})['output_text'] 125 | if(questions): 126 | questions = questions_chain({"input_documents": [summary_doc]})['output_text'] 127 | 128 | out = {'summary':final_summary, 'keypoints':insights, 'questions':questions} 129 | return out 130 | 131 | 132 | def keywords(vectorstore, llm, embedding_dim=1536, max_tokens=10000): 133 | index = vectorstore.store.index 134 | num_items = len(vectorstore.store.index_to_docstore_id) 135 | embedding_dim = embedding_dim 136 | vectors = [] 137 | 138 | for i in range(num_items): 139 | vectors.append(index.reconstruct(i)) 140 | 141 | embedding_matrix = np.array(vectors) 142 | doc_index = (vectorstore.store.docstore.__dict__['_dict']) 143 | chunk_tokens = [] 144 | 145 | for key, value in doc_index.items(): 146 | chunk_tokens.append(llm.model.get_num_tokens(value.page_content)) 147 | 148 | mean_chunk_size = statistics.mean(chunk_tokens) 149 | target = max_tokens 150 | 151 | if target // mean_chunk_size <= len(chunk_tokens): 152 | num_clusters = (target // mean_chunk_size).__int__() 153 | else: 154 | num_clusters = len(chunk_tokens).__int__() 155 | 156 | print(f"Can afford {num_clusters} clusters , with mean chunk size of {mean_chunk_size} tokens, out of {len(chunk_tokens)} total chunks") 157 | 158 | kmeans = KMeans(n_clusters=num_clusters, random_state=42, n_init=10).fit(embedding_matrix) 159 | 160 | closest_indices = [] 161 | 162 | for i in range(num_clusters): 163 | distances = np.linalg.norm(vectors - kmeans.cluster_centers_[i], axis=1) 164 | closest_index = np.argmin(distances) 165 | closest_indices.append(closest_index) 166 | 167 | selected_indices = sorted(closest_indices) 168 | doc_ids = list(map(vectorstore.store.index_to_docstore_id.get, selected_indices)) 169 | contents = list(map(vectorstore.store.docstore.__dict__['_dict'].get, doc_ids)) 170 | 171 | keyword_prompt = """ 172 | You will be given a single passage of a document. This section will be enclosed in triple backticks (```) 173 | Your goal is to identify what the passage tries to describe and give five comma separated un-numbered keywords from the passage. 174 | 175 | ```{text}``` 176 | keywords: 177 | """ 178 | keyword_prompt_template = PromptTemplate(template=keyword_prompt, input_variables=["text"]) 179 | keyword_chain = load_summarize_chain(llm=llm.model, 180 | chain_type="stuff", 181 | prompt=keyword_prompt_template) 182 | 183 | #Summary mappings 184 | print('Mapping keywords') 185 | 186 | #res_2_key = [keyword_chain({"input_documents": [i]})['output_text'] for i in contents] 187 | res_2_key = [] 188 | with tqdm(total=len(contents), desc="Processing documents") as pbar: 189 | for i in contents: 190 | res_2_key_t = keyword_chain({"input_documents": [i]})['output_text'] 191 | res_2_key.append(res_2_key_t) 192 | pbar.update(1) 193 | 194 | #mapping keywords to chunks 195 | labels = kmeans.labels_ 196 | string_list = res_2_key 197 | label_to_string = dict(zip(range(len(string_list)), string_list)) 198 | mapped_strings = [label_to_string[label] for label in labels] 199 | return res_2_key --------------------------------------------------------------------------------