├── requirements.txt
├── .DS_Store
├── dump
    ├── txt.pkl
    └── .DS_Store
├── data
    └── .DS_Store
├── store
    └── .DS_Store
├── README.md
├── docGPT_core.py
└── RVS.py


/requirements.txt:
--------------------------------------------------------------------------------
1 | langchain
2 | gradio
3 | numpy
4 | sklearn


--------------------------------------------------------------------------------
/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ssm123ssm/docGPT-pharm/HEAD/.DS_Store


--------------------------------------------------------------------------------
/dump/txt.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ssm123ssm/docGPT-pharm/HEAD/dump/txt.pkl


--------------------------------------------------------------------------------
/data/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ssm123ssm/docGPT-pharm/HEAD/data/.DS_Store


--------------------------------------------------------------------------------
/dump/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ssm123ssm/docGPT-pharm/HEAD/dump/.DS_Store


--------------------------------------------------------------------------------
/store/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ssm123ssm/docGPT-pharm/HEAD/store/.DS_Store


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # docGPT - Document Intelligence 
 2 | 
 3 | Welcome to **docGPT**, an easily implemented document intelligence program written as an abstraction over langchain. Simply put all the unstructured data (eg PDF files) in the `data` folder and start asking questions about them!
 4 | 
 5 | Useful for Retrieval and Summarization tasks.
 6 | 
 7 | ## Overview
 8 | 
 9 | Place your files in the `data` folder. Install dependencies and run
10 | 
11 | ```
12 | from docGPT_core import * 
13 | from RVS import *
14 | 
15 | set_tokens(OPENAI_TOKEN='OPENAI TOKEN HERE',HF_TOKEN="HF TOKEN HERE")
16 | llm  =  Llm(model_type='gpt-3.5-turbo')
17 | openai_embeddings  =  Embedding(model_type='openai')
18 | persona  =  Persona(personality_type='explainer')
19 | 
20 | vs = VectorStore(embedding_model=openai_embeddings)
21 | chain = Chain(retriever=vs, llm=llm, persona=persona)
22 | ```
23 | 
24 | You can start chatting with your data now.
25 | 
26 | `chain.qa(inputs={"query": "Your question"})`
27 | 
28 | ## Installation
29 | 
30 | Clone this repo. Install dependencies langchain, gradio, numpy and sklearn using `pip install -r requirements.txt`
31 | 
32 | ## Preparation
33 | 
34 | Import docGPT_core and RVS modules.
35 | 
36 | ```
37 | from docGPT_core import * 
38 | from RVS import *
39 | ```
40 | You have to define the OpenAI token, the LLM to be used and the embedding model to be used. The default LLM is OpenAI `gpt-3.5-turbo` and default embedding model is `text-embedding-ada-002`. If you are planning on using Huggingface models, HugingFace token also needs to be set.
41 | 
42 | ```
43 | #setting tokens
44 | set_tokens(OPENAI_TOKEN='OPENAI TOKEN HERE',HF_TOKEN="HF TOKEN HERE")
45 | 
46 | #setting models
47 | llm  =  Llm(model_type='gpt-3.5-turbo')
48 | openai_embeddings  =  Embedding(model_type='openai')
49 | 
50 | #setting the personality of the model
51 | persona  =  Persona(personality_type='explainer')
52 | ```
53 | 
54 |  ## Vectorstore creation
55 | 
56 | Place all the files you want to create the vectorstore for, in the `data` folder and run call the VectorStore function. You can pass the chunk size  and the overlaps if necessary. 
57 | 
58 | ```
59 | vs = VectorStore(embedding_model=openai_embeddings)
60 | ```
61 | 
62 | This might take some time depending on the size of the documents. If the data in the `data` folder need OCR, additional packages including Tesseract OCR needs to be installed.  
63 | 
64 | You can save the created vectorstore in disk to avoid re-creating it with `vs.save('my_vectorstore.pkl')` and reload it later with `load_vectorstore('my_vectorstore.pkl')`
65 | 
66 | ## Chain for prompting
67 | 
68 | The QA chain can be initiated with one line of code
69 | 
70 | `chain = Chain(retriever=vs, llm=llm, persona=persona)`, 
71 | 
72 | The `retriever` used in the chain object creation in above example is the default retriever attached to the vectorstore. If you want to retrieve n-number of documents, not the default 3 documents, use Retriever class.
73 | 
74 | Eg: `chain = Chain(retriever=Retriever(vs, k=4), llm=llm, persona=persona)`
75 | 
76 | and can query the vectorstore with 
77 | 
78 | `chain.qa(inputs={"query": "Your question"})`
79 | 
80 | ## Summarization
81 | 
82 | RVS module provides functions for summarization.
83 | 
84 | `summarize()` function returns the summary of the vectorstore as a passage or as key points.
85 | 
86 | eg:  `summarize(vectorstore=vs, llm=llm, max_tokens=2500, summary=False, keypoints=True)`
87 | 
88 | Toggle `summary` and `keypoints` booleans to return either kind of summary. The method uses Representative Vector Summarization method explained in [https://arxiv.org/abs/2308.00479v1](https://arxiv.org/abs/2308.00479v1).
89 | Change the `max_tokens` parameter, depending on the number of chunks to be included and the size of each chunk. Suggestion is to start with a value like 2000, and depending on the answer, fine tune.
90 | 
91 | ### Further control
92 | 
93 | Mentioned above is the highest level API for creating the QA engine. Please refer this document for finer control.
94 | 


--------------------------------------------------------------------------------
/docGPT_core.py:
--------------------------------------------------------------------------------
  1 | # Imports
  2 | from langchain.chains import RetrievalQA, ConversationalRetrievalChain
  3 | from langchain.document_loaders import DirectoryLoader, UnstructuredFileLoader
  4 | from langchain.memory import ConversationBufferMemory
  5 | from langchain.text_splitter import RecursiveCharacterTextSplitter
  6 | from langchain import PromptTemplate, FAISS, ConversationChain, HuggingFaceHub
  7 | from langchain.llms import GPT4All, OpenAIChat
  8 | from langchain.chat_models import ChatOpenAI
  9 | from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
 10 | from langchain.embeddings import HuggingFaceEmbeddings, OpenAIEmbeddings
 11 | import os
 12 | import pickle
 13 | import logging
 14 | 
 15 | logging.basicConfig(level=logging.WARNING)
 16 | 
 17 | 
 18 | def set_tokens(OPENAI_TOKEN, HF_TOKEN):
 19 |     os.environ['HUGGINGFACEHUB_API_TOKEN'] = HF_TOKEN
 20 |     os.environ['OPENAI_API_KEY'] = OPENAI_TOKEN
 21 | 
 22 | 
 23 | def load_vectorstore(store_name='vectorstore'):
 24 |     with open(store_name, 'rb') as f:
 25 |         logging.warning("Vectorstore loaded from disk")
 26 |         return pickle.load(f)
 27 | 
 28 | 
 29 | def load_txt(store_name='dump/txt.pkl'):
 30 |     with open(store_name, 'rb') as f:
 31 |         logging.warning("Txt splits loaded from disk")
 32 |         return pickle.load(f)
 33 | 
 34 | 
 35 | class VectorStore:
 36 |     def __init__(self, embedding_model, doc_path='data', chunk_size=1000, chunk_overlap=200, file=False, file_path=None,
 37 |                  from_large_embeddings=False, vectorstore=None):
 38 |         logging.warning("Loading input files...")
 39 |         if not from_large_embeddings:
 40 |             my_loader = DirectoryLoader(doc_path)
 41 |             if file:
 42 |                 my_loader = UnstructuredFileLoader(file_path=file_path)
 43 |             docs = my_loader.load()
 44 |             text_split = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
 45 |             logging.warning("Starting ingestion...")
 46 |             text = text_split.split_documents(docs)
 47 |             with open('dump/txt.pkl', 'wb') as f:
 48 |                 pickle.dump(text, f)
 49 |                 logging.warning("Text splits dumped...")
 50 | 
 51 |             # text = load_txt()
 52 |             self.store = FAISS.from_documents(text, embedding_model.model)
 53 |             self.retriever = self.store.as_retriever(search_type="similarity", search_kwargs={"k": 2})
 54 |         else:
 55 |             self.store = vectorstore
 56 |             self.retriever = self.store.as_retriever(search_type="similarity", search_kwargs={"k": 2})
 57 | 
 58 |         logging.warning("Vector store created in memory. Use save method to write the store to disk.")
 59 | 
 60 |     def save(self, store_name='vectorstore'):
 61 |         path = store_name
 62 | 
 63 |         # Save the object to disk
 64 |         with open(path, 'wb') as f:
 65 |             pickle.dump(self, f)
 66 |             logging.warning("VectorStore on disk now...")
 67 | 
 68 | 
 69 | class Embedding:
 70 |     def __init__(self, model_name='sentence-transformers/all-MiniLM-L6-v2', model_type="hf"):
 71 |         md = HuggingFaceEmbeddings(model_name=model_name) if model_type == "hf" else OpenAIEmbeddings()
 72 |         self.model = md
 73 | 
 74 | 
 75 | class Retriever:
 76 |     def __init__(self, vectorstore, search_type='similarity', k=2):
 77 |         self.retriever = vectorstore.store.as_retriever(search_type=search_type, search_kwargs={"k": k})
 78 | 
 79 | 
 80 | class Llm:
 81 |     def __init__(self, model_type='gpt4all', model_path='bin/nous-hermes-13b.ggmlv3.q4_0.bin'):
 82 |         repo_id = "tiiuae/falcon-7b-instruct"
 83 |         if model_type == 'gpt4all':
 84 |             callbacks = [StreamingStdOutCallbackHandler()]
 85 |             self.model = GPT4All(model=model_path, backend="gptj", callbacks=callbacks, verbose=True)
 86 |         elif model_type == 'hf':
 87 |             self.model = HuggingFaceHub(repo_id=repo_id,
 88 |                                         model_kwargs={"temperature": 0.2, "max_new_tokens": 2000})
 89 |         else:
 90 |             self.model = ChatOpenAI(model=model_type, temperature=0)
 91 | 
 92 | 
 93 | class Persona:
 94 |     def __init__(self, personality_type='truthful'):
 95 | 
 96 |         prompt_template_mcq = """Use the following pieces of context to answer the question at the end. If the answer 
 97 |         is not in the context, guess the most probable answer. If the context does not provide the answer, say The context
 98 |         does not prove the exact answer, but the most probable answer is...
 99 | 
100 |         {context}
101 |         
102 |         Question: {question}
103 |         You must give an answer.
104 |         """
105 | 
106 |         prompt_template_idiot = """Say I don't know
107 |         {context}
108 |         {question}
109 |         You should not give the answer. Instead pretend that you don't know the answer.
110 |         """
111 | 
112 |         prompt_template_truthful = """Use the following pieces of context to answer the question at the end.
113 | 
114 |         {context}
115 |         
116 |         Question: {question}
117 |         Give a detailed answer. If the context does not provide the answer, say the answer cannot be found from the 
118 |         given context. Always start with OK....
119 |         """
120 | 
121 |         prompt_template_explainer = """Use the following pieces of context to answer the question at the end.
122 | 
123 |         {context}
124 |         
125 |         Question: {question}
126 |         Give a detailed answer with the help of the provided piece of information.
127 |         """
128 | 
129 |         prompt_template_rapper = """Use the following pieces of context to answer the question at the end.
130 | 
131 |         {context}
132 |         
133 |         Question: {question}
134 |         Give a detailed answer with the help of the provided piece of information. You are a rapper and answer in a rap.
135 |         """
136 |         if personality_type == 'mcq':
137 |             prompt_template = prompt_template_mcq
138 |         elif personality_type == 'idiot':
139 |             prompt_template = prompt_template_idiot
140 |         elif personality_type == 'explainer':
141 |             prompt_template = prompt_template_explainer
142 |         elif personality_type == 'rapper':
143 |             prompt_template = prompt_template_rapper
144 |         else:
145 |             prompt_template = prompt_template_truthful
146 | 
147 |         prompt = PromptTemplate(
148 |             template=prompt_template, input_variables=["context", "question"]
149 |         )
150 | 
151 |         pr = {"prompt": prompt}
152 |         self.persona = pr
153 | 
154 | 
155 | class Chain:
156 |     def __init__(self, retriever, llm, persona, chain_type="stuff", source_nodes=True):
157 |         self.qa = RetrievalQA.from_chain_type(llm=llm.model, chain_type=chain_type, retriever=retriever.retriever,
158 |                                               # return_source_documents=source_nodes,
159 |                                               chain_type_kwargs=persona.persona, verbose=True)
160 |         self.con_qa = RetrievalQA.from_chain_type(llm=llm.model, chain_type=chain_type, retriever=retriever.retriever,
161 |                                                   chain_type_kwargs=persona.persona, verbose=True,
162 |                                                   memory=ConversationBufferMemory(), )
163 | 
164 | 
165 | 


--------------------------------------------------------------------------------
/RVS.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import statistics
  3 | import logging
  4 | from langchain.chains.summarize import load_summarize_chain
  5 | from langchain.docstore.document import Document
  6 | from langchain import PromptTemplate
  7 | from sklearn.cluster import KMeans
  8 | from tqdm import tqdm
  9 | 
 10 | 
 11 | def summarize(vectorstore, llm, embedding_dim=1536, max_tokens=10000, summary=True, keypoints=False, questions=False):
 12 |     index = vectorstore.store.index
 13 |     num_items = len(vectorstore.store.index_to_docstore_id)
 14 |     embedding_dim = embedding_dim
 15 |     vectors = []
 16 | 
 17 |     for i in range(num_items):
 18 |         vectors.append(index.reconstruct(i))
 19 | 
 20 |     embedding_matrix = np.array(vectors)
 21 |     doc_index = (vectorstore.store.docstore.__dict__['_dict'])
 22 |     chunk_tokens = []
 23 | 
 24 |     for key, value in doc_index.items():
 25 |         chunk_tokens.append(llm.model.get_num_tokens(value.page_content))
 26 | 
 27 |     mean_chunk_size = statistics.mean(chunk_tokens)
 28 |     target = max_tokens
 29 | 
 30 |     if target // mean_chunk_size <= len(chunk_tokens):
 31 |         num_clusters = (target // mean_chunk_size).__int__()
 32 |     else:
 33 |         num_clusters = len(chunk_tokens).__int__()
 34 | 
 35 |     logging.warning(f"Number of chunks chosen: {num_clusters}")
 36 |     print(f"Can afford {num_clusters} clusters , with mean chunk size of {mean_chunk_size} tokens, out of {len(chunk_tokens)} total chunks")
 37 | 
 38 |     kmeans = KMeans(n_clusters=num_clusters, random_state=42, n_init=10).fit(embedding_matrix)
 39 | 
 40 |     closest_indices = []
 41 | 
 42 |     for i in range(num_clusters):
 43 |         distances = np.linalg.norm(vectors - kmeans.cluster_centers_[i], axis=1)
 44 |         closest_index = np.argmin(distances)
 45 |         closest_indices.append(closest_index)
 46 | 
 47 |     selected_indices = sorted(closest_indices)
 48 |     doc_ids = list(map(vectorstore.store.index_to_docstore_id.get, selected_indices))
 49 |     contents = list(map(vectorstore.store.docstore.__dict__['_dict'].get, doc_ids))
 50 | 
 51 |     map_prompt = """
 52 |     You will be given a single passage of a document. This section will be enclosed in triple backticks (```)
 53 |     Your goal is to identify what the passage tries to describe and give the general idea tha passage is discussing, as a summary. Do not focus on specific details and try to understand the general context. Start with This section is mainly obout,
 54 |     ```{text}```
 55 |     GENERAL IDEA:
 56 |     """
 57 |     map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text"])
 58 |     map_chain = load_summarize_chain(llm=llm.model,
 59 |                                     chain_type="stuff",
 60 |                                     prompt=map_prompt_template)
 61 | 
 62 |     #Summary mappings
 63 |     print('Mapping summaries')
 64 |     
 65 |     results = []
 66 |     with tqdm(total=len(contents), desc="Processing documents") as pbar:
 67 |         for i in contents:
 68 |             res_2 = map_chain({"input_documents": [i]})['output_text']
 69 |             results.append(res_2)
 70 |             pbar.update(1)
 71 | 
 72 |     #results = [map_chain({"input_documents": [i]})['output_text'] for i in contents]
 73 |     summary_map = ''.join(['\n\nSummary: ' + s for s in results])
 74 |     summary_doc = Document(page_content = summary_map)
 75 | 
 76 |     summary_prompt = """
 77 |     You will be given a set of summaries of randomly selected passages from a document.
 78 |     Your goal is to write a paragraph on what the document is likely to be about.
 79 | 
 80 |     ```{text}```
 81 | 
 82 |     The document is:
 83 |     """
 84 | 
 85 |     insights_prompt = """
 86 |         You will be given a set of summaries of passages from a document.
 87 |         Your goal is to generate an overall general summary of the document using the summaries provided below within triple backticks.
 88 | 
 89 |         ```{text}```
 90 | 
 91 |         OVERALL CONTENT: Provide a list of bullet points.
 92 |         """
 93 | 
 94 |     questions_prompt = """
 95 |         You will be given a set of summaries of passages from a document.
 96 |         Your goal is to generate an overall general comprehensive summary of the document using the summaries provided below within triple backticks and ask them as questions.
 97 | 
 98 |         ```{text}```
 99 | 
100 |         QUESTIONS: Provide a list of questions.
101 |         """
102 | 
103 |     summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"])
104 |     insights_prompt_template = PromptTemplate(template=insights_prompt, input_variables=["text"])
105 |     questions_prompt_template = PromptTemplate(template=questions_prompt, input_variables=["text"])
106 | 
107 |     summary_chain = load_summarize_chain(llm=llm.model,
108 |                                     chain_type="stuff",
109 |                                     prompt=summary_prompt_template)
110 |     insights_chain = load_summarize_chain(llm=llm.model,
111 |                                         chain_type="stuff",
112 |                                         prompt=insights_prompt_template)
113 |     questions_chain = load_summarize_chain(llm=llm.model,
114 |                                         chain_type="stuff",
115 |                                         prompt=questions_prompt_template)
116 |     
117 |     final_summary = None
118 |     insights = None
119 |     questions = None
120 | 
121 |     if(summary):
122 |         final_summary = summary_chain({"input_documents": [summary_doc]})['output_text']
123 |     if(keypoints):
124 |         insights = insights_chain({"input_documents": [summary_doc]})['output_text']
125 |     if(questions):
126 |         questions = questions_chain({"input_documents": [summary_doc]})['output_text']
127 | 
128 |     out = {'summary':final_summary, 'keypoints':insights, 'questions':questions}
129 |     return out
130 | 
131 | 
132 | def keywords(vectorstore, llm, embedding_dim=1536, max_tokens=10000):
133 |     index = vectorstore.store.index
134 |     num_items = len(vectorstore.store.index_to_docstore_id)
135 |     embedding_dim = embedding_dim
136 |     vectors = []
137 | 
138 |     for i in range(num_items):
139 |         vectors.append(index.reconstruct(i))
140 | 
141 |     embedding_matrix = np.array(vectors)
142 |     doc_index = (vectorstore.store.docstore.__dict__['_dict'])
143 |     chunk_tokens = []
144 | 
145 |     for key, value in doc_index.items():
146 |         chunk_tokens.append(llm.model.get_num_tokens(value.page_content))
147 | 
148 |     mean_chunk_size = statistics.mean(chunk_tokens)
149 |     target = max_tokens
150 | 
151 |     if target // mean_chunk_size <= len(chunk_tokens):
152 |         num_clusters = (target // mean_chunk_size).__int__()
153 |     else:
154 |         num_clusters = len(chunk_tokens).__int__()
155 | 
156 |     print(f"Can afford {num_clusters} clusters , with mean chunk size of {mean_chunk_size} tokens, out of {len(chunk_tokens)} total chunks")
157 | 
158 |     kmeans = KMeans(n_clusters=num_clusters, random_state=42, n_init=10).fit(embedding_matrix)
159 | 
160 |     closest_indices = []
161 | 
162 |     for i in range(num_clusters):
163 |         distances = np.linalg.norm(vectors - kmeans.cluster_centers_[i], axis=1)
164 |         closest_index = np.argmin(distances)
165 |         closest_indices.append(closest_index)
166 | 
167 |     selected_indices = sorted(closest_indices)
168 |     doc_ids = list(map(vectorstore.store.index_to_docstore_id.get, selected_indices))
169 |     contents = list(map(vectorstore.store.docstore.__dict__['_dict'].get, doc_ids))
170 | 
171 |     keyword_prompt = """
172 |     You will be given a single passage of a document. This section will be enclosed in triple backticks (```)
173 |     Your goal is to identify what the passage tries to describe and give five comma separated un-numbered keywords from the passage.
174 | 
175 |     ```{text}```
176 |     keywords:
177 |     """
178 |     keyword_prompt_template = PromptTemplate(template=keyword_prompt, input_variables=["text"])
179 |     keyword_chain = load_summarize_chain(llm=llm.model,
180 |                                     chain_type="stuff",
181 |                                     prompt=keyword_prompt_template)
182 | 
183 |     #Summary mappings
184 |     print('Mapping keywords')
185 |     
186 |     #res_2_key = [keyword_chain({"input_documents": [i]})['output_text'] for i in contents]
187 |     res_2_key = []
188 |     with tqdm(total=len(contents), desc="Processing documents") as pbar:
189 |         for i in contents:
190 |             res_2_key_t = keyword_chain({"input_documents": [i]})['output_text']
191 |             res_2_key.append(res_2_key_t)
192 |             pbar.update(1)
193 |     
194 |     #mapping keywords to chunks
195 |     labels = kmeans.labels_
196 |     string_list = res_2_key
197 |     label_to_string = dict(zip(range(len(string_list)), string_list))
198 |     mapped_strings = [label_to_string[label] for label in labels]
199 |     return res_2_key


--------------------------------------------------------------------------------