├── .env.sample ├── requirements.txt ├── README.md ├── templates └── index.html ├── .gitignore └── main.py /.env.sample: -------------------------------------------------------------------------------- 1 | OPENAI_API_KEY='' 2 | PINECONE_API_KEY='' 3 | PINECONE_ENV='' 4 | EMBEDDING_ID='' 5 | SECRET_KEY='' -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | flask 2 | python-dotenv 3 | openai 4 | pinecone-client 5 | langchain 6 | git+https://github.com/embedstore/embedstore.git 7 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # langchain-pinecone-chat-bot 2 | 3 | This repo is a fully functional Flask app that can be used to create a chatbot app like BibleGPT, KrishnaGPT or Chat app over any other data source. 4 | 5 | It uses the following 6 | 7 | * Ready made embeddings from [embedstore.ai](https://embedstore.ai) ([python package](https://github.com/embedstore/embedstore)). These text is chunked using LangChain's RecursiveCharacterTextSplitter with chunk_size as 1000, chunk_overlap as 100 and length_function as len. OpenAI embeddings (dimension 1536) are then used to calculate embeddings for each chunk. 8 | * It loads the embeddings and then indexes them into a Pinecone index. 9 | * Now whenever a user query is received, it first creates embedding for it using OpenAI embeddings. Then it search for the nearest 3 neighbour using cosine similarity in Pinecone index. 10 | * Now these documents are passed as context to ChatGPT API with the below prompt and temperature as 0 and max_tokens as 800 11 | 12 | ``` 13 | You are given a paragraph and a query. You need to answer the query on the basis of paragraph. If the answer is not contained within the text below, say "Sorry, I don't know. Please try again." 14 | 15 | P: {documents} 16 | Q: {query} 17 | A: 18 | ``` 19 | * Answer retrieved from ChatGPT API is shown to the user. 20 | 21 | # Setup Instructions 22 | 23 | * First make sure that you python3, python-virtualenv and setup tool installed on the system. 24 | * Next clone the repo 25 | 26 | ``` 27 | git clone https://github.com/embedstore/langchain-pinecone-chat-bot.git 28 | ``` 29 | 30 | * Now create a virtual environment and install all the dependencies 31 | 32 | ``` 33 | cd langchain-pinecone-chat-bot 34 | virtualenv -p $(which python3) pyenv 35 | source pyenv/bin/activate 36 | 37 | pip install -r requirements.txt 38 | ``` 39 | 40 | * Now copy the env file and fill in your variables 41 | 42 | ``` 43 | cp .env.sample .env 44 | ``` 45 | 46 | * The `EMBEDDING_ID` variable is the id of the embedding which you can get from [embedstore.ai](https://embedstore.ai) 47 | 48 | * Now run the app 49 | 50 | ``` 51 | python main.py 52 | ``` 53 | 54 | * You can also create a repl and import the repo and run it easily over there. 55 | 56 | # License 57 | 58 | * MIT License -------------------------------------------------------------------------------- /templates/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Naval Almanack Book Search 7 | 8 | 9 | 10 |

11 |

12 |

13 |

Naval Almanack Book Search

14 | 21 | {% if result %} 22 |

23 |

Result:

24 |

25 |

Documents:

26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | {% for document in result.documents %} 35 | 36 | 37 | 38 | 39 | {% endfor %} 40 | 41 |

#	Document
{{ loop.index }}	{{ document }}

42 |

43 | {% endif %} 44 |

45 |

46 |

47 | 48 | 49 | 50 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | share/python-wheels/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | MANIFEST 28 | 29 | # PyInstaller 30 | # Usually these files are written by a python script from a template 31 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 32 | *.manifest 33 | *.spec 34 | 35 | # Installer logs 36 | pip-log.txt 37 | pip-delete-this-directory.txt 38 | 39 | # Unit test / coverage reports 40 | htmlcov/ 41 | .tox/ 42 | .nox/ 43 | .coverage 44 | .coverage.* 45 | .cache 46 | nosetests.xml 47 | coverage.xml 48 | *.cover 49 | *.py,cover 50 | .hypothesis/ 51 | .pytest_cache/ 52 | cover/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | .pybuilder/ 76 | target/ 77 | 78 | # Jupyter Notebook 79 | .ipynb_checkpoints 80 | 81 | # IPython 82 | profile_default/ 83 | ipython_config.py 84 | 85 | # pyenv 86 | # For a library or package, you might want to ignore these files since the code is 87 | # intended to run in multiple environments; otherwise, check them in: 88 | # .python-version 89 | 90 | # pipenv 91 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 92 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 93 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 94 | # install all needed dependencies. 95 | #Pipfile.lock 96 | 97 | # poetry 98 | # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. 99 | # This is especially recommended for binary packages to ensure reproducibility, and is more 100 | # commonly ignored for libraries. 101 | # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control 102 | #poetry.lock 103 | 104 | # pdm 105 | # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. 106 | #pdm.lock 107 | # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it 108 | # in version control. 109 | # https://pdm.fming.dev/#use-with-ide 110 | .pdm.toml 111 | 112 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm 113 | __pypackages__/ 114 | 115 | # Celery stuff 116 | celerybeat-schedule 117 | celerybeat.pid 118 | 119 | # SageMath parsed files 120 | *.sage.py 121 | 122 | # Environments 123 | .env 124 | .venv 125 | env/ 126 | venv/ 127 | ENV/ 128 | env.bak/ 129 | venv.bak/ 130 | 131 | # Spyder project settings 132 | .spyderproject 133 | .spyproject 134 | 135 | # Rope project settings 136 | .ropeproject 137 | 138 | # mkdocs documentation 139 | /site 140 | 141 | # mypy 142 | .mypy_cache/ 143 | .dmypy.json 144 | dmypy.json 145 | 146 | # Pyre type checker 147 | .pyre/ 148 | 149 | # pytype static type analyzer 150 | .pytype/ 151 | 152 | # Cython debug symbols 153 | cython_debug/ 154 | 155 | # PyCharm 156 | # JetBrains specific template is maintained in a separate JetBrains.gitignore that can 157 | # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore 158 | # and can be added to the global gitignore or merged into this file. For a more nuclear 159 | # option (not recommended) you can uncomment the following to ignore the entire idea folder. 160 | #.idea/ 161 | 162 | pyenv/* 163 | venv/* 164 | 165 | .ideas.md -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | import os 2 | import openai 3 | import pinecone 4 | import itertools 5 | 6 | from dotenv import load_dotenv 7 | from flask import Flask, request, render_template, redirect, url_for, session 8 | from embedstore import load_embedding 9 | 10 | load_dotenv() 11 | 12 | app = Flask(__name__) 13 | app.secret_key = os.getenv("SECRET_KEY") 14 | openai.api_key = os.getenv("OPENAI_API_KEY") 15 | 16 | EMBEDDING_DIMENSION = 1536 17 | INDEX_NAME = "naval-almanack-book" 18 | 19 | pinecone.init(api_key=os.getenv("PINECONE_API_KEY"), environment=os.getenv("PINECONE_ENV")) 20 | 21 | 22 | def get_embedding(chunk): 23 | """Get embedding using OpenAI""" 24 | response = openai.Embedding.create( 25 | input=chunk, 26 | model="text-embedding-ada-002", 27 | ) 28 | embedding = response['data'][0]['embedding'] 29 | return embedding 30 | 31 | 32 | def get_response_from_openai(query, documents): 33 | """Get ChatGPT api response""" 34 | prompt = get_prompt_for_query(query, documents) 35 | messages = [{"role": "user", "content": prompt}] 36 | response = openai.ChatCompletion.create( 37 | model='gpt-3.5-turbo', 38 | messages=messages, 39 | temperature=0, 40 | max_tokens=800, 41 | top_p=1, 42 | ) 43 | return response["choices"][0]["message"]["content"] 44 | 45 | 46 | def create_pinecone_index(index_name): 47 | """Create Pinecone index if it doesn't exists""" 48 | existing_indexes = pinecone.list_indexes() 49 | if index_name not in existing_indexes: 50 | print(f"{index_name} index not found in pinecone. Creating it...") 51 | pinecone.create_index(index_name, dimension=EMBEDDING_DIMENSION) 52 | return pinecone.Index(index_name) 53 | 54 | 55 | def chunks(iterable, batch_size=100): 56 | """A helper function to break an iterable into chunks of size batch_size.""" 57 | it = iter(iterable) 58 | chunk = tuple(itertools.islice(it, batch_size)) 59 | while chunk: 60 | yield chunk 61 | chunk = tuple(itertools.islice(it, batch_size)) 62 | 63 | 64 | def get_prompt_for_query(query, documents): 65 | """Build prompt for question answering""" 66 | template = """ 67 | You are given a paragraph and a query. You need to answer the query on the basis of paragraph. If the answer is not contained within the text below, say \"Sorry, I don't know. Please try again.\"\n\nP:{documents}\nQ: {query}\nA: 68 | """ 69 | final_prompt = template.format( 70 | documents=documents, 71 | query=query 72 | ) 73 | return final_prompt 74 | 75 | 76 | def search_for_query(query): 77 | """Main function to search answer for query""" 78 | output = {} 79 | query_embedding = get_embedding(query) 80 | print(f"Embedding generated for {query}") 81 | results = index.query( 82 | vector=query_embedding, 83 | top_k=3, 84 | include_values=False, 85 | include_metadata=True, 86 | ) 87 | documents = [ 88 | match['metadata']['document'] for match in results['matches'] 89 | ] 90 | documents_as_str = "\n".join(documents) 91 | response = get_response_from_openai(query, documents_as_str) 92 | print(f"Final response received from openai.") 93 | output["response"] = response 94 | output["documents"] = documents 95 | return output 96 | 97 | 98 | index = create_pinecone_index(INDEX_NAME) 99 | result = load_embedding(os.getenv("EMBEDDING_ID"), embed_for="chroma") 100 | doc_ids = result["ids"] 101 | embeddings = result["embeddings"] 102 | documents = result["documents"] 103 | 104 | final_data = [] 105 | for idx, doc_id in enumerate(doc_ids): 106 | final_data.append((doc_id, embeddings[idx], {"document": documents[idx]})) 107 | 108 | for ids_vectors_chunk in chunks(final_data, batch_size=50): 109 | index.upsert(vectors=ids_vectors_chunk) 110 | 111 | print(pinecone.describe_index(INDEX_NAME)) 112 | 113 | 114 | @app.route('/', methods=['GET', 'POST']) 115 | def run_bot(): 116 | if request.method == 'POST': 117 | query = request.form.get('query') 118 | result = search_for_query(query) 119 | session['result'] = result 120 | session['query'] = query 121 | return redirect(url_for('run_bot')) 122 | return render_template('index.html', query=session.get("query"), result=session.get("result")) 123 | 124 | 125 | if __name__ == '__main__': 126 | app.run() --------------------------------------------------------------------------------