├── delete_pinecone_index.py
├── query_only.py
├── README.zh.md
├── main.py
└── README.md


/delete_pinecone_index.py:
--------------------------------------------------------------------------------
 1 | import pinecone
 2 | import os
 3 | import sys
 4 | 
 5 | pinecone.init(
 6 |         api_key=os.environ["PINECONE_API_KEY"],
 7 |         environment="us-east1-gcp"
 8 |     )
 9 | 
10 | index_name = sys.argv[1]
11 | 
12 | if index_name in pinecone.list_indexes():
13 |     pinecone.delete_index(index_name)
14 |     print(f"index: '{index_name}' successfully deleted")
15 | else:
16 |     print(f"index: '{index_name}' not found in pinecone")


--------------------------------------------------------------------------------
/query_only.py:
--------------------------------------------------------------------------------
  1 | import pinecone
  2 | from langchain.embeddings import OpenAIEmbeddings
  3 | from langchain.prompts import PromptTemplate
  4 | from langchain.llms import OpenAI
  5 | from langchain.chains import LLMChain
  6 | import os, sys, json
  7 | 
  8 | # This file is a trimmed and slightly-altered version of main.py
  9 | 
 10 | def pinecone_init(index_name: str = 'notion-database'):
 11 |     '''initialize connection to pinecone (get API key at app.pinecone.io)'''
 12 |     # index_name = 'notion-database' # assigned as the default value
 13 |     pinecone.init(
 14 |         api_key=os.environ["PINECONE_API_KEY"],
 15 |         environment="us-east1-gcp"
 16 |     )
 17 | 
 18 |     # check if index already exists (it shouldn't if this is first time)
 19 |     if index_name in pinecone.list_indexes():
 20 |         index = pinecone.Index(index_name)
 21 |         return index
 22 |     else:
 23 |         print(f"index {index_name} not found")
 24 |         quit()
 25 | 
 26 | def get_docs(path: str = 'docs.json'):
 27 |     '''get indexed docs from docs.json, which is a memory file of main.py'''
 28 |     with open(path, 'r') as f:
 29 |         docs = json.load(f)
 30 |     return docs
 31 | 
 32 | 
 33 | def pinecone_query(query: str = "who are you", docs=get_docs(), index=pinecone_init()):
 34 |     query_coord = OpenAIEmbeddings().embed_query(query)
 35 |     # retrieve from Pinecone
 36 |     query_res = index.query(query_coord, top_k=3, include_metadata=True)
 37 | 
 38 |     content_ids = [
 39 |             int(x['id']) for x in query_res['matches']
 40 |         ]
 41 |     contents = [docs[i] for i in content_ids]
 42 |     contents_str = "\n\n".join(contents)
 43 |     
 44 |     return contents_str
 45 | 
 46 | 
 47 | def ask_gpt3(query:str ="who are you",contents_str=pinecone_query()):
 48 |     
 49 |     prompt = PromptTemplate(
 50 |         input_variables=["question","contents"],
 51 |         template=''' Answer this question: "{question}" using the contents below
 52 |         Contents:
 53 |         {contents}
 54 |         Answer:
 55 |         ''',
 56 |     )
 57 | 
 58 |     chain = LLMChain(
 59 |         llm=OpenAI(temperature=0),
 60 |         prompt=prompt,
 61 |         # verbose=True,
 62 |         )
 63 |     # print(prompt.format(question=query, contents=contents_str)) # for debugging purpose
 64 |     answer = chain.run(
 65 |         question=query,
 66 |         contents=contents_str,
 67 |         #verbose=True
 68 |         )
 69 |     return answer
 70 | 
 71 | def ans_cont_to_file(answer, contents_str):
 72 |     # This last set of code is to write the answer and contents to text files.
 73 |     with open ("answer.txt", "w") as f:
 74 |         f.write(answer)
 75 |     with open ("contents.txt", "w") as h:
 76 |         h.write(contents_str)
 77 | 
 78 | def main():
 79 |     try:
 80 |         query = sys.argv[1]
 81 |     except IndexError:
 82 |         query = input("ask a question: ")
 83 |     
 84 |     print("connecting to pinecone index...")
 85 |     index = pinecone_init("notion-database")
 86 |     print("getting docs")
 87 |     docs = get_docs()
 88 |     
 89 |     print("querying pinecone...")
 90 |     # query = input("ask a question")
 91 |     
 92 |     contents_str =  pinecone_query(query, docs, index)
 93 |     print("querying gpt...")
 94 |     answer = ask_gpt3(query=query, contents_str=contents_str)
 95 |     
 96 |     # optimal, converts the answer and contents to text files
 97 |     ans_cont_to_file(answer, contents_str)
 98 |     
 99 |     print(f"done! the answer to '{query}' is: '{answer}'")
100 | 
101 | 
102 | if __name__ == "__main__":
103 |     main()


--------------------------------------------------------------------------------
/README.zh.md:
--------------------------------------------------------------------------------
  1 | [![en](https://img.shields.io/badge/lang-en-red.svg)](https://github.com/madeyexz/markdown-file-query/blob/main/README.md)
  2 | [![zh](https://img.shields.io/badge/lang-zh-blue.svg)](https://github.com/madeyexz/markdown-file-query/blob/main/README.zh.md)
  3 | 
  4 | ## 概述
  5 | 本项目
  6 | - 使用[Pinecone](https://www.pinecone.io/)向量数据库以及OpenAI的embedding model将文字转变为向量。
  7 | - 兼容任何`.md`类型文件，因此它完美兼容Notion和Obsidian（如果你用Notion的话得要手动输出成`.md`类型文件）
  8 | - 是作者使用[费曼学习法](https://en.wikipedia.org/wiki/Learning_by_teaching)的一个案例
  9 | - 可能是[llama_index](https://github.com/jerryjliu/llama_index#-dependencies)的一个弱化克隆。因此如果你想要一个撰写地更优美的文件问答程序，那么请参考llama_index。
 10 | 
 11 | ### 实现原理
 12 | 1. 对于每一个`.md`文件，它们会被`langchain.textsplitter`切分成许多小块。
 13 | 2. 对于每一个小块，它们会被OpenAI的embedding model转换成向量(`langchain.embeddings.OpenAIEmbeddings`)
 14 | 3. 接下来这些向量会被上传到`Pinecone`向量数据库。
 15 | 4. 问题也会被转换成向量并上传到Pinecone。
 16 | 5. 我们比较问题向量和数据库中的向量（使用余弦相似度）来检索结果。
 17 | 6. 最相似的三个结果会被送入GPT-3，GPT-3会生成一个自然语言答案。
 18 | 
 19 | ### 代办
 20 | - [ ] 加一个 `--help` 选项
 21 | - [ ] 部署到 Streamlit 上
 22 | ## 开始
 23 | 
 24 | ### 运行条件
 25 | 1. 准备 Pinecone 和 OpenAI 的 API key
 26 |    - Pinecone API key 可以从[这里](https://app.pinecone.io/)获得。
 27 |    - OpenAI API key 可以从[这里](https://platform.openai.com/account/api-keys)获得。
 28 | 2. 将 Pinecone 和 OpenAI 的 API key 导出到系统环境中
 29 |    ``` bash
 30 |    export PINECONE_API_KEY="your_pinecone_api_key"
 31 |    export OPENAI_API_KEY="your_openai_api_key"
 32 |    ```
 33 |    现在在 Python 中使用
 34 |    ``` python
 35 |    import os
 36 |    os.environ["PINECONE_API_KEY"]
 37 |    os.environ["OPENAI_API_KEY"]
 38 |    ```
 39 |    来检查你是否已经将它们导出到系统环境中，如果出现 `KeyError`，那么请重启终端（如果你在使用的话，还有你的IDE）。
 40 | 
 41 | ### 安装
 42 | 1. 将本项目克隆到你的本地
 43 |     ```bash
 44 |     git clone https://github.com/madeyexz/markdown-file-query.git
 45 |     ```
 46 | 2. 安装依赖项
 47 |     ``` bash
 48 |     pip install pinecone langchain tqdm
 49 |     ```
 50 | 
 51 | ### 使用
 52 | 1. 准备好你的`.md`文件并将它们放在一个文件夹中（或者你可以自己取一个名字，但是你需要相应地修改代码）。注意这个文件夹应该和`main.py`在同一个目录下。
 53 | 2. 如果这是你第一次查询某个文档，那么运行`main.py`程序
 54 |     ``` bash
 55 |     python3 main.py "文件夹的路径" "问题"
 56 |     ```
 57 | 3. 查询结果和GPT生成答案的参考文本会分别保存在`answer.txt`和`contents.txt`中。
 58 | 4. 如果你想要再次查询同一批文档，那么运行`query_only.py`来避免重新嵌入文档。
 59 |     ``` bash
 60 |     python3 query_only.py "问题"
 61 |     ```
 62 | 
 63 | ### 使用实例
 64 | 1. 我有一个文件夹叫做`markdown_database`，它包含了许多`.md`文件，我想要用问题"what's the strange situation"来查询这个数据库。
 65 |     ``` bash
 66 |     ❯ python3 main.py "markdown_database" "what's the strange situation"                                                        
 67 |     ```     
 68 |     ```text             
 69 |     initiating pinecone index...
 70 |     digesting docs...
 71 |     uploading datas to pinecone...
 72 |     92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████          | 60/65 [00:29<00:02,  1.87it/s]
 73 |     let's wait for 60 seconds to avoid RateLimitError... \(since im not a paid user\))
 74 |     100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [01:00<00:00,  1.00s/it]
 75 |     100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 65/65 [01:32<00:00,  1.42s/it]
 76 |     querying pinecone...
 77 |     querying gpt...
 78 |     writing results to answer.txt and contents.txt
 79 |     done! the answer to 'what's the strange situation' is: '
 80 |     The Strange Situation is a standardized procedure devised by Mary Ainsworth in the 1970s to observe attachment security in children within the context of caregiver relationships. It applies to infants between the age of nine and 18 months and involves a series of eight episodes lasting approximately 3 minutes each, whereby a mother, child and stranger are introduced, separated and reunited. The procedure is used to observe the quality of a young child’s attachment to his or her mother, and can also be applied to other attachment figures, such as God, through the use of Emotionally Focused Therapy (EFT) and religious beliefs, such as the saying “there are no atheists in foxholes”.'
 81 |     ```
 82 | 2. 如果我想要再次查询同一批文档，那么我可以使用`query_only.py`来避免重新嵌入文档。
 83 |     ``` bash
 84 |     ❯ python3 query_only.py "Who is Mary Ainsworth?"
 85 |     ```
 86 |     ``` text
 87 |     connecting to pinecone index...
 88 |     getting docs
 89 |     querying pinecone...
 90 |     querying gpt...
 91 |     done! the answer to 'Who is Mary Ainsworth?' is: '
 92 |     Mary Ainsworth was a developmental psychologist who devised the Strange Situation in the 1970s to observe attachment security in children within the context of caregiver relationships. The Strange Situation involves a series of eight episodes lasting approximately 3 minutes each, whereby a mother, child and stranger are introduced, separated and reunited. Ainsworth is also known for her observation that if you want to see the quality of a young child’s attachment to his or her mother, watch what the child does, not when Mother leaves, but when she returns. She is also known for her research on anxious babies and their inability to use their mothers as a secure base.'
 93 |     ```
 94 | ## 已知问题
 95 | 1. 如果你使用了Pinecone，那么每当你想要查询一个新的文档（也就是创建一个新的数据库）时，你应该创建一个新的Pinecone索引（因为你不想要来自旧文档的答案），或者删除旧索引。这是因为Pinecone目前还不支持更新索引。
 96 | 
 97 |     要删除旧索引：
 98 |     ``` bash
 99 |     python3 delete_pinecone_index.py NAME_OF_INDEX
100 |     ```
101 | ## 致谢
102 | 非常感谢开源社区提供的简单明了的例子和全面的教程！
103 | - [openai-cookbook: using vector database for embeddings search](https://github.com/openai/openai-cookbook/blob/main/examples/vector_databases/Using_vector_databases_for_embeddings_search.ipynb)
104 | - [Build a Personal Search Engine Web App using Open AI Text Embeddings - Avra](https://medium.com/@avra42/build-a-personal-search-engine-web-app-using-open-ai-text-embeddings-d6541f32892d)
105 | - this project is heavily inspired by [hwchase17/notion-qa](https://github.com/hwchase17/notion-qa)
106 | - [Langchain](https://python.langchain.com/en/latest), a Python library for manipulating LLMs elegently.
107 | 


--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
  1 | import os # to get system environment variables
  2 | import sys # to parse system arguments
  3 | import json # to dump and load data (lists) elegently
  4 | import pinecone # vector database service, core of the VDB query
  5 | from pathlib import Path # used to manipulate file paths elegently
  6 | from tqdm.auto import tqdm # to show progress bar
  7 | import time # to avoid RateLimitError
  8 | from langchain.text_splitter import CharacterTextSplitter # to split texts
  9 | from langchain.prompts import PromptTemplate # makes query easier 
 10 | from langchain.llms import OpenAI # to query LLMs
 11 | from langchain.chains import LLMChain # makes query easier
 12 | from langchain.embeddings import OpenAIEmbeddings # to turn texts into vectors
 13 | 
 14 | def pinecone_init(index_name: str = 'notion-database'):
 15 |     '''initialize connection to pinecone (get API key at app.pinecone.io)'''
 16 |     # index_name = 'notion-database' # assigned as the default value
 17 |     pinecone.init(
 18 |         api_key=os.environ["PINECONE_API_KEY"],
 19 |         environment="us-east1-gcp"
 20 |     )
 21 | 
 22 |     # check if index already exists (it shouldn't if this is first time)
 23 |     if index_name not in pinecone.list_indexes():
 24 |         # if does not exist, create index
 25 |         pinecone.create_index(
 26 |             index_name,
 27 |             dimension=1536,
 28 |             metric='cosine',
 29 |             # metric='euclidean',
 30 |             metadata_config={'indexed': ['channel_id', 'published']} # useless code, guess why im not deleting this yet?
 31 |         )
 32 |     # connect to index
 33 |     index = pinecone.Index(index_name)
 34 |     # view index status with
 35 |     # index.describe_index_stats()
 36 |     return index
 37 | 
 38 |     # an error was met and solved upon retring && upgrading jupyter notebook with `pip install notebook --upgrade`
 39 | 
 40 | def md_digest(ps: list = list(Path("Notion_DB/").glob("**/*.md"))):
 41 |     '''This is the logic for ingesting Notion data into LangChain.'''
 42 |     
 43 |     # Here we load in the data in the format that Notion exports it in.
 44 |     data = []
 45 |     sources = []
 46 |     for p in ps:
 47 |         with open(p) as f:
 48 |             data.append(f.read())
 49 |         sources.append(p)
 50 | 
 51 |     # We do this due to the context limits of the LLMs.
 52 |     # chunk size is 1000, which means each chunk of text will be 1000 characters long, and that the separator is a new line
 53 |     text_splitter = CharacterTextSplitter(chunk_size=1000, separator="\n")
 54 |     docs = []
 55 |     metadatas = []
 56 |     for i, d in enumerate(data):
 57 |         # where i, d is the index and content of each .md file respectively
 58 |         splits = text_splitter.split_text(d)
 59 |         docs.extend(splits)
 60 |         metadatas.extend([{"source": sources[i]}] * len(splits))
 61 | 
 62 |     # after digestion, we save the docs to local json files for later queries to avoid re-encoding.
 63 |     with open('docs.json', 'w') as f:
 64 |         json.dump(docs, f)    
 65 |     
 66 |     return docs
 67 |     # question, will the data be too big/unspecific for each chunk?
 68 |     # now len(docs) should be the number of vectors this is going to create
 69 | 
 70 | def pinecone_upload(docs: list = md_digest(), index=pinecone_init()):
 71 |     '''This is the logic for uploading the data into Pinecone.'''
 72 |     # upload to pinecone
 73 | 
 74 |     id_batch = [str(x) for x in range(0, len(docs))]
 75 |     coord_list = []
 76 | 
 77 |     for i in tqdm(range(0, len(docs))):
 78 |         # this line is added to avoid RateLimitError, where 60 second is a very random but conservative number.
 79 |         # a stupid approach by me :D
 80 |         rest = 60
 81 |         if i != 0 and i % 60 == 0:
 82 |             print(f"let's wait for {rest} seconds to avoid RateLimitError... \(since im not a paid user\))")
 83 |             for i in tqdm(range(0, rest)):
 84 |                 time.sleep(1)
 85 | 
 86 |         # get texts to encode
 87 |         texts = docs[i]
 88 |         coord = OpenAIEmbeddings().embed_query(texts)
 89 |         coord_list.append(coord)
 90 | 
 91 |     # prepare and upload the vectors to Pinecone
 92 |     vectors = list(zip(id_batch, coord_list))
 93 |     index.upsert(vectors)
 94 | 
 95 | 
 96 | def pinecone_query(query: str = "who are you", docs=md_digest(), index=pinecone_init()):
 97 |     query_coord = OpenAIEmbeddings().embed_query(query)
 98 |     # retrieve from Pinecone
 99 |     # get relevant contexts (including the questions)
100 |     query_res = index.query(query_coord, top_k=3, include_metadata=True)
101 | 
102 |     content_ids = [
103 |             int(x['id']) for x in query_res['matches']
104 |         ]
105 |     contents = [docs[i] for i in content_ids]
106 |     contents_str = "\n\n".join(contents)
107 |     
108 |     return contents_str
109 | 
110 | 
111 | def ask_gpt3(query:str ="who are you",contents_str=pinecone_query()):
112 |     
113 |     prompt = PromptTemplate(
114 |         input_variables=["question","contents"],
115 |         template=''' Answer this question: "{question}" using the contents below
116 |         Contents:
117 |         {contents}
118 |         Answer:
119 |         ''',
120 |     )
121 | 
122 |     chain = LLMChain(
123 |         llm=OpenAI(temperature=0),
124 |         prompt=prompt,
125 |         # verbose=True,
126 |         )
127 | 
128 |     answer = chain.run(question=query,contents=contents_str)
129 |     return answer
130 | 
131 | def ans_cont_to_file(answer, contents_str):
132 |     # This last set of code is to write the answer and contents to text files.
133 |     with open ("answer.txt", "w") as f:
134 |         f.write(answer)
135 |     with open ("contents.txt", "w") as h:
136 |         h.write(contents_str)
137 | 
138 | def main():
139 |     print("initiating pinecone index...")
140 |     index = pinecone_init("notion-database")
141 |     directory, query = sys.argv[1], sys.argv[2]
142 |     
143 |     print("digesting docs...")
144 |     docs = md_digest(list(Path(directory).glob("**/*.md")))
145 |     # docs = md_digest()
146 |     
147 |     print("uploading datas to pinecone...")
148 |     pinecone_upload(docs, index)
149 |     
150 |     print("querying pinecone...")
151 |     # query = input("ask a question")
152 |     contents_str =  pinecone_query(query, docs)
153 |     
154 |     print("querying gpt...")
155 |     answer = ask_gpt3(query=query, contents_str=contents_str)
156 |     
157 |     # optimal, converts the answer and contents to text files
158 |     print("writing results to answer.txt and contents.txt")
159 |     ans_cont_to_file(answer, contents_str)
160 |     
161 |     print(f"done! the answer to '{query}' is: '{answer}'")
162 | 
163 | if __name__ == "__main__":
164 |     main()


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | [![en](https://img.shields.io/badge/lang-en-red.svg)](https://github.com/madeyexz/markdown-file-query/blob/main/README.md)
  2 | [![zh](https://img.shields.io/badge/lang-zh-blue.svg)](https://github.com/madeyexz/markdown-file-query/blob/main/README.zh.md)
  3 | 
  4 | > *This project currently works best with English documents.*
  5 | 
  6 | ## About This Project
  7 | this project
  8 | - utilizes [Pinecone](https://www.pinecone.io/) vector database (VDB) and OpenAI (vector) embedding model to turn texts into vectors.
  9 | - works with any `.md` file, so it works perfectly with Notion & Obsidian (though for Notion you have to export it to `.md` manually first)
 10 | - is the author's practice of [Feynman technique](https://en.wikipedia.org/wiki/Learning_by_teaching).
 11 | - is probably a weaker duplicate of [privateGPT](https://github.com/imartinez/privateGPT) and [llama_index](https://github.com/jerryjliu/llama_index#-dependencies), if you want a beautifully-crafted document query program, you should use llama_index instead of this toy.
 12 | 
 13 | ### Walkthrough of this Program
 14 | 1. Each markdown file in the target directory is cut into lots of small chunks using `langchain.textsplitter`
 15 | 2. Each chunck is turned into a vector via OpenAI's embedding model (`langchain.embeddings.OpenAIEmbeddings`)
 16 | 3. The vectors are then uploaded to `Pinecone` vector database.
 17 | 4. Queries are also converted to vectors using the vector embedding model and uploaded to Pinecone.
 18 | 5. To retrieve search results, we compare the query vector with vector database using Pinecone (by cosine similarity).
 19 | 6. Closest 3 results are retrieved and fed into GPT-3 along with the question, and GPT-3 will generate an answer in natural language.
 20 | 
 21 | ### TODO
 22 | - [ ] add a `--help` option
 23 | - [ ] deploy to Streamlit
 24 | ## Getting Started
 25 | 
 26 | ### Prerequisites
 27 | 1. Prepare Pinecone and OpenAI API key:
 28 |     - Pinecone API key can be obtained [here](https://app.pinecone.io/).
 29 |     - OpenAI API key can be obtained [here](https://platform.openai.com/account/api-keys).
 30 | 2. To export the Pinecone and OpenAI API key to system environment
 31 |    ``` bash
 32 |    export PINECONE_API_KEY="your_pinecone_api_key"
 33 |    export OPENAI_API_KEY="your_openai_api_key"
 34 |    ```
 35 |    now in Python use
 36 |    ``` python
 37 |    import os
 38 |    os.environ["PINECONE_API_KEY"]
 39 |    os.environ["OPENAI_API_KEY"]
 40 |    ```
 41 |    to check if you have them exported to system environment, if `KeyError`, then restart the terminal upon completion (and your IDE if you are using one).
 42 | ### Installation
 43 | 1. clone this repo to your local machine
 44 |     ```bash
 45 |     git clone https://github.com/madeyexz/markdown-file-query.git 
 46 |     ```
 47 | 2. Install the dependencies
 48 |     ``` bash
 49 |     pip install pinecone langchain tqdm
 50 |     ```
 51 | 
 52 | ### Usage
 53 | 1. Prepare the markdown file(s) and put them in a `FOLDER` (or any name you like, but you have to change the code accordingly). Notice this should be in the same directory as `main.py`.
 54 | 2. If this is your first time querying a certain document, run the `main.py` program
 55 |     ``` bash
 56 |     python3 main.py "PATH_OF_FOLDER" "QUESTION"
 57 |     ```
 58 | 3. The query results and the reference GPT used to generate the answer will be saved in `answer.txt` and `contents.txt` respectively.
 59 | 4. If you want to query the same batch of documents again, then run the `query_only.py` to avoid re-embedding the documents.
 60 |     ``` bash
 61 |     python3 query_only.py "QUESTION"
 62 |     ```
 63 | 
 64 | ### Example
 65 | 1. I have a folder called `markdown_database` which contains a bunch of `.md` files, I want to query this database with the question "Whats the strange situation"
 66 |     ``` bash
 67 |     ❯ python3 main.py "markdown_database" "what's the strange situation"                                                        
 68 |     ```     
 69 |     ```text             
 70 |     initiating pinecone index...
 71 |     digesting docs...
 72 |     uploading datas to pinecone...
 73 |     92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████          | 60/65 [00:29<00:02,  1.87it/s]
 74 |     let's wait for 60 seconds to avoid RateLimitError... \(since im not a paid user\))
 75 |     100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [01:00<00:00,  1.00s/it]
 76 |     100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 65/65 [01:32<00:00,  1.42s/it]
 77 |     querying pinecone...
 78 |     querying gpt...
 79 |     writing results to answer.txt and contents.txt
 80 |     done! the answer to 'what's the strange situation' is: '
 81 |     The Strange Situation is a standardized procedure devised by Mary Ainsworth in the 1970s to observe attachment security in children within the context of caregiver relationships. It applies to infants between the age of nine and 18 months and involves a series of eight episodes lasting approximately 3 minutes each, whereby a mother, child and stranger are introduced, separated and reunited. The procedure is used to observe the quality of a young child’s attachment to his or her mother, and can also be applied to other attachment figures, such as God, through the use of Emotionally Focused Therapy (EFT) and religious beliefs, such as the saying “there are no atheists in foxholes”.'
 82 |     ```
 83 | 2. If I want to query the same database again, I can use `query_only.py` to avoid re-embedding the documents.
 84 |     ``` bash
 85 |     ❯ python3 query_only.py "Who is Mary Ainsworth?"
 86 |     ```
 87 |     ``` text
 88 |     connecting to pinecone index...
 89 |     getting docs
 90 |     querying pinecone...
 91 |     querying gpt...
 92 |     done! the answer to 'Who is Mary Ainsworth?' is: '
 93 |     Mary Ainsworth was a developmental psychologist who devised the Strange Situation in the 1970s to observe attachment security in children within the context of caregiver relationships. The Strange Situation involves a series of eight episodes lasting approximately 3 minutes each, whereby a mother, child and stranger are introduced, separated and reunited. Ainsworth is also known for her observation that if you want to see the quality of a young child’s attachment to his or her mother, watch what the child does, not when Mother leaves, but when she returns. She is also known for her research on anxious babies and their inability to use their mothers as a secure base.'
 94 |     ```
 95 | ## Known Limitation
 96 | 1. If you use Pinecone, then whenever you want to query a new document (i.e. creating a new database), you should probably create a new Pinecone index (for you don't want answers from the old document), or delete the old index. This is because Pinecone does not support updating the index (yet). 
 97 | 
 98 |     To delete the old index:
 99 |     ``` bash
100 |     python3 delete_pinecone_index.py NAME_OF_INDEX
101 |     ```
102 | ## Acknowledgements
103 | Huge shout out to the open-source community for providing straight-forward examples and comprehensive tutorials!
104 | - [openai-cookbook: using vector database for embeddings search](https://github.com/openai/openai-cookbook/blob/main/examples/vector_databases/Using_vector_databases_for_embeddings_search.ipynb)
105 | - [Build a Personal Search Engine Web App using Open AI Text Embeddings - Avra](https://medium.com/@avra42/build-a-personal-search-engine-web-app-using-open-ai-text-embeddings-d6541f32892d)
106 | - this project is heavily inspired by [hwchase17/notion-qa](https://github.com/hwchase17/notion-qa)
107 | - [Langchain](https://python.langchain.com/en/latest), a Python library for manipulating LLMs elegently.
108 | 


--------------------------------------------------------------------------------