├── .gitignore ├── README.md ├── chat ├── .env ├── Bens_Bites_Logo.jpg ├── logo.png └── main.py ├── data_sample ├── d0.mpst_1k_raw.csv └── other │ └── wiki_movie_plots_small.csv ├── notes.txt ├── poetry.lock ├── pyproject.toml ├── requirements.txt └── src ├── check_results.py ├── other └── generate_index_wikipedia_movies.py ├── p1.generate_index_mpst.py ├── p2.make_jsonl_for_requests_mpst.py ├── p3.api_request_parallel_processor.py ├── p4.convert_jsonl_with_embeddings_to_csv.py └── p5.upload_to_pinecone.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Created by https://www.toptal.com/developers/gitignore/api/macos 2 | # Edit at https://www.toptal.com/developers/gitignore?templates=macos 3 | 4 | 5 | chat/*.env 6 | .env 7 | 8 | settings.json 9 | 10 | data_sample/d* 11 | 12 | ### macOS ### 13 | # General 14 | .DS_Store 15 | .AppleDouble 16 | .LSOverride 17 | data/ 18 | chat/main_local.py 19 | 20 | # Icon must end with two \r 21 | Icon 22 | 23 | 24 | # Thumbnails 25 | ._* 26 | 27 | # Files that might appear in the root of a volume 28 | .DocumentRevisions-V100 29 | .fseventsd 30 | .Spotlight-V100 31 | .TemporaryItems 32 | .Trashes 33 | .VolumeIcon.icns 34 | .com.apple.timemachine.donotpresent 35 | 36 | # Directories potentially created on remote AFP share 37 | .AppleDB 38 | .AppleDesktop 39 | Network Trash Folder 40 | Temporary Items 41 | .apdisk 42 | 43 | ### macOS Patch ### 44 | # iCloud generated files 45 | *.icloud 46 | 47 | # End of https://www.toptal.com/developers/gitignore/api/macos 48 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # GPTflix source code for deployment on Streamlit 2 | 3 | ## What are we going to build? 4 | 5 | 6 | This is the source code of www.gptflix.ai 7 | 8 | We will build a GPTflix QA bot with OpenAI, Pinecone DB and Streamlit. You will learn how to prepare text to send to an embedding model. You will capture the embeddings and text returned from the model for upload to Pinecone DB. Afterwards you will setup a Pinecone DB index and upload the OpenAI embeddings to the DB for the bot to search over the embeddings. 9 | 10 | Finally, we will setup a QA bot frontend chat app with Streamlit. When the user asks the bot a question, the bot will search over the movie text in your Pinecone DB. It will answer your question about a movie based on text from the DB. 11 | 12 |
13 | 14 | ## What is the point? 15 | 16 | This is meant as a basic scaffolding to build your own knowledge-retrieval systems, it's super basic for now! 17 | 18 | This repo contains the GPTflix source code and a Streamlit deployment guide. 19 | 20 |
21 | 22 | ## Setup prerequisites 23 | 24 | This repo is set up for deployment on Streamlit, you will want to set your environment variables in streamlit like this: 25 | 26 | 1. Fork the [GPTflix](https://github.com/stephansturges/GPTflix/fork) repo to your GitHub account. 27 | 28 | 2. Set up an account on [Pinecone.io](https://app.pinecone.io/) 29 | 30 | 3. Set up an account on [Streamlit cloud](https://share.streamlit.io/signup) 31 | 32 | 4. Create a new app on Streamlit. Link it to your fork of the repo on Github then point the app to `/chat/main.py` as the main executable. 33 | 34 | 5. Go to your app settings, and navigate to Secrets. Set up the secret like this: 35 | 36 | [//]: # 37 | 38 | [API_KEYS] 39 | pinecone = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx" 40 | openai = "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" 41 | 42 | 6. Make a `.env` file in the the root of the project with your OpenAI API Key on your local machine. 43 | 44 | [//]: # 45 | 46 | PINECONE_API_KEY=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx 47 | OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 48 | 49 | 50 | 51 | Those need to be your pinecone and openai API keys of course ;) 52 | 53 |
54 | 55 | ## How to add data? 56 | This repo is set up to walk through a demo using the MPST data in /data_samples 57 | These are the steps: 58 | 59 | 1. Run `p1.generate_index_mpst.py` to prepare the text from`./data_sample/d0.mpst_1k_raw.csv` into a format we can inject into a model and get its embedding. 60 | 61 | [//]: # 62 | 63 | python p1.generate_index_mpst.py 64 | 65 | 2. Run `p2.make_jsonl_for_requests_mpst.py` to convert your new `d1.mpst_1k_converted.csv` file to a jsonl file with instructions to run the embeddings requests against the OpenAI API. 66 | 67 | [//]: # 68 | 69 | python p2.make_jsonl_for_requests_mpst.py 70 | 71 | 3. Run `p3.api_request_parallel_processor.py` on the JSONL file from (2) to get embeddings. 72 | 73 | [//]: # 74 | 75 | python src/p3.api_request_parallel_processor.py \ 76 | --requests_filepath data_sample/d2.embeddings_maker.jsonl \ 77 | --save_filepath data_sample/d3.embeddings_maker_results.jsonl \ 78 | --request_url https://api.openai.com/v1/embeddings \ 79 | --max_requests_per_minute 1500 \ 80 | --max_tokens_per_minute 6250000 \ 81 | --token_encoding_name cl100k_base \ 82 | --max_attempts 5 \ 83 | --logging_level 20 84 | 85 | 4. Run `p4.convert_jsonl_with_embeddings_to_csv.py` with the new jsonl file to make a pretty CSV with the text and embeddings. 86 | ~~This is cosmetic and a bit of a waste of time in the process, feel free to clean it up.~~. -> actually that's not quite true: you don't care about making the CSV because you don't need to care about the index of the embeddings **if you are only going to upload data to the index once**, if you are going to be updating the indexing and adding more data, or need an offline / readable format to keep track of things then making the CSV kinda makes sense :) 87 | 88 | [//]: # 89 | 90 | python p4.convert_jsonl_with_embeddings_to_csv.py 91 | 92 | 5. Run `p5.upload_to_pinecone.py` with your api key and database settings to upload all that text data and embeddings. 93 | 94 | [//]: # 95 | 96 | python p5.upload_to_pinecone.py 97 | 98 | You can run the app locally but you'll need to remove the images (the paths are different on streamlit cloud) 99 | 100 |
101 | 102 | ## What is included? 103 | 104 | At the moment there is some data in sample_data, all taken from Kaggle as examples. 105 | 106 |
107 | 108 | ## To do 109 | 110 | [] Add memory: summarize previous questions / answers and prepend to prompt
111 | [] Add different modes: wider search in database
112 | [] Add different modes: AI tones / characters for responses
113 | [] Better docs
114 | 115 | 116 | BETTER DOCS COMING SOON! Feel free to contribute them :) 117 | 118 | #LICENSE 119 | 120 | MIT License 121 | 122 | Copyright (c) 2023 Stephan Sturges 123 | 124 | Permission is hereby granted, free of charge, to any person obtaining a copy 125 | of this software and associated documentation files (the "Software"), to deal 126 | in the Software without restriction, including without limitation the rights 127 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 128 | copies of the Software, and to permit persons to whom the Software is 129 | furnished to do so, subject to the following conditions: 130 | 131 | The above copyright notice and this permission notice shall be included in all 132 | copies or substantial portions of the Software. 133 | 134 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 135 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 136 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 137 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 138 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 139 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 140 | SOFTWARE. 141 | -------------------------------------------------------------------------------- /chat/.env: -------------------------------------------------------------------------------- 1 | OPENAI_API_KEY= 2 | -------------------------------------------------------------------------------- /chat/Bens_Bites_Logo.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stephansturges/GPTflix/6f2db9a42acf1fb025eff3e7fdfea435686f620b/chat/Bens_Bites_Logo.jpg -------------------------------------------------------------------------------- /chat/logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stephansturges/GPTflix/6f2db9a42acf1fb025eff3e7fdfea435686f620b/chat/logo.png -------------------------------------------------------------------------------- /chat/main.py: -------------------------------------------------------------------------------- 1 | import openai 2 | import tiktoken 3 | import numpy as np 4 | import os 5 | import streamlit as st 6 | import json 7 | from streamlit_chat import message 8 | import pinecone 9 | import random 10 | 11 | from PIL import Image 12 | 13 | pinecone_api_key = st.secrets["API_KEYS"]["pinecone"] 14 | 15 | pinecone.init(api_key=pinecone_api_key, environment="us-east1-gcp") 16 | 17 | openai.api_key = st.secrets["API_KEYS"]["openai"] 18 | 19 | 20 | #gptflix_logo = Image.open('./chat/logo.png') 21 | 22 | bens_bites_logo = Image.open('./chat/Bens_Bites_Logo.jpg') 23 | 24 | # random user picture 25 | user_av = random.randint(0, 100) 26 | 27 | # random bott picture 28 | bott_av = random.randint(0, 100) 29 | 30 | def randomize_array(arr): 31 | sampled_arr = [] 32 | while arr: 33 | elem = random.choice(arr) 34 | sampled_arr.append(elem) 35 | arr.remove(elem) 36 | return sampled_arr 37 | 38 | st.set_page_config(page_title="GPTflix", page_icon="🍿", layout="wide") 39 | 40 | st.header("GPTflix is like chatGPT for movie reviews!🍿\n") 41 | 42 | 43 | # st.header("Thanks for visiting GPTflix! It's been a fun experiment, with over 4000 unique users over four weeks and an average of 10 questions per user while the site was online! Perhaps we will be back some time...🍿\n") 44 | 45 | # Define the name of the index and the dimensionality of the embeddings 46 | index_name = "400kmovies" 47 | dimension = 1536 48 | 49 | pineconeindex = pinecone.Index(index_name) 50 | 51 | 52 | ###################################### 53 | ####### 54 | ####### OPEN AI SETTINGS !!! 55 | ####### 56 | ####### 57 | ###################################### 58 | 59 | 60 | #COMPLETIONS_MODEL = "text-davinci-003" 61 | COMPLETIONS_MODEL = "gpt-3.5-turbo" 62 | EMBEDDING_MODEL = "text-embedding-ada-002" 63 | 64 | COMPLETIONS_API_PARAMS = { 65 | # We use temperature of 0.0 because it gives the most predictable, factual answer. 66 | "temperature": 0.0, 67 | "max_tokens": 400, 68 | "model": COMPLETIONS_MODEL, 69 | } 70 | 71 | 72 | feedback_url = "https://forms.gle/YMTtGK1zXdCRzRaj6" 73 | bb_url ="https://www.bensbites.co/?utm_source=gptflix" 74 | tech_url = "https://news.ycombinator.com/item?id=34802625" 75 | github_url = "https://github.com/stephansturges/GPTflix" 76 | 77 | with st.sidebar: 78 | st.markdown("# About 🙌") 79 | st.markdown( 80 | "GPTflix allows you to talk to version of chatGPT \n" 81 | "that has access to reviews of about 10 000 movies! 🎬 \n" 82 | "Holy smokes, chatGPT and 10x cheaper??! We are BACK! 😝\n" 83 | ) 84 | st.markdown( 85 | "Unline chatGPT, GPTflix can't make stuff up\n" 86 | "and will only answer from injected knowlege 👩‍🏫 \n" 87 | ) 88 | st.markdown("---") 89 | st.markdown("A side project by Stephan Sturges") 90 | st.markdown("Kept online by [Ben's Bites](%s)!" %bb_url) 91 | st.image(bens_bites_logo, width=60) 92 | 93 | st.markdown("---") 94 | st.markdown("Tech [info](%s) for you nerds out there!" %tech_url) 95 | st.markdown("Give feedback [here](%s)" %feedback_url) 96 | st.markdown("---") 97 | st.markdown("Code open-sourced [here](%s)" %github_url) 98 | st.markdown("---") 99 | 100 | 101 | # MAIN FUNCTIONS 102 | 103 | 104 | 105 | 106 | def num_tokens_from_string(string, encoding_name): 107 | """Returns the number of tokens in a text string.""" 108 | encoding = tiktoken.get_encoding(encoding_name) 109 | num_tokens = len(encoding.encode(string)) 110 | return num_tokens 111 | 112 | 113 | 114 | def get_embedding(text, model): 115 | result = openai.Embedding.create( 116 | model=model, 117 | input=text 118 | ) 119 | return result["data"][0]["embedding"] 120 | 121 | 122 | 123 | MAX_SECTION_LEN = 2500 #in tokens 124 | SEPARATOR = "\n" 125 | ENCODING = "cl100k_base" # encoding for text-embedding-ada-002 126 | 127 | encoding = tiktoken.get_encoding(ENCODING) 128 | separator_len = len(encoding.encode(SEPARATOR)) 129 | 130 | 131 | 132 | def construct_prompt_pinecone(question): 133 | """ 134 | Fetch relevant information from pinecone DB 135 | """ 136 | xq = get_embedding(question , EMBEDDING_MODEL) 137 | 138 | #print(xq) 139 | 140 | res = pineconeindex.query([xq], top_k=30, include_metadata=True, namespace="movies") 141 | 142 | #print(res) 143 | # print(most_relevant_document_sections[:2]) 144 | 145 | chosen_sections = [] 146 | chosen_sections_length = 0 147 | 148 | for match in res['matches'][:12]: 149 | #print(f"{match['score']:.2f}: {match['metadata']['text']}") 150 | if chosen_sections_length <= MAX_SECTION_LEN: 151 | document_section = match['metadata']['text'] 152 | 153 | # document_section = str(_[0] + _[1]) 154 | chosen_sections.append(SEPARATOR + document_section) 155 | 156 | chosen_sections_length += num_tokens_from_string(str(document_section), "gpt2") 157 | 158 | for match in randomize_array(res['matches'][-18:]): 159 | #print(f"{match['score']:.2f}: {match['metadata']['text']}") 160 | if chosen_sections_length <= MAX_SECTION_LEN: 161 | document_section = match['metadata']['text'] 162 | 163 | # document_section = str(_[0] + _[1]) 164 | chosen_sections.append(SEPARATOR + document_section) 165 | 166 | chosen_sections_length += num_tokens_from_string(str(document_section), "gpt2") 167 | 168 | 169 | # Useful diagnostic information 170 | #print(f"Selected {len(chosen_sections)} document sections:") 171 | 172 | header = """Answer the question as truthfully as possible using the provided context, 173 | and if the answer is not contained within the text below, say "I don't know." 174 | Answer in a very sarcastic tone and make it fun! Surprise the user with your answers. You can give long answers tangentially related to the movie.\n 175 | You are GPTflix, a AI movie-buff that loves talking about movies!\n 176 | Context:\n 177 | """ 178 | return header + "".join(chosen_sections) 179 | 180 | 181 | 182 | #TO BE ADDED: memory with summary of past discussions 183 | 184 | def summarize_past_conversation(content): 185 | 186 | APPEND_COMPLETION_PARAMS = { 187 | "temperature": 0.0, 188 | "max_tokens": 300, 189 | "model": COMPLETIONS_MODEL, 190 | } 191 | 192 | prompt = "Summarize this discussion into a single paragraph keeping the titles of any movies mentioned: \n" + content 193 | 194 | try: 195 | response = openai.Completion.create( 196 | prompt=prompt, 197 | **APPEND_COMPLETION_PARAMS 198 | ) 199 | except Exception as e: 200 | print("I'm afraid your question failed! This is the error: ") 201 | print(e) 202 | return None 203 | 204 | choices = response.get("choices", []) 205 | if len(choices) > 0: 206 | return choices[0]["text"].strip(" \n") 207 | else: 208 | return None 209 | 210 | 211 | 212 | 213 | 214 | COMPLETIONS_API_PARAMS = { 215 | "temperature": 0.0, 216 | "max_tokens": 500, 217 | "model": COMPLETIONS_MODEL, 218 | } 219 | 220 | 221 | def answer_query_with_context_pinecone(query): 222 | prompt = construct_prompt_pinecone(query) + "\n\n Q: " + query + "\n A:" 223 | 224 | print("---------------------------------------------") 225 | print("prompt:") 226 | print(prompt) 227 | print("---------------------------------------------") 228 | try: 229 | response = openai.ChatCompletion.create( 230 | messages=[{"role": "system", "content": "You are a helpful AI who loves movies."}, 231 | {"role": "user", "content": str(prompt)}], 232 | # {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."}, 233 | # {"role": "user", "content": "Where was it played?"} 234 | # ] 235 | **COMPLETIONS_API_PARAMS 236 | ) 237 | except Exception as e: 238 | print("I'm afraid your question failed! This is the error: ") 239 | print(e) 240 | return None 241 | 242 | choices = response.get("choices", []) 243 | if len(choices) > 0: 244 | return choices[0]["message"]["content"].strip(" \n") 245 | else: 246 | return None 247 | 248 | 249 | 250 | # Storing the chat 251 | if 'generated' not in st.session_state: 252 | st.session_state['generated'] = [] 253 | 254 | if 'past' not in st.session_state: 255 | st.session_state['past'] = [] 256 | 257 | def clear_text(): 258 | st.session_state["input"] = "" 259 | 260 | # We will get the user's input by calling the get_text function 261 | def get_text(): 262 | input_text = st.text_input("Input a question here! For example: \"Is X movie good?\". \n It works best if your question contains the title of a movie! You might want to be really specific, like talking about Pixar's Brave rather than just Brave. Also, I have no memory of previous questions!😅😊","Who are you?", key="input") 263 | return input_text 264 | 265 | 266 | 267 | user_input = get_text() 268 | 269 | 270 | if user_input: 271 | output = answer_query_with_context_pinecone(user_input) 272 | 273 | # store the output 274 | st.session_state.past.append(user_input) 275 | st.session_state.generated.append(output) 276 | 277 | 278 | if st.session_state['generated']: 279 | for i in range(len(st.session_state['generated'])-1, -1, -1): 280 | message(st.session_state["generated"][i],seed=bott_av , key=str(i)) 281 | message(st.session_state['past'][i], is_user=True,avatar_style="adventurer",seed=user_av, key=str(i) + '_user') 282 | 283 | 284 | -------------------------------------------------------------------------------- /notes.txt: -------------------------------------------------------------------------------- 1 | 2 | # How to curl your index and fetch a record for index id 1 3 | curl --request GET \ 4 | --url 'https://YOUR-PINECONE-INDEX-NAME.svc.us-east1-gcp.pinecone.io/vectors/fetch?ids=1&namespace=movies' \ 5 | --header 'Api-Key: YOUR-PINECONE-API-KEY' \ 6 | --header 'accept: application/json' -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [tool.poetry] 2 | name = "DKVgpt" 3 | version = "0.1.0" 4 | description = "Accurate answers about DKV" 5 | authors = ["stephan sturges"] 6 | license = "MIT" 7 | readme = "README.md" 8 | 9 | [tool.poetry.dependencies] 10 | python = ">3.9.7,<4.0.0" 11 | streamlit = "^1.17.0" 12 | openai = "^0.27.0" 13 | tiktoken = "0.2.0" 14 | streamlit-chat = "0.0.2.1" 15 | pinecone-client = "2.2.1" 16 | 17 | 18 | 19 | [tool.poetry.group.dev.dependencies] 20 | black = {version = "^23.1a1", allow-prereleases = true} 21 | python-dotenv = "^0.21.1" 22 | pytest = "^7.2.1" 23 | 24 | [build-system] 25 | requires = ["poetry-core"] 26 | build-backend = "poetry.core.masonry.api" 27 | 28 | 29 | # Secrets to be set from streamlit : Pinecone API key and OpenAI API Key 30 | [API_KEYS] 31 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | openai==0.27.0 2 | aiohttp==3.8.4 3 | aiosignal==1.3.1 4 | altair==4.2.2 5 | async-timeout==4.0.2 6 | attrs==22.2.0 7 | blinker==1.5 8 | blobfile==2.0.1 9 | cachetools==5.3.0 10 | certifi==2022.12.7 11 | charset-normalizer==3.0.1 12 | click==8.1.3 13 | decorator==5.1.1 14 | dnspython==2.3.0 15 | entrypoints==0.4 16 | filelock==3.9.0 17 | frozenlist==1.3.3 18 | gitdb==4.0.10 19 | GitPython==3.1.31 20 | idna==3.4 21 | importlib-metadata==6.0.0 22 | packaging==23.0 23 | pinecone-client==2.2.1 24 | protobuf==3.20.3 25 | pyarrow==11.0.0 26 | pycryptodomex==3.17 27 | pydeck==0.8.0 28 | regex==2022.10.31 29 | requests==2.28.2 30 | rich==13.3.1 31 | semver==2.13.0 32 | six==1.16.0 33 | smmap==5.0.0 34 | streamlit==1.19.0 35 | streamlit-chat==0.0.2.1 36 | tiktoken==0.2.0 37 | toml==0.10.2 38 | toolz==0.12.0 39 | typing_extensions==4.5.0 40 | tzdata==2022.7 41 | tzlocal==4.2 42 | urllib3==1.26.14 43 | validators==0.20.0 44 | yarl==1.8.2 45 | zipp==3.15.0 46 | -------------------------------------------------------------------------------- /src/check_results.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | import json 4 | from dotenv import load_dotenv 5 | import pinecone 6 | 7 | 8 | def check_p3_results(): 9 | """Counts how many embeddings are in the .jsonl file.""" 10 | 11 | total_errors = 0 12 | list_of_indices_with_errors = list() 13 | with open("data_sample/d3.embeddings_maker_results.jsonl", encoding="utf8") as f: 14 | lines = f.readlines() 15 | total_lines = len(lines) 16 | for i in range(0, total_lines): 17 | data = json.loads(lines[i]) 18 | if isinstance(data[1], list): 19 | list_of_indices_with_errors.append(i + 1) 20 | total_errors += 1 21 | 22 | complete_embeddings = total_lines - total_errors 23 | success_rate = (complete_embeddings / total_lines) * 100 24 | 25 | print( 26 | f"\nIndices with error: {list_of_indices_with_errors}\n" 27 | f"\nTotal elements with embedding error: {total_errors}" 28 | f"\nTotal embeddings made from elements: {complete_embeddings}" 29 | f"\nTotal percentage of elements successfully embedded from corpus: {success_rate:.2f}%" 30 | f"\nTotal elements processed by OpenAI: {total_lines}\n" 31 | ) 32 | 33 | 34 | def check_p5_results_query_pinecone( 35 | ids: list, 36 | index_name: str, 37 | namespace: str = "movies", 38 | ): 39 | """This function will return a specific vector id back from the index.""" 40 | dotenv_path = os.path.join(os.path.dirname(__file__), ".env") 41 | load_dotenv(dotenv_path) 42 | 43 | pinecone.init( 44 | api_key=os.environ.get("PINECONE_API_KEY"), environment="us-east1-gcp" 45 | ) 46 | index = pinecone.Index(index_name) 47 | 48 | fetch_response = index.fetch(ids=ids, namespace=namespace) 49 | for i, id in enumerate(ids): 50 | print( 51 | f'Vector Id: {ids[i]}\n{fetch_response["vectors"][ids[i]]["metadata"]["text"]}\n\n' 52 | ) 53 | 54 | 55 | if __name__ == "__main__": 56 | 57 | # check_p3_results() 58 | 59 | # index starts at 0 60 | # check data from d4.csv matches the index in pinecone 61 | check_p5_results_query_pinecone(ids=["9"], index_name="1kmovies") 62 | -------------------------------------------------------------------------------- /src/other/generate_index_wikipedia_movies.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import tokenizer 3 | import transformers 4 | import nltk 5 | import openai 6 | import tiktoken 7 | encoding = tiktoken.get_encoding("cl100k_base") 8 | import pickle 9 | import numpy as np 10 | import os 11 | import nltk 12 | from dotenv import load_dotenv 13 | import warnings 14 | 15 | 16 | 17 | # This is a sample converter that takes CSV data from a table 18 | # (in this case the Kaggle dataset here https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots ) 19 | # and converts this data into a CSV that contains a single column 20 | # with a block of text that we want to make accessible on our Pinecone database 21 | 22 | 23 | def num_tokens_from_string(string, encoding_name): 24 | """Returns the number of tokens in a text string.""" 25 | encoding = tiktoken.get_encoding(encoding_name) 26 | num_tokens = len(encoding.encode(string)) 27 | return num_tokens 28 | 29 | warnings.filterwarnings("ignore", category=DeprecationWarning) 30 | warnings.filterwarnings("ignore", category=FutureWarning) 31 | 32 | 33 | openai.api_key = #### SET YOUR API KEY HER 34 | 35 | 36 | 37 | COMPLETIONS_MODEL = "text-davinci-003" 38 | EMBEDDING_MODEL = "text-embedding-ada-002" 39 | 40 | 41 | df = pd.read_csv(filepath_or_buffer='data_sample/wiki_movie_plots_small.csv', sep="," , header=0, dtype="string", encoding="utf-8" ) 42 | 43 | df["gpttext"] = "Title: " + df["Title"].astype(str) + \ 44 | " Year: " + df["Release Year"].astype(str) + \ 45 | " Cast: " + df["Cast"].astype(str).replace("", "unknown") + \ 46 | " Director: " + df["Director"].astype(str) + \ 47 | " Country of production: " + df["Origin/Ethnicity"].astype(str) +\ 48 | " Genre: " + df["Genre"].astype(str) +\ 49 | " wiki: " + df["Wiki Page"].astype(str) +\ 50 | " Plot / story / about: " + df["Plot"].astype(str).replace("[1]", "FUCK") 51 | 52 | df = df.drop(df.columns[[0, 1, 2, 3, 4, 5, 6, 7]], axis=1) 53 | df.to_csv('data/wiki_plots_small.csv') 54 | 55 | 56 | -------------------------------------------------------------------------------- /src/p1.generate_index_mpst.py: -------------------------------------------------------------------------------- 1 | # Python Standard Library 2 | import warnings 3 | 4 | # Third Party Libraries 5 | import pandas as pd 6 | import tiktoken 7 | 8 | warnings.filterwarnings("ignore", category=DeprecationWarning) 9 | warnings.filterwarnings("ignore", category=FutureWarning) 10 | 11 | 12 | def num_tokens_from_string(string, encoding_name: str = "cl100k_base"): 13 | """Returns the number of tokens in a text string.""" 14 | encoding = tiktoken.get_encoding(encoding_name) 15 | num_tokens = len(encoding.encode(string)) 16 | return num_tokens 17 | 18 | 19 | def combine_text_to_one_column(df): 20 | df["gpttext"] = ( 21 | "Title: " 22 | + df["title"].astype(str) 23 | + " tags: " 24 | + df["tags"].astype(str) 25 | + " Plot / story / about: " 26 | + df["plot_synopsis"].astype(str) 27 | ) 28 | 29 | df = df.drop(df.columns[[0, 1, 2, 3, 4, 5]], axis=1) 30 | 31 | df.to_csv(f"data_sample/d1.mpst_1k_converted.csv") 32 | 33 | 34 | if __name__ == "__main__": 35 | # This is a sample converter that takes CSV data from a table 36 | # (in this case the Kaggle dataset here https://www.kaggle.com/datasets/cryptexcode/mpst-movie-plot-synopses-with-tags ) 37 | # and converts this data into a CSV that contains a single column 38 | # with a block of text that we want to make accessible on our Pinecone database 39 | 40 | 41 | # read sample data 42 | df = pd.read_csv( 43 | filepath_or_buffer="data_sample/d0.mpst_1k_raw.csv", 44 | sep=",", 45 | header=0, 46 | dtype="string", 47 | encoding="utf-8", 48 | ) 49 | 50 | combine_text_to_one_column(df=df) 51 | -------------------------------------------------------------------------------- /src/p2.make_jsonl_for_requests_mpst.py: -------------------------------------------------------------------------------- 1 | import json 2 | import pandas as pd 3 | 4 | 5 | # This is a sample converter that takes CSV data from a CSV table 6 | # (in this case the Kaggle dataset here https://www.kaggle.com/datasets/cryptexcode/mpst-movie-plot-synopses-with-tags ) 7 | # and iterates through the rows of this data to make a JSONL file 8 | # which can be loaded and processed by api_request_parallel_processor.py 9 | # to generate the embeddings which we will use for vector search! 10 | 11 | df2 = pd.read_csv("data_sample/d1.mpst_1k_converted.csv") 12 | 13 | 14 | filename = "data_sample/d2.embeddings_maker.jsonl" 15 | jobs = [ 16 | {"model": "text-embedding-ada-002", "input": str(row[1])} 17 | for index, row in df2.iterrows() 18 | ] 19 | with open(filename, "w") as f: 20 | for job in jobs: 21 | json_string = json.dumps(job) 22 | f.write(json_string + "\n") 23 | -------------------------------------------------------------------------------- /src/p3.api_request_parallel_processor.py: -------------------------------------------------------------------------------- 1 | """ 2 | API REQUEST PARALLEL PROCESSOR 3 | 4 | Using the OpenAI API to process lots of text quickly takes some care. 5 | If you trickle in a million API requests one by one, they'll take days to complete. 6 | If you flood a million API requests in parallel, they'll exceed the rate limits and fail with errors. 7 | To maximize throughput, parallel requests need to be throttled to stay under rate limits. 8 | 9 | This script parallelizes requests to the OpenAI API while throttling to stay under rate limits. 10 | 11 | Features: 12 | - Streams requests from file, to avoid running out of memory for giant jobs 13 | - Makes requests concurrently, to maximize throughput 14 | - Throttles request and token usage, to stay under rate limits 15 | - Retries failed requests up to {max_attempts} times, to avoid missing data 16 | - Logs errors, to diagnose problems with requests 17 | 18 | Example command to call script: 19 | ``` 20 | python examples/api_request_parallel_processor.py \ 21 | --requests_filepath examples/data/example_requests_to_parallel_process.jsonl \ 22 | --save_filepath examples/data/example_requests_to_parallel_process_results.jsonl \ 23 | --request_url https://api.openai.com/v1/embeddings \ 24 | --max_requests_per_minute 1500 \ 25 | --max_tokens_per_minute 6250000 \ 26 | --token_encoding_name cl100k_base \ 27 | --max_attempts 5 \ 28 | --logging_level 20 29 | ``` 30 | 31 | Inputs: 32 | - requests_filepath : str 33 | - path to the file containing the requests to be processed 34 | - file should be a jsonl file, where each line is a json object with API parameters 35 | - e.g., {"model": "text-embedding-ada-002", "input": "embed me"} 36 | - as with all jsonl files, take care that newlines in the content are properly escaped (json.dumps does this automatically) 37 | - an example file is provided at examples/data/example_requests_to_parallel_process.jsonl 38 | - the code to generate the example file is appended to the bottom of this script 39 | - save_filepath : str, optional 40 | - path to the file where the results will be saved 41 | - file will be a jsonl file, where each line is an array with the original request plus the API response 42 | - e.g., [{"model": "text-embedding-ada-002", "input": "embed me"}, {...}] 43 | - if omitted, results will be saved to {requests_filename}_results.jsonl 44 | - request_url : str, optional 45 | - URL of the API endpoint to call 46 | - if omitted, will default to "https://api.openai.com/v1/embeddings" 47 | - api_key : str, optional 48 | - API key to use 49 | - if omitted, the script will attempt to read it from an environment variable {os.getenv("OPENAI_API_KEY")} 50 | - max_requests_per_minute : float, optional 51 | - target number of requests to make per minute (will make less if limited by tokens) 52 | - leave headroom by setting this to 50% or 75% of your limit 53 | - if requests are limiting you, try batching multiple embeddings or completions into one request 54 | - if omitted, will default to 1,500 55 | - max_tokens_per_minute : float, optional 56 | - target number of tokens to use per minute (will use less if limited by requests) 57 | - leave headroom by setting this to 50% or 75% of your limit 58 | - if omitted, will default to 125,000 59 | - token_encoding_name : str, optional 60 | - name of the token encoding used, as defined in the `tiktoken` package 61 | - if omitted, will default to "cl100k_base" (used by `text-embedding-ada-002`) 62 | - max_attempts : int, optional 63 | - number of times to retry a failed request before giving up 64 | - if omitted, will default to 5 65 | - logging_level : int, optional 66 | - level of logging to use; higher numbers will log fewer messages 67 | - 40 = ERROR; will log only when requests fail after all retries 68 | - 30 = WARNING; will log when requests his rate limits or other errors 69 | - 20 = INFO; will log when requests start and the status at finish 70 | - 10 = DEBUG; will log various things as the loop runs to see when they occur 71 | - if omitted, will default to 20 (INFO). 72 | 73 | The script is structured as follows: 74 | - Imports 75 | - Define main() 76 | - Initialize things 77 | - In main loop: 78 | - Get next request if one is not already waiting for capacity 79 | - Update available token & request capacity 80 | - If enough capacity available, call API 81 | - The loop pauses if a rate limit error is hit 82 | - The loop breaks when no tasks remain 83 | - Define dataclasses 84 | - StatusTracker (stores script metadata counters; only one instance is created) 85 | - APIRequest (stores API inputs, outputs, metadata; one method to call API) 86 | - Define functions 87 | - api_endpoint_from_url (extracts API endpoint from request URL) 88 | - append_to_jsonl (writes to results file) 89 | - num_tokens_consumed_from_request (bigger function to infer token usage from request) 90 | - task_id_generator_function (yields 1, 2, 3, ...) 91 | - Run main() 92 | """ 93 | 94 | # imports 95 | import os 96 | from os.path import join, dirname 97 | from dotenv import load_dotenv 98 | from datetime import datetime 99 | import aiohttp # for making API calls concurrently 100 | import argparse # for running script from command line 101 | import asyncio # for running API calls concurrently 102 | import json # for saving results to a jsonl file 103 | import logging # for logging rate limit warnings and other messages 104 | import os # for reading API key 105 | import tiktoken # for counting tokens 106 | import time # for sleeping after rate limit is hit 107 | from dataclasses import dataclass # for storing API inputs, outputs, and metadata 108 | import socket 109 | 110 | 111 | 112 | socket.gethostbyname("") 113 | 114 | async def process_api_requests_from_file( 115 | requests_filepath: str, 116 | save_filepath: str, 117 | request_url: str, 118 | api_key: str, 119 | max_requests_per_minute: float, 120 | max_tokens_per_minute: float, 121 | token_encoding_name: str, 122 | max_attempts: int, 123 | logging_level: int, 124 | ): 125 | """Processes API requests in parallel, throttling to stay under rate limits.""" 126 | # constants 127 | seconds_to_pause_after_rate_limit_error = 15 128 | seconds_to_sleep_each_loop = 0.001 # 1 ms limits max throughput to 1,000 requests per second 129 | 130 | # initialize logging 131 | logging.basicConfig(level=logging_level) 132 | logging.debug(f"Logging initialized at level {logging_level}") 133 | 134 | # infer API endpoint and construct request header 135 | api_endpoint = api_endpoint_from_url(request_url) 136 | request_header = {"Authorization": f"Bearer {api_key}"} 137 | 138 | # initialize trackers 139 | queue_of_requests_to_retry = asyncio.Queue() 140 | task_id_generator = task_id_generator_function() # generates integer IDs of 1, 2, 3, ... 141 | status_tracker = StatusTracker() # single instance to track a collection of variables 142 | next_request = None # variable to hold the next request to call 143 | 144 | # initialize available capacity counts 145 | available_request_capacity = max_requests_per_minute 146 | available_token_capacity = max_tokens_per_minute 147 | last_update_time = time.time() 148 | 149 | # intialize flags 150 | file_not_finished = True # after file is empty, we'll skip reading it 151 | logging.debug(f"Initialization complete.") 152 | 153 | # initialize file reading 154 | with open(requests_filepath) as file: 155 | # `requests` will provide requests one at a time 156 | requests = file.__iter__() 157 | logging.debug(f"File opened. Entering main loop") 158 | 159 | while True: 160 | # get next request (if one is not already waiting for capacity) 161 | if next_request is None: 162 | if queue_of_requests_to_retry.empty() is False: 163 | next_request = queue_of_requests_to_retry.get_nowait() 164 | logging.debug(f"Retrying request {next_request.task_id}: {next_request}") 165 | elif file_not_finished: 166 | try: 167 | # get new request 168 | request_json = eval(next(requests)) 169 | next_request = APIRequest( 170 | task_id=next(task_id_generator), 171 | request_json=request_json, 172 | token_consumption=num_tokens_consumed_from_request(request_json, api_endpoint, token_encoding_name), 173 | attempts_left=max_attempts, 174 | ) 175 | status_tracker.num_tasks_started += 1 176 | status_tracker.num_tasks_in_progress += 1 177 | logging.debug(f"Reading request {next_request.task_id}: {next_request}") 178 | except StopIteration: 179 | # if file runs out, set flag to stop reading it 180 | logging.debug("Read file exhausted") 181 | file_not_finished = False 182 | 183 | # update available capacity 184 | current_time = time.time() 185 | seconds_since_update = current_time - last_update_time 186 | available_request_capacity = min( 187 | available_request_capacity + max_requests_per_minute * seconds_since_update / 60.0, 188 | max_requests_per_minute, 189 | ) 190 | available_token_capacity = min( 191 | available_token_capacity + max_tokens_per_minute * seconds_since_update / 60.0, 192 | max_tokens_per_minute, 193 | ) 194 | last_update_time = current_time 195 | 196 | # if enough capacity available, call API 197 | if next_request: 198 | next_request_tokens = next_request.token_consumption 199 | if ( 200 | available_request_capacity >= 1 201 | and available_token_capacity >= next_request_tokens 202 | ): 203 | # update counters 204 | available_request_capacity -= 1 205 | available_token_capacity -= next_request_tokens 206 | next_request.attempts_left -= 1 207 | 208 | # call API 209 | asyncio.create_task( 210 | next_request.call_API( 211 | request_url=request_url, 212 | request_header=request_header, 213 | retry_queue=queue_of_requests_to_retry, 214 | save_filepath=save_filepath, 215 | status_tracker=status_tracker, 216 | ) 217 | ) 218 | next_request = None # reset next_request to empty 219 | 220 | # if all tasks are finished, break 221 | if status_tracker.num_tasks_in_progress == 0: 222 | break 223 | 224 | # main loop sleeps briefly so concurrent tasks can run 225 | await asyncio.sleep(seconds_to_sleep_each_loop) 226 | 227 | # if a rate limit error was hit recently, pause to cool down 228 | seconds_since_rate_limit_error = (time.time() - status_tracker.time_of_last_rate_limit_error) 229 | if seconds_since_rate_limit_error < seconds_to_pause_after_rate_limit_error: 230 | remaining_seconds_to_pause = (seconds_to_pause_after_rate_limit_error - seconds_since_rate_limit_error) 231 | await asyncio.sleep(remaining_seconds_to_pause) 232 | # ^e.g., if pause is 15 seconds and final limit was hit 5 seconds ago 233 | logging.warn(f"Pausing to cool down until {time.ctime(status_tracker.time_of_last_rate_limit_error + seconds_to_pause_after_rate_limit_error)}") 234 | 235 | # after finishing, log final status 236 | logging.info(f"""Parallel processing complete. Results saved to {save_filepath}""") 237 | if status_tracker.num_tasks_failed > 0: 238 | logging.warning(f"{status_tracker.num_tasks_failed} / {status_tracker.num_tasks_started} requests failed. Errors logged to {save_filepath}.") 239 | if status_tracker.num_rate_limit_errors > 0: 240 | logging.warning(f"{status_tracker.num_rate_limit_errors} rate limit errors received. Consider running at a lower rate.") 241 | 242 | 243 | # dataclasses 244 | 245 | 246 | @dataclass 247 | class StatusTracker: 248 | """Stores metadata about the script's progress. Only one instance is created.""" 249 | 250 | num_tasks_started: int = 0 251 | num_tasks_in_progress: int = 0 # script ends when this reaches 0 252 | num_tasks_succeeded: int = 0 253 | num_tasks_failed: int = 0 254 | num_rate_limit_errors: int = 0 255 | num_api_errors: int = 0 # excluding rate limit errors, counted above 256 | num_other_errors: int = 0 257 | time_of_last_rate_limit_error: int = 0 # used to cool off after hitting rate limits 258 | 259 | 260 | @dataclass 261 | class APIRequest: 262 | """Stores an API request's inputs, outputs, and other metadata. Contains a method to make an API call.""" 263 | 264 | task_id: int 265 | request_json: dict 266 | token_consumption: int 267 | attempts_left: int 268 | result = [] 269 | 270 | async def call_API( 271 | self, 272 | request_url: str, 273 | request_header: dict, 274 | retry_queue: asyncio.Queue, 275 | save_filepath: str, 276 | status_tracker: StatusTracker, 277 | ): 278 | """Calls the OpenAI API and saves results.""" 279 | logging.info(f"Starting request #{self.task_id}") 280 | error = None 281 | try: 282 | async with aiohttp.ClientSession() as session: 283 | async with session.post( 284 | url=request_url, headers=request_header, json=self.request_json 285 | ) as response: 286 | response = await response.json() 287 | if "error" in response: 288 | logging.warning( 289 | f"Request {self.task_id} failed with error {response['error']}" 290 | ) 291 | status_tracker.num_api_errors += 1 292 | error = response 293 | if "Rate limit" in response["error"].get("message", ""): 294 | status_tracker.time_of_last_rate_limit_error = time.time() 295 | status_tracker.num_rate_limit_errors += 1 296 | status_tracker.num_api_errors -= 1 # rate limit errors are counted separately 297 | 298 | except Exception as e: # catching naked exceptions is bad practice, but in this case we'll log & save them 299 | logging.warning(f"Request {self.task_id} failed with Exception {e}") 300 | status_tracker.num_other_errors += 1 301 | error = e 302 | if error: 303 | self.result.append(error) 304 | if self.attempts_left: 305 | retry_queue.put_nowait(self) 306 | else: 307 | logging.error(f"Request {self.request_json} failed after all attempts. Saving errors: {self.result}") 308 | append_to_jsonl([self.request_json, self.result], save_filepath) 309 | status_tracker.num_tasks_in_progress -= 1 310 | status_tracker.num_tasks_failed += 1 311 | else: 312 | append_to_jsonl([self.request_json, response], save_filepath) 313 | status_tracker.num_tasks_in_progress -= 1 314 | status_tracker.num_tasks_succeeded += 1 315 | logging.debug(f"Request {self.task_id} saved to {save_filepath}") 316 | 317 | 318 | # functions 319 | 320 | 321 | def api_endpoint_from_url(request_url): 322 | """Extract the API endpoint from the request URL.""" 323 | return request_url.split("/")[-1] 324 | 325 | 326 | def append_to_jsonl(data, filename: str) -> None: 327 | """Append a json payload to the end of a jsonl file.""" 328 | json_string = json.dumps(data) 329 | with open(filename, "a") as f: 330 | f.write(json_string + "\n") 331 | 332 | 333 | def num_tokens_consumed_from_request( 334 | request_json: dict, 335 | api_endpoint: str, 336 | token_encoding_name: str, 337 | ): 338 | """Count the number of tokens in the request. Only supports completion and embedding requests.""" 339 | encoding = tiktoken.get_encoding(token_encoding_name) 340 | # if completions request, tokens = prompt + n * max_tokens 341 | if api_endpoint == "completions": 342 | prompt = request_json["prompt"] 343 | max_tokens = request_json.get("max_tokens", 15) 344 | n = request_json.get("n", 1) 345 | completion_tokens = n * max_tokens 346 | if isinstance(prompt, str): # single prompt 347 | prompt_tokens = len(encoding.encode(prompt)) 348 | num_tokens = prompt_tokens + completion_tokens 349 | return num_tokens 350 | elif isinstance(prompt, list): # multiple prompts 351 | prompt_tokens = sum([len(encoding.encode(p)) for p in prompt]) 352 | num_tokens = prompt_tokens + completion_tokens 353 | return num_tokens 354 | else: 355 | raise TypeError('Expecting either string or list of strings for "prompt" field in completion request') 356 | # if embeddings request, tokens = input tokens 357 | elif api_endpoint == "embeddings": 358 | input = request_json["input"] 359 | if isinstance(input, str): # single input 360 | try: 361 | num_tokens = len(encoding.encode(input)) # hack to put this in a "try" clause because non-UTF-8 characters can cause the tokenizer to fail 362 | except: 363 | num_tokens = 0 364 | return num_tokens 365 | elif isinstance(input, list): # multiple inputs 366 | num_tokens = sum([len(encoding.encode(i)) for i in input]) 367 | return num_tokens 368 | else: 369 | raise TypeError('Expecting either string or list of strings for "inputs" field in embedding request') 370 | # more logic needed to support other API calls (e.g., edits, inserts, DALL-E) 371 | else: 372 | raise NotImplementedError(f'API endpoint "{api_endpoint}" not implemented in this script') 373 | 374 | 375 | def task_id_generator_function(): 376 | """Generate integers 0, 1, 2, and so on.""" 377 | task_id = 0 378 | while True: 379 | yield task_id 380 | task_id += 1 381 | 382 | 383 | # run script 384 | 385 | 386 | if __name__ == "__main__": 387 | dotenv_path = join(dirname(__file__), '.env') 388 | load_dotenv(dotenv_path) 389 | 390 | # parse command line arguments 391 | parser = argparse.ArgumentParser() 392 | parser.add_argument("--requests_filepath") 393 | parser.add_argument("--save_filepath", default=None) 394 | parser.add_argument("--request_url", default="https://api.openai.com/v1/embeddings") 395 | parser.add_argument("--api_key", default=os.environ.get("OPENAI_API_KEY")) 396 | parser.add_argument("--max_requests_per_minute", type=int, default=3_000 * 0.5) 397 | parser.add_argument("--max_tokens_per_minute", type=int, default=250_000 * 0.5) 398 | parser.add_argument("--token_encoding_name", default="cl100k_base") 399 | parser.add_argument("--max_attempts", type=int, default=5) 400 | parser.add_argument("--logging_level", default=logging.INFO) 401 | args = parser.parse_args() 402 | 403 | if args.save_filepath is None: 404 | args.save_filepath = args.requests_filepath.replace(".jsonl", "_results.jsonl") 405 | 406 | start_time = datetime.now() 407 | # run script 408 | asyncio.run( 409 | process_api_requests_from_file( 410 | requests_filepath=args.requests_filepath, 411 | save_filepath=args.save_filepath, 412 | request_url=args.request_url, 413 | api_key=args.api_key, 414 | max_requests_per_minute=float(args.max_requests_per_minute), 415 | max_tokens_per_minute=float(args.max_tokens_per_minute), 416 | token_encoding_name=args.token_encoding_name, 417 | max_attempts=int(args.max_attempts), 418 | logging_level=int(args.logging_level), 419 | ) 420 | ) 421 | print(f"\nTook {datetime.now() - start_time} h.m.s:ms to process data to OpenAI\n") 422 | 423 | 424 | """ 425 | APPENDIX 426 | 427 | The example requests file at openai-cookbook/examples/data/example_requests_to_parallel_process.jsonl contains 10,000 requests to text-embedding-ada-002. 428 | 429 | It was generated with the following code: 430 | 431 | ```python 432 | import json 433 | 434 | filename = "data/example_requests_to_parallel_process.jsonl" 435 | n_requests = 10_000 436 | jobs = [{"model": "text-embedding-ada-002", "input": str(x) + "\n"} for x in range(n_requests)] 437 | with open(filename, "w") as f: 438 | for job in jobs: 439 | json_string = json.dumps(job) 440 | f.write(json_string + "\n") 441 | ``` 442 | 443 | As with all jsonl files, take care that newlines in the content are properly escaped (json.dumps does this automatically). 444 | """ -------------------------------------------------------------------------------- /src/p4.convert_jsonl_with_embeddings_to_csv.py: -------------------------------------------------------------------------------- 1 | import json 2 | import pandas as pd 3 | import numpy as np 4 | import os 5 | 6 | 7 | filename = "data_sample/d3.embeddings_maker_results.jsonl" 8 | 9 | with open(os.path.abspath(filename), "r", encoding="utf-8") as f: 10 | data = [ 11 | json.loads(line) 12 | for line in open(os.path.abspath(filename), "r", encoding="utf-8") 13 | ] 14 | 15 | 16 | print("OPENED JSONL FILE WITH EMBEDDINGS") 17 | 18 | 19 | def flattenizer(a): 20 | return (a[0],) + tuple(a[1]) 21 | 22 | 23 | dataframe_with_text_and_embeddings = pd.DataFrame() 24 | 25 | processed_count = 0 26 | mydata_expanded_flat = [] 27 | for line in data: 28 | # if the data had an error when trying to embed the text from OpenAi 29 | # it returns a list instance instead of a dict. 30 | # The error count reported from p3 plus processed_count should equal 31 | # the total amount of documents you sent to OpenAI for processing 32 | if isinstance(line[1], list): 33 | continue 34 | else: 35 | info = flattenizer( 36 | [ 37 | json.loads(json.dumps(line))[0]["input"], 38 | json.loads(json.dumps(line))[1]["data"][0]["embedding"], 39 | ] 40 | ) 41 | mydata_expanded_flat.append(info) 42 | processed_count += 1 43 | 44 | print(f"\nTotal embeddings converted to csv: {processed_count}\n") 45 | 46 | # TODO Drop any bad lines if an embedding was not successful 47 | # mydata_expanded_flat = [ 48 | # flattenizer( 49 | # [ 50 | # json.loads(json.dumps(line))[0]["input"], 51 | # json.loads(json.dumps(line))[1]["data"][0]["embedding"], 52 | # ] 53 | # ) 54 | # for line in data 55 | # ] 56 | 57 | print("CONVERTED JSONL FLAT ARRAY") 58 | 59 | 60 | def columns_index_maker(): 61 | column_names = [] 62 | column_names.append("gpttext") 63 | for _ in range(1536): 64 | column_names.append(str(_)) 65 | 66 | return column_names 67 | 68 | 69 | all_the_columns = columns_index_maker() 70 | 71 | df = pd.DataFrame(mydata_expanded_flat, columns=all_the_columns) 72 | 73 | print("CONVERTED BIG ARRAY TO DATAFRAME") 74 | 75 | 76 | def chunker(seq, size): 77 | return (seq[pos : pos + size] for pos in range(0, len(seq), size)) 78 | 79 | def chonk_dataframe_and_make_csv_with_embeds(pddf, outputfile, chunks): 80 | """ 81 | If you are working on very large files, for example uploading all of wikipedia 82 | these indexes can get very very chonky with the embeddings appended (like >400Gb). 83 | 84 | This is why we chunk through the dataframe and append pieces to the CSV to avoid 85 | running out of memory. 86 | 87 | Args: 88 | pddf (_type_): A sequence 89 | outputfile (file): Saved .csv file of embeddings 90 | chunks (int): The buffer size 91 | """ 92 | for i, chunk in enumerate(chunker(pddf, chunks)): 93 | print("CHONKING TO CSV No: " + str(i)) 94 | document_embeddings_i = pd.DataFrame(chunk) 95 | document_embeddings_i.to_csv( 96 | outputfile, mode="a", index=False, header=False if i > 0 else True 97 | ) 98 | 99 | 100 | if __name__ == "__main__": 101 | 102 | chonk_dataframe_and_make_csv_with_embeds( 103 | df, "data_sample/d4.embeddings_maker_results.csv", 1000 104 | ) 105 | -------------------------------------------------------------------------------- /src/p5.upload_to_pinecone.py: -------------------------------------------------------------------------------- 1 | import pinecone 2 | import csv 3 | import numpy as np 4 | import os 5 | from os.path import join, dirname 6 | from dotenv import load_dotenv 7 | 8 | 9 | class PineconeUpload: 10 | def __init__( 11 | self, 12 | pinecone_api_key, 13 | index_name, 14 | embeddings_csv, 15 | embedding_dims: 1536, 16 | create_index: bool = False, 17 | ) -> None: 18 | self.pinecone_api_key = pinecone_api_key 19 | self.index_name = index_name 20 | self.embeddings_csv = embeddings_csv 21 | self.embedding_dims = embedding_dims 22 | self.create_index = create_index 23 | self.pinecone_index = self.make_pinecone_index() 24 | 25 | def get_first_4000_chars(self, s): 26 | """ 27 | We are using a function to limit the metadata character length to 4000 here. 28 | The reason is that there is a limit on the size of metadata that you can append 29 | to a vector on pinecone, currently I believe it's 10Kb... You could go with more 30 | than 4000 and give it a try. 31 | """ 32 | if len(s) > 4000: 33 | return s[:4000] 34 | else: 35 | return s 36 | 37 | def make_pinecone_index(self): 38 | """Create the pinecone index.""" 39 | 40 | pinecone.init(api_key=self.pinecone_api_key, environment="us-east1-gcp") 41 | 42 | if self.create_index: 43 | # Create an empty index if required 44 | pinecone.create_index(name=self.index_name, dimension=self.embedding_dims) 45 | 46 | index = pinecone.Index(self.index_name) 47 | 48 | # Get info about our new pinecone index. 49 | print(f"Pinecone index info: {pinecone.whoami()} \n") 50 | return index 51 | 52 | def upsert_embeddings_batch(self, starting_index, data_batch, index_offset): 53 | """Define a function to upsert embeddings in batches.""" 54 | 55 | # Convert the data to a list of Pinecone upsert requests 56 | upsert_requests = [ 57 | ( 58 | str(starting_index + i + index_offset), 59 | embedding, 60 | {"text": self.get_first_4000_chars(row[0])}, 61 | ) # taking 1500 first characters because of meta size limit 62 | for i, row in enumerate(data_batch) 63 | for embedding in [np.array([float(x) for x in row[1:]]).tolist()] 64 | ] 65 | 66 | # Upsert the embeddings in batch 67 | upsert_response = self.pinecone_index.upsert(vectors=upsert_requests, namespace="movies") 68 | 69 | return upsert_response 70 | 71 | def upsert_embeddings_to_index(self): 72 | # Load the data from the CSV file 73 | with open(self.embeddings_csv) as f: 74 | reader = csv.reader(f) 75 | next(reader) # skip header row 76 | data = list(reader) 77 | 78 | # Upsert the embeddings in batches 79 | batch_size = 100 80 | index_offset = 0 81 | while index_offset < len(data): 82 | batch = data[index_offset : index_offset + batch_size] 83 | 84 | ## APPEND VECTORS TO INDEX AFTER LAST ENTRIES 85 | # upsert_embeddings_batch( int(index.describe_index_stats()['total_vector_count'] +1) ,batch, index_offset) 86 | 87 | ## REPLACE VECTORS STATING AT 0 88 | self.upsert_embeddings_batch(0, batch, index_offset) 89 | print("batch " + str(index_offset)) 90 | index_offset += batch_size 91 | print(f"Total vectors in the index: {self.pinecone_index.describe_index_stats()['total_vector_count']}") 92 | 93 | 94 | if __name__ == "__main__": 95 | dotenv_path = join(dirname(__file__), '.env') 96 | info = load_dotenv(dotenv_path) 97 | 98 | # Define the name of the index and the dimensionality of the embeddings 99 | index_name = "1kmovies" 100 | embeddings_csv = "data_sample/d4.embeddings_maker_results.csv" 101 | embedding_dims = 1536 102 | create_index = True 103 | 104 | pinecone = PineconeUpload( 105 | pinecone_api_key=os.environ.get("PINECONE_API_KEY"), 106 | index_name=index_name, 107 | embeddings_csv=embeddings_csv, 108 | embedding_dims=embedding_dims, 109 | create_index=create_index, 110 | ) 111 | 112 | pinecone.upsert_embeddings_to_index() 113 | --------------------------------------------------------------------------------