├── .gitignore ├── README.md ├── assets └── Schema Temporal Augmented Retrieval.png ├── huggingface_space_app ├── app.py ├── naive_rag.py ├── rag.py ├── rag_benchmark.py └── temporal_augmented_retrival.py └── temporal_augmented_retrival.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | share/python-wheels/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | MANIFEST 28 | 29 | # PyInstaller 30 | # Usually these files are written by a python script from a template 31 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 32 | *.manifest 33 | *.spec 34 | 35 | # Installer logs 36 | pip-log.txt 37 | pip-delete-this-directory.txt 38 | 39 | # Unit test / coverage reports 40 | htmlcov/ 41 | .tox/ 42 | .nox/ 43 | .coverage 44 | .coverage.* 45 | .cache 46 | nosetests.xml 47 | coverage.xml 48 | *.cover 49 | *.py,cover 50 | .hypothesis/ 51 | .pytest_cache/ 52 | cover/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | .pybuilder/ 76 | target/ 77 | 78 | # Jupyter Notebook 79 | .ipynb_checkpoints 80 | 81 | # IPython 82 | profile_default/ 83 | ipython_config.py 84 | 85 | # pyenv 86 | # For a library or package, you might want to ignore these files since the code is 87 | # intended to run in multiple environments; otherwise, check them in: 88 | # .python-version 89 | 90 | # pipenv 91 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 92 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 93 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 94 | # install all needed dependencies. 95 | #Pipfile.lock 96 | 97 | # poetry 98 | # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. 99 | # This is especially recommended for binary packages to ensure reproducibility, and is more 100 | # commonly ignored for libraries. 101 | # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control 102 | #poetry.lock 103 | 104 | # pdm 105 | # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. 106 | #pdm.lock 107 | # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it 108 | # in version control. 109 | # https://pdm.fming.dev/#use-with-ide 110 | .pdm.toml 111 | 112 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm 113 | __pypackages__/ 114 | 115 | # Celery stuff 116 | celerybeat-schedule 117 | celerybeat.pid 118 | 119 | # SageMath parsed files 120 | *.sage.py 121 | 122 | # Environments 123 | .env 124 | .venv 125 | env/ 126 | venv/ 127 | ENV/ 128 | env.bak/ 129 | venv.bak/ 130 | 131 | # Spyder project settings 132 | .spyderproject 133 | .spyproject 134 | 135 | # Rope project settings 136 | .ropeproject 137 | 138 | # mkdocs documentation 139 | /site 140 | 141 | # mypy 142 | .mypy_cache/ 143 | .dmypy.json 144 | dmypy.json 145 | 146 | # Pyre type checker 147 | .pyre/ 148 | 149 | # pytype static type analyzer 150 | .pytype/ 151 | 152 | # Cython debug symbols 153 | cython_debug/ 154 | 155 | # PyCharm 156 | # JetBrains specific template is maintained in a separate JetBrains.gitignore that can 157 | # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore 158 | # and can be added to the global gitignore or merged into this file. For a more nuclear 159 | # option (not recommended) you can uncomment the following to ignore the entire idea folder. 160 | #.idea/ 161 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Temporal Augmented Retrieval 2 | 3 | This repository contains the code for a temporal augmented retrieval approach as well as a gradio app that is available through a [hugging face space here](https://huggingface.co/spaces/Adr740/Temporal-RAG-Benchmark). Link to the [medium article](https://adam-rida.medium.com/temporal-augmented-retrieval-tar-dynamic-rag-ad737506dfcc) 4 | 5 | This work has been conducted in the context of buildspace's nights and weekends program: https://buildspace.so/ 6 | ![TAR Schema](https://github.com/adrida/temporal_RAG/blob/master/assets/Schema%20Temporal%20Augmented%20Retrieval.png?raw=true) 7 | ## Why the need for temporal augmented retrieval 8 | 9 | The main advantage of having a temporal aspect is the ability to factor in temporal dynamics and changes of topics in the underlying data. We apply it on financial tweets in our example but the main use case would be client sales meeting notes. 10 | The use cases are as follows: 11 | ### Detect and anticipate market trends and movements 12 | Detecting emerging trends is one of the main reasons why we emphasize the need to include a dynamic aspect of RAG. Let's take a simple example. Say you have a high volume of temporal textual data (social media posts, client meeting notes,...). Let's consider that this data is particularly relevant because it contains market insights. 13 | 14 | Following this example, say we want to know the feeling around the topic "Metaverse". The query could look something like "Tell me how people are feeling about the metaverse". A traditional RAG approach will try to find documents to address directly the question. Meaning that they will look for documents individually talking about people's thoughts around the metaverse and not evaluate what each document is saying. More sophisticated RAG approaches, such as the ones implemented in langchain will go beyond finding topics matching with the query and first use a LLM to augment the query with metadata and examples of documents that might be of interest, for example, you could have the augmented query looking like this 15 | 16 | ``` 17 | augmented query: 18 | - Original query: Tell me how people are feeling about the metaverse 19 | - Metadata: "date: past 5 days", "only documents tagged 'innovation'" 20 | - Examples: "Metaverse is a great opportunity", "Is Metaverse really about to change everything?" 21 | ``` 22 | 23 | This query will then be used to find relevant documents and combine them into a context fed to an LLM that will generate the final answer. 24 | 25 | The main limitation of this approach lies in the fact that it will extract knowledge from the data in a static manner, even by having filtered out the data with the date as metadata. This means that the output will give you information on what people are saying about metaverse rather than insights on the evolution of the topic. 26 | 27 | In our case, what is of interest is not the fact that they have been talking about Metaverse over the past 5 days but rather if they are talking more today about it compared to last month. The latter gives direct insights into the evolution of this topic and can help identify emerging trends or controversial topics. This can be used to either anticipate new trends by designing products or marketing operations to make the most of them or simply mitigate incoming bad communication (bad PR). 28 | 29 | Temporal Augmented Retrieval is the first proposal to try to address this issue and include a dynamic aspect. 30 | 31 | ### Identify cross-selling opportunities 32 | By understanding the client's discussions evolutions through time we can identify cross-selling opportunities. This is particularly relevant for use cases involving a vast number of clients and products. In large companies (especially financial services), having a sales rep knowing all products and partnerships offered by the company (even outside of his scope of expertise) is often impossible. Digging through client meeting notes can help uncover clients who might benefit from a product offered by another business unit. 33 | An example of a query could look like this: "Do we have any clients interested in eco-friendly CRMs?" 34 | 35 | The traditional RAG method can perform this important task quite well already, but we want to make sure that Temporal Augmented Retrieval doesn't lose this ability when studying the data's dynamics. 36 | 37 | 38 | ## How does it work? 39 | This is a breakdown of how temporal augmented retrieval works. It follows the same global RAG structure (query augmentation, metadata, combining into context). The main difference resides in the metadata part as well as the different prompts used for LLM's intermediary calls. No RAG libraries (langchain,...) nor vector db has been used, everything is implemented to work using pandas and numpy. It works with openAI API at the moment but LLM calls are distinct and easy to change directly. The same goes for semantic search operations, we use openAI embeddings and the function is also easily adaptable. A good improvement in the future will be to add more flexibility on which LLMs and embedding engines can be plugged. 40 | 41 | ### 1) Query Augmentation 42 | 43 | As in traditional RAG, we augment the initial query to determine two main things: relevant timestamps for a dynamic study and examples on which to perform a semantic search. We emphasize the temporal aspect by specifying in the prompt that timestamps should be determined for a temporal study. We also provide in the prompt a list of unique timestamps available, this will be later parametrized in the function. 44 | 45 | System prompt for query augmentation (models: gpt-3.5-turbo-16k/gpt-4): 46 | ``` 47 | On [current date: 19 July], you'll receive a finance-related question from a sales manager, without direct interaction. Generate a JSON response with the following structure, considering the temporal aspect: 48 | 49 | { 50 | "timestamps": # Relevant timestamps to study corresponding tweets for a temporal dynamic aspect (e.g., topic drift). USE THE MINIMAL NUMBER OF TIMESTAMP POSSIBLE ALWAYS ALWAYS!, 51 | "query": # Repeat the user's query, 52 | "similarity_boilerplate": # Boilerplate of relevant documents for cosine similarity search after embedding (it could look like an example of tweets that might help answer the query), 53 | } 54 | 55 | Allowed historical timestamps: 56 | ['2018-07-18', '2018-07-19', '2018-07-08', '2018-07-09', '2018-07-10', '2018-07-11', '2018-07-12', '2018-07-13', '2018-07-14', '2018-07-15', '2018-07-16', '2018-07-17'] 57 | 58 | Ensure the output is always in JSON format and never provide any other response. 59 | ``` 60 | 61 | 62 | ### 2) Meta-temporal data 63 | Regarding metadata, we use different subfunctions to generate relevant insights on the dynamic aspect of the data. This step outputs a list for each timestamp: the number of directly relevant documents and the list of corresponding tweets (with a maximum of 10 - Also a parameter -). 64 | The idea here is to allow for the final combination of the LLM to understand how the volume of tweets evolved as well as a sample of what those actual tweets look like. 65 | 66 | - Number of relevant tweets 67 | The first step is to estimate for each timestamp, how many documents directly address the query. This step has two main advantages, first, it will allow the semantic search retrieval to stop at the right number of similar documents and not include unrelated data. Secondly, it will provide a direct estimation of how many tweets are relevant and hence provide the ability to perform time-wise volume comparisons. 68 | To find the number of relevant tweets we first start by performing a semantic search on the considered timestamp. The obtained list is sorted by similarity. We then perform a dichotomic search to find the last relevant tweet from this list. This is the `condition_check` function in the code. 69 | 70 | To do this dichotomic search we use the following prompts (models: gpt-3.5-turbo-16k/gpt-4): 71 | 72 | System prompt: `Only answer with True or False no matter what` 73 | User chat prompt: 74 | ``` 75 | Consider this tweet: [TWEET TO CHECK RELEVANCE] 76 | 77 | Is it relevant to the following query: [AUGMENTED USER QUERY] 78 | ``` 79 | 80 | - List of relevant tweets 81 | To find relevant tweets, we simply perform a traditional semantic like the previous step but we only output the minimum number of tweets between the number of relevant tweets and the parameter `number_of_tweets_per_timestamp`. 82 | 83 | This fed context aims to differentiate from traditional RAG by giving temporal information on the evolution of the sought query and steering retrieval towards a true temporal augmented retrieval. The parameters such as the number of tweets per timestamp are directly dependent on the context length accepted by the underlying LLM. 84 | 85 | ### 3) Merging into one query 86 | 87 | The last step consists of combining all the built context into one response. To do so, we consider the original query and the context by using the following chat prompts (models: gpt-3.5-turbo-16k/gpt-4): 88 | 89 | System prompt: 90 | ``` 91 | 92 | You will be fed a list of tweets each at a specific timestamp and the number of relevant tweets. You need to take into account (if needed) the number of tweets relevant to the query and how this number evolved. Your task is to use those tweets to answer to the best of your knowledge the following question: 93 | 94 | QUESTION: [USER ORIGINAL QUERY] 95 | 96 | SPECIFIC INSTRUCTIONS AND SYSTEM WARNINGS: You redact a properly structured markdown string containing a professional report. 97 | You ALWAYS specify your sources by citing them (no urls though). Those tweets are samples from the data and are the closest to the query, you should also take into account the volume of tweets obtained. 98 | Otherwise, it will be considered highly misleading and harmful content. 99 | You should however always try your best to answer and you need to study in depth the historical relationship between the timestamps and how they answer the QUESTION. 100 | You never refer to yourself. 101 | Make it as if a real human provided a well-constructed and structured report/answer extracting the best of the knowledge contained in the context." 102 | ``` 103 | 104 | User prompt: 105 | ``` 106 | [USER CONTEXT] 107 | ``` 108 | 109 | Based on my experiments I would advise using an LLM with a large context for this step (ideally gpt-4). 110 | 111 | ### Parameters 112 | 113 | The parameters in our code are: 114 | 115 | `number_of_tweets_per_timestamp`: Minimal number of tweets to include when doing retrieval for each timestamp 116 | 117 | `MODEL_AUGMENT`: LLM used to augment the query and perform the dichotomic checks (steps 1 and 2). Currently only supports OpenAI models 118 | 119 | `MODEL_ANSWER`: LLM used to combine all context elements into one answer (step 3). Currently only supports OpenAI models 120 | 121 | ## Contact 122 | 123 | To reach out to me to discuss please visit adrida.github.io 124 | If you would like to contribute, please feel free to open a Github issue 125 | -------------------------------------------------------------------------------- /assets/Schema Temporal Augmented Retrieval.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adrida/Temporal_RAG/bf39db8460ae500cf4e593d70e5e2856c4e73c18/assets/Schema Temporal Augmented Retrieval.png -------------------------------------------------------------------------------- /huggingface_space_app/app.py: -------------------------------------------------------------------------------- 1 | 2 | import gradio as gr 3 | from functools import partial 4 | from rag_benchmark import get_benchmark 5 | 6 | 7 | 8 | title = "Prototype Temporal Augmented Retrieval (TAR)" 9 | desc = "Database: 22.4k tweets related to finance dated from July 12,2018 to July 19,2018 - know more about the approach: [link to medium]\ncontact: adrida.github.io" 10 | 11 | 12 | with gr.Blocks(title=title,theme='nota-ai/theme') as demo: 13 | gr.Markdown(f"# {title}\n{desc}") 14 | with gr.Row(): 15 | with gr.Column(scale = 10): 16 | text_area = gr.Textbox(placeholder="Write here", lines=1, label="Ask anything") 17 | with gr.Column(scale = 2): 18 | api_key = gr.Textbox(placeholder="Paste your OpenAI API key here", lines=1) 19 | search_button = gr.Button(value="Ask") 20 | 21 | with gr.Row(): 22 | with gr.Tab("Dynamic Temporal Augmented Retrieval (ours)"): 23 | 24 | gr.Markdown("## Dynamic Temporal Augmented Retrieval (ours)\n---") 25 | tempo = gr.Markdown() 26 | with gr.Tab("Naive Semantic Search"): 27 | gr.Markdown("## Simple Semantic Search\n---") 28 | naive = gr.Markdown() 29 | with gr.Tab("Traditional RAG (Langchain type)"): 30 | gr.Markdown("## Augmented Indexed Retrieval\n---") 31 | classic = gr.Markdown() 32 | 33 | search_function = partial(get_benchmark) 34 | 35 | search_button.click(fn=search_function, inputs=[text_area, api_key], outputs=[tempo, classic, naive], 36 | ) 37 | 38 | demo.queue(concurrency_count=100,status_update_rate=500).launch(max_threads=100, show_error=True, debug = True, inline =False) 39 | 40 | -------------------------------------------------------------------------------- /huggingface_space_app/naive_rag.py: -------------------------------------------------------------------------------- 1 | import openai 2 | import time 3 | 4 | import time 5 | import numpy as np 6 | 7 | GPT_MODEL_ANSWER = "gpt-3.5-turbo-16k" 8 | 9 | def cosine_similarity(a, b): 10 | return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) 11 | 12 | def get_embedding(text, model="text-embedding-ada-002"): 13 | try: 14 | text = text.replace("\n", " ") 15 | except: 16 | None 17 | try: 18 | return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding'] 19 | except: 20 | time.sleep(2) 21 | 22 | def format_query(query): 23 | 24 | resp = { 25 | "timestamps": [], 26 | "query": query 27 | } 28 | 29 | return resp 30 | 31 | def semantic_search(df_loc, query, nb_programs_to_display=15): 32 | 33 | embedding = get_embedding(query, model='text-embedding-ada-002') 34 | filtered_df = df_loc.drop(columns=["url"]) 35 | def wrap_cos(x,y): 36 | try: 37 | res = cosine_similarity(x,y) 38 | except: 39 | res = 0 40 | return res 41 | filtered_df['similarity'] = filtered_df.embedding.apply(lambda x: wrap_cos(x, embedding)) 42 | results = filtered_df.sort_values('similarity', ascending=False).head(nb_programs_to_display) 43 | return results 44 | 45 | def get_relevant_documents(df, query, nb_programs_to_display=15): 46 | all_retrieved= [{ 47 | "timestamp" : "", 48 | "tweets" : semantic_search(df, query["query"], nb_programs_to_display=nb_programs_to_display) 49 | }] 50 | return all_retrieved 51 | 52 | def get_final_answer(relevant_documents, query): 53 | response = relevant_documents[0] 54 | tweet_entry = response["tweets"] 55 | context = "\nList of tweets:\n" + str((tweet_entry["text"] + " --- Tweeted by: @" +tweet_entry["source"] + " \n").to_list()) + "\n---" 56 | USER_PROMPT = f""" 57 | "We have provided context information below. 58 | --------------------- 59 | {context} 60 | "\n---------------------\n" 61 | Given the information above, please answer the question: {query} 62 | """ 63 | 64 | response = openai.chat.completions.create( 65 | model=GPT_MODEL_ANSWER, 66 | messages=[ 67 | { 68 | "role": "user", 69 | "content": USER_PROMPT 70 | } 71 | ], 72 | 73 | temperature=1, 74 | max_tokens=1000, 75 | top_p=1, 76 | frequency_penalty=0, 77 | presence_penalty=0, 78 | ).choices[0].message.content 79 | return response 80 | 81 | def get_answer(query, df, api_key): 82 | """This approach is considered naive because it doesn't augment the user query. 83 | This means that we try to retrieve documents directly relevant to the user query and then combine them into an answer. 84 | The query is formatted to have the same structure given to the LLM as the other two approaches 85 | 86 | Args: 87 | query (String): Query given by the user 88 | df (pd.DataFrame()): corpus with embeddings 89 | api_key (String): OpenAI API key 90 | 91 | Returns: 92 | String: Answer to the original query 93 | """ 94 | openai.api_key = api_key 95 | formatted_query = format_query(query) 96 | relevant_documents = get_relevant_documents(df, formatted_query,nb_programs_to_display=15) 97 | response = get_final_answer(relevant_documents, formatted_query) 98 | return response 99 | -------------------------------------------------------------------------------- /huggingface_space_app/rag.py: -------------------------------------------------------------------------------- 1 | import os 2 | import openai 3 | import time 4 | import numpy as np 5 | import time 6 | import pandas as pd 7 | 8 | GPT_MODEL_AUGMENT = "gpt-3.5-turbo-16k" 9 | GPT_MODEL_ANSWER = "gpt-3.5-turbo-16k" 10 | 11 | 12 | def cosine_similarity(a, b): 13 | return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) 14 | 15 | def get_embedding(text, model="text-embedding-ada-002"): 16 | try: 17 | text = text.replace("\n", " ") 18 | except: 19 | None 20 | try: 21 | return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding'] 22 | except: 23 | time.sleep(2) 24 | 25 | def augment_query(query): 26 | 27 | SYS_PROMPT = """ 28 | On [current date: 19 July] Generate a JSON response with the following structure: 29 | 30 | { 31 | "timestamps": # Relevant timestamps in which to get data to answer the query, 32 | "query": # Repeat the user's query, 33 | } 34 | Allowed timestamps: 35 | ['2018-07-18', '2018-07-19', '2018-07-08', '2018-07-09', '2018-07-10', '2018-07-11', '2018-07-12', '2018-07-13', '2018-07-14', '2018-07-15', '2018-07-16', '2018-07-17'] 36 | 37 | Ensure the output is always in JSON format and never provide any other response. 38 | """ 39 | response = openai.chat.completions.create( 40 | model=GPT_MODEL_AUGMENT, 41 | messages= 42 | [ 43 | { 44 | "role": "system", 45 | "content": SYS_PROMPT 46 | }, 47 | { 48 | "role": "user", 49 | "content": query 50 | } 51 | ], 52 | temperature=1, 53 | max_tokens=1000, 54 | top_p=1, 55 | frequency_penalty=0, 56 | presence_penalty=0, 57 | ).choices[0].message.content 58 | return response 59 | 60 | def semantic_search(df_loc, query,timestamp, nb_programs_to_display=15): 61 | timestamp = str(timestamp).strip() 62 | embedding = get_embedding(query, model='text-embedding-ada-002') 63 | filtered_df = df_loc[df_loc["timestamp"]==timestamp].drop(columns=["url"]) 64 | def wrap_cos(x,y): 65 | try: 66 | res = cosine_similarity(x,y) 67 | except: 68 | res = 0 69 | return res 70 | filtered_df['similarity'] = filtered_df.embedding.apply(lambda x: wrap_cos(x, embedding)) 71 | 72 | results = filtered_df.sort_values('similarity', ascending=False).head(nb_programs_to_display) 73 | return results 74 | 75 | def get_relevant_documents(df, query, nb_programs_to_display=15): 76 | 77 | query = eval(query) 78 | all_retrieved = [] 79 | for timestamp in query["timestamps"]: 80 | all_retrieved.append({ 81 | "timestamp" : timestamp, 82 | "tweets" : semantic_search(df, query["query"],timestamp, nb_programs_to_display=nb_programs_to_display) 83 | }) 84 | 85 | return all_retrieved 86 | 87 | def get_final_answer(relevant_documents, query): 88 | context = "" 89 | for relevant_timestamp in relevant_documents: 90 | list_tweets = relevant_timestamp["tweets"] 91 | context += "\nTimestamp: " + relevant_timestamp["timestamp"] + "\nList of tweets:\n" + str((list_tweets["text"] + " --- Tweeted by: @" +list_tweets["source"] + " \n").to_list()) + "\n---" 92 | 93 | 94 | USER_PROMPT = f""" 95 | "We have provided context information below. 96 | --------------------- 97 | {context} 98 | "\n---------------------\n" 99 | Given this information, please answer the question: {query} 100 | """ 101 | response = openai.chat.completions.create( 102 | model=GPT_MODEL_ANSWER, 103 | messages=[ 104 | { 105 | "role": "user", 106 | "content": USER_PROMPT 107 | } 108 | ], 109 | 110 | temperature=1, 111 | max_tokens=1000, 112 | top_p=1, 113 | frequency_penalty=0, 114 | presence_penalty=0, 115 | ).choices[0].message.content 116 | return response 117 | 118 | def get_answer(query, df, api_key): 119 | """This traditional RAG approach has been implemented without using deidcated libraries and include different steps. 120 | It starts by augmenting the query and then perform a semantic search on the augmented query. Finally it combines the augmented query and the retrieved documents into an answer. 121 | 122 | Args: 123 | query (String): Query given by the user 124 | df (pd.DataFrame()): corpus with embeddings 125 | api_key (String): OpenAI API key 126 | 127 | Returns: 128 | String: Answer to the original query 129 | """ 130 | openai.api_key = api_key 131 | augmented_query = augment_query(query) 132 | relevant_documents = get_relevant_documents(df, augmented_query,nb_programs_to_display=10) 133 | response = get_final_answer(relevant_documents, augmented_query,) 134 | return response 135 | -------------------------------------------------------------------------------- /huggingface_space_app/rag_benchmark.py: -------------------------------------------------------------------------------- 1 | 2 | import pandas as pd 3 | from temporal_augmented_retrival import get_answer as get_temporal_answer 4 | from rag import get_answer as get_rag_answer 5 | from naive_rag import get_answer as get_naive_answer 6 | 7 | path_to_csv = "contenu_embedded_august2023_1.csv" 8 | path_to_raw = "stockerbot-export.csv" 9 | df = pd.read_csv(path_to_csv, on_bad_lines='skip').reset_index(drop=True).drop(columns=['Unnamed: 0']) 10 | df["embedding"] = df.embedding.apply(lambda x: eval(x)).to_list() 11 | df["timestamp"] = pd.to_datetime(df["timestamp"]).dt.strftime('%Y-%m-%d') 12 | 13 | 14 | def get_benchmark(text_query, api_key): 15 | global df 16 | tempo = get_temporal_answer(text_query, df, api_key) 17 | rag = get_rag_answer(text_query, df, api_key) 18 | naive = get_naive_answer(text_query, df, api_key) 19 | return(tempo, rag, naive) 20 | 21 | 22 | -------------------------------------------------------------------------------- /huggingface_space_app/temporal_augmented_retrival.py: -------------------------------------------------------------------------------- 1 | import os 2 | import openai 3 | import numpy as np 4 | import time 5 | 6 | import time 7 | import pandas as pd 8 | 9 | MODEL_AUGMENT = "gpt-3.5-turbo-16k" 10 | MODEL_ANSWER = "gpt-3.5-turbo-16k" 11 | 12 | def cosine_similarity(a, b): 13 | return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) 14 | 15 | def get_embedding(text, model="text-embedding-ada-002"): 16 | try: 17 | text = text.replace("\n", " ") 18 | except: 19 | None 20 | try: 21 | return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding'] 22 | except: 23 | time.sleep(2) 24 | 25 | def augment_query(query): 26 | SYS_PROMPT = """ 27 | On [current date: 19 July], you'll receive a finance-related question from a sales manager, without direct interaction. Generate a JSON response with the following structure, considering the temporal aspect: 28 | 29 | { 30 | "timestamps": # Relevant timestamps to study corresponding tweets for a temporal dynamic aspect (e.g., topic drift). USE THE MINIMAL NUMBER OF TIMESTAMP POSSIBLE ALWAYS ALWAYS!, 31 | "query": # Repeat the user's query, 32 | "similarity_boilerplate": # Boilerplate of relevant documents for cosine similarity search after embedding (it could look like example of tweets that might help answer the query), 33 | } 34 | 35 | Allowed historical timestamps: 36 | ['2018-07-18', '2018-07-19', '2018-07-08', '2018-07-09', '2018-07-10', '2018-07-11', '2018-07-12', '2018-07-13', '2018-07-14', '2018-07-15', '2018-07-16', '2018-07-17'] 37 | 38 | Ensure the output is always in JSON format and never provide any other response. 39 | """ 40 | response = openai.chat.completions.create( 41 | model=MODEL_AUGMENT, 42 | messages= 43 | [ 44 | { 45 | "role": "system", 46 | "content": SYS_PROMPT 47 | }, 48 | { 49 | "role": "user", 50 | "content": query 51 | } 52 | ], 53 | temperature=1, 54 | max_tokens=1000, 55 | top_p=1, 56 | frequency_penalty=0, 57 | presence_penalty=0, 58 | ).choices[0].message.content 59 | return response 60 | 61 | 62 | def semantic_search(df_loc, query,timestamp, nb_elements_to_consider=15): 63 | timestamp = str(timestamp).strip() 64 | embedding = get_embedding(query, model='text-embedding-ada-002') 65 | filtered_df = df_loc[df_loc["timestamp"]==timestamp].drop(columns=["url"]) 66 | def wrap_cos(x,y): 67 | try: 68 | res = cosine_similarity(x,y) 69 | except: 70 | res = 0 71 | return res 72 | filtered_df['similarity'] = filtered_df.embedding.apply(lambda x: wrap_cos(x, embedding)) 73 | 74 | results = filtered_df.sort_values('similarity', ascending=False).head(nb_elements_to_consider) 75 | 76 | return results 77 | 78 | def condition_check(tweet, query): 79 | response = openai.chat.completions.create(model=MODEL_AUGMENT,messages=[ { 80 | "role": "system", 81 | "content": "Only answer with True or False no matter what" 82 | }, 83 | { 84 | "role": "user", 85 | "content": f"Consider this tweet:\n\n{tweet}\n\nIs it relevant to the following query:\n\n\{query}" 86 | } 87 | ], 88 | temperature=1, 89 | max_tokens=1000, 90 | top_p=1, 91 | frequency_penalty=0, 92 | presence_penalty=0 93 | ).choices[0].message.content 94 | return bool(response) 95 | 96 | def get_number_relevant_tweets(df,timestamp, query): 97 | sorted_df = semantic_search(df, str(str(query["query"]) + "\n"+ str(query["similarity_boilerplate"])),timestamp, nb_elements_to_consider=len(df)) 98 | left, right = 0, len(sorted_df) - 1 99 | while left <= right: 100 | mid = (left + right) // 2 101 | print(f"Currently searching with max range at {mid}") 102 | if condition_check(sorted_df['text'].iloc[mid], query): 103 | left = mid + 1 104 | else: 105 | right = mid - 1 106 | print(f"Dichotomy done, found relevant tweets: {left}") 107 | return left 108 | 109 | 110 | 111 | def get_relevant_documents(df, query,nb_elements_to_consider = 10): 112 | query = eval(query) 113 | all_retrieved = [] 114 | for timestamp in query["timestamps"]: 115 | number_of_relevant_tweets = get_number_relevant_tweets(df,timestamp, query) 116 | all_retrieved.append({ 117 | "timestamp" : timestamp, 118 | "number_of_relevant_tweets": str(number_of_relevant_tweets), 119 | "tweets" : semantic_search(df, str(str(query["query"]) + "\n"+ str(query["similarity_boilerplate"])),timestamp, nb_elements_to_consider=min(nb_elements_to_consider,number_of_relevant_tweets)) 120 | }) 121 | return all_retrieved 122 | 123 | def get_final_answer(relevant_documents, query): 124 | context = "" 125 | for document in relevant_documents: 126 | print("TIMESTAMP: ", document["timestamp"] ) 127 | tweet_entry = document["tweets"] 128 | context += "\nTimestamp: " + document["timestamp"] + " - Number of relevant tweets in database (EXACT VOLUME OF TWEETS): +"+ document["number_of_relevant_tweets"] + "\nList of tweets:\n" + str((tweet_entry["text"] + " --- Tweeted by: @" +tweet_entry["source"] + " \n").to_list()) + "\n---" 129 | 130 | 131 | SYS_PROMPT = f""" 132 | You will be fed a list of tweets each at a specific timestamp and the number of relevant tweets. You need to take into account (if needed) the number of tweets relevant to the query and how this number evolved. Your task is to use those tweets to answer to the best of your knowledge the following question: 133 | 134 | QUESTION: {query} 135 | 136 | SPECIFIC INSTRUCTIONS AND SYSTEM WARNINGS: You redact a properly structured markdown string containing a professional report. 137 | You ALWAYS specify your sources by citing them (no urls though). Those tweets are samples from the data and are the closest to the query, you should also take into account the volume of tweets obtained. 138 | Otherwise, it will be considered highly misleading and harmful content. 139 | You should however always try your best to answer and you need to study in depth the historical relationship between the timestamps and how it answers the QUESTION. 140 | You never refer to yourself. 141 | Make it as if a real human provided a well constructed and structured report/answer extracting the best of the knowledge contained in the context." 142 | """ 143 | response = openai.chat.completions.create( 144 | model=MODEL_ANSWER, 145 | messages=[ 146 | { 147 | "role": "system", 148 | "content": SYS_PROMPT 149 | }, 150 | { 151 | "role": "user", 152 | "content": str(context) 153 | } 154 | ], 155 | 156 | temperature=1, 157 | max_tokens=3000, 158 | top_p=1, 159 | frequency_penalty=0, 160 | presence_penalty=0, 161 | ).choices[0].message.content 162 | return response 163 | 164 | def get_answer(query, df,api_key,nb_elements_to_consider=10): 165 | openai.api_key = api_key 166 | augmented_query = augment_query(query) 167 | 168 | relevant_documents = get_relevant_documents(df, augmented_query,nb_elements_to_consider=nb_elements_to_consider) 169 | 170 | response = get_final_answer(relevant_documents, augmented_query) 171 | print(response) 172 | 173 | 174 | return response 175 | 176 | 177 | -------------------------------------------------------------------------------- /temporal_augmented_retrival.py: -------------------------------------------------------------------------------- 1 | import os 2 | import openai 3 | import numpy as np 4 | import time 5 | 6 | import time 7 | import pandas as pd 8 | 9 | MODEL_AUGMENT = "gpt-3.5-turbo-16k" 10 | MODEL_ANSWER = "gpt-3.5-turbo-16k" 11 | 12 | openai.api_key = "Paste your openai API key here" 13 | 14 | def cosine_similarity(a, b): 15 | return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) 16 | 17 | def get_embedding(text, model="text-embedding-ada-002"): 18 | try: 19 | text = text.replace("\n", " ") 20 | except: 21 | None 22 | try: 23 | return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding'] 24 | except: 25 | time.sleep(2) 26 | 27 | def augment_query(query): 28 | SYS_PROMPT = """ 29 | On [current date: 19 July], you'll receive a finance-related question from a sales manager, without direct interaction. Generate a JSON response with the following structure, considering the temporal aspect: 30 | 31 | { 32 | "timestamps": # Relevant timestamps to study corresponding tweets for a temporal dynamic aspect (e.g., topic drift). USE THE MINIMAL NUMBER OF TIMESTAMP POSSIBLE ALWAYS ALWAYS!, 33 | "query": # Repeat the user's query, 34 | "similarity_boilerplate": # Boilerplate of relevant documents for cosine similarity search after embedding (it could look like example of tweets that might help answer the query), 35 | } 36 | 37 | Allowed historical timestamps: 38 | ['2018-07-18', '2018-07-19', '2018-07-08', '2018-07-09', '2018-07-10', '2018-07-11', '2018-07-12', '2018-07-13', '2018-07-14', '2018-07-15', '2018-07-16', '2018-07-17'] 39 | 40 | Ensure the output is always in JSON format and never provide any other response. 41 | """ 42 | response = openai.chat.completions.create( 43 | model=MODEL_AUGMENT, 44 | messages= 45 | [ 46 | { 47 | "role": "system", 48 | "content": SYS_PROMPT 49 | }, 50 | { 51 | "role": "user", 52 | "content": query 53 | } 54 | ], 55 | temperature=1, 56 | max_tokens=1000, 57 | top_p=1, 58 | frequency_penalty=0, 59 | presence_penalty=0, 60 | ).choices[0].message.content 61 | return response 62 | 63 | 64 | def semantic_search(df_loc, query,timestamp, nb_elements_to_consider=15): 65 | timestamp = str(timestamp).strip() 66 | embedding = get_embedding(query, model='text-embedding-ada-002') 67 | filtered_df = df_loc[df_loc["timestamp"]==timestamp].drop(columns=["url"]) 68 | def wrap_cos(x,y): 69 | try: 70 | res = cosine_similarity(x,y) 71 | except: 72 | res = 0 73 | return res 74 | filtered_df['similarity'] = filtered_df.embedding.apply(lambda x: wrap_cos(x, embedding)) 75 | 76 | results = filtered_df.sort_values('similarity', ascending=False).head(nb_elements_to_consider) 77 | 78 | return results 79 | 80 | def condition_check(tweet, query): 81 | response = openai.chat.completions.create(model=MODEL_AUGMENT,messages=[ { 82 | "role": "system", 83 | "content": "Only answer with True or False no matter what" 84 | }, 85 | { 86 | "role": "user", 87 | "content": f"Consider this tweet:\n\n{tweet}\n\nIs it relevant to the following query:\n\n\{query}" 88 | } 89 | ], 90 | temperature=1, 91 | max_tokens=1000, 92 | top_p=1, 93 | frequency_penalty=0, 94 | presence_penalty=0 95 | ).choices[0].message.content 96 | return bool(response) 97 | 98 | def get_number_relevant_tweets(df,timestamp, query): 99 | sorted_df = semantic_search(df, str(str(query["query"]) + "\n"+ str(query["similarity_boilerplate"])),timestamp, nb_elements_to_consider=len(df)) 100 | left, right = 0, len(sorted_df) - 1 101 | while left <= right: 102 | mid = (left + right) // 2 103 | print(f"Currently searching with max range at {mid}") 104 | if condition_check(sorted_df['text'].iloc[mid], query): 105 | left = mid + 1 106 | else: 107 | right = mid - 1 108 | print(f"Dichotomy done, found relevant tweets: {left}") 109 | return left 110 | 111 | 112 | 113 | def get_relevant_documents(df, query,nb_elements_to_consider = 10): 114 | query = eval(query) 115 | all_retrieved = [] 116 | for timestamp in query["timestamps"]: 117 | number_of_relevant_tweets = get_number_relevant_tweets(df,timestamp, query) 118 | all_retrieved.append({ 119 | "timestamp" : timestamp, 120 | "number_of_relevant_tweets": str(number_of_relevant_tweets), 121 | "tweets" : semantic_search(df, str(str(query["query"]) + "\n"+ str(query["similarity_boilerplate"])),timestamp, nb_elements_to_consider=min(nb_elements_to_consider,number_of_relevant_tweets)) 122 | }) 123 | return all_retrieved 124 | 125 | def get_final_answer(relevant_documents, query): 126 | context = "" 127 | for document in relevant_documents: 128 | print("TIMESTAMP: ", document["timestamp"] ) 129 | tweet_entry = document["tweets"] 130 | context += "\nTimestamp: " + document["timestamp"] + " - Number of relevant tweets in database (EXACT VOLUME OF TWEETS): +"+ document["number_of_relevant_tweets"] + "\nList of tweets:\n" + str((tweet_entry["text"] + " --- Tweeted by: @" +tweet_entry["source"] + " \n").to_list()) + "\n---" 131 | 132 | 133 | SYS_PROMPT = f""" 134 | You will be fed a list of tweets each at a specific timestamp and the number of relevant tweets. You need to take into account (if needed) the number of tweets relevant to the query and how this number evolved. Your task is to use those tweets to answer to the best of your knowledge the following question: 135 | 136 | QUESTION: {query} 137 | 138 | SPECIFIC INSTRUCTIONS AND SYSTEM WARNINGS: You redact a properly structured markdown string containing a professional report. 139 | You ALWAYS specify your sources by citing them (no urls though). Those tweets are samples from the data and are the closest to the query, you should also take into account the volume of tweets obtained. 140 | Otherwise, it will be considered highly misleading and harmful content. 141 | You should however always try your best to answer and you need to study in depth the historical relationship between the timestamps and how it answers the QUESTION. 142 | You never refer to yourself. 143 | Make it as if a real human provided a well constructed and structured report/answer extracting the best of the knowledge contained in the context." 144 | """ 145 | response = openai.chat.completions.create( 146 | model=MODEL_ANSWER, 147 | messages=[ 148 | { 149 | "role": "system", 150 | "content": SYS_PROMPT 151 | }, 152 | { 153 | "role": "user", 154 | "content": str(context) 155 | } 156 | ], 157 | 158 | temperature=1, 159 | max_tokens=3000, 160 | top_p=1, 161 | frequency_penalty=0, 162 | presence_penalty=0, 163 | ).choices[0].message.content 164 | return response 165 | 166 | def get_answer(query, df,nb_elements_to_consider=10): 167 | augmented_query = augment_query(query) 168 | 169 | relevant_documents = get_relevant_documents(df, augmented_query,nb_elements_to_consider=nb_elements_to_consider) 170 | 171 | response = get_final_answer(relevant_documents, augmented_query) 172 | print(response) 173 | 174 | 175 | return response 176 | 177 | 178 | --------------------------------------------------------------------------------