├── .gitignore
├── README.md
├── assets
    └── Schema Temporal Augmented Retrieval.png
├── huggingface_space_app
    ├── app.py
    ├── naive_rag.py
    ├── rag.py
    ├── rag_benchmark.py
    └── temporal_augmented_retrival.py
└── temporal_augmented_retrival.py


/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | share/python-wheels/
 24 | *.egg-info/
 25 | .installed.cfg
 26 | *.egg
 27 | MANIFEST
 28 | 
 29 | # PyInstaller
 30 | #  Usually these files are written by a python script from a template
 31 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 32 | *.manifest
 33 | *.spec
 34 | 
 35 | # Installer logs
 36 | pip-log.txt
 37 | pip-delete-this-directory.txt
 38 | 
 39 | # Unit test / coverage reports
 40 | htmlcov/
 41 | .tox/
 42 | .nox/
 43 | .coverage
 44 | .coverage.*
 45 | .cache
 46 | nosetests.xml
 47 | coverage.xml
 48 | *.cover
 49 | *.py,cover
 50 | .hypothesis/
 51 | .pytest_cache/
 52 | cover/
 53 | 
 54 | # Translations
 55 | *.mo
 56 | *.pot
 57 | 
 58 | # Django stuff:
 59 | *.log
 60 | local_settings.py
 61 | db.sqlite3
 62 | db.sqlite3-journal
 63 | 
 64 | # Flask stuff:
 65 | instance/
 66 | .webassets-cache
 67 | 
 68 | # Scrapy stuff:
 69 | .scrapy
 70 | 
 71 | # Sphinx documentation
 72 | docs/_build/
 73 | 
 74 | # PyBuilder
 75 | .pybuilder/
 76 | target/
 77 | 
 78 | # Jupyter Notebook
 79 | .ipynb_checkpoints
 80 | 
 81 | # IPython
 82 | profile_default/
 83 | ipython_config.py
 84 | 
 85 | # pyenv
 86 | #   For a library or package, you might want to ignore these files since the code is
 87 | #   intended to run in multiple environments; otherwise, check them in:
 88 | # .python-version
 89 | 
 90 | # pipenv
 91 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 92 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 93 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 94 | #   install all needed dependencies.
 95 | #Pipfile.lock
 96 | 
 97 | # poetry
 98 | #   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
 99 | #   This is especially recommended for binary packages to ensure reproducibility, and is more
100 | #   commonly ignored for libraries.
101 | #   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
102 | #poetry.lock
103 | 
104 | # pdm
105 | #   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
106 | #pdm.lock
107 | #   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
108 | #   in version control.
109 | #   https://pdm.fming.dev/#use-with-ide
110 | .pdm.toml
111 | 
112 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
113 | __pypackages__/
114 | 
115 | # Celery stuff
116 | celerybeat-schedule
117 | celerybeat.pid
118 | 
119 | # SageMath parsed files
120 | *.sage.py
121 | 
122 | # Environments
123 | .env
124 | .venv
125 | env/
126 | venv/
127 | ENV/
128 | env.bak/
129 | venv.bak/
130 | 
131 | # Spyder project settings
132 | .spyderproject
133 | .spyproject
134 | 
135 | # Rope project settings
136 | .ropeproject
137 | 
138 | # mkdocs documentation
139 | /site
140 | 
141 | # mypy
142 | .mypy_cache/
143 | .dmypy.json
144 | dmypy.json
145 | 
146 | # Pyre type checker
147 | .pyre/
148 | 
149 | # pytype static type analyzer
150 | .pytype/
151 | 
152 | # Cython debug symbols
153 | cython_debug/
154 | 
155 | # PyCharm
156 | #  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
157 | #  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
158 | #  and can be added to the global gitignore or merged into this file.  For a more nuclear
159 | #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
160 | #.idea/
161 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Temporal Augmented Retrieval
  2 | 
  3 | This repository contains the code for a temporal augmented retrieval approach as well as a gradio app that is available through a [hugging face space here](https://huggingface.co/spaces/Adr740/Temporal-RAG-Benchmark). Link to the [medium article](https://adam-rida.medium.com/temporal-augmented-retrieval-tar-dynamic-rag-ad737506dfcc) 
  4 | 
  5 | This work has been conducted in the context of buildspace's nights and weekends program: https://buildspace.so/
  6 | ![TAR Schema](https://github.com/adrida/temporal_RAG/blob/master/assets/Schema%20Temporal%20Augmented%20Retrieval.png?raw=true) 
  7 | ## Why the need for temporal augmented retrieval
  8 | 
  9 | The main advantage of having a temporal aspect is the ability to factor in temporal dynamics and changes of topics in the underlying data. We apply it on financial tweets in our example but the main use case would be client sales meeting notes.
 10 | The use cases are as follows:
 11 | ### Detect and anticipate market trends and movements
 12 | Detecting emerging trends is one of the main reasons why we emphasize the need to include a dynamic aspect of RAG. Let's take a simple example. Say you have a high volume of temporal textual data (social media posts, client meeting notes,...). Let's consider that this data is particularly relevant because it contains market insights. 
 13 | 
 14 | Following this example, say we want to know the feeling around the topic "Metaverse". The query could look something like "Tell me how people are feeling about the metaverse". A traditional RAG approach will try to find documents to address directly the question. Meaning that they will look for documents individually talking about people's thoughts around the metaverse and not evaluate what each document is saying. More sophisticated RAG approaches, such as the ones implemented in langchain will go beyond finding topics matching with the query and first use a LLM to augment the query with metadata and examples of documents that might be of interest, for example, you could have the augmented query looking like this
 15 | 
 16 | ```
 17 | augmented query:
 18 | - Original query: Tell me how people are feeling about the metaverse
 19 | - Metadata: "date: past 5 days", "only documents tagged 'innovation'"
 20 | - Examples: "Metaverse is a great opportunity", "Is Metaverse really about to change everything?"
 21 | ```
 22 | 
 23 | This query will then be used to find relevant documents and combine them into a context fed to an LLM that will generate the final answer.
 24 | 
 25 | The main limitation of this approach lies in the fact that it will extract knowledge from the data in a static manner, even by having filtered out the data with the date as metadata. This means that the output will give you information on what people are saying about metaverse rather than insights on the evolution of the topic.
 26 | 
 27 | In our case, what is of interest is not the fact that they have been talking about Metaverse over the past 5 days but rather if they are talking more today about it compared to last month. The latter gives direct insights into the evolution of this topic and can help identify emerging trends or controversial topics. This can be used to either anticipate new trends by designing products or marketing operations to make the most of them or simply mitigate incoming bad communication (bad PR).
 28 | 
 29 | Temporal Augmented Retrieval is the first proposal to try to address this issue and include a dynamic aspect.
 30 | 
 31 | ### Identify cross-selling opportunities
 32 | By understanding the client's discussions evolutions through time we can identify cross-selling opportunities. This is particularly relevant for use cases involving a vast number of clients and products. In large companies (especially financial services), having a sales rep knowing all products and partnerships offered by the company (even outside of his scope of expertise) is often impossible. Digging through client meeting notes can help uncover clients who might benefit from a product offered by another business unit. 
 33 | An example of a query could look like this: "Do we have any clients interested in eco-friendly CRMs?"
 34 | 
 35 | The traditional RAG method can perform this important task quite well already, but we want to make sure that Temporal Augmented Retrieval doesn't lose this ability when studying the data's dynamics.
 36 | 
 37 | 
 38 | ## How does it work?
 39 | This is a breakdown of how temporal augmented retrieval works. It follows the same global RAG structure (query augmentation, metadata, combining into context). The main difference resides in the metadata part as well as the different prompts used for LLM's intermediary calls. No RAG libraries (langchain,...) nor vector db has been used, everything is implemented to work using pandas and numpy. It works with openAI API at the moment but LLM calls are distinct and easy to change directly. The same goes for semantic search operations, we use openAI embeddings and the function is also easily adaptable. A good improvement in the future will be to add more flexibility on which LLMs and embedding engines can be plugged. 
 40 | 
 41 | ### 1) Query Augmentation
 42 | 
 43 | As in traditional RAG, we augment the initial query to determine two main things: relevant timestamps for a dynamic study and examples on which to perform a semantic search. We emphasize the temporal aspect by specifying in the prompt that timestamps should be determined for a temporal study. We also provide in the prompt a list of unique timestamps available, this will be later parametrized in the function.
 44 | 
 45 | System prompt for query augmentation (models: gpt-3.5-turbo-16k/gpt-4):
 46 | ```
 47 | On [current date: 19 July], you'll receive a finance-related question from a sales manager, without direct interaction. Generate a JSON response with the following structure, considering the temporal aspect:
 48 | 
 49 | {
 50 | "timestamps": # Relevant timestamps to study corresponding tweets for a temporal dynamic aspect (e.g., topic drift). USE THE MINIMAL NUMBER OF TIMESTAMP POSSIBLE ALWAYS ALWAYS!,
 51 | "query": # Repeat the user's query,
 52 | "similarity_boilerplate": # Boilerplate of relevant documents for cosine similarity search after embedding (it could look like an example of tweets that might help answer the query),
 53 | }
 54 | 
 55 | Allowed historical timestamps:
 56 | ['2018-07-18', '2018-07-19', '2018-07-08', '2018-07-09', '2018-07-10', '2018-07-11', '2018-07-12', '2018-07-13', '2018-07-14', '2018-07-15', '2018-07-16', '2018-07-17']
 57 | 
 58 | Ensure the output is always in JSON format and never provide any other response.
 59 | ```
 60 | 
 61 | 
 62 | ### 2) Meta-temporal data
 63 | Regarding metadata, we use different subfunctions to generate relevant insights on the dynamic aspect of the data. This step outputs a list for each timestamp: the number of directly relevant documents and the list of corresponding tweets (with a maximum of 10 - Also a parameter -).
 64 | The idea here is to allow for the final combination of the LLM to understand how the volume of tweets evolved as well as a sample of what those actual tweets look like.
 65 | 
 66 | - Number of relevant tweets
 67 | The first step is to estimate for each timestamp, how many documents directly address the query. This step has two main advantages, first, it will allow the semantic search retrieval to stop at the right number of similar documents and not include unrelated data. Secondly, it will provide a direct estimation of how many tweets are relevant and hence provide the ability to perform time-wise volume comparisons.
 68 | To find the number of relevant tweets we first start by performing a semantic search on the considered timestamp. The obtained list is sorted by similarity. We then perform a dichotomic search to find the last relevant tweet from this list. This is the `condition_check` function in the code.
 69 | 
 70 | To do this dichotomic search we use the following prompts (models: gpt-3.5-turbo-16k/gpt-4):
 71 | 
 72 | System prompt: `Only answer with True or False no matter what`
 73 | User chat prompt: 
 74 | ```
 75 | Consider this tweet: [TWEET TO CHECK RELEVANCE]
 76 | 
 77 | Is it relevant to the following query: [AUGMENTED USER QUERY]
 78 | ```
 79 | 
 80 | - List of relevant tweets
 81 | To find relevant tweets, we simply perform a traditional semantic like the previous step but we only output the minimum number of tweets between the number of relevant tweets and the parameter `number_of_tweets_per_timestamp`. 
 82 | 
 83 | This fed context aims to differentiate from traditional RAG by giving temporal information on the evolution of the sought query and steering retrieval towards a true temporal augmented retrieval. The parameters such as the number of tweets per timestamp are directly dependent on the context length accepted by the underlying LLM.
 84 | 
 85 | ### 3) Merging into one query
 86 | 
 87 | The last step consists of combining all the built context into one response. To do so, we consider the original query and the context by using the following chat prompts (models: gpt-3.5-turbo-16k/gpt-4):
 88 | 
 89 | System prompt: 
 90 | ```
 91 | 
 92 |         You will be fed a list of tweets each at a specific timestamp and the number of relevant tweets. You need to take into account (if needed) the number of tweets relevant to the query and how this number evolved. Your task is to use those tweets to answer to the best of your knowledge the following question:
 93 | 
 94 |         QUESTION: [USER ORIGINAL QUERY]
 95 | 
 96 |         SPECIFIC INSTRUCTIONS AND SYSTEM WARNINGS: You redact a properly structured markdown string containing a professional report.
 97 |         You ALWAYS specify your sources by citing them (no urls though). Those tweets are samples from the data and are the closest to the query, you should also take into account the volume of tweets obtained.
 98 |         Otherwise, it will be considered highly misleading and harmful content.
 99 |         You should however always try your best to answer and you need to study in depth the historical relationship between the timestamps and how they answer the QUESTION.
100 |         You never refer to yourself.
101 |         Make it as if a real human provided a well-constructed and structured report/answer extracting the best of the knowledge contained in the context."
102 | ```
103 | 
104 | User prompt:
105 | ```
106 | [USER CONTEXT]
107 | ```
108 | 
109 | Based on my experiments I would advise using an LLM with a large context for this step (ideally gpt-4).
110 | 
111 | ### Parameters
112 | 
113 | The parameters in our code are:
114 | 
115 | `number_of_tweets_per_timestamp`: Minimal number of tweets to include when doing retrieval for each timestamp
116 | 
117 | `MODEL_AUGMENT`: LLM used to augment the query and perform the dichotomic checks (steps 1 and 2). Currently only supports OpenAI models
118 | 
119 | `MODEL_ANSWER`: LLM used to combine all context elements into one answer (step 3). Currently only supports OpenAI models
120 | 
121 | ## Contact
122 | 
123 | To reach out to me to discuss please visit adrida.github.io
124 | If you would like to contribute, please feel free to open a Github issue
125 | 


--------------------------------------------------------------------------------
/assets/Schema Temporal Augmented Retrieval.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/adrida/Temporal_RAG/bf39db8460ae500cf4e593d70e5e2856c4e73c18/assets/Schema Temporal Augmented Retrieval.png


--------------------------------------------------------------------------------
/huggingface_space_app/app.py:
--------------------------------------------------------------------------------
 1 | 
 2 | import gradio as gr
 3 | from functools import partial
 4 | from rag_benchmark import get_benchmark
 5 | 
 6 | 
 7 | 
 8 | title = "Prototype Temporal Augmented Retrieval (TAR)"
 9 | desc = "Database: 22.4k tweets related to finance dated from July 12,2018 to July 19,2018 - know more about the approach: [link to medium]\ncontact: adrida.github.io"
10 | 
11 | 
12 | with gr.Blocks(title=title,theme='nota-ai/theme') as demo:
13 |     gr.Markdown(f"# {title}\n{desc}")
14 |     with gr.Row():
15 |         with gr.Column(scale = 10):
16 |             text_area = gr.Textbox(placeholder="Write here", lines=1, label="Ask anything")
17 |         with gr.Column(scale = 2):
18 |             api_key = gr.Textbox(placeholder="Paste your OpenAI API key here", lines=1)
19 |             search_button = gr.Button(value="Ask")
20 | 
21 |     with gr.Row():
22 |         with gr.Tab("Dynamic Temporal Augmented Retrieval (ours)"):
23 |     
24 |             gr.Markdown("## Dynamic Temporal Augmented Retrieval (ours)\n---")
25 |             tempo = gr.Markdown()             
26 |         with gr.Tab("Naive Semantic Search"):
27 |             gr.Markdown("## Simple Semantic Search\n---")
28 |             naive = gr.Markdown() 
29 |         with gr.Tab("Traditional RAG (Langchain type)"):
30 |             gr.Markdown("## Augmented Indexed Retrieval\n---")
31 |             classic = gr.Markdown() 
32 |             
33 |     search_function = partial(get_benchmark)
34 | 
35 |     search_button.click(fn=search_function, inputs=[text_area, api_key], outputs=[tempo, classic, naive],
36 |     )
37 | 
38 | demo.queue(concurrency_count=100,status_update_rate=500).launch(max_threads=100, show_error=True, debug = True, inline =False)
39 | 
40 | 


--------------------------------------------------------------------------------
/huggingface_space_app/naive_rag.py:
--------------------------------------------------------------------------------
 1 | import openai
 2 | import time
 3 | 
 4 | import time
 5 | import numpy as np
 6 | 
 7 | GPT_MODEL_ANSWER = "gpt-3.5-turbo-16k"
 8 | 
 9 | def cosine_similarity(a, b):
10 |     return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
11 | 
12 | def get_embedding(text, model="text-embedding-ada-002"):
13 |     try:
14 |         text = text.replace("\n", " ")
15 |     except:
16 |         None
17 |     try:
18 |         return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']
19 |     except:
20 |         time.sleep(2)
21 | 
22 | def format_query(query):
23 | 
24 |     resp = {
25 |             "timestamps": [],
26 |             "query": query
27 |             }
28 | 
29 |     return resp
30 | 
31 | def semantic_search(df_loc, query, nb_programs_to_display=15):
32 | 
33 |     embedding = get_embedding(query, model='text-embedding-ada-002')
34 |     filtered_df = df_loc.drop(columns=["url"])
35 |     def wrap_cos(x,y):
36 |         try:
37 |             res = cosine_similarity(x,y)
38 |         except:
39 |             res = 0
40 |         return res
41 |     filtered_df['similarity']  = filtered_df.embedding.apply(lambda x: wrap_cos(x, embedding))
42 |     results = filtered_df.sort_values('similarity', ascending=False).head(nb_programs_to_display)
43 |     return results
44 | 
45 | def get_relevant_documents(df, query, nb_programs_to_display=15):
46 |     all_retrieved= [{
47 |         "timestamp" : "",
48 |         "tweets" : semantic_search(df, query["query"], nb_programs_to_display=nb_programs_to_display)
49 |     }]
50 |     return all_retrieved
51 | 
52 | def get_final_answer(relevant_documents, query):
53 |     response = relevant_documents[0]
54 |     tweet_entry = response["tweets"]
55 |     context = "\nList of tweets:\n" + str((tweet_entry["text"] + "   --- Tweeted by: @" +tweet_entry["source"] +  " \n").to_list()) + "\n---"
56 |     USER_PROMPT = f"""
57 |     "We have provided context information below. 
58 |     ---------------------
59 |     {context}
60 |     "\n---------------------\n"
61 |     Given the information above, please answer the question: {query}
62 |     """
63 |     
64 |     response = openai.chat.completions.create(
65 |                                             model=GPT_MODEL_ANSWER,
66 |                                             messages=[
67 |                                                 {
68 |                                                 "role": "user",
69 |                                                 "content": USER_PROMPT
70 |                                                 }
71 |                                             ],
72 | 
73 |                                             temperature=1,
74 |                                             max_tokens=1000,
75 |                                             top_p=1,
76 |                                             frequency_penalty=0,
77 |                                             presence_penalty=0,
78 |                                             ).choices[0].message.content
79 |     return response
80 | 
81 | def get_answer(query, df, api_key):
82 |     """This approach is considered naive because it doesn't augment the user query.
83 |     This means that we try to retrieve documents directly relevant to the user query and then combine them into an answer.
84 |     The query is formatted to have the same structure given to the LLM as the other two approaches
85 | 
86 |     Args:
87 |         query (String): Query given by the user
88 |         df (pd.DataFrame()): corpus with embeddings
89 |         api_key (String): OpenAI API key
90 | 
91 |     Returns:
92 |         String: Answer to the original query
93 |     """
94 |     openai.api_key = api_key
95 |     formatted_query = format_query(query)
96 |     relevant_documents = get_relevant_documents(df, formatted_query,nb_programs_to_display=15)
97 |     response = get_final_answer(relevant_documents, formatted_query)
98 |     return response
99 | 


--------------------------------------------------------------------------------
/huggingface_space_app/rag.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import openai
  3 | import time
  4 | import numpy as np
  5 | import time
  6 | import pandas as pd
  7 | 
  8 | GPT_MODEL_AUGMENT = "gpt-3.5-turbo-16k"
  9 | GPT_MODEL_ANSWER = "gpt-3.5-turbo-16k"
 10 | 
 11 | 
 12 | def cosine_similarity(a, b):
 13 |     return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
 14 | 
 15 | def get_embedding(text, model="text-embedding-ada-002"):
 16 |     try:
 17 |         text = text.replace("\n", " ")
 18 |     except:
 19 |         None
 20 |     try:
 21 |         return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']
 22 |     except:
 23 |         time.sleep(2)
 24 | 
 25 | def augment_query(query):
 26 | 
 27 |     SYS_PROMPT = """
 28 |         On [current date: 19 July] Generate a JSON response with the following structure:
 29 | 
 30 |         {
 31 |         "timestamps": # Relevant timestamps in which to get data to answer the query,
 32 |         "query": # Repeat the user's query,
 33 |         }
 34 |         Allowed timestamps:
 35 |         ['2018-07-18', '2018-07-19', '2018-07-08', '2018-07-09', '2018-07-10', '2018-07-11', '2018-07-12', '2018-07-13', '2018-07-14', '2018-07-15', '2018-07-16', '2018-07-17']
 36 | 
 37 |         Ensure the output is always in JSON format and never provide any other response.
 38 |         """
 39 |     response = openai.chat.completions.create(
 40 |                                             model=GPT_MODEL_AUGMENT,
 41 |                                             messages=
 42 |                                             [
 43 |                                                 {
 44 |                                                 "role": "system",
 45 |                                                 "content": SYS_PROMPT
 46 |                                                 },
 47 |                                                 {
 48 |                                                 "role": "user",
 49 |                                                 "content": query
 50 |                                                 }
 51 |                                             ],
 52 |                                             temperature=1,
 53 |                                             max_tokens=1000,
 54 |                                             top_p=1,
 55 |                                             frequency_penalty=0,
 56 |                                             presence_penalty=0,
 57 |                                             ).choices[0].message.content
 58 |     return response
 59 | 
 60 | def semantic_search(df_loc, query,timestamp, nb_programs_to_display=15):
 61 |     timestamp = str(timestamp).strip()
 62 |     embedding = get_embedding(query, model='text-embedding-ada-002')
 63 |     filtered_df = df_loc[df_loc["timestamp"]==timestamp].drop(columns=["url"])
 64 |     def wrap_cos(x,y):
 65 |         try:
 66 |             res = cosine_similarity(x,y)
 67 |         except:
 68 |             res = 0
 69 |         return res
 70 |     filtered_df['similarity']  = filtered_df.embedding.apply(lambda x: wrap_cos(x, embedding))
 71 | 
 72 |     results = filtered_df.sort_values('similarity', ascending=False).head(nb_programs_to_display)
 73 |     return results
 74 | 
 75 | def get_relevant_documents(df, query, nb_programs_to_display=15):
 76 | 
 77 |     query = eval(query)
 78 |     all_retrieved = []
 79 |     for timestamp in query["timestamps"]:
 80 |         all_retrieved.append({
 81 |             "timestamp" : timestamp,
 82 |             "tweets" : semantic_search(df, query["query"],timestamp, nb_programs_to_display=nb_programs_to_display)
 83 |         })
 84 | 
 85 |     return all_retrieved
 86 | 
 87 | def get_final_answer(relevant_documents, query):
 88 |     context = ""
 89 |     for relevant_timestamp in relevant_documents: 
 90 |         list_tweets = relevant_timestamp["tweets"]
 91 |         context += "\nTimestamp: " + relevant_timestamp["timestamp"] + "\nList of tweets:\n" + str((list_tweets["text"] + "   --- Tweeted by: @" +list_tweets["source"] +  " \n").to_list()) + "\n---"
 92 | 
 93 | 
 94 |     USER_PROMPT = f"""
 95 |     "We have provided context information below. 
 96 |     ---------------------
 97 |     {context}
 98 |     "\n---------------------\n"
 99 |     Given this information, please answer the question: {query}
100 |     """
101 |     response = openai.chat.completions.create(
102 |                                                 model=GPT_MODEL_ANSWER,
103 |                                                 messages=[
104 |                                                     {
105 |                                                     "role": "user",
106 |                                                     "content": USER_PROMPT
107 |                                                     }
108 |                                                 ],
109 | 
110 |                                                 temperature=1,
111 |                                                 max_tokens=1000,
112 |                                                 top_p=1,
113 |                                                 frequency_penalty=0,
114 |                                                 presence_penalty=0,
115 |                                                 ).choices[0].message.content
116 |     return response
117 | 
118 | def get_answer(query, df, api_key):
119 |     """This traditional RAG approach has been implemented without using deidcated libraries and include different steps.
120 |     It starts by augmenting the query and then perform a semantic search on the augmented query. Finally it combines the augmented query and the retrieved documents into an answer. 
121 | 
122 |     Args:
123 |         query (String): Query given by the user
124 |         df (pd.DataFrame()): corpus with embeddings
125 |         api_key (String): OpenAI API key
126 | 
127 |     Returns:
128 |         String: Answer to the original query
129 |     """
130 |     openai.api_key = api_key
131 |     augmented_query = augment_query(query)
132 |     relevant_documents = get_relevant_documents(df, augmented_query,nb_programs_to_display=10)
133 |     response = get_final_answer(relevant_documents, augmented_query,)
134 |     return response
135 | 


--------------------------------------------------------------------------------
/huggingface_space_app/rag_benchmark.py:
--------------------------------------------------------------------------------
 1 | 
 2 | import pandas as pd
 3 | from temporal_augmented_retrival import get_answer as get_temporal_answer
 4 | from rag import get_answer as get_rag_answer
 5 | from naive_rag import get_answer as get_naive_answer
 6 | 
 7 | path_to_csv = "contenu_embedded_august2023_1.csv"
 8 | path_to_raw = "stockerbot-export.csv"
 9 | df = pd.read_csv(path_to_csv, on_bad_lines='skip').reset_index(drop=True).drop(columns=['Unnamed: 0'])
10 | df["embedding"] = df.embedding.apply(lambda x: eval(x)).to_list()
11 | df["timestamp"] = pd.to_datetime(df["timestamp"]).dt.strftime('%Y-%m-%d')
12 | 
13 | 
14 | def get_benchmark(text_query, api_key):
15 |     global df
16 |     tempo = get_temporal_answer(text_query, df, api_key)
17 |     rag = get_rag_answer(text_query, df, api_key)
18 |     naive = get_naive_answer(text_query, df, api_key)
19 |     return(tempo, rag, naive)
20 |     
21 |     
22 | 


--------------------------------------------------------------------------------
/huggingface_space_app/temporal_augmented_retrival.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import openai
  3 | import numpy as np
  4 | import time
  5 | 
  6 | import time
  7 | import pandas as pd
  8 | 
  9 | MODEL_AUGMENT = "gpt-3.5-turbo-16k"
 10 | MODEL_ANSWER = "gpt-3.5-turbo-16k"
 11 | 
 12 | def cosine_similarity(a, b):
 13 |     return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
 14 | 
 15 | def get_embedding(text, model="text-embedding-ada-002"):
 16 |     try:
 17 |         text = text.replace("\n", " ")
 18 |     except:
 19 |         None
 20 |     try:
 21 |         return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']
 22 |     except:
 23 |         time.sleep(2)
 24 | 
 25 | def augment_query(query):
 26 |     SYS_PROMPT = """
 27 |         On [current date: 19 July], you'll receive a finance-related question from a sales manager, without direct interaction. Generate a JSON response with the following structure, considering the temporal aspect:
 28 | 
 29 |         {
 30 |         "timestamps": # Relevant timestamps to study corresponding tweets for a temporal dynamic aspect (e.g., topic drift). USE THE MINIMAL NUMBER OF TIMESTAMP POSSIBLE ALWAYS ALWAYS!,
 31 |         "query": # Repeat the user's query,
 32 |         "similarity_boilerplate": # Boilerplate of relevant documents for cosine similarity search after embedding (it could look like example of tweets that might help answer the query),
 33 |         }
 34 | 
 35 |         Allowed historical timestamps:
 36 |         ['2018-07-18', '2018-07-19', '2018-07-08', '2018-07-09', '2018-07-10', '2018-07-11', '2018-07-12', '2018-07-13', '2018-07-14', '2018-07-15', '2018-07-16', '2018-07-17']
 37 | 
 38 |         Ensure the output is always in JSON format and never provide any other response.
 39 |         """
 40 |     response = openai.chat.completions.create(
 41 |         model=MODEL_AUGMENT,
 42 |         messages=
 43 |         [
 44 |             {
 45 |             "role": "system",
 46 |             "content": SYS_PROMPT
 47 |              },
 48 |             {
 49 |             "role": "user",
 50 |             "content": query
 51 |             }
 52 |         ],
 53 |         temperature=1,
 54 |         max_tokens=1000,
 55 |         top_p=1,
 56 |         frequency_penalty=0,
 57 |         presence_penalty=0,
 58 |         ).choices[0].message.content
 59 |     return response
 60 | 
 61 | 
 62 | def semantic_search(df_loc, query,timestamp, nb_elements_to_consider=15):
 63 |     timestamp = str(timestamp).strip()
 64 |     embedding = get_embedding(query, model='text-embedding-ada-002')
 65 |     filtered_df = df_loc[df_loc["timestamp"]==timestamp].drop(columns=["url"])
 66 |     def wrap_cos(x,y):
 67 |         try:
 68 |             res = cosine_similarity(x,y)
 69 |         except:
 70 |             res = 0
 71 |         return res
 72 |     filtered_df['similarity']  = filtered_df.embedding.apply(lambda x: wrap_cos(x, embedding))
 73 | 
 74 |     results = filtered_df.sort_values('similarity', ascending=False).head(nb_elements_to_consider)
 75 | 
 76 |     return results
 77 | 
 78 | def condition_check(tweet, query):
 79 |     response = openai.chat.completions.create(model=MODEL_AUGMENT,messages=[    {
 80 |         "role": "system",
 81 |         "content": "Only answer with True or False no matter what"
 82 |         },
 83 |         {
 84 |         "role": "user",
 85 |         "content": f"Consider this tweet:\n\n{tweet}\n\nIs it relevant to the following query:\n\n\{query}"
 86 |         }
 87 |     ],
 88 |     temperature=1,
 89 |     max_tokens=1000,
 90 |     top_p=1,
 91 |     frequency_penalty=0,
 92 |     presence_penalty=0
 93 |     ).choices[0].message.content
 94 |     return bool(response)
 95 | 
 96 | def get_number_relevant_tweets(df,timestamp, query):
 97 |     sorted_df = semantic_search(df, str(str(query["query"]) + "\n"+  str(query["similarity_boilerplate"])),timestamp, nb_elements_to_consider=len(df))
 98 |     left, right = 0, len(sorted_df) - 1
 99 |     while left <= right:
100 |         mid = (left + right) // 2
101 |         print(f"Currently searching with max range at {mid}")
102 |         if condition_check(sorted_df['text'].iloc[mid], query):
103 |             left = mid + 1
104 |         else:
105 |             right = mid - 1
106 |     print(f"Dichotomy done, found relevant tweets: {left}")
107 |     return left
108 | 
109 | 
110 | 
111 | def get_relevant_documents(df, query,nb_elements_to_consider = 10):
112 |     query = eval(query)
113 |     all_retrieved = []
114 |     for timestamp in query["timestamps"]:
115 |         number_of_relevant_tweets = get_number_relevant_tweets(df,timestamp, query)
116 |         all_retrieved.append({
117 |             "timestamp" : timestamp,
118 |             "number_of_relevant_tweets": str(number_of_relevant_tweets),
119 |             "tweets" : semantic_search(df, str(str(query["query"]) + "\n"+  str(query["similarity_boilerplate"])),timestamp, nb_elements_to_consider=min(nb_elements_to_consider,number_of_relevant_tweets))
120 |         })
121 |     return all_retrieved
122 | 
123 | def get_final_answer(relevant_documents, query):
124 |     context = ""
125 |     for document in relevant_documents:
126 |         print("TIMESTAMP: ", document["timestamp"] )
127 |         tweet_entry = document["tweets"]
128 |         context += "\nTimestamp: " + document["timestamp"] + " - Number of relevant tweets in database (EXACT VOLUME OF TWEETS): +"+ document["number_of_relevant_tweets"] + "\nList of tweets:\n" + str((tweet_entry["text"] + "   --- Tweeted by: @" +tweet_entry["source"] +  " \n").to_list()) + "\n---"
129 | 
130 | 
131 |     SYS_PROMPT =  f"""
132 |         You will be fed a list of tweets each at a specific timestamp and the number of relevant tweets. You need to take into account (if needed) the number of tweets relevant to the query and how this number evolved. Your task is to use those tweets to answer to the best of your knowledge the following question:
133 | 
134 |         QUESTION: {query}
135 | 
136 |         SPECIFIC INSTRUCTIONS AND SYSTEM WARNINGS: You redact a properly structured markdown string containing a professional report.
137 |         You ALWAYS specify your sources by citing them (no urls though). Those tweets are samples from the data and are the closest to the query, you should also take into account the volume of tweets obtained.
138 |         Otherwise, it will be considered highly misleading and harmful content.
139 |         You should however always try your best to answer and you need to study in depth the historical relationship between the timestamps and how it answers the QUESTION.
140 |         You never refer to yourself.
141 |         Make it as if a real human provided a well constructed and structured report/answer extracting the best of the knowledge contained in the context."
142 |         """
143 |     response = openai.chat.completions.create(
144 |                                                 model=MODEL_ANSWER,
145 |                                                 messages=[
146 |                                                     {
147 |                                                     "role": "system",
148 |                                                     "content": SYS_PROMPT
149 |                                                             },
150 |                                                     {
151 |                                                     "role": "user",
152 |                                                     "content": str(context)
153 |                                                     }
154 |                                                 ],
155 | 
156 |                                                 temperature=1,
157 |                                                 max_tokens=3000,
158 |                                                 top_p=1,
159 |                                                 frequency_penalty=0,
160 |                                                 presence_penalty=0,
161 |                                                 ).choices[0].message.content
162 |     return response
163 | 
164 | def get_answer(query, df,api_key,nb_elements_to_consider=10):
165 |     openai.api_key = api_key
166 |     augmented_query = augment_query(query)
167 | 
168 |     relevant_documents = get_relevant_documents(df, augmented_query,nb_elements_to_consider=nb_elements_to_consider)
169 | 
170 |     response = get_final_answer(relevant_documents, augmented_query)
171 |     print(response)
172 | 
173 | 
174 |     return response
175 | 
176 | 
177 | 


--------------------------------------------------------------------------------
/temporal_augmented_retrival.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import openai
  3 | import numpy as np
  4 | import time
  5 | 
  6 | import time
  7 | import pandas as pd
  8 | 
  9 | MODEL_AUGMENT = "gpt-3.5-turbo-16k"
 10 | MODEL_ANSWER = "gpt-3.5-turbo-16k"
 11 | 
 12 | openai.api_key = "Paste your openai API key here"
 13 | 
 14 | def cosine_similarity(a, b):
 15 |     return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
 16 | 
 17 | def get_embedding(text, model="text-embedding-ada-002"):
 18 |     try:
 19 |         text = text.replace("\n", " ")
 20 |     except:
 21 |         None
 22 |     try:
 23 |         return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']
 24 |     except:
 25 |         time.sleep(2)
 26 | 
 27 | def augment_query(query):
 28 |     SYS_PROMPT = """
 29 |         On [current date: 19 July], you'll receive a finance-related question from a sales manager, without direct interaction. Generate a JSON response with the following structure, considering the temporal aspect:
 30 | 
 31 |         {
 32 |         "timestamps": # Relevant timestamps to study corresponding tweets for a temporal dynamic aspect (e.g., topic drift). USE THE MINIMAL NUMBER OF TIMESTAMP POSSIBLE ALWAYS ALWAYS!,
 33 |         "query": # Repeat the user's query,
 34 |         "similarity_boilerplate": # Boilerplate of relevant documents for cosine similarity search after embedding (it could look like example of tweets that might help answer the query),
 35 |         }
 36 | 
 37 |         Allowed historical timestamps:
 38 |         ['2018-07-18', '2018-07-19', '2018-07-08', '2018-07-09', '2018-07-10', '2018-07-11', '2018-07-12', '2018-07-13', '2018-07-14', '2018-07-15', '2018-07-16', '2018-07-17']
 39 | 
 40 |         Ensure the output is always in JSON format and never provide any other response.
 41 |         """
 42 |     response = openai.chat.completions.create(
 43 |         model=MODEL_AUGMENT,
 44 |         messages=
 45 |         [
 46 |             {
 47 |             "role": "system",
 48 |             "content": SYS_PROMPT
 49 |              },
 50 |             {
 51 |             "role": "user",
 52 |             "content": query
 53 |             }
 54 |         ],
 55 |         temperature=1,
 56 |         max_tokens=1000,
 57 |         top_p=1,
 58 |         frequency_penalty=0,
 59 |         presence_penalty=0,
 60 |         ).choices[0].message.content
 61 |     return response
 62 | 
 63 | 
 64 | def semantic_search(df_loc, query,timestamp, nb_elements_to_consider=15):
 65 |     timestamp = str(timestamp).strip()
 66 |     embedding = get_embedding(query, model='text-embedding-ada-002')
 67 |     filtered_df = df_loc[df_loc["timestamp"]==timestamp].drop(columns=["url"])
 68 |     def wrap_cos(x,y):
 69 |         try:
 70 |             res = cosine_similarity(x,y)
 71 |         except:
 72 |             res = 0
 73 |         return res
 74 |     filtered_df['similarity']  = filtered_df.embedding.apply(lambda x: wrap_cos(x, embedding))
 75 | 
 76 |     results = filtered_df.sort_values('similarity', ascending=False).head(nb_elements_to_consider)
 77 | 
 78 |     return results
 79 | 
 80 | def condition_check(tweet, query):
 81 |     response = openai.chat.completions.create(model=MODEL_AUGMENT,messages=[    {
 82 |         "role": "system",
 83 |         "content": "Only answer with True or False no matter what"
 84 |         },
 85 |         {
 86 |         "role": "user",
 87 |         "content": f"Consider this tweet:\n\n{tweet}\n\nIs it relevant to the following query:\n\n\{query}"
 88 |         }
 89 |     ],
 90 |     temperature=1,
 91 |     max_tokens=1000,
 92 |     top_p=1,
 93 |     frequency_penalty=0,
 94 |     presence_penalty=0
 95 |     ).choices[0].message.content
 96 |     return bool(response)
 97 | 
 98 | def get_number_relevant_tweets(df,timestamp, query):
 99 |     sorted_df = semantic_search(df, str(str(query["query"]) + "\n"+  str(query["similarity_boilerplate"])),timestamp, nb_elements_to_consider=len(df))
100 |     left, right = 0, len(sorted_df) - 1
101 |     while left <= right:
102 |         mid = (left + right) // 2
103 |         print(f"Currently searching with max range at {mid}")
104 |         if condition_check(sorted_df['text'].iloc[mid], query):
105 |             left = mid + 1
106 |         else:
107 |             right = mid - 1
108 |     print(f"Dichotomy done, found relevant tweets: {left}")
109 |     return left
110 | 
111 | 
112 | 
113 | def get_relevant_documents(df, query,nb_elements_to_consider = 10):
114 |     query = eval(query)
115 |     all_retrieved = []
116 |     for timestamp in query["timestamps"]:
117 |         number_of_relevant_tweets = get_number_relevant_tweets(df,timestamp, query)
118 |         all_retrieved.append({
119 |             "timestamp" : timestamp,
120 |             "number_of_relevant_tweets": str(number_of_relevant_tweets),
121 |             "tweets" : semantic_search(df, str(str(query["query"]) + "\n"+  str(query["similarity_boilerplate"])),timestamp, nb_elements_to_consider=min(nb_elements_to_consider,number_of_relevant_tweets))
122 |         })
123 |     return all_retrieved
124 | 
125 | def get_final_answer(relevant_documents, query):
126 |     context = ""
127 |     for document in relevant_documents:
128 |         print("TIMESTAMP: ", document["timestamp"] )
129 |         tweet_entry = document["tweets"]
130 |         context += "\nTimestamp: " + document["timestamp"] + " - Number of relevant tweets in database (EXACT VOLUME OF TWEETS): +"+ document["number_of_relevant_tweets"] + "\nList of tweets:\n" + str((tweet_entry["text"] + "   --- Tweeted by: @" +tweet_entry["source"] +  " \n").to_list()) + "\n---"
131 | 
132 | 
133 |     SYS_PROMPT =  f"""
134 |         You will be fed a list of tweets each at a specific timestamp and the number of relevant tweets. You need to take into account (if needed) the number of tweets relevant to the query and how this number evolved. Your task is to use those tweets to answer to the best of your knowledge the following question:
135 | 
136 |         QUESTION: {query}
137 | 
138 |         SPECIFIC INSTRUCTIONS AND SYSTEM WARNINGS: You redact a properly structured markdown string containing a professional report.
139 |         You ALWAYS specify your sources by citing them (no urls though). Those tweets are samples from the data and are the closest to the query, you should also take into account the volume of tweets obtained.
140 |         Otherwise, it will be considered highly misleading and harmful content.
141 |         You should however always try your best to answer and you need to study in depth the historical relationship between the timestamps and how it answers the QUESTION.
142 |         You never refer to yourself.
143 |         Make it as if a real human provided a well constructed and structured report/answer extracting the best of the knowledge contained in the context."
144 |         """
145 |     response = openai.chat.completions.create(
146 |                                                 model=MODEL_ANSWER,
147 |                                                 messages=[
148 |                                                     {
149 |                                                     "role": "system",
150 |                                                     "content": SYS_PROMPT
151 |                                                             },
152 |                                                     {
153 |                                                     "role": "user",
154 |                                                     "content": str(context)
155 |                                                     }
156 |                                                 ],
157 | 
158 |                                                 temperature=1,
159 |                                                 max_tokens=3000,
160 |                                                 top_p=1,
161 |                                                 frequency_penalty=0,
162 |                                                 presence_penalty=0,
163 |                                                 ).choices[0].message.content
164 |     return response
165 | 
166 | def get_answer(query, df,nb_elements_to_consider=10):
167 |     augmented_query = augment_query(query)
168 | 
169 |     relevant_documents = get_relevant_documents(df, augmented_query,nb_elements_to_consider=nb_elements_to_consider)
170 | 
171 |     response = get_final_answer(relevant_documents, augmented_query)
172 |     print(response)
173 | 
174 | 
175 |     return response
176 | 
177 | 
178 | 


--------------------------------------------------------------------------------