├── .dockerignore
├── Dockerfile
├── README.md
├── config.yml
├── images
    ├── beautiful_maestro.png
    ├── beautiful_maestro_2.png
    ├── llava.png
    ├── maestro.png
    ├── memformer.png
    └── streamlit_app.png
├── requirements.txt
└── src
    ├── app.py
    ├── data
        └── .gitignore
    └── utils
        ├── __pycache__
            ├── app_utils.cpython-312.pyc
            ├── ddgs.cpython-312.pyc
            ├── load_config.cpython-312.pyc
            └── web_search.cpython-312.pyc
        ├── app_utils.py
        ├── arxiv_scraper.py
        └── load_config.py


/.dockerignore:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM python:3.12.1
 2 | 
 3 | WORKDIR /code
 4 | 
 5 | COPY ./requirements.txt /code/requirements.txt
 6 | COPY ./config.yml /code/config.yml
 7 | COPY ./src /code/src/
 8 | COPY ./.streamlit /code/.streamlit/
 9 | COPY ./images/maestro.png /code/images/maestro.png
10 | COPY ./images/streamlit_app.png /code/images/streamlit_app.png
11 | 
12 | EXPOSE 8501
13 | 
14 | RUN pip install --no-cache-dir -r /code/requirements.txt
15 | 
16 | 
17 | CMD ["streamlit", "run", "src/app.py"]


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # RAG - Maestro
 2 | RAG-Maestro is an up-to-date LLM assistant designed to provide clear and concise explanations of scientific concepts **and relevant papers**. As a Q&A bot, it does not keep track of your conversation and will treat each input independently.
 3 | 
 4 | 
 5 | ![maestro](images/beautiful_maestro_2.png)
 6 | 
 7 | 
 8 | # Examples
 9 | 
10 | What is LLava?            |  Do you know what the Memformer is?
11 | :-------------------------:|:-------------------------:
12 | ![llava](images/llava.png)  |  ![memformer](images/memformer.png)
13 | 
14 | # Implementation Details
15 | 
16 | The bot is composed of three building blocks that work sequentially:
17 | 
18 | ### A Keywords extractor
19 | 
20 | RAG-Maestro first tasks is to extract from your request the keywords to browse for. the [RAKE](https://www.analyticsvidhya.com/blog/2021/10/rapid-keyword-extraction-rake-algorithm-in-natural-language-processing/) (Rapid Keyword Extraction Algorithm) from `nltk` is used.
21 | 
22 | ### A Paper browser
23 | 
24 | Once the keywords extracted, they are used to retrieve the 5 most relevant papers from [arxiv.org](https://www.arxiv.org/). These papers are then downloaded and scrapped.
25 | 
26 | To build the scraper, I used the open-source `arxiv` API and  `PyPDF2` to ease the pdf reading.
27 | 
28 | ### A RAG Pipeline ([Paper](https://arxiv.org/pdf/2005.11401.pdf))
29 | 
30 | That retrieves the most relevant information from the scraped papers relatively to the query, and takes it as context to summarize. One of the main features I implemented (prompt engineering) is that the bot is *citing its sources*. Hence, it becomes possible to assess the veracity of the provided answer. The pipeline is using OpenAI LLMS (`embedding-v3` and `gpt-3.5-turbo`) to process the retrieval and the generation steps. Like every LLM, RAG-Maestro can be subject to hallucinations. **Making it citing the sources can help us to detect a hallucination**.
31 | 
32 | 
33 | I used [llama_index]( https://docs.llamaindex.ai/en/stable/) to build the RAG pipeline, specifically picked a "tree_summarizer" form query engine to generate the answer. All the hyperparameters are stored in an editable `config.yml` file.
34 | 
35 | # Commands
36 | 
37 | 
38 | ### Running the app locally from this repository
39 | - clone this repository
40 | - Create a new Python environment provided with pip
41 | - run `pip install -r requirements.txt`
42 | - run `streamlit run src/app.py`
43 | - Now open the 'External URL' in your browser. Enjoy the bot.
44 | 
45 | ![Alt text](images/streamlit_app.png)


--------------------------------------------------------------------------------
/config.yml:
--------------------------------------------------------------------------------
 1 | gpt_model: gpt-3.5-turbo
 2 | temperature: 0.9
 3 | max_tokens: 1000
 4 | chunk_size: 500
 5 | similarity_top_k: 5
 6 | articles_to_search: 5
 7 | llm_system_role:
 8 |   "As a chatbot, your goal is to respond to the user's question respectfully and concisely.\
 9 |   You will receive the user's new query, along with 3 articles from the web search result for that query.\
10 |   Answer the user with the most relevant information. After answering, cite your sources and provide the url."
11 | llm_format_output:
12 |   " \\
13 |   #Citing sources\
14 |   After giving your final answer, you will cite your sources the following way:\
15 |   'REFERENCES: \
16 |     Title of article -> url \
17 |     Title of article -> url \
18 |     etc...' "
19 | 


--------------------------------------------------------------------------------
/images/beautiful_maestro.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AymenKallala/RAG_Maestro/78a058c202920d5c6cb2b475cdb334ef800bbfa3/images/beautiful_maestro.png


--------------------------------------------------------------------------------
/images/beautiful_maestro_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AymenKallala/RAG_Maestro/78a058c202920d5c6cb2b475cdb334ef800bbfa3/images/beautiful_maestro_2.png


--------------------------------------------------------------------------------
/images/llava.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AymenKallala/RAG_Maestro/78a058c202920d5c6cb2b475cdb334ef800bbfa3/images/llava.png


--------------------------------------------------------------------------------
/images/maestro.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AymenKallala/RAG_Maestro/78a058c202920d5c6cb2b475cdb334ef800bbfa3/images/maestro.png


--------------------------------------------------------------------------------
/images/memformer.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AymenKallala/RAG_Maestro/78a058c202920d5c6cb2b475cdb334ef800bbfa3/images/memformer.png


--------------------------------------------------------------------------------
/images/streamlit_app.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AymenKallala/RAG_Maestro/78a058c202920d5c6cb2b475cdb334ef800bbfa3/images/streamlit_app.png


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
  1 | aiohttp==3.9.1
  2 | aiosignal==1.3.1
  3 | altair==5.2.0
  4 | annotated-types==0.6.0
  5 | anyio==4.2.0
  6 | arxiv==2.1.0
  7 | attrs==23.2.0
  8 | Automat==22.10.0
  9 | beautifulsoup4==4.12.2
 10 | black==23.12.1
 11 | blinker==1.7.0
 12 | cachetools==5.3.2
 13 | certifi==2023.11.17
 14 | cffi==1.16.0
 15 | chardet==5.2.0
 16 | charset-normalizer==3.3.2
 17 | click==8.1.7
 18 | constantly==23.10.4
 19 | cryptography==41.0.7
 20 | cssselect==1.2.0
 21 | curl-cffi==0.5.10
 22 | dataclasses-json==0.6.3
 23 | Deprecated==1.2.14
 24 | distro==1.9.0
 25 | duckduckgo_search==4.1.1
 26 | feedparser==6.0.10
 27 | filelock==3.13.1
 28 | frozenlist==1.4.1
 29 | fsspec==2023.12.2
 30 | gitdb==4.0.11
 31 | GitPython==3.1.40
 32 | greenlet==3.0.3
 33 | h11==0.14.0
 34 | httpcore==1.0.2
 35 | httpx==0.26.0
 36 | hyperlink==21.0.0
 37 | idna==3.6
 38 | importlib-metadata==6.11.0
 39 | incremental==22.10.0
 40 | itemadapter==0.8.0
 41 | itemloaders==1.1.0
 42 | Jinja2==3.1.2
 43 | jmespath==1.0.1
 44 | joblib==1.3.2
 45 | jsonschema==4.20.0
 46 | jsonschema-specifications==2023.12.1
 47 | llama-index==0.9.26
 48 | lxml==5.0.1
 49 | markdown-it-py==3.0.0
 50 | MarkupSafe==2.1.3
 51 | marshmallow==3.20.1
 52 | mdurl==0.1.2
 53 | multidict==6.0.4
 54 | mypy-extensions==1.0.0
 55 | nltk==3.8.1
 56 | numpy==1.26.3
 57 | openai==1.6.1
 58 | outcome==1.3.0.post0
 59 | pandas==2.1.4
 60 | parsel==1.8.1
 61 | pathspec==0.12.1
 62 | pillow==10.2.0
 63 | Protego==0.3.0
 64 | protobuf==4.25.1
 65 | pyarrow==14.0.2
 66 | pyarxiv==1.0.3.1
 67 | pyasn1==0.5.1
 68 | pyasn1-modules==0.3.0
 69 | pycparser==2.21
 70 | pydantic==2.5.3
 71 | pydantic_core==2.14.6
 72 | pydeck==0.8.1b0
 73 | PyDispatcher==2.0.7
 74 | pyOpenSSL==23.3.0
 75 | pypdf==3.17.4
 76 | PyPDF2==3.0.1
 77 | pyprojroot==0.3.0
 78 | PySocks==1.7.1
 79 | python-dotenv==1.0.0
 80 | pytz==2023.3.post1
 81 | PyYAML==6.0.1
 82 | queuelib==1.6.2
 83 | rake-nltk==1.0.6
 84 | readability-lxml==0.8.1
 85 | referencing==0.32.1
 86 | regex==2023.12.25
 87 | requests==2.31.0
 88 | requests-file==1.5.1
 89 | rich==13.7.0
 90 | rpds-py==0.16.2
 91 | Scrapy==2.11.0
 92 | selenium==4.16.0
 93 | service-identity==23.1.0
 94 | setuptools==69.0.3
 95 | sgmllib3k==1.0.0
 96 | smmap==5.0.1
 97 | sniffio==1.3.0
 98 | sortedcontainers==2.4.0
 99 | soupsieve==2.5
100 | SQLAlchemy==2.0.25
101 | streamlit==1.29.0
102 | streamlit-chat==0.1.1
103 | tenacity==8.2.3
104 | tiktoken==0.5.2
105 | tldextract==5.1.1
106 | toml==0.10.2
107 | toolz==0.12.0
108 | tqdm==4.66.1
109 | trio==0.23.2
110 | trio-websocket==0.11.1
111 | Twisted==22.10.0
112 | typing-inspect==0.9.0
113 | tzdata==2023.4
114 | tzlocal==5.2
115 | urllib3==2.1.0
116 | validators==0.22.0
117 | w3lib==2.1.2
118 | watchdog==3.0.0
119 | webdriver-manager==4.0.1
120 | wheel==0.42.0
121 | wrapt==1.16.0
122 | wsproto==1.2.0
123 | yarl==1.9.4
124 | youtube-transcript-api==0.6.2
125 | zope.interface==6.1
126 | 


--------------------------------------------------------------------------------
/src/app.py:
--------------------------------------------------------------------------------
  1 | import streamlit as st
  2 | from streamlit_chat import message
  3 | from PIL import Image
  4 | from utils.load_config import LoadConfig
  5 | from utils.app_utils import load_data, RAG, delete_data
  6 | import subprocess
  7 | import os
  8 | 
  9 | 
 10 | APPCFG = LoadConfig()
 11 | 
 12 | # ===================================
 13 | # Setting page title and header
 14 | # ===================================
 15 | im = Image.open("images/maestro.png")
 16 | os.environ["OPENAI_API_KEY"] = st.secrets["openai_key"]
 17 | 
 18 | st.set_page_config(page_title="RAG-Maestro", page_icon=im, layout="wide")
 19 | st.markdown(
 20 |     "<h1 style='text-align: center;'>RAG-Maestro (Scientific Assistant)</h1>",
 21 |     unsafe_allow_html=True,
 22 | )
 23 | st.divider()
 24 | st.markdown(
 25 |         "<center><i>RAG-Maestro is an up-to-date LLM assistant designed to provide clear and concise explanations of scientific concepts <b>and relevant papers</b>. As a Q&A bot, it does not keep track of your conversation and will treat each input independently.  Do not hesitate to clear the conversation once in a while! Hoping that RAG-Maestro will help get quick answers and expand your scientific knowledge.</center>",
 26 |         unsafe_allow_html=True,
 27 |     )
 28 | st.divider()
 29 | 
 30 | # ===================================
 31 | # Initialise session state variables
 32 | # ===================================
 33 | if "generated" not in st.session_state:
 34 |     st.session_state["generated"] = []
 35 | 
 36 | if "past" not in st.session_state:
 37 |     st.session_state["past"] = []
 38 | 
 39 | # ==================================
 40 | # Sidebar:
 41 | # ==================================
 42 | counter_placeholder = st.sidebar.empty()
 43 | with st.sidebar:
 44 |     st.markdown(
 45 |         "<h3 style='text-align: center;'>Ask anything you need to brush up on!</h3>",
 46 |         unsafe_allow_html=True,
 47 |     )
 48 |     st.markdown(
 49 |         "<center><b>Example: </b></center>",
 50 |         unsafe_allow_html=True,
 51 |     )
 52 |     st.markdown(
 53 |         "<center><i>What is GPT4?</i></center>",
 54 |         unsafe_allow_html=True,
 55 |     )
 56 |     st.markdown(
 57 |         "<center><i>Explain me Mixture of Models (MoE)</i></center>",
 58 |         unsafe_allow_html=True,
 59 |     )
 60 |     st.markdown(
 61 |         "<center><i>How does RAG works?</i></center>",
 62 |         unsafe_allow_html=True,
 63 |     )
 64 |     # st.sidebar.title("An agent that read and summarizethe the news for you")
 65 |     st.sidebar.image("images/maestro.png", use_column_width=True)
 66 |     clear_button = st.sidebar.button("Clear Conversation", key="clear")
 67 |     st.markdown(
 68 |     "<a style='display: block; text-align: center;' href='https://aymenkallala.github.io/' target='_blank'> Aymen Kallala</a>",
 69 |     unsafe_allow_html=True,
 70 | )
 71 | # ==================================
 72 | # Reset everything (Clear button)
 73 | if clear_button:
 74 |     st.session_state["generated"] = []
 75 |     st.session_state["past"] = []
 76 |     delete_data()
 77 | 
 78 | response_container = st.container()  # container for message display
 79 | 
 80 | if query := st.chat_input(
 81 |     "What do you need to know? I will explain it and point you out interesting readings."
 82 | ):
 83 |     st.session_state["past"].append(query)
 84 |     try:
 85 |         with st.spinner("Browsing the best papers..."):
 86 |             process = subprocess.Popen(
 87 |                 f"python src/utils/arxiv_scraper.py --query '{query}' --numresults {APPCFG.articles_to_search}",
 88 |                 shell=True,
 89 |             )
 90 |             out, err = process.communicate()
 91 |             errcode = process.returncode
 92 | 
 93 |         with st.spinner("Reading them..."):
 94 |             data = load_data()
 95 |             index = RAG(APPCFG, _docs=data)
 96 |             query_engine = index.as_query_engine(
 97 |                 response_mode="tree_summarize",
 98 |                 verbose=True,
 99 |                 similarity_top_k=APPCFG.similarity_top_k,
100 |             )
101 |         with st.spinner("Thinking..."):
102 |             response = query_engine.query(query + APPCFG.llm_format_output)
103 | 
104 |         st.session_state["generated"].append(response.response)
105 |         del index
106 |         del query_engine
107 | 
108 |         with response_container:
109 |             for i in range(len(st.session_state["generated"])):
110 |                 message(st.session_state["past"][i], is_user=True)
111 | 
112 |                 message(st.session_state["generated"][i], is_user=False)
113 | 
114 |     except Exception as e:
115 |         print(e)
116 |         st.session_state["generated"].append(
117 |             "An error occured with the paper search, please modify your query."
118 |         )
119 | 


--------------------------------------------------------------------------------
/src/data/.gitignore:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AymenKallala/RAG_Maestro/78a058c202920d5c6cb2b475cdb334ef800bbfa3/src/data/.gitignore


--------------------------------------------------------------------------------
/src/utils/__pycache__/app_utils.cpython-312.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AymenKallala/RAG_Maestro/78a058c202920d5c6cb2b475cdb334ef800bbfa3/src/utils/__pycache__/app_utils.cpython-312.pyc


--------------------------------------------------------------------------------
/src/utils/__pycache__/ddgs.cpython-312.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AymenKallala/RAG_Maestro/78a058c202920d5c6cb2b475cdb334ef800bbfa3/src/utils/__pycache__/ddgs.cpython-312.pyc


--------------------------------------------------------------------------------
/src/utils/__pycache__/load_config.cpython-312.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AymenKallala/RAG_Maestro/78a058c202920d5c6cb2b475cdb334ef800bbfa3/src/utils/__pycache__/load_config.cpython-312.pyc


--------------------------------------------------------------------------------
/src/utils/__pycache__/web_search.cpython-312.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AymenKallala/RAG_Maestro/78a058c202920d5c6cb2b475cdb334ef800bbfa3/src/utils/__pycache__/web_search.cpython-312.pyc


--------------------------------------------------------------------------------
/src/utils/app_utils.py:
--------------------------------------------------------------------------------
 1 | from typing import List
 2 | import streamlit as st
 3 | import os, shutil
 4 | from llama_index import VectorStoreIndex, ServiceContext, Document
 5 | from llama_index.llms import OpenAI
 6 | from llama_index import SimpleDirectoryReader
 7 | 
 8 | 
 9 | def load_data():
10 |     reader = SimpleDirectoryReader(input_dir="src/data", recursive=True)
11 |     docs = reader.load_data()
12 |     return docs
13 | 
14 | 
15 | def RAG(_config, _docs):
16 |     service_context = ServiceContext.from_defaults(
17 |         llm=OpenAI(
18 |             model=_config.gpt_model,
19 |             temperature=_config.temperature,
20 |             max_tokens=_config.max_tokens,
21 |             system_prompt=_config.llm_system_role,
22 |         ),
23 |         chunk_size=_config.chunk_size,
24 |     )
25 |     index = VectorStoreIndex.from_documents(_docs, service_context=service_context)
26 |     return index
27 | 
28 | 
29 | def delete_data():
30 |     print("Cleaning the data folder")
31 |     folder = "src/data"
32 |     for filename in os.listdir(folder):
33 |         if filename != ".gitignore":
34 |             file_path = os.path.join(folder, filename)
35 |             try:
36 |                 if os.path.isfile(file_path) or os.path.islink(file_path):
37 |                     os.unlink(file_path)
38 |                 elif os.path.isdir(file_path):
39 |                     shutil.rmtree(file_path)
40 |             except Exception as e:
41 |                 print("Failed to delete %s. Reason: %s" % (file_path, e))
42 | 


--------------------------------------------------------------------------------
/src/utils/arxiv_scraper.py:
--------------------------------------------------------------------------------
 1 | import arxiv
 2 | import argparse
 3 | import PyPDF2
 4 | import os
 5 | import json
 6 | import nltk
 7 | from rake_nltk import Rake
 8 | 
 9 | nltk.download("stopwords")
10 | nltk.download("punkt")
11 | 
12 | 
13 | def refine_query(query):
14 |     rake = Rake()
15 |     rake.extract_keywords_from_text(query)
16 |     keywords = rake.get_ranked_phrases()
17 |     return " ".join(keywords)
18 | 
19 | 
20 | def scrape_papers(args):
21 |     refined_query = refine_query(args.query)
22 |     results = []
23 | 
24 |     search = arxiv.Search(
25 |         query=refined_query,
26 |         max_results=args.numresults,
27 |         sort_by=arxiv.SortCriterion.Relevance,
28 |     )
29 |     papers = list(search.results())
30 | 
31 |     for i, p in enumerate(papers):
32 |         text = ""
33 |         file_path = f"src/data/data_{i}.pdf"
34 |         p.download_pdf(filename=file_path)
35 | 
36 |         with open(f"src/data/data_{i}.pdf", "rb") as file:
37 |             pdf = PyPDF2.PdfReader(file)
38 | 
39 |             for page in range(len(pdf.pages)):
40 |                 page_obj = pdf.pages[page]
41 | 
42 |                 text += page_obj.extract_text() + " "
43 | 
44 |         os.unlink(file_path)
45 |         paper_doc = {"url": p.pdf_url, "title": p.title, "text": text}
46 |         results.append(paper_doc)
47 |     return results
48 | 
49 | 
50 | if __name__ == "__main__":
51 |     parser = argparse.ArgumentParser()
52 |     parser.add_argument("--query", help="the query to search for", type=str)
53 |     parser.add_argument(
54 |         "--numresults", help="the number of results to return", type=int
55 |     )
56 |     args = parser.parse_args()
57 | 
58 |     results = scrape_papers(args)
59 |     for i, r in enumerate(results):
60 |         with open(f"src/data/data_{i}.json", "w") as f:
61 |             json.dump(r, f)
62 | 


--------------------------------------------------------------------------------
/src/utils/load_config.py:
--------------------------------------------------------------------------------
 1 | import yaml
 2 | from pyprojroot import here
 3 | 
 4 | 
 5 | class LoadConfig:
 6 |     """
 7 |     A class for loading configuration settings, including OpenAI credentials.
 8 | 
 9 |     This class reads configuration parameters from a YAML file and sets them as attributes.
10 |     It also includes a method to load OpenAI API credentials.
11 | 
12 |     Attributes:
13 |         gpt_model (str): The GPT model to be used.
14 |         temperature (float): The temperature parameter for generating responses.
15 |         llm_system_role (str): The system role for the language model.
16 |         llm_format_output (str): The formatting constrain of the language model.
17 | 
18 |     Methods:
19 |         __init__(): Initializes the LoadConfig instance by loading configuration from a YAML file.
20 |         load_openai_credentials(): Loads OpenAI configuration settings.
21 |     """
22 | 
23 |     def __init__(self) -> None:
24 |         with open(here("config.yml")) as cfg:
25 |             app_config = yaml.load(cfg, Loader=yaml.FullLoader)
26 |         self.gpt_model = app_config["gpt_model"]
27 |         self.temperature = app_config["temperature"]
28 |         self.max_tokens = app_config["max_tokens"]
29 |         self.articles_to_search = app_config["articles_to_search"]
30 |         self.llm_system_role = app_config["llm_system_role"]
31 |         self.llm_format_output = app_config["llm_format_output"]
32 |         self.chunk_size = app_config["chunk_size"]
33 |         self.similarity_top_k = app_config["similarity_top_k"]
34 | 


--------------------------------------------------------------------------------