├── .dockerignore ├── Dockerfile ├── README.md ├── config.yml ├── images ├── beautiful_maestro.png ├── beautiful_maestro_2.png ├── llava.png ├── maestro.png ├── memformer.png └── streamlit_app.png ├── requirements.txt └── src ├── app.py ├── data └── .gitignore └── utils ├── __pycache__ ├── app_utils.cpython-312.pyc ├── ddgs.cpython-312.pyc ├── load_config.cpython-312.pyc └── web_search.cpython-312.pyc ├── app_utils.py ├── arxiv_scraper.py └── load_config.py /.dockerignore: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM python:3.12.1 2 | 3 | WORKDIR /code 4 | 5 | COPY ./requirements.txt /code/requirements.txt 6 | COPY ./config.yml /code/config.yml 7 | COPY ./src /code/src/ 8 | COPY ./.streamlit /code/.streamlit/ 9 | COPY ./images/maestro.png /code/images/maestro.png 10 | COPY ./images/streamlit_app.png /code/images/streamlit_app.png 11 | 12 | EXPOSE 8501 13 | 14 | RUN pip install --no-cache-dir -r /code/requirements.txt 15 | 16 | 17 | CMD ["streamlit", "run", "src/app.py"] -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # RAG - Maestro 2 | RAG-Maestro is an up-to-date LLM assistant designed to provide clear and concise explanations of scientific concepts **and relevant papers**. As a Q&A bot, it does not keep track of your conversation and will treat each input independently. 3 | 4 | 5 | ![maestro](images/beautiful_maestro_2.png) 6 | 7 | 8 | # Examples 9 | 10 | What is LLava? | Do you know what the Memformer is? 11 | :-------------------------:|:-------------------------: 12 | ![llava](images/llava.png) | ![memformer](images/memformer.png) 13 | 14 | # Implementation Details 15 | 16 | The bot is composed of three building blocks that work sequentially: 17 | 18 | ### A Keywords extractor 19 | 20 | RAG-Maestro first tasks is to extract from your request the keywords to browse for. the [RAKE](https://www.analyticsvidhya.com/blog/2021/10/rapid-keyword-extraction-rake-algorithm-in-natural-language-processing/) (Rapid Keyword Extraction Algorithm) from `nltk` is used. 21 | 22 | ### A Paper browser 23 | 24 | Once the keywords extracted, they are used to retrieve the 5 most relevant papers from [arxiv.org](https://www.arxiv.org/). These papers are then downloaded and scrapped. 25 | 26 | To build the scraper, I used the open-source `arxiv` API and `PyPDF2` to ease the pdf reading. 27 | 28 | ### A RAG Pipeline ([Paper](https://arxiv.org/pdf/2005.11401.pdf)) 29 | 30 | That retrieves the most relevant information from the scraped papers relatively to the query, and takes it as context to summarize. One of the main features I implemented (prompt engineering) is that the bot is *citing its sources*. Hence, it becomes possible to assess the veracity of the provided answer. The pipeline is using OpenAI LLMS (`embedding-v3` and `gpt-3.5-turbo`) to process the retrieval and the generation steps. Like every LLM, RAG-Maestro can be subject to hallucinations. **Making it citing the sources can help us to detect a hallucination**. 31 | 32 | 33 | I used [llama_index]( https://docs.llamaindex.ai/en/stable/) to build the RAG pipeline, specifically picked a "tree_summarizer" form query engine to generate the answer. All the hyperparameters are stored in an editable `config.yml` file. 34 | 35 | # Commands 36 | 37 | 38 | ### Running the app locally from this repository 39 | - clone this repository 40 | - Create a new Python environment provided with pip 41 | - run `pip install -r requirements.txt` 42 | - run `streamlit run src/app.py` 43 | - Now open the 'External URL' in your browser. Enjoy the bot. 44 | 45 | ![Alt text](images/streamlit_app.png) -------------------------------------------------------------------------------- /config.yml: -------------------------------------------------------------------------------- 1 | gpt_model: gpt-3.5-turbo 2 | temperature: 0.9 3 | max_tokens: 1000 4 | chunk_size: 500 5 | similarity_top_k: 5 6 | articles_to_search: 5 7 | llm_system_role: 8 | "As a chatbot, your goal is to respond to the user's question respectfully and concisely.\ 9 | You will receive the user's new query, along with 3 articles from the web search result for that query.\ 10 | Answer the user with the most relevant information. After answering, cite your sources and provide the url." 11 | llm_format_output: 12 | " \\ 13 | #Citing sources\ 14 | After giving your final answer, you will cite your sources the following way:\ 15 | 'REFERENCES: \ 16 | Title of article -> url \ 17 | Title of article -> url \ 18 | etc...' " 19 | -------------------------------------------------------------------------------- /images/beautiful_maestro.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AymenKallala/RAG_Maestro/78a058c202920d5c6cb2b475cdb334ef800bbfa3/images/beautiful_maestro.png -------------------------------------------------------------------------------- /images/beautiful_maestro_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AymenKallala/RAG_Maestro/78a058c202920d5c6cb2b475cdb334ef800bbfa3/images/beautiful_maestro_2.png -------------------------------------------------------------------------------- /images/llava.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AymenKallala/RAG_Maestro/78a058c202920d5c6cb2b475cdb334ef800bbfa3/images/llava.png -------------------------------------------------------------------------------- /images/maestro.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AymenKallala/RAG_Maestro/78a058c202920d5c6cb2b475cdb334ef800bbfa3/images/maestro.png -------------------------------------------------------------------------------- /images/memformer.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AymenKallala/RAG_Maestro/78a058c202920d5c6cb2b475cdb334ef800bbfa3/images/memformer.png -------------------------------------------------------------------------------- /images/streamlit_app.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AymenKallala/RAG_Maestro/78a058c202920d5c6cb2b475cdb334ef800bbfa3/images/streamlit_app.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | aiohttp==3.9.1 2 | aiosignal==1.3.1 3 | altair==5.2.0 4 | annotated-types==0.6.0 5 | anyio==4.2.0 6 | arxiv==2.1.0 7 | attrs==23.2.0 8 | Automat==22.10.0 9 | beautifulsoup4==4.12.2 10 | black==23.12.1 11 | blinker==1.7.0 12 | cachetools==5.3.2 13 | certifi==2023.11.17 14 | cffi==1.16.0 15 | chardet==5.2.0 16 | charset-normalizer==3.3.2 17 | click==8.1.7 18 | constantly==23.10.4 19 | cryptography==41.0.7 20 | cssselect==1.2.0 21 | curl-cffi==0.5.10 22 | dataclasses-json==0.6.3 23 | Deprecated==1.2.14 24 | distro==1.9.0 25 | duckduckgo_search==4.1.1 26 | feedparser==6.0.10 27 | filelock==3.13.1 28 | frozenlist==1.4.1 29 | fsspec==2023.12.2 30 | gitdb==4.0.11 31 | GitPython==3.1.40 32 | greenlet==3.0.3 33 | h11==0.14.0 34 | httpcore==1.0.2 35 | httpx==0.26.0 36 | hyperlink==21.0.0 37 | idna==3.6 38 | importlib-metadata==6.11.0 39 | incremental==22.10.0 40 | itemadapter==0.8.0 41 | itemloaders==1.1.0 42 | Jinja2==3.1.2 43 | jmespath==1.0.1 44 | joblib==1.3.2 45 | jsonschema==4.20.0 46 | jsonschema-specifications==2023.12.1 47 | llama-index==0.9.26 48 | lxml==5.0.1 49 | markdown-it-py==3.0.0 50 | MarkupSafe==2.1.3 51 | marshmallow==3.20.1 52 | mdurl==0.1.2 53 | multidict==6.0.4 54 | mypy-extensions==1.0.0 55 | nltk==3.8.1 56 | numpy==1.26.3 57 | openai==1.6.1 58 | outcome==1.3.0.post0 59 | pandas==2.1.4 60 | parsel==1.8.1 61 | pathspec==0.12.1 62 | pillow==10.2.0 63 | Protego==0.3.0 64 | protobuf==4.25.1 65 | pyarrow==14.0.2 66 | pyarxiv==1.0.3.1 67 | pyasn1==0.5.1 68 | pyasn1-modules==0.3.0 69 | pycparser==2.21 70 | pydantic==2.5.3 71 | pydantic_core==2.14.6 72 | pydeck==0.8.1b0 73 | PyDispatcher==2.0.7 74 | pyOpenSSL==23.3.0 75 | pypdf==3.17.4 76 | PyPDF2==3.0.1 77 | pyprojroot==0.3.0 78 | PySocks==1.7.1 79 | python-dotenv==1.0.0 80 | pytz==2023.3.post1 81 | PyYAML==6.0.1 82 | queuelib==1.6.2 83 | rake-nltk==1.0.6 84 | readability-lxml==0.8.1 85 | referencing==0.32.1 86 | regex==2023.12.25 87 | requests==2.31.0 88 | requests-file==1.5.1 89 | rich==13.7.0 90 | rpds-py==0.16.2 91 | Scrapy==2.11.0 92 | selenium==4.16.0 93 | service-identity==23.1.0 94 | setuptools==69.0.3 95 | sgmllib3k==1.0.0 96 | smmap==5.0.1 97 | sniffio==1.3.0 98 | sortedcontainers==2.4.0 99 | soupsieve==2.5 100 | SQLAlchemy==2.0.25 101 | streamlit==1.29.0 102 | streamlit-chat==0.1.1 103 | tenacity==8.2.3 104 | tiktoken==0.5.2 105 | tldextract==5.1.1 106 | toml==0.10.2 107 | toolz==0.12.0 108 | tqdm==4.66.1 109 | trio==0.23.2 110 | trio-websocket==0.11.1 111 | Twisted==22.10.0 112 | typing-inspect==0.9.0 113 | tzdata==2023.4 114 | tzlocal==5.2 115 | urllib3==2.1.0 116 | validators==0.22.0 117 | w3lib==2.1.2 118 | watchdog==3.0.0 119 | webdriver-manager==4.0.1 120 | wheel==0.42.0 121 | wrapt==1.16.0 122 | wsproto==1.2.0 123 | yarl==1.9.4 124 | youtube-transcript-api==0.6.2 125 | zope.interface==6.1 126 | -------------------------------------------------------------------------------- /src/app.py: -------------------------------------------------------------------------------- 1 | import streamlit as st 2 | from streamlit_chat import message 3 | from PIL import Image 4 | from utils.load_config import LoadConfig 5 | from utils.app_utils import load_data, RAG, delete_data 6 | import subprocess 7 | import os 8 | 9 | 10 | APPCFG = LoadConfig() 11 | 12 | # =================================== 13 | # Setting page title and header 14 | # =================================== 15 | im = Image.open("images/maestro.png") 16 | os.environ["OPENAI_API_KEY"] = st.secrets["openai_key"] 17 | 18 | st.set_page_config(page_title="RAG-Maestro", page_icon=im, layout="wide") 19 | st.markdown( 20 | "

RAG-Maestro (Scientific Assistant)

", 21 | unsafe_allow_html=True, 22 | ) 23 | st.divider() 24 | st.markdown( 25 | "
RAG-Maestro is an up-to-date LLM assistant designed to provide clear and concise explanations of scientific concepts and relevant papers. As a Q&A bot, it does not keep track of your conversation and will treat each input independently. Do not hesitate to clear the conversation once in a while! Hoping that RAG-Maestro will help get quick answers and expand your scientific knowledge.
", 26 | unsafe_allow_html=True, 27 | ) 28 | st.divider() 29 | 30 | # =================================== 31 | # Initialise session state variables 32 | # =================================== 33 | if "generated" not in st.session_state: 34 | st.session_state["generated"] = [] 35 | 36 | if "past" not in st.session_state: 37 | st.session_state["past"] = [] 38 | 39 | # ================================== 40 | # Sidebar: 41 | # ================================== 42 | counter_placeholder = st.sidebar.empty() 43 | with st.sidebar: 44 | st.markdown( 45 | "

Ask anything you need to brush up on!

", 46 | unsafe_allow_html=True, 47 | ) 48 | st.markdown( 49 | "
Example:
", 50 | unsafe_allow_html=True, 51 | ) 52 | st.markdown( 53 | "
What is GPT4?
", 54 | unsafe_allow_html=True, 55 | ) 56 | st.markdown( 57 | "
Explain me Mixture of Models (MoE)
", 58 | unsafe_allow_html=True, 59 | ) 60 | st.markdown( 61 | "
How does RAG works?
", 62 | unsafe_allow_html=True, 63 | ) 64 | # st.sidebar.title("An agent that read and summarizethe the news for you") 65 | st.sidebar.image("images/maestro.png", use_column_width=True) 66 | clear_button = st.sidebar.button("Clear Conversation", key="clear") 67 | st.markdown( 68 | " Aymen Kallala", 69 | unsafe_allow_html=True, 70 | ) 71 | # ================================== 72 | # Reset everything (Clear button) 73 | if clear_button: 74 | st.session_state["generated"] = [] 75 | st.session_state["past"] = [] 76 | delete_data() 77 | 78 | response_container = st.container() # container for message display 79 | 80 | if query := st.chat_input( 81 | "What do you need to know? I will explain it and point you out interesting readings." 82 | ): 83 | st.session_state["past"].append(query) 84 | try: 85 | with st.spinner("Browsing the best papers..."): 86 | process = subprocess.Popen( 87 | f"python src/utils/arxiv_scraper.py --query '{query}' --numresults {APPCFG.articles_to_search}", 88 | shell=True, 89 | ) 90 | out, err = process.communicate() 91 | errcode = process.returncode 92 | 93 | with st.spinner("Reading them..."): 94 | data = load_data() 95 | index = RAG(APPCFG, _docs=data) 96 | query_engine = index.as_query_engine( 97 | response_mode="tree_summarize", 98 | verbose=True, 99 | similarity_top_k=APPCFG.similarity_top_k, 100 | ) 101 | with st.spinner("Thinking..."): 102 | response = query_engine.query(query + APPCFG.llm_format_output) 103 | 104 | st.session_state["generated"].append(response.response) 105 | del index 106 | del query_engine 107 | 108 | with response_container: 109 | for i in range(len(st.session_state["generated"])): 110 | message(st.session_state["past"][i], is_user=True) 111 | 112 | message(st.session_state["generated"][i], is_user=False) 113 | 114 | except Exception as e: 115 | print(e) 116 | st.session_state["generated"].append( 117 | "An error occured with the paper search, please modify your query." 118 | ) 119 | -------------------------------------------------------------------------------- /src/data/.gitignore: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AymenKallala/RAG_Maestro/78a058c202920d5c6cb2b475cdb334ef800bbfa3/src/data/.gitignore -------------------------------------------------------------------------------- /src/utils/__pycache__/app_utils.cpython-312.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AymenKallala/RAG_Maestro/78a058c202920d5c6cb2b475cdb334ef800bbfa3/src/utils/__pycache__/app_utils.cpython-312.pyc -------------------------------------------------------------------------------- /src/utils/__pycache__/ddgs.cpython-312.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AymenKallala/RAG_Maestro/78a058c202920d5c6cb2b475cdb334ef800bbfa3/src/utils/__pycache__/ddgs.cpython-312.pyc -------------------------------------------------------------------------------- /src/utils/__pycache__/load_config.cpython-312.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AymenKallala/RAG_Maestro/78a058c202920d5c6cb2b475cdb334ef800bbfa3/src/utils/__pycache__/load_config.cpython-312.pyc -------------------------------------------------------------------------------- /src/utils/__pycache__/web_search.cpython-312.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AymenKallala/RAG_Maestro/78a058c202920d5c6cb2b475cdb334ef800bbfa3/src/utils/__pycache__/web_search.cpython-312.pyc -------------------------------------------------------------------------------- /src/utils/app_utils.py: -------------------------------------------------------------------------------- 1 | from typing import List 2 | import streamlit as st 3 | import os, shutil 4 | from llama_index import VectorStoreIndex, ServiceContext, Document 5 | from llama_index.llms import OpenAI 6 | from llama_index import SimpleDirectoryReader 7 | 8 | 9 | def load_data(): 10 | reader = SimpleDirectoryReader(input_dir="src/data", recursive=True) 11 | docs = reader.load_data() 12 | return docs 13 | 14 | 15 | def RAG(_config, _docs): 16 | service_context = ServiceContext.from_defaults( 17 | llm=OpenAI( 18 | model=_config.gpt_model, 19 | temperature=_config.temperature, 20 | max_tokens=_config.max_tokens, 21 | system_prompt=_config.llm_system_role, 22 | ), 23 | chunk_size=_config.chunk_size, 24 | ) 25 | index = VectorStoreIndex.from_documents(_docs, service_context=service_context) 26 | return index 27 | 28 | 29 | def delete_data(): 30 | print("Cleaning the data folder") 31 | folder = "src/data" 32 | for filename in os.listdir(folder): 33 | if filename != ".gitignore": 34 | file_path = os.path.join(folder, filename) 35 | try: 36 | if os.path.isfile(file_path) or os.path.islink(file_path): 37 | os.unlink(file_path) 38 | elif os.path.isdir(file_path): 39 | shutil.rmtree(file_path) 40 | except Exception as e: 41 | print("Failed to delete %s. Reason: %s" % (file_path, e)) 42 | -------------------------------------------------------------------------------- /src/utils/arxiv_scraper.py: -------------------------------------------------------------------------------- 1 | import arxiv 2 | import argparse 3 | import PyPDF2 4 | import os 5 | import json 6 | import nltk 7 | from rake_nltk import Rake 8 | 9 | nltk.download("stopwords") 10 | nltk.download("punkt") 11 | 12 | 13 | def refine_query(query): 14 | rake = Rake() 15 | rake.extract_keywords_from_text(query) 16 | keywords = rake.get_ranked_phrases() 17 | return " ".join(keywords) 18 | 19 | 20 | def scrape_papers(args): 21 | refined_query = refine_query(args.query) 22 | results = [] 23 | 24 | search = arxiv.Search( 25 | query=refined_query, 26 | max_results=args.numresults, 27 | sort_by=arxiv.SortCriterion.Relevance, 28 | ) 29 | papers = list(search.results()) 30 | 31 | for i, p in enumerate(papers): 32 | text = "" 33 | file_path = f"src/data/data_{i}.pdf" 34 | p.download_pdf(filename=file_path) 35 | 36 | with open(f"src/data/data_{i}.pdf", "rb") as file: 37 | pdf = PyPDF2.PdfReader(file) 38 | 39 | for page in range(len(pdf.pages)): 40 | page_obj = pdf.pages[page] 41 | 42 | text += page_obj.extract_text() + " " 43 | 44 | os.unlink(file_path) 45 | paper_doc = {"url": p.pdf_url, "title": p.title, "text": text} 46 | results.append(paper_doc) 47 | return results 48 | 49 | 50 | if __name__ == "__main__": 51 | parser = argparse.ArgumentParser() 52 | parser.add_argument("--query", help="the query to search for", type=str) 53 | parser.add_argument( 54 | "--numresults", help="the number of results to return", type=int 55 | ) 56 | args = parser.parse_args() 57 | 58 | results = scrape_papers(args) 59 | for i, r in enumerate(results): 60 | with open(f"src/data/data_{i}.json", "w") as f: 61 | json.dump(r, f) 62 | -------------------------------------------------------------------------------- /src/utils/load_config.py: -------------------------------------------------------------------------------- 1 | import yaml 2 | from pyprojroot import here 3 | 4 | 5 | class LoadConfig: 6 | """ 7 | A class for loading configuration settings, including OpenAI credentials. 8 | 9 | This class reads configuration parameters from a YAML file and sets them as attributes. 10 | It also includes a method to load OpenAI API credentials. 11 | 12 | Attributes: 13 | gpt_model (str): The GPT model to be used. 14 | temperature (float): The temperature parameter for generating responses. 15 | llm_system_role (str): The system role for the language model. 16 | llm_format_output (str): The formatting constrain of the language model. 17 | 18 | Methods: 19 | __init__(): Initializes the LoadConfig instance by loading configuration from a YAML file. 20 | load_openai_credentials(): Loads OpenAI configuration settings. 21 | """ 22 | 23 | def __init__(self) -> None: 24 | with open(here("config.yml")) as cfg: 25 | app_config = yaml.load(cfg, Loader=yaml.FullLoader) 26 | self.gpt_model = app_config["gpt_model"] 27 | self.temperature = app_config["temperature"] 28 | self.max_tokens = app_config["max_tokens"] 29 | self.articles_to_search = app_config["articles_to_search"] 30 | self.llm_system_role = app_config["llm_system_role"] 31 | self.llm_format_output = app_config["llm_format_output"] 32 | self.chunk_size = app_config["chunk_size"] 33 | self.similarity_top_k = app_config["similarity_top_k"] 34 | --------------------------------------------------------------------------------