├── .dockerignore ├── Dockerfile ├── README.md ├── config.yml ├── images ├── beautiful_maestro.png ├── beautiful_maestro_2.png ├── llava.png ├── maestro.png ├── memformer.png └── streamlit_app.png ├── requirements.txt └── src ├── app.py ├── data └── .gitignore └── utils ├── __pycache__ ├── app_utils.cpython-312.pyc ├── ddgs.cpython-312.pyc ├── load_config.cpython-312.pyc └── web_search.cpython-312.pyc ├── app_utils.py ├── arxiv_scraper.py └── load_config.py /.dockerignore: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM python:3.12.1 2 | 3 | WORKDIR /code 4 | 5 | COPY ./requirements.txt /code/requirements.txt 6 | COPY ./config.yml /code/config.yml 7 | COPY ./src /code/src/ 8 | COPY ./.streamlit /code/.streamlit/ 9 | COPY ./images/maestro.png /code/images/maestro.png 10 | COPY ./images/streamlit_app.png /code/images/streamlit_app.png 11 | 12 | EXPOSE 8501 13 | 14 | RUN pip install --no-cache-dir -r /code/requirements.txt 15 | 16 | 17 | CMD ["streamlit", "run", "src/app.py"] -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # RAG - Maestro 2 | RAG-Maestro is an up-to-date LLM assistant designed to provide clear and concise explanations of scientific concepts **and relevant papers**. As a Q&A bot, it does not keep track of your conversation and will treat each input independently. 3 | 4 | 5 |  6 | 7 | 8 | # Examples 9 | 10 | What is LLava? | Do you know what the Memformer is? 11 | :-------------------------:|:-------------------------: 12 |  |  13 | 14 | # Implementation Details 15 | 16 | The bot is composed of three building blocks that work sequentially: 17 | 18 | ### A Keywords extractor 19 | 20 | RAG-Maestro first tasks is to extract from your request the keywords to browse for. the [RAKE](https://www.analyticsvidhya.com/blog/2021/10/rapid-keyword-extraction-rake-algorithm-in-natural-language-processing/) (Rapid Keyword Extraction Algorithm) from `nltk` is used. 21 | 22 | ### A Paper browser 23 | 24 | Once the keywords extracted, they are used to retrieve the 5 most relevant papers from [arxiv.org](https://www.arxiv.org/). These papers are then downloaded and scrapped. 25 | 26 | To build the scraper, I used the open-source `arxiv` API and `PyPDF2` to ease the pdf reading. 27 | 28 | ### A RAG Pipeline ([Paper](https://arxiv.org/pdf/2005.11401.pdf)) 29 | 30 | That retrieves the most relevant information from the scraped papers relatively to the query, and takes it as context to summarize. One of the main features I implemented (prompt engineering) is that the bot is *citing its sources*. Hence, it becomes possible to assess the veracity of the provided answer. The pipeline is using OpenAI LLMS (`embedding-v3` and `gpt-3.5-turbo`) to process the retrieval and the generation steps. Like every LLM, RAG-Maestro can be subject to hallucinations. **Making it citing the sources can help us to detect a hallucination**. 31 | 32 | 33 | I used [llama_index]( https://docs.llamaindex.ai/en/stable/) to build the RAG pipeline, specifically picked a "tree_summarizer" form query engine to generate the answer. All the hyperparameters are stored in an editable `config.yml` file. 34 | 35 | # Commands 36 | 37 | 38 | ### Running the app locally from this repository 39 | - clone this repository 40 | - Create a new Python environment provided with pip 41 | - run `pip install -r requirements.txt` 42 | - run `streamlit run src/app.py` 43 | - Now open the 'External URL' in your browser. Enjoy the bot. 44 | 45 |  -------------------------------------------------------------------------------- /config.yml: -------------------------------------------------------------------------------- 1 | gpt_model: gpt-3.5-turbo 2 | temperature: 0.9 3 | max_tokens: 1000 4 | chunk_size: 500 5 | similarity_top_k: 5 6 | articles_to_search: 5 7 | llm_system_role: 8 | "As a chatbot, your goal is to respond to the user's question respectfully and concisely.\ 9 | You will receive the user's new query, along with 3 articles from the web search result for that query.\ 10 | Answer the user with the most relevant information. After answering, cite your sources and provide the url." 11 | llm_format_output: 12 | " \\ 13 | #Citing sources\ 14 | After giving your final answer, you will cite your sources the following way:\ 15 | 'REFERENCES: \ 16 | Title of article -> url \ 17 | Title of article -> url \ 18 | etc...' " 19 | -------------------------------------------------------------------------------- /images/beautiful_maestro.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AymenKallala/RAG_Maestro/78a058c202920d5c6cb2b475cdb334ef800bbfa3/images/beautiful_maestro.png -------------------------------------------------------------------------------- /images/beautiful_maestro_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AymenKallala/RAG_Maestro/78a058c202920d5c6cb2b475cdb334ef800bbfa3/images/beautiful_maestro_2.png -------------------------------------------------------------------------------- /images/llava.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AymenKallala/RAG_Maestro/78a058c202920d5c6cb2b475cdb334ef800bbfa3/images/llava.png -------------------------------------------------------------------------------- /images/maestro.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AymenKallala/RAG_Maestro/78a058c202920d5c6cb2b475cdb334ef800bbfa3/images/maestro.png -------------------------------------------------------------------------------- /images/memformer.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AymenKallala/RAG_Maestro/78a058c202920d5c6cb2b475cdb334ef800bbfa3/images/memformer.png -------------------------------------------------------------------------------- /images/streamlit_app.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AymenKallala/RAG_Maestro/78a058c202920d5c6cb2b475cdb334ef800bbfa3/images/streamlit_app.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | aiohttp==3.9.1 2 | aiosignal==1.3.1 3 | altair==5.2.0 4 | annotated-types==0.6.0 5 | anyio==4.2.0 6 | arxiv==2.1.0 7 | attrs==23.2.0 8 | Automat==22.10.0 9 | beautifulsoup4==4.12.2 10 | black==23.12.1 11 | blinker==1.7.0 12 | cachetools==5.3.2 13 | certifi==2023.11.17 14 | cffi==1.16.0 15 | chardet==5.2.0 16 | charset-normalizer==3.3.2 17 | click==8.1.7 18 | constantly==23.10.4 19 | cryptography==41.0.7 20 | cssselect==1.2.0 21 | curl-cffi==0.5.10 22 | dataclasses-json==0.6.3 23 | Deprecated==1.2.14 24 | distro==1.9.0 25 | duckduckgo_search==4.1.1 26 | feedparser==6.0.10 27 | filelock==3.13.1 28 | frozenlist==1.4.1 29 | fsspec==2023.12.2 30 | gitdb==4.0.11 31 | GitPython==3.1.40 32 | greenlet==3.0.3 33 | h11==0.14.0 34 | httpcore==1.0.2 35 | httpx==0.26.0 36 | hyperlink==21.0.0 37 | idna==3.6 38 | importlib-metadata==6.11.0 39 | incremental==22.10.0 40 | itemadapter==0.8.0 41 | itemloaders==1.1.0 42 | Jinja2==3.1.2 43 | jmespath==1.0.1 44 | joblib==1.3.2 45 | jsonschema==4.20.0 46 | jsonschema-specifications==2023.12.1 47 | llama-index==0.9.26 48 | lxml==5.0.1 49 | markdown-it-py==3.0.0 50 | MarkupSafe==2.1.3 51 | marshmallow==3.20.1 52 | mdurl==0.1.2 53 | multidict==6.0.4 54 | mypy-extensions==1.0.0 55 | nltk==3.8.1 56 | numpy==1.26.3 57 | openai==1.6.1 58 | outcome==1.3.0.post0 59 | pandas==2.1.4 60 | parsel==1.8.1 61 | pathspec==0.12.1 62 | pillow==10.2.0 63 | Protego==0.3.0 64 | protobuf==4.25.1 65 | pyarrow==14.0.2 66 | pyarxiv==1.0.3.1 67 | pyasn1==0.5.1 68 | pyasn1-modules==0.3.0 69 | pycparser==2.21 70 | pydantic==2.5.3 71 | pydantic_core==2.14.6 72 | pydeck==0.8.1b0 73 | PyDispatcher==2.0.7 74 | pyOpenSSL==23.3.0 75 | pypdf==3.17.4 76 | PyPDF2==3.0.1 77 | pyprojroot==0.3.0 78 | PySocks==1.7.1 79 | python-dotenv==1.0.0 80 | pytz==2023.3.post1 81 | PyYAML==6.0.1 82 | queuelib==1.6.2 83 | rake-nltk==1.0.6 84 | readability-lxml==0.8.1 85 | referencing==0.32.1 86 | regex==2023.12.25 87 | requests==2.31.0 88 | requests-file==1.5.1 89 | rich==13.7.0 90 | rpds-py==0.16.2 91 | Scrapy==2.11.0 92 | selenium==4.16.0 93 | service-identity==23.1.0 94 | setuptools==69.0.3 95 | sgmllib3k==1.0.0 96 | smmap==5.0.1 97 | sniffio==1.3.0 98 | sortedcontainers==2.4.0 99 | soupsieve==2.5 100 | SQLAlchemy==2.0.25 101 | streamlit==1.29.0 102 | streamlit-chat==0.1.1 103 | tenacity==8.2.3 104 | tiktoken==0.5.2 105 | tldextract==5.1.1 106 | toml==0.10.2 107 | toolz==0.12.0 108 | tqdm==4.66.1 109 | trio==0.23.2 110 | trio-websocket==0.11.1 111 | Twisted==22.10.0 112 | typing-inspect==0.9.0 113 | tzdata==2023.4 114 | tzlocal==5.2 115 | urllib3==2.1.0 116 | validators==0.22.0 117 | w3lib==2.1.2 118 | watchdog==3.0.0 119 | webdriver-manager==4.0.1 120 | wheel==0.42.0 121 | wrapt==1.16.0 122 | wsproto==1.2.0 123 | yarl==1.9.4 124 | youtube-transcript-api==0.6.2 125 | zope.interface==6.1 126 | -------------------------------------------------------------------------------- /src/app.py: -------------------------------------------------------------------------------- 1 | import streamlit as st 2 | from streamlit_chat import message 3 | from PIL import Image 4 | from utils.load_config import LoadConfig 5 | from utils.app_utils import load_data, RAG, delete_data 6 | import subprocess 7 | import os 8 | 9 | 10 | APPCFG = LoadConfig() 11 | 12 | # =================================== 13 | # Setting page title and header 14 | # =================================== 15 | im = Image.open("images/maestro.png") 16 | os.environ["OPENAI_API_KEY"] = st.secrets["openai_key"] 17 | 18 | st.set_page_config(page_title="RAG-Maestro", page_icon=im, layout="wide") 19 | st.markdown( 20 | "