├── .env.example ├── .gitignore ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── compose.yaml ├── docker ├── Dockerfile ├── app.py ├── conda_env.sh ├── environment.yml ├── getSecrets.py ├── run.sh ├── toolsFunctions.py ├── usage.md └── utils.py ├── environment.yml ├── flowchart.png ├── local_setup.ps1 ├── local_setup.sh ├── logo.png ├── scripts ├── app.py ├── toolsFunctions.py └── utils.py ├── start_services.ps1 ├── start_services.sh └── usage.md /.env.example: -------------------------------------------------------------------------------- 1 | llamacloud_api_key="llx-xxx" 2 | mistral_api_key="*******************abc" 3 | phoenix_api_key="*******************def" -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.xml 2 | .env 3 | papers-parser.code-workspace 4 | data/ 5 | qdrant_storage/ 6 | scripts/__pycache__/ 7 | huggingface_spaces/ -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing to PapersChat 2 | 3 | Do you want to contribute to this project? Make sure to read this guidelines first :) 4 | 5 | ## Issue 6 | 7 | **When to do it**: 8 | 9 | - You found bugs but you don't know how to solve them or don't have time/will to do the solve 10 | - You want new features but you don't know how to implement them or don't have time/will to do the implementation 11 | 12 | > ⚠️ _Always check open and closed issues before you submit yours to avoid duplicates_ 13 | 14 | **How to do it**: 15 | 16 | - Open an issue 17 | - Give the issue a meaningful title (short but effective problem description) 18 | - Describe the problem following the issue template 19 | 20 | ## Traditional contribution 21 | 22 | **When to do it**: 23 | 24 | - You found bugs and corrected them 25 | - You optimized/improved the code 26 | - You added new features that you think could be useful to others 27 | 28 | **How to do it**: 29 | 30 | 1. Fork this repository 31 | 2. Commit your changes 32 | 3. Submit pull request (make sure to provide a thorough description of the changes) 33 | 34 | 35 | ## Showcase your PrAIvateSearch 36 | 37 | **When to do it**: 38 | 39 | - You modified the base application with new features but you don't want/can't merge them with the original PrAIvateSearch 40 | 41 | **How to do it**: 42 | 43 | - Go to [_GitHub Discussions > Show and tell_](https://github.com/AstraBert/PrAIvateSearch/discussions/categories/show-and-tell) page 44 | - Open a new discussion there, describing your PrAIvateSearch application 45 | 46 | ### Thanks for contributing! -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 Clelia (Astra) Bertelli 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

PapersChat

2 | 3 |

Chatting With Papers Made Easy

4 | 5 |

If you find PapersChat useful, please consider to support us through donation:

6 |
7 | GitHub Sponsors Badge 8 |
9 | 10 |
11 | PapersChat Logo 12 |
13 | 14 | **PapersChat** is an agentic AI application that allows you to chat with your papers and gather also information from papers on ArXiv and on PubMed. It is powered by [LlamaIndex](https://www.llamaindex.ai/), [Qdrant](https://qdrant.tech) and [Mistral AI](https://mistral.ai/en). 15 | 16 | ### Flowchart 17 | 18 |
19 | PapersChat Flowchart 20 |
21 | 22 | ### Install and launch it 23 | 24 | The installation of the application is a unique process, you simply have to clone the GitHub repository: 25 | 26 | ```bash 27 | git clone https://github.com/AstraBert/PapersChat.git 28 | cd PapersChat/ 29 | ``` 30 | 31 | To launch the app, you can follow two paths: 32 | 33 | **1. Docker (recommended)** 34 | 35 | > _Required: [Docker](https://docs.docker.com/desktop/) and [docker compose](https://docs.docker.com/compose/)_ 36 | 37 | - Add the `mistral_api_key`, `phoenix_api_key` and `llamacloud_api_key` variables in the [`.env.example`](./docker/.env.example) file and modify the name of the file to `.env`. Get these keys: 38 | + [On Mistral AI](https://console.mistral.ai/api-keys/) 39 | + [On LlamaCloud](https://cloud.llamaindex.ai/) 40 | + [On Phoenix/Arize](https://llamatrace.com/projects) 41 | 42 | ```bash 43 | # modify your access token, e.g. hf_token="hf_abcdefg1234567" 44 | mv .env.example .env 45 | ``` 46 | 47 | - Launch the docker application: 48 | 49 | ```bash 50 | # If you are on Linux/macOS 51 | bash start_services.sh 52 | # If you are on Windows 53 | .\start_services.ps1 54 | ``` 55 | 56 | You will see the application running on http://localhost:7860 and you will be able to use it. Depending on your connection and on your hardware, the set up might take some time (up to 30 mins to set up) - but this is only for the first time your run it! 57 | 58 | **2. Source code** 59 | 60 | > _Required: [Docker](https://docs.docker.com/desktop/), [docker compose](https://docs.docker.com/compose/) and [conda](https://anaconda.org/anaconda/conda)_ 61 | 62 | - Add the `mistral_api_key`, `phoenix_api_key` and `llamacloud_api_key` variables in the [`.env.example`](./docker/.env.example) file and modify the name of the file to `.env`. Get these keys: 63 | + [On Mistral AI](https://console.mistral.ai/api-keys/) 64 | + [On LlamaCloud](https://cloud.llamaindex.ai/) 65 | + [On Phoenix/Arize](https://llamatrace.com/projects) 66 | 67 | ```bash 68 | mv .env.example .env 69 | # modify the variables, e.g.: 70 | # llamacloud_api_key="llx-000-abc" 71 | # mistral_api_key="01234abc" 72 | # phoenix_api_key="56789def" 73 | ``` 74 | 75 | - Alternatively, if you wish to use Azure OpenAI or Ollama, specify: 76 | 77 | ```bash 78 | azure_openai_api_key="***" # if you wish to use Azure OpenAI 79 | ollama_model="gemma3:latest" # if you wish to use Ollama 80 | ``` 81 | 82 | >[!IMPORTANT] 83 | > _This is only possible while launching from source code, Docker launching does not support this option_ 84 | 85 | - Set up PapersChat using the dedicated script: 86 | 87 | ```bash 88 | # For MacOs/Linux users 89 | bash local_setup.sh 90 | # For Windows users 91 | .\local_setup.ps1 92 | ``` 93 | 94 | - Or you can do it manually, if you prefer: 95 | 96 | ```bash 97 | docker compose up db -d 98 | 99 | conda env create -f environment.yml 100 | 101 | conda activate papers-chat 102 | python3 scripts/app.py 103 | conda deactivate 104 | ``` 105 | 106 | ## Contributing 107 | 108 | Contributions are always welcome! Follow the contributions guidelines reported [here](CONTRIBUTING.md). 109 | 110 | ## License and rights of usage 111 | 112 | The software is provided under MIT [license](./LICENSE). 113 | 114 | ### Full documentation will come soon!👷‍♀️ 115 | 116 | -------------------------------------------------------------------------------- /compose.yaml: -------------------------------------------------------------------------------- 1 | name: papers_chat 2 | 3 | services: 4 | app: 5 | build: 6 | context: ./docker/ 7 | dockerfile: Dockerfile 8 | ports: 9 | - 7860:7860 10 | secrets: 11 | - mistral 12 | - phoenix 13 | - llamacloud 14 | networks: 15 | - internal_net 16 | db: 17 | image: qdrant/qdrant 18 | ports: 19 | - 6333:6333 20 | - 6334:6334 21 | volumes: 22 | - "./qdrant_storage:/qdrant/storage" 23 | networks: 24 | - internal_net 25 | 26 | networks: 27 | internal_net: 28 | driver: bridge 29 | driver_opts: 30 | com.docker.network.bridge.host_binding_ipv4: "127.0.0.1" 31 | 32 | secrets: 33 | mistral: 34 | environment: mistral_api_key 35 | phoenix: 36 | environment: phoenix_api_key 37 | llamacloud: 38 | environment: llamacloud_api_key -------------------------------------------------------------------------------- /docker/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM condaforge/miniforge3 2 | 3 | WORKDIR /app/ 4 | COPY . /app/ 5 | RUN bash /app/conda_env.sh 6 | 7 | EXPOSE 7860 8 | CMD ["bash", "/app/run.sh"] -------------------------------------------------------------------------------- /docker/app.py: -------------------------------------------------------------------------------- 1 | from utils import ingest_documents, qdrant_client, List, QdrantVectorStore, VectorStoreIndex, embedder 2 | import gradio as gr 3 | from toolsFunctions import pubmed_tool, arxiv_tool 4 | from llama_index.core.tools import QueryEngineTool, FunctionTool 5 | from llama_index.core import Settings 6 | from llama_index.llms.mistralai import MistralAI 7 | from llama_index.core.llms import ChatMessage 8 | from llama_index.core.agent import ReActAgent 9 | from getSecrets import mistral_api_key, phoenix_api_key 10 | from phoenix.otel import register 11 | from openinference.instrumentation.llama_index import LlamaIndexInstrumentor 12 | import time 13 | import os 14 | 15 | ## Observing and tracing 16 | PHOENIX_API_KEY = phoenix_api_key 17 | os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={PHOENIX_API_KEY}" 18 | os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com" 19 | tracer_provider = register( 20 | project_name="llamaindex", 21 | ) 22 | LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider) 23 | 24 | ## Globals 25 | Settings.llm = MistralAI(model="mistral-small-latest", temperature=0, api_key=mistral_api_key) 26 | Settings.embed_model = embedder 27 | arxivtool = FunctionTool.from_defaults(arxiv_tool, name="arxiv_tool", description="A tool to search ArXiv (pre-print papers database) for specific papers") 28 | pubmedtool = FunctionTool.from_defaults(pubmed_tool, name="pubmed_tool", description="A tool to search PubMed (printed medical papers database) for specific papers") 29 | query_engine = None 30 | message_history = [ 31 | ChatMessage(role="system", content="You are a useful assistant that has to help the user with questions that they ask about several papers they uploaded. You should base your answers on the context you can retrieve from the PDFs and, if you cannot retrieve any, search ArXiv for a potential answer. If you cannot find any viable answer, please reply that you do not know the answer to the user's question") 32 | ] 33 | 34 | ## Functions 35 | def reply(message, history, files: List[str] | None, collection: str | None, llamaparse: bool = False): 36 | global message_history 37 | if message == "" or message is None: 38 | response = "You should provide a message" 39 | r = "" 40 | for char in response: 41 | r+=char 42 | time.sleep(0.001) 43 | yield r 44 | elif files is None and collection == "": 45 | res = "### WARNING! You did not specify any collection, so I only interrogated ArXiv and/or PubMed to answer your question\n\n" 46 | agent = ReActAgent.from_tools(tools=[pubmedtool, arxivtool], verbose=True) 47 | response = agent.chat(message = message, chat_history = message_history) 48 | response = str(response) 49 | message_history.append(ChatMessage(role="user", content=message)) 50 | message_history.append(ChatMessage(role="assistant", content=response)) 51 | response = res + response 52 | r = "" 53 | for char in response: 54 | r+=char 55 | time.sleep(0.001) 56 | yield r 57 | elif files is None and collection != "" and collection not in [c.name for c in qdrant_client.get_collections().collections]: 58 | response = "Make sure that the name of the existing collection to use as a knowledge base is correct, because the one you provided does not exist! You can check your existing collections and their features in the dedicated tab of the app :)" 59 | r = "" 60 | for char in response: 61 | r+=char 62 | time.sleep(0.001) 63 | yield r 64 | elif files is not None: 65 | if collection == "": 66 | response = "You should provide a collection name (new or existing) if you want to ingest files!" 67 | r = "" 68 | for char in response: 69 | r+=char 70 | time.sleep(0.001) 71 | yield r 72 | else: 73 | collection_name = collection 74 | index = ingest_documents(files, collection_name, llamaparse) 75 | query_engine = index.as_query_engine() 76 | rag_tool = QueryEngineTool.from_defaults(query_engine, name="papers_rag", description="A RAG engine with information from selected scientific papers") 77 | agent = ReActAgent.from_tools(tools=[rag_tool, pubmedtool, arxivtool], verbose=True) 78 | response = agent.chat(message = message, chat_history = message_history) 79 | response = str(response) 80 | message_history.append(ChatMessage(role="user", content=message)) 81 | message_history.append(ChatMessage(role="assistant", content=response)) 82 | r = "" 83 | for char in response: 84 | r+=char 85 | time.sleep(0.001) 86 | yield r 87 | else: 88 | vector_store = QdrantVectorStore(client = qdrant_client, collection_name=collection, enable_hybrid=True) 89 | index = VectorStoreIndex.from_vector_store(vector_store=vector_store) 90 | query_engine = index.as_query_engine() 91 | rag_tool = QueryEngineTool.from_defaults(query_engine, name="papers_rag", description="A RAG engine with information from selected scientific papers") 92 | agent = ReActAgent.from_tools(tools=[rag_tool, pubmedtool, arxivtool], verbose=True) 93 | response = agent.chat(message = message, chat_history = message_history) 94 | response = str(response) 95 | message_history.append(ChatMessage(role="user", content=message)) 96 | message_history.append(ChatMessage(role="assistant", content=response)) 97 | r = "" 98 | for char in response: 99 | r+=char 100 | time.sleep(0.001) 101 | yield r 102 | 103 | def to_markdown_color(grade: str): 104 | colors = {"red": "ff0000", "yellow": "ffcc00", "green": "33cc33"} 105 | mdcode = f"![#{colors[grade]}](https://placehold.co/15x15/{colors[grade]}/{colors[grade]}.png)" 106 | return mdcode 107 | 108 | def get_qdrant_collections_dets(): 109 | collections = [c.name for c in qdrant_client.get_collections().collections] 110 | details = [] 111 | counter = 0 112 | for collection in collections: 113 | counter += 1 114 | dets = qdrant_client.get_collection(collection) 115 | p = f"### {counter}. {collection}\n\n**Number of Points**: {dets.points_count}\n\n**Status**: {to_markdown_color(dets.status)} {dets.status}\n\n" 116 | details.append(p) 117 | final_text = "

Available Collections

\n\n" 118 | final_text += "\n\n".join(details) 119 | return final_text 120 | 121 | ## Frontend 122 | accordion = gr.Accordion(label="⚠️Set up these parameters before you start chatting!⚠️") 123 | 124 | iface1 = gr.ChatInterface(fn=reply, additional_inputs=[gr.File(label="Upload Papers (only PDF allowed!) - Ingestion", file_count="multiple", file_types=[".pdf","pdf",".PDF","PDF"], value=None), gr.Textbox(label="Collection", info="Upload your papers to a collection (new or existing)", value=""), gr.Checkbox(label="Use LlamaParse", info="Needs the LlamaCloud API key", value=False)], additional_inputs_accordion=accordion) 125 | u = open("usage.md") 126 | content = u.read() 127 | u.close() 128 | iface2 = gr.Blocks() 129 | with iface2: 130 | with gr.Row(): 131 | gr.Markdown(content) 132 | iface3 = gr.Interface(fn=get_qdrant_collections_dets, inputs=None, outputs=gr.Markdown(label="Collections"), submit_btn="See your collections") 133 | iface = gr.TabbedInterface([iface1, iface2, iface3], ["Chat💬", "Usage Guide⚙️", "Your Collections🔎"], title="PapersChat📝") 134 | iface.launch(server_name="0.0.0.0", server_port=7860) -------------------------------------------------------------------------------- /docker/conda_env.sh: -------------------------------------------------------------------------------- 1 | eval "$(conda shell.bash hook)" 2 | 3 | conda env create -f /app/environment.yml -------------------------------------------------------------------------------- /docker/environment.yml: -------------------------------------------------------------------------------- 1 | name: papers-chat 2 | channels: 3 | - conda-forge 4 | dependencies: 5 | - _libgcc_mutex=0.1=conda_forge 6 | - _openmp_mutex=4.5=2_gnu 7 | - aiohappyeyeballs=2.4.6=pyhd8ed1ab_0 8 | - aiohttp=3.11.12=py311h2dc5d0c_0 9 | - aiosignal=1.3.2=pyhd8ed1ab_0 10 | - annotated-types=0.7.0=pyhd8ed1ab_1 11 | - anyio=4.8.0=pyhd8ed1ab_0 12 | - attrs=25.1.0=pyh71513ae_0 13 | - beautifulsoup4=4.13.3=pyha770c72_0 14 | - brotli-python=1.1.0=py311hfdbb021_2 15 | - bzip2=1.0.8=h4bc722e_7 16 | - ca-certificates=2025.1.31=hbcca054_0 17 | - certifi=2025.1.31=pyhd8ed1ab_0 18 | - cffi=1.17.1=py311hf29c0ef_0 19 | - charset-normalizer=3.4.1=pyhd8ed1ab_0 20 | - click=8.1.8=pyh707e725_0 21 | - colorama=0.4.6=pyhd8ed1ab_1 22 | - dataclasses-json=0.6.7=pyhd8ed1ab_1 23 | - deprecated=1.2.18=pyhd8ed1ab_0 24 | - dirtyjson=1.0.8=pyhd8ed1ab_1 25 | - distro=1.9.0=pyhd8ed1ab_1 26 | - eval-type-backport=0.2.2=pyhd8ed1ab_0 27 | - eval_type_backport=0.2.2=pyha770c72_0 28 | - exceptiongroup=1.2.2=pyhd8ed1ab_1 29 | - filetype=1.2.0=pyhd8ed1ab_0 30 | - freetype=2.12.1=h267a509_2 31 | - frozenlist=1.5.0=py311h2dc5d0c_1 32 | - fsspec=2025.2.0=pyhd8ed1ab_0 33 | - greenlet=3.1.1=py311hfdbb021_1 34 | - h11=0.14.0=pyhd8ed1ab_1 35 | - h2=4.2.0=pyhd8ed1ab_0 36 | - hpack=4.1.0=pyhd8ed1ab_0 37 | - httpcore=1.0.7=pyh29332c3_1 38 | - httpx=0.28.1=pyhd8ed1ab_0 39 | - hyperframe=6.1.0=pyhd8ed1ab_0 40 | - idna=3.10=pyhd8ed1ab_1 41 | - jiter=0.8.2=py311h9e33e62_0 42 | - joblib=1.4.2=pyhd8ed1ab_1 43 | - lcms2=2.17=h717163a_0 44 | - ld_impl_linux-64=2.43=h712a8e2_2 45 | - lerc=4.0.0=h27087fc_0 46 | - libblas=3.9.0=28_h59b9bed_openblas 47 | - libcblas=3.9.0=28_he106b2a_openblas 48 | - libdeflate=1.23=h4ddbbb0_0 49 | - libexpat=2.6.4=h5888daf_0 50 | - libffi=3.4.6=h2dba641_0 51 | - libgcc=14.2.0=h77fa898_1 52 | - libgcc-ng=14.2.0=h69a702a_1 53 | - libgfortran=14.2.0=h69a702a_1 54 | - libgfortran5=14.2.0=hd5240d6_1 55 | - libgomp=14.2.0=h77fa898_1 56 | - libjpeg-turbo=3.0.0=hd590300_1 57 | - liblapack=3.9.0=28_h7ac8fdf_openblas 58 | - liblzma=5.6.4=hb9d3cd8_0 59 | - libnsl=2.0.1=hd590300_0 60 | - libopenblas=0.3.28=pthreads_h94d23a6_1 61 | - libpng=1.6.46=h943b412_0 62 | - libsqlite=3.48.0=hee588c1_1 63 | - libstdcxx=14.2.0=hc0a3c3a_1 64 | - libstdcxx-ng=14.2.0=h4852527_1 65 | - libtiff=4.7.0=hd9ff511_3 66 | - libuuid=2.38.1=h0b41bf4_0 67 | - libwebp-base=1.5.0=h851e524_0 68 | - libxcb=1.17.0=h8a09558_0 69 | - libxcrypt=4.4.36=hd590300_1 70 | - libzlib=1.3.1=hb9d3cd8_2 71 | - llama-cloud=0.1.12=pyhd8ed1ab_0 72 | - llama-cloud-services=0.6.1=pyhd8ed1ab_0 73 | - llama-index=0.12.17=pyhd8ed1ab_0 74 | - llama-index-agent-openai=0.4.5=pyhd8ed1ab_0 75 | - llama-index-cli=0.4.0=pyhd8ed1ab_1 76 | - llama-index-core=0.12.17=pyhd8ed1ab_1 77 | - llama-index-embeddings-openai=0.3.1=pyhd8ed1ab_1 78 | - llama-index-indices-managed-llama-cloud=0.6.4=pyhd8ed1ab_0 79 | - llama-index-legacy=0.9.48.post4=pyhd8ed1ab_1 80 | - llama-index-llms-openai=0.3.19=pyhd8ed1ab_0 81 | - llama-index-multi-modal-llms-openai=0.4.3=pyhd8ed1ab_0 82 | - llama-index-program-openai=0.3.1=pyhd8ed1ab_1 83 | - llama-index-question-gen-openai=0.3.0=pyhd8ed1ab_1 84 | - llama-index-readers-file=0.4.5=pyhd8ed1ab_0 85 | - llama-index-readers-llama-parse=0.4.0=pyhd8ed1ab_1 86 | - llama-parse=0.6.1=pyhd8ed1ab_0 87 | - llamaindex-py-client=0.1.19=pyhd8ed1ab_1 88 | - marshmallow=3.26.1=pyhd8ed1ab_0 89 | - multidict=6.1.0=py311h2dc5d0c_2 90 | - mypy_extensions=1.0.0=pyha770c72_1 91 | - ncurses=6.5=h2d0b736_3 92 | - nest-asyncio=1.6.0=pyhd8ed1ab_1 93 | - networkx=3.4.2=pyh267e887_2 94 | - nltk=3.9.1=pyhd8ed1ab_1 95 | - numpy=2.2.3=py311h5d046bc_0 96 | - openai=1.63.0=pyhd8ed1ab_0 97 | - openjpeg=2.5.3=h5fbd93e_0 98 | - openssl=3.4.1=h7b32b05_0 99 | - packaging=24.2=pyhd8ed1ab_2 100 | - pandas=2.2.3=py311h7db5c69_1 101 | - pip=25.0.1=pyh8b19718_0 102 | - propcache=0.2.1=py311h2dc5d0c_1 103 | - pthread-stubs=0.4=hb9d3cd8_1002 104 | - pycparser=2.22=pyh29332c3_1 105 | - pydantic=2.10.6=pyh3cfb1c2_0 106 | - pydantic-core=2.27.2=py311h9e33e62_0 107 | - pypdf=5.3.0=pyh29332c3_0 108 | - pysocks=1.7.1=pyha55dd90_7 109 | - python=3.11.11=h9e4cc4f_1_cpython 110 | - python-dateutil=2.9.0.post0=pyhff2d567_1 111 | - python-dotenv=1.0.1=pyhd8ed1ab_1 112 | - python-tzdata=2025.1=pyhd8ed1ab_0 113 | - python_abi=3.11=5_cp311 114 | - pytz=2024.1=pyhd8ed1ab_0 115 | - pyyaml=6.0.2=py311h2dc5d0c_2 116 | - readline=8.2=h8228510_1 117 | - regex=2024.11.6=py311h9ecbd09_0 118 | - requests=2.32.3=pyhd8ed1ab_1 119 | - setuptools=75.8.0=pyhff2d567_0 120 | - six=1.17.0=pyhd8ed1ab_0 121 | - sniffio=1.3.1=pyhd8ed1ab_1 122 | - soupsieve=2.5=pyhd8ed1ab_1 123 | - sqlalchemy=2.0.38=py311h9ecbd09_0 124 | - striprtf=0.0.26=pyhd8ed1ab_0 125 | - tenacity=8.5.0=pyhd8ed1ab_0 126 | - tiktoken=0.9.0=py311hf1706b8_0 127 | - tk=8.6.13=noxft_h4845f30_101 128 | - tqdm=4.67.1=pyhd8ed1ab_1 129 | - typing-extensions=4.12.2=hd8ed1ab_1 130 | - typing_extensions=4.12.2=pyha770c72_1 131 | - typing_inspect=0.9.0=pyhd8ed1ab_1 132 | - tzdata=2025a=h78e105d_0 133 | - urllib3=2.3.0=pyhd8ed1ab_0 134 | - wheel=0.45.1=pyhd8ed1ab_1 135 | - wrapt=1.17.2=py311h9ecbd09_0 136 | - xorg-libxau=1.0.12=hb9d3cd8_0 137 | - xorg-libxdmcp=1.1.5=hb9d3cd8_0 138 | - yaml=0.2.5=h7f98852_2 139 | - yarl=1.18.3=py311h2dc5d0c_1 140 | - zstandard=0.23.0=py311hbc35293_1 141 | - zstd=1.5.6=ha6fb4c9_0 142 | - pip: 143 | - aiofiles==23.2.1 144 | - aioitertools==0.12.0 145 | - aiosqlite==0.21.0 146 | - alembic==1.14.1 147 | - arize-phoenix==7.12.2 148 | - arize-phoenix-evals==0.20.3 149 | - arize-phoenix-otel==0.7.1 150 | - arxiv==2.1.3 151 | - authlib==1.4.1 152 | - azure-ai-documentintelligence==1.0.0 153 | - azure-core==1.32.0 154 | - azure-identity==1.20.0 155 | - biopython==1.85 156 | - cachetools==5.5.1 157 | - cobble==0.1.4 158 | - coloredlogs==15.0.1 159 | - cryptography==44.0.1 160 | - defusedxml==0.7.1 161 | - et-xmlfile==2.0.0 162 | - fastapi==0.115.8 163 | - fastembed==0.5.1 164 | - feedparser==6.0.11 165 | - ffmpy==0.5.0 166 | - filelock==3.17.0 167 | - flatbuffers==25.2.10 168 | - googleapis-common-protos==1.67.0 169 | - gradio==5.16.0 170 | - gradio-client==1.7.0 171 | - graphql-core==3.2.6 172 | - grpc-interceptor==0.15.4 173 | - grpcio==1.70.0 174 | - grpcio-tools==1.70.0 175 | - huggingface-hub==0.28.1 176 | - humanfriendly==10.0 177 | - importlib-metadata==8.5.0 178 | - isodate==0.7.2 179 | - jinja2==3.1.4 180 | - jsonpath-python==1.0.6 181 | - llama-index-embeddings-fastembed==0.3.0 182 | - llama-index-embeddings-huggingface==0.5.1 183 | - llama-index-llms-mistralai==0.3.2 184 | - llama-index-tools-arxiv==0.3.0 185 | - llama-index-vector-stores-qdrant==0.4.3 186 | - loguru==0.7.3 187 | - lxml==5.3.1 188 | - mako==1.3.9 189 | - mammoth==1.9.0 190 | - markdown-it-py==3.0.0 191 | - markdownify==0.14.1 192 | - markitdown==0.0.1a4 193 | - markupsafe==2.1.5 194 | - mdurl==0.1.2 195 | - mistralai==1.5.0 196 | - mmh3==4.1.0 197 | - mpmath==1.3.0 198 | - msal==1.31.1 199 | - msal-extensions==1.2.0 200 | - nvidia-cublas-cu12==12.4.5.8 201 | - nvidia-cuda-cupti-cu12==12.4.127 202 | - nvidia-cuda-nvrtc-cu12==12.4.127 203 | - nvidia-cuda-runtime-cu12==12.4.127 204 | - nvidia-cudnn-cu12==9.1.0.70 205 | - nvidia-cufft-cu12==11.2.1.3 206 | - nvidia-curand-cu12==10.3.5.147 207 | - nvidia-cusolver-cu12==11.6.1.9 208 | - nvidia-cusparse-cu12==12.3.1.170 209 | - nvidia-cusparselt-cu12==0.6.2 210 | - nvidia-nccl-cu12==2.21.5 211 | - nvidia-nvjitlink-cu12==12.4.127 212 | - nvidia-nvtx-cu12==12.4.127 213 | - olefile==0.47 214 | - onnxruntime==1.20.1 215 | - openinference-instrumentation==0.1.22 216 | - openinference-instrumentation-llama-index==3.2.0 217 | - openinference-semantic-conventions==0.1.14 218 | - openpyxl==3.1.5 219 | - opentelemetry-api==1.30.0 220 | - opentelemetry-exporter-otlp==1.30.0 221 | - opentelemetry-exporter-otlp-proto-common==1.30.0 222 | - opentelemetry-exporter-otlp-proto-grpc==1.30.0 223 | - opentelemetry-exporter-otlp-proto-http==1.30.0 224 | - opentelemetry-instrumentation==0.51b0 225 | - opentelemetry-proto==1.30.0 226 | - opentelemetry-sdk==1.30.0 227 | - opentelemetry-semantic-conventions==0.51b0 228 | - orjson==3.10.15 229 | - pathvalidate==3.2.3 230 | - pdfminer-six==20240706 231 | - pillow==10.4.0 232 | - portalocker==2.10.1 233 | - protobuf==5.29.3 234 | - psutil==7.0.0 235 | - puremagic==1.28 236 | - py-rust-stemmers==0.1.3 237 | - pyarrow==19.0.0 238 | - pydub==0.25.1 239 | - pygments==2.19.1 240 | - pyjwt==2.10.1 241 | - python-multipart==0.0.20 242 | - python-pptx==1.0.2 243 | - qdrant-client==1.13.2 244 | - rich==13.9.4 245 | - ruff==0.9.6 246 | - safehttpx==0.1.6 247 | - safetensors==0.5.2 248 | - scikit-learn==1.6.1 249 | - scipy==1.15.1 250 | - semantic-version==2.10.0 251 | - sentence-transformers==3.4.1 252 | - sgmllib3k==1.0.0 253 | - shellingham==1.5.4 254 | - speechrecognition==3.14.1 255 | - sqlean-py==3.47.0 256 | - starlette==0.45.3 257 | - strawberry-graphql==0.253.1 258 | - sympy==1.13.1 259 | - threadpoolctl==3.5.0 260 | - tokenizers==0.21.0 261 | - tomlkit==0.13.2 262 | - torch==2.6.0 263 | - torchaudio==2.6.0 264 | - torchvision==0.21.0 265 | - transformers==4.48.3 266 | - triton==3.2.0 267 | - typer==0.15.1 268 | - uvicorn==0.34.0 269 | - websockets==14.2 270 | - xlrd==2.0.1 271 | - xlsxwriter==3.2.2 272 | - youtube-transcript-api==0.6.3 273 | - zipp==3.21.0 -------------------------------------------------------------------------------- /docker/getSecrets.py: -------------------------------------------------------------------------------- 1 | m = open("/run/secrets/mistral") 2 | mistral_api_key = m.read() 3 | m.close() 4 | p = open("/run/secrets/phoenix") 5 | phoenix_api_key = p.read() 6 | p.close() 7 | l = open("/run/secrets/llamacloud") 8 | llamacloud_api_key = l.read() 9 | l.close() -------------------------------------------------------------------------------- /docker/run.sh: -------------------------------------------------------------------------------- 1 | eval "$(conda shell.bash hook)" 2 | 3 | conda activate papers-chat 4 | echo "Activated conda env" 5 | python3 /app/app.py 6 | -------------------------------------------------------------------------------- /docker/toolsFunctions.py: -------------------------------------------------------------------------------- 1 | import urllib, urllib.request 2 | from pydantic import Field 3 | from datetime import datetime 4 | from markitdown import MarkItDown 5 | from Bio import Entrez 6 | import xml.etree.ElementTree as ET 7 | 8 | md = MarkItDown() 9 | 10 | def format_today(): 11 | d = datetime.now() 12 | if d.month < 10: 13 | month = f"0{d.month}" 14 | else: 15 | month = d.month 16 | if d.day < 10: 17 | day = f"0{d.day}" 18 | else: 19 | day = d.day 20 | if d.hour < 10: 21 | hour = f"0{d.hour}" 22 | else: 23 | hour = d.hour 24 | if d.minute < 10: 25 | minute = f"0{d.hour}" 26 | else: 27 | minute = d.minute 28 | today = f"{d.year}{month}{day}{hour}{minute}" 29 | two_years_ago = f"{d.year-2}{month}{day}{hour}{minute}" 30 | return today, two_years_ago 31 | 32 | def arxiv_tool(search_query: str = Field(description="The query with which to search ArXiv database")): 33 | """A tool to search ArXiv""" 34 | today, two_years_ago = format_today() 35 | query = search_query.replace(" ", "+") 36 | url = f'http://export.arxiv.org/api/query?search_query=all:{query}&submittedDate:[{two_years_ago}+TO+{today}]&start=0&max_results=3' 37 | data = urllib.request.urlopen(url) 38 | content = data.read().decode("utf-8") 39 | f = open("arxiv_results.xml", "w") 40 | f.write(content) 41 | f.close() 42 | result = md.convert("arxiv_results.xml") 43 | return result.text_content 44 | 45 | def search_pubmed(query): 46 | Entrez.email = "astraberte9@gmail.com" # Replace with your email 47 | handle = Entrez.esearch(db="pubmed", term=query, retmax=3) 48 | record = Entrez.read(handle) 49 | handle.close() 50 | return record["IdList"] 51 | 52 | def fetch_pubmed_details(pubmed_ids): 53 | Entrez.email = "your.personal@email.com" # Replace with your email 54 | handle = Entrez.efetch(db="pubmed", id=pubmed_ids, rettype="medline", retmode="xml") 55 | records = handle.read() 56 | handle.close() 57 | recs = records.decode("utf-8") 58 | f = open("biomed_results.xml", "w") 59 | f.write(recs) 60 | f.close() 61 | 62 | def fetch_xml(): 63 | tree = ET.parse("biomed_results.xml") 64 | root = tree.getroot() 65 | parsed_articles = [] 66 | for article in root.findall('PubmedArticle'): 67 | # Extract title 68 | title = article.find('.//ArticleTitle') 69 | title_text = title.text if title is not None else "No title" 70 | # Extract abstract 71 | abstract = article.find('.//Abstract/AbstractText') 72 | abstract_text = abstract.text if abstract is not None else "No abstract" 73 | # Format output 74 | formatted_entry = f"## {title_text}\n\n**Abstract**:\n\n{abstract_text}" 75 | parsed_articles.append(formatted_entry) 76 | return "\n\n".join(parsed_articles) 77 | 78 | def pubmed_tool(search_query: str = Field(description="The query with which to search PubMed database")): 79 | """A tool to search PubMed""" 80 | idlist = search_pubmed(search_query) 81 | if len(idlist) == 0: 82 | return "There is no significant match in PubMed" 83 | fetch_pubmed_details(idlist) 84 | content = fetch_xml() 85 | return content 86 | -------------------------------------------------------------------------------- /docker/usage.md: -------------------------------------------------------------------------------- 1 |

PapersChat Usage Guide

2 | 3 |

If you find PapersChat useful, please consider to support us through donation:

4 |
5 | GitHub Sponsors Badge 6 |
7 | 8 | > _This guide is only on how to use **the app**, not on how to install and/or launch it or on how it works internally. For that, please refer to the [GitHub repository](https://github.com/AstraBert/PapersChat)_ 9 | 10 | ## Use PapersChat with your documents 11 | 12 | If you have papers that you would like to chat with, this is the right section of the guide! 13 | 14 | In order to chat with your papers, you will need to upload them (**as PDF files**) on the dedicated "Upload Papers" widget that you can see at the bottom of the chat interface: you can upload one or more files there (remember: the more you upload, the slower their processing is going to be). 15 | 16 | Once you uploaded the files, before submitting them, you have to do two more things: 17 | 18 | 1. Specify the collection that you want to upload the documents to (in the "Collection" area) 19 | 2. Write your first question/message to interrogate your papers (in the message input space) 20 | 21 | For what concerns point (1), you can give your collection whatever name you want: once you created a new collection, you can always re-use it in the future, just inputting the same name. If you do not remember all your collections, you can go to the "Your collections" tab in the application and click on "Generate" to see the list of your collections. 22 | 23 | Point (2) is very important: if you do not send any message, PapersChat will tell you that you need to send one. 24 | 25 | Once you uploaded the papers, specified the collection and wrote the message, you can send the message and PapersChat will: 26 | 27 | - Ingest your documents 28 | - Produce an answer to your questions 29 | 30 | Congrats! Now you got the first collection and the first message! 31 | 32 | > _**NOTE**: there is still an option we haven't talked about, i.e. the 'LlamaParse' checkbox. If you select that checkbox, you will enable LlamaParse, a tool that LlamaIndex offers [as part of its LlamaCloud services](https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse/). LlamaParse employs enhanced parsing techniques to produce a clean and well-structured data for (often messy) unstructured documents: the free tier offers the possibility of parsing 1000 pages/day. While this approach generates very good data for your collections, you have to take into account the fact that it might take quite some time to parse your documents (especially if they are dense, have lots of text-in-images or are very long). By default the LLamaParse option is disabled_ 33 | 34 | ## Use PapersChat with a collection as knowledge base 35 | 36 | Once you have uploaded all your documents, you might want to interrogate them without having to upload even more. That's where comes into hand the "collection as knowledge base" option. You can simply send a message selecting one of your existing collections as a knowledge base for PapersChat (without uploading any file) and... BAM! You will see that PapersChat replies to your questions :) 37 | 38 | ## Use PapersChat to interrogate PubMed/ArXiv 39 | 40 | PapersChat has access also to PubMed and ArXiv papers archives: if you do not specify a collection name and you do not upload any files, your question is used by PapersChat to search these two online databases for an answer. 41 | 42 | ## Monitor your collections 43 | 44 | Under the "Your Collections" tab of the application you can, by clicking on "Generate", see your collections: you can see how many data points are in these collections (these data points **do not match** with the number of papers you uploaded) and what is the status of your collections. 45 | 46 | A brief guide to the collections status: 47 | 48 | - "green": collection is optimized and searchable 49 | - "yellow": collection is being optimized and you can search it 50 | - "red": collection is not optimized and it will probably return an error if you try to search it -------------------------------------------------------------------------------- /docker/utils.py: -------------------------------------------------------------------------------- 1 | from llama_index.embeddings.huggingface import HuggingFaceEmbedding 2 | from llama_index.core import Settings 3 | from qdrant_client import QdrantClient 4 | from llama_index.core import VectorStoreIndex, SimpleDirectoryReader 5 | from llama_index.core import StorageContext 6 | from llama_index.vector_stores.qdrant import QdrantVectorStore 7 | from llama_cloud_services import LlamaParse 8 | from getSecrets import llamacloud_api_key 9 | from typing import List 10 | import torch 11 | 12 | 13 | qdrant_client = QdrantClient("http://127.0.0.1:6333") 14 | device = "cuda" if torch.cuda.is_available() else "cpu" 15 | embedder = HuggingFaceEmbedding(model_name="nomic-ai/modernbert-embed-base", device=device) 16 | Settings.embed_model = embedder 17 | 18 | def ingest_documents(files: List[str], collection_name: str, llamaparse: True): 19 | vector_store = QdrantVectorStore(client=qdrant_client, collection_name=collection_name, enable_hybrid=True) 20 | storage_context = StorageContext.from_defaults(vector_store=vector_store) 21 | if llamaparse: 22 | parser = LlamaParse( 23 | result_type="markdown", 24 | api_key=llamacloud_api_key 25 | ) 26 | file_extractor = {".pdf": parser} 27 | documents = SimpleDirectoryReader(input_files=files, file_extractor=file_extractor).load_data() 28 | else: 29 | documents = SimpleDirectoryReader(input_files=files).load_data() 30 | index = VectorStoreIndex.from_documents( 31 | documents, 32 | storage_context=storage_context, 33 | ) 34 | return index 35 | 36 | -------------------------------------------------------------------------------- /environment.yml: -------------------------------------------------------------------------------- 1 | name: papers-chat 2 | channels: 3 | - conda-forge 4 | dependencies: 5 | - _libgcc_mutex=0.1=conda_forge 6 | - _openmp_mutex=4.5=2_gnu 7 | - aiohappyeyeballs=2.4.6=pyhd8ed1ab_0 8 | - aiohttp=3.11.12=py311h2dc5d0c_0 9 | - aiosignal=1.3.2=pyhd8ed1ab_0 10 | - annotated-types=0.7.0=pyhd8ed1ab_1 11 | - anyio=4.8.0=pyhd8ed1ab_0 12 | - attrs=25.1.0=pyh71513ae_0 13 | - beautifulsoup4=4.13.3=pyha770c72_0 14 | - brotli-python=1.1.0=py311hfdbb021_2 15 | - bzip2=1.0.8=h4bc722e_7 16 | - ca-certificates=2025.1.31=hbcca054_0 17 | - certifi=2025.1.31=pyhd8ed1ab_0 18 | - cffi=1.17.1=py311hf29c0ef_0 19 | - charset-normalizer=3.4.1=pyhd8ed1ab_0 20 | - click=8.1.8=pyh707e725_0 21 | - colorama=0.4.6=pyhd8ed1ab_1 22 | - dataclasses-json=0.6.7=pyhd8ed1ab_1 23 | - deprecated=1.2.18=pyhd8ed1ab_0 24 | - dirtyjson=1.0.8=pyhd8ed1ab_1 25 | - distro=1.9.0=pyhd8ed1ab_1 26 | - eval-type-backport=0.2.2=pyhd8ed1ab_0 27 | - eval_type_backport=0.2.2=pyha770c72_0 28 | - exceptiongroup=1.2.2=pyhd8ed1ab_1 29 | - filetype=1.2.0=pyhd8ed1ab_0 30 | - freetype=2.12.1=h267a509_2 31 | - frozenlist=1.5.0=py311h2dc5d0c_1 32 | - fsspec=2025.2.0=pyhd8ed1ab_0 33 | - greenlet=3.1.1=py311hfdbb021_1 34 | - h11=0.14.0=pyhd8ed1ab_1 35 | - h2=4.2.0=pyhd8ed1ab_0 36 | - hpack=4.1.0=pyhd8ed1ab_0 37 | - httpcore=1.0.7=pyh29332c3_1 38 | - httpx=0.28.1=pyhd8ed1ab_0 39 | - hyperframe=6.1.0=pyhd8ed1ab_0 40 | - idna=3.10=pyhd8ed1ab_1 41 | - jiter=0.8.2=py311h9e33e62_0 42 | - joblib=1.4.2=pyhd8ed1ab_1 43 | - lcms2=2.17=h717163a_0 44 | - ld_impl_linux-64=2.43=h712a8e2_2 45 | - lerc=4.0.0=h27087fc_0 46 | - libblas=3.9.0=28_h59b9bed_openblas 47 | - libcblas=3.9.0=28_he106b2a_openblas 48 | - libdeflate=1.23=h4ddbbb0_0 49 | - libexpat=2.6.4=h5888daf_0 50 | - libffi=3.4.6=h2dba641_0 51 | - libgcc=14.2.0=h77fa898_1 52 | - libgcc-ng=14.2.0=h69a702a_1 53 | - libgfortran=14.2.0=h69a702a_1 54 | - libgfortran5=14.2.0=hd5240d6_1 55 | - libgomp=14.2.0=h77fa898_1 56 | - libjpeg-turbo=3.0.0=hd590300_1 57 | - liblapack=3.9.0=28_h7ac8fdf_openblas 58 | - liblzma=5.6.4=hb9d3cd8_0 59 | - libnsl=2.0.1=hd590300_0 60 | - libopenblas=0.3.28=pthreads_h94d23a6_1 61 | - libpng=1.6.46=h943b412_0 62 | - libsqlite=3.48.0=hee588c1_1 63 | - libstdcxx=14.2.0=hc0a3c3a_1 64 | - libstdcxx-ng=14.2.0=h4852527_1 65 | - libtiff=4.7.0=hd9ff511_3 66 | - libuuid=2.38.1=h0b41bf4_0 67 | - libwebp-base=1.5.0=h851e524_0 68 | - libxcb=1.17.0=h8a09558_0 69 | - libxcrypt=4.4.36=hd590300_1 70 | - libzlib=1.3.1=hb9d3cd8_2 71 | - llama-cloud=0.1.12=pyhd8ed1ab_0 72 | - llama-cloud-services=0.6.1=pyhd8ed1ab_0 73 | - llama-index=0.12.17=pyhd8ed1ab_0 74 | - llama-index-agent-openai=0.4.5=pyhd8ed1ab_0 75 | - llama-index-cli=0.4.0=pyhd8ed1ab_1 76 | - llama-index-core=0.12.17=pyhd8ed1ab_1 77 | - llama-index-embeddings-openai=0.3.1=pyhd8ed1ab_1 78 | - llama-index-indices-managed-llama-cloud=0.6.4=pyhd8ed1ab_0 79 | - llama-index-legacy=0.9.48.post4=pyhd8ed1ab_1 80 | - llama-index-llms-openai=0.3.19=pyhd8ed1ab_0 81 | - llama-index-multi-modal-llms-openai=0.4.3=pyhd8ed1ab_0 82 | - llama-index-program-openai=0.3.1=pyhd8ed1ab_1 83 | - llama-index-question-gen-openai=0.3.0=pyhd8ed1ab_1 84 | - llama-index-readers-file=0.4.5=pyhd8ed1ab_0 85 | - llama-index-readers-llama-parse=0.4.0=pyhd8ed1ab_1 86 | - llama-parse=0.6.1=pyhd8ed1ab_0 87 | - llamaindex-py-client=0.1.19=pyhd8ed1ab_1 88 | - marshmallow=3.26.1=pyhd8ed1ab_0 89 | - multidict=6.1.0=py311h2dc5d0c_2 90 | - mypy_extensions=1.0.0=pyha770c72_1 91 | - ncurses=6.5=h2d0b736_3 92 | - nest-asyncio=1.6.0=pyhd8ed1ab_1 93 | - networkx=3.4.2=pyh267e887_2 94 | - nltk=3.9.1=pyhd8ed1ab_1 95 | - numpy=2.2.3=py311h5d046bc_0 96 | - openai=1.63.0=pyhd8ed1ab_0 97 | - openjpeg=2.5.3=h5fbd93e_0 98 | - openssl=3.4.1=h7b32b05_0 99 | - packaging=24.2=pyhd8ed1ab_2 100 | - pandas=2.2.3=py311h7db5c69_1 101 | - pip=25.0.1=pyh8b19718_0 102 | - propcache=0.2.1=py311h2dc5d0c_1 103 | - pthread-stubs=0.4=hb9d3cd8_1002 104 | - pycparser=2.22=pyh29332c3_1 105 | - pydantic=2.10.6=pyh3cfb1c2_0 106 | - pydantic-core=2.27.2=py311h9e33e62_0 107 | - pypdf=5.3.0=pyh29332c3_0 108 | - pysocks=1.7.1=pyha55dd90_7 109 | - python=3.11.11=h9e4cc4f_1_cpython 110 | - python-dateutil=2.9.0.post0=pyhff2d567_1 111 | - python-dotenv=1.0.1=pyhd8ed1ab_1 112 | - python-tzdata=2025.1=pyhd8ed1ab_0 113 | - python_abi=3.11=5_cp311 114 | - pytz=2024.1=pyhd8ed1ab_0 115 | - pyyaml=6.0.2=py311h2dc5d0c_2 116 | - readline=8.2=h8228510_1 117 | - regex=2024.11.6=py311h9ecbd09_0 118 | - requests=2.32.3=pyhd8ed1ab_1 119 | - setuptools=75.8.0=pyhff2d567_0 120 | - six=1.17.0=pyhd8ed1ab_0 121 | - sniffio=1.3.1=pyhd8ed1ab_1 122 | - soupsieve=2.5=pyhd8ed1ab_1 123 | - sqlalchemy=2.0.38=py311h9ecbd09_0 124 | - striprtf=0.0.26=pyhd8ed1ab_0 125 | - tenacity=8.5.0=pyhd8ed1ab_0 126 | - tiktoken=0.9.0=py311hf1706b8_0 127 | - tk=8.6.13=noxft_h4845f30_101 128 | - tqdm=4.67.1=pyhd8ed1ab_1 129 | - typing-extensions=4.12.2=hd8ed1ab_1 130 | - typing_extensions=4.12.2=pyha770c72_1 131 | - typing_inspect=0.9.0=pyhd8ed1ab_1 132 | - tzdata=2025a=h78e105d_0 133 | - urllib3=2.3.0=pyhd8ed1ab_0 134 | - wheel=0.45.1=pyhd8ed1ab_1 135 | - wrapt=1.17.2=py311h9ecbd09_0 136 | - xorg-libxau=1.0.12=hb9d3cd8_0 137 | - xorg-libxdmcp=1.1.5=hb9d3cd8_0 138 | - yaml=0.2.5=h7f98852_2 139 | - yarl=1.18.3=py311h2dc5d0c_1 140 | - zstandard=0.23.0=py311hbc35293_1 141 | - zstd=1.5.6=ha6fb4c9_0 142 | - pip: 143 | - aiofiles==23.2.1 144 | - aioitertools==0.12.0 145 | - aiosqlite==0.21.0 146 | - alembic==1.14.1 147 | - arize-phoenix==7.12.2 148 | - arize-phoenix-evals==0.20.3 149 | - arize-phoenix-otel==0.7.1 150 | - arxiv==2.1.3 151 | - authlib==1.4.1 152 | - azure-ai-documentintelligence==1.0.0 153 | - azure-core==1.32.0 154 | - azure-identity==1.20.0 155 | - biopython==1.85 156 | - cachetools==5.5.1 157 | - cobble==0.1.4 158 | - coloredlogs==15.0.1 159 | - cryptography==44.0.1 160 | - defusedxml==0.7.1 161 | - et-xmlfile==2.0.0 162 | - fastapi==0.115.8 163 | - fastembed==0.5.1 164 | - feedparser==6.0.11 165 | - ffmpy==0.5.0 166 | - filelock==3.17.0 167 | - flatbuffers==25.2.10 168 | - googleapis-common-protos==1.67.0 169 | - gradio==5.16.0 170 | - gradio-client==1.7.0 171 | - graphql-core==3.2.6 172 | - grpc-interceptor==0.15.4 173 | - grpcio==1.70.0 174 | - grpcio-tools==1.70.0 175 | - huggingface-hub==0.28.1 176 | - humanfriendly==10.0 177 | - importlib-metadata==8.5.0 178 | - isodate==0.7.2 179 | - jinja2==3.1.4 180 | - jsonpath-python==1.0.6 181 | - llama-index-embeddings-fastembed==0.3.0 182 | - llama-index-embeddings-huggingface==0.5.1 183 | - llama-index-llms-azure-openai==0.3.2 184 | - llama-index-llms-mistralai==0.3.2 185 | - llama-index-llms-ollama==0.5.4 186 | - llama-index-tools-arxiv==0.3.0 187 | - llama-index-vector-stores-qdrant==0.4.3 188 | - loguru==0.7.3 189 | - lxml==5.3.1 190 | - mako==1.3.9 191 | - mammoth==1.9.0 192 | - markdown-it-py==3.0.0 193 | - markdownify==0.14.1 194 | - markitdown==0.0.1a4 195 | - markupsafe==2.1.5 196 | - mdurl==0.1.2 197 | - mistralai==1.5.0 198 | - mmh3==4.1.0 199 | - mpmath==1.3.0 200 | - msal==1.31.1 201 | - msal-extensions==1.2.0 202 | - nvidia-cublas-cu12==12.4.5.8 203 | - nvidia-cuda-cupti-cu12==12.4.127 204 | - nvidia-cuda-nvrtc-cu12==12.4.127 205 | - nvidia-cuda-runtime-cu12==12.4.127 206 | - nvidia-cudnn-cu12==9.1.0.70 207 | - nvidia-cufft-cu12==11.2.1.3 208 | - nvidia-curand-cu12==10.3.5.147 209 | - nvidia-cusolver-cu12==11.6.1.9 210 | - nvidia-cusparse-cu12==12.3.1.170 211 | - nvidia-cusparselt-cu12==0.6.2 212 | - nvidia-nccl-cu12==2.21.5 213 | - nvidia-nvjitlink-cu12==12.4.127 214 | - nvidia-nvtx-cu12==12.4.127 215 | - olefile==0.47 216 | - ollama==0.4.8 217 | - onnxruntime==1.20.1 218 | - openinference-instrumentation==0.1.22 219 | - openinference-instrumentation-llama-index==3.2.0 220 | - openinference-semantic-conventions==0.1.14 221 | - openpyxl==3.1.5 222 | - opentelemetry-api==1.30.0 223 | - opentelemetry-exporter-otlp==1.30.0 224 | - opentelemetry-exporter-otlp-proto-common==1.30.0 225 | - opentelemetry-exporter-otlp-proto-grpc==1.30.0 226 | - opentelemetry-exporter-otlp-proto-http==1.30.0 227 | - opentelemetry-instrumentation==0.51b0 228 | - opentelemetry-proto==1.30.0 229 | - opentelemetry-sdk==1.30.0 230 | - opentelemetry-semantic-conventions==0.51b0 231 | - orjson==3.10.15 232 | - pathvalidate==3.2.3 233 | - pdfminer-six==20240706 234 | - pillow==10.4.0 235 | - portalocker==2.10.1 236 | - protobuf==5.29.3 237 | - psutil==7.0.0 238 | - puremagic==1.28 239 | - py-rust-stemmers==0.1.3 240 | - pyarrow==19.0.0 241 | - pydub==0.25.1 242 | - pygments==2.19.1 243 | - pyjwt==2.10.1 244 | - python-multipart==0.0.20 245 | - python-pptx==1.0.2 246 | - qdrant-client==1.13.2 247 | - rich==13.9.4 248 | - ruff==0.9.6 249 | - safehttpx==0.1.6 250 | - safetensors==0.5.2 251 | - scikit-learn==1.6.1 252 | - scipy==1.15.1 253 | - semantic-version==2.10.0 254 | - sentence-transformers==3.4.1 255 | - sgmllib3k==1.0.0 256 | - shellingham==1.5.4 257 | - speechrecognition==3.14.1 258 | - sqlean-py==3.47.0 259 | - starlette==0.45.3 260 | - strawberry-graphql==0.253.1 261 | - sympy==1.13.1 262 | - threadpoolctl==3.5.0 263 | - tokenizers==0.21.0 264 | - tomlkit==0.13.2 265 | - torch==2.6.0 266 | - torchaudio==2.6.0 267 | - torchvision==0.21.0 268 | - transformers==4.48.3 269 | - triton==3.2.0 270 | - typer==0.15.1 271 | - uvicorn==0.34.0 272 | - websockets==14.2 273 | - xlrd==2.0.1 274 | - xlsxwriter==3.2.2 275 | - youtube-transcript-api==0.6.3 276 | - zipp==3.21.0 -------------------------------------------------------------------------------- /flowchart.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AstraBert/PapersChat/1d91b36f92071b7f1e002204735a7bbba385f406/flowchart.png -------------------------------------------------------------------------------- /local_setup.ps1: -------------------------------------------------------------------------------- 1 | docker compose up db -d 2 | 3 | conda env create -f environment.yml 4 | 5 | conda activate papers-chat 6 | python3 scripts/app.py 7 | conda deactivate -------------------------------------------------------------------------------- /local_setup.sh: -------------------------------------------------------------------------------- 1 | docker compose up db -d 2 | 3 | conda env create -f environment.yml 4 | 5 | conda activate papers-chat 6 | python3 scripts/app.py 7 | conda deactivate -------------------------------------------------------------------------------- /logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AstraBert/PapersChat/1d91b36f92071b7f1e002204735a7bbba385f406/logo.png -------------------------------------------------------------------------------- /scripts/app.py: -------------------------------------------------------------------------------- 1 | from utils import ingest_documents, qdrant_client, List, QdrantVectorStore, VectorStoreIndex, embedder 2 | import sys 3 | import gradio as gr 4 | from toolsFunctions import pubmed_tool, arxiv_tool 5 | from llama_index.core.tools import QueryEngineTool, FunctionTool 6 | from llama_index.core import Settings 7 | from llama_index.llms.mistralai import MistralAI 8 | from llama_index.llms.azure_openai import AzureOpenAI 9 | from llama_index.llms.ollama import Ollama 10 | from llama_index.core.llms import ChatMessage 11 | from llama_index.core.agent import ReActAgent 12 | from dotenv import load_dotenv 13 | from phoenix.otel import register 14 | from openinference.instrumentation.llama_index import LlamaIndexInstrumentor 15 | import time 16 | import os 17 | 18 | load_dotenv() 19 | 20 | ## Observing and tracing 21 | PHOENIX_API_KEY = os.getenv("phoenix_api_key") 22 | os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={PHOENIX_API_KEY}" 23 | os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com" 24 | tracer_provider = register( 25 | project_name="llamaindex", 26 | ) 27 | LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider) 28 | 29 | ## Globals 30 | if os.getenv("mistral_api_key", None) is not None: 31 | Settings.llm = MistralAI(model="mistral-small-latest", temperature=0, api_key=os.getenv("mistral_api_key")) 32 | elif os.getenv("azure_openai_api_key", None) is not None: 33 | Settings.llm = AzureOpenAI(model="gpt-4.1", temperature=0, api_key=os.getenv("azure_openai_api_key")) 34 | elif os.getenv("ollama_model", None) is not None: 35 | Settings.llm = Ollama(model=os.getenv("ollama_model")) 36 | else: 37 | print("ERROR! No supported LLM can be loaded in PapersChat. Exiting...") 38 | sys.exit(1) 39 | 40 | Settings.embed_model = embedder 41 | arxivtool = FunctionTool.from_defaults(arxiv_tool, name="arxiv_tool", description="A tool to search ArXiv (pre-print papers database) for specific papers") 42 | pubmedtool = FunctionTool.from_defaults(pubmed_tool, name="pubmed_tool", description="A tool to search PubMed (printed medical papers database) for specific papers") 43 | query_engine = None 44 | message_history = [ 45 | ChatMessage(role="system", content="You are a useful assistant that has to help the user with questions that they ask about several papers they uploaded. You should base your answers on the context you can retrieve from the PDFs and, if you cannot retrieve any, search ArXiv for a potential answer. If you cannot find any viable answer, please reply that you do not know the answer to the user's question") 46 | ] 47 | 48 | ## Functions 49 | def reply(message, history, files: List[str] | None, collection: str | None, llamaparse: bool = False): 50 | global message_history 51 | if message == "" or message is None: 52 | response = "You should provide a message" 53 | r = "" 54 | for char in response: 55 | r+=char 56 | time.sleep(0.001) 57 | yield r 58 | elif files is None and collection == "": 59 | res = "### WARNING! You did not specify any collection, so I only interrogated ArXiv and/or PubMed to answer your question\n\n" 60 | agent = ReActAgent.from_tools(tools=[pubmedtool, arxivtool], verbose=True) 61 | response = agent.chat(message = message, chat_history = message_history) 62 | response = str(response) 63 | message_history.append(ChatMessage(role="user", content=message)) 64 | message_history.append(ChatMessage(role="assistant", content=response)) 65 | response = res + response 66 | r = "" 67 | for char in response: 68 | r+=char 69 | time.sleep(0.001) 70 | yield r 71 | elif files is None and collection != "" and collection not in [c.name for c in qdrant_client.get_collections().collections]: 72 | response = "Make sure that the name of the existing collection to use as a knowledge base is correct, because the one you provided does not exist! You can check your existing collections and their features in the dedicated tab of the app :)" 73 | r = "" 74 | for char in response: 75 | r+=char 76 | time.sleep(0.001) 77 | yield r 78 | elif files is not None: 79 | if collection == "": 80 | response = "You should provide a collection name (new or existing) if you want to ingest files!" 81 | r = "" 82 | for char in response: 83 | r+=char 84 | time.sleep(0.001) 85 | yield r 86 | else: 87 | collection_name = collection 88 | index = ingest_documents(files, collection_name, llamaparse) 89 | query_engine = index.as_query_engine() 90 | rag_tool = QueryEngineTool.from_defaults(query_engine, name="papers_rag", description="A RAG engine with information from selected scientific papers") 91 | agent = ReActAgent.from_tools(tools=[rag_tool, pubmedtool, arxivtool], verbose=True) 92 | response = agent.chat(message = message, chat_history = message_history) 93 | response = str(response) 94 | message_history.append(ChatMessage(role="user", content=message)) 95 | message_history.append(ChatMessage(role="assistant", content=response)) 96 | r = "" 97 | for char in response: 98 | r+=char 99 | time.sleep(0.001) 100 | yield r 101 | else: 102 | vector_store = QdrantVectorStore(client = qdrant_client, collection_name=collection, enable_hybrid=True) 103 | index = VectorStoreIndex.from_vector_store(vector_store=vector_store) 104 | query_engine = index.as_query_engine() 105 | rag_tool = QueryEngineTool.from_defaults(query_engine, name="papers_rag", description="A RAG engine with information from selected scientific papers") 106 | agent = ReActAgent.from_tools(tools=[rag_tool, pubmedtool, arxivtool], verbose=True) 107 | response = agent.chat(message = message, chat_history = message_history) 108 | response = str(response) 109 | message_history.append(ChatMessage(role="user", content=message)) 110 | message_history.append(ChatMessage(role="assistant", content=response)) 111 | r = "" 112 | for char in response: 113 | r+=char 114 | time.sleep(0.001) 115 | yield r 116 | 117 | def to_markdown_color(grade: str): 118 | colors = {"red": "ff0000", "yellow": "ffcc00", "green": "33cc33"} 119 | mdcode = f"![#{colors[grade]}](https://placehold.co/15x15/{colors[grade]}/{colors[grade]}.png)" 120 | return mdcode 121 | 122 | def get_qdrant_collections_dets(): 123 | collections = [c.name for c in qdrant_client.get_collections().collections] 124 | details = [] 125 | counter = 0 126 | for collection in collections: 127 | counter += 1 128 | dets = qdrant_client.get_collection(collection) 129 | p = f"### {counter}. {collection}\n\n**Number of Points**: {dets.points_count}\n\n**Status**: {to_markdown_color(dets.status)} {dets.status}\n\n" 130 | details.append(p) 131 | final_text = "

Available Collections

\n\n" 132 | final_text += "\n\n".join(details) 133 | return final_text 134 | 135 | ## Frontend 136 | accordion = gr.Accordion(label="⚠️Set up these parameters before you start chatting!⚠️") 137 | 138 | iface1 = gr.ChatInterface(fn=reply, additional_inputs=[gr.File(label="Upload Papers (only PDF allowed!)", file_count="multiple", file_types=[".pdf","pdf",".PDF","PDF"], value=None), gr.Textbox(label="Collection", info="Upload your papers to a collection (new or existing)", value=""), gr.Checkbox(label="Use LlamaParse", info="Needs the LlamaCloud API key", value=False)], additional_inputs_accordion=accordion) 139 | u = open("usage.md") 140 | content = u.read() 141 | u.close() 142 | iface2 = gr.Blocks() 143 | with iface2: 144 | with gr.Row(): 145 | gr.Markdown(content) 146 | iface3 = gr.Interface(fn=get_qdrant_collections_dets, inputs=None, outputs=gr.Markdown(label="Collections"), submit_btn="See your collections") 147 | iface = gr.TabbedInterface([iface1, iface2, iface3], ["Chat💬", "Usage Guide⚙️", "Your Collections🔎"], title="PapersChat📝") 148 | iface.launch(server_name="0.0.0.0", server_port=7860) -------------------------------------------------------------------------------- /scripts/toolsFunctions.py: -------------------------------------------------------------------------------- 1 | import urllib, urllib.request 2 | from pydantic import Field 3 | from datetime import datetime 4 | from markitdown import MarkItDown 5 | from Bio import Entrez 6 | import xml.etree.ElementTree as ET 7 | 8 | md = MarkItDown() 9 | 10 | def format_today(): 11 | d = datetime.now() 12 | if d.month < 10: 13 | month = f"0{d.month}" 14 | else: 15 | month = d.month 16 | if d.day < 10: 17 | day = f"0{d.day}" 18 | else: 19 | day = d.day 20 | if d.hour < 10: 21 | hour = f"0{d.hour}" 22 | else: 23 | hour = d.hour 24 | if d.minute < 10: 25 | minute = f"0{d.hour}" 26 | else: 27 | minute = d.minute 28 | today = f"{d.year}{month}{day}{hour}{minute}" 29 | two_years_ago = f"{d.year-2}{month}{day}{hour}{minute}" 30 | return today, two_years_ago 31 | 32 | def arxiv_tool(search_query: str = Field(description="The query with which to search ArXiv database")): 33 | """A tool to search ArXiv""" 34 | today, two_years_ago = format_today() 35 | query = search_query.replace(" ", "+") 36 | url = f'http://export.arxiv.org/api/query?search_query=all:{query}&submittedDate:[{two_years_ago}+TO+{today}]&start=0&max_results=3' 37 | data = urllib.request.urlopen(url) 38 | content = data.read().decode("utf-8") 39 | f = open("arxiv_results.xml", "w") 40 | f.write(content) 41 | f.close() 42 | result = md.convert("arxiv_results.xml") 43 | return result.text_content 44 | 45 | def search_pubmed(query): 46 | Entrez.email = "astraberte9@gmail.com" # Replace with your email 47 | handle = Entrez.esearch(db="pubmed", term=query, retmax=3) 48 | record = Entrez.read(handle) 49 | handle.close() 50 | return record["IdList"] 51 | 52 | def fetch_pubmed_details(pubmed_ids): 53 | Entrez.email = "your.personal@email.com" # Replace with your email 54 | handle = Entrez.efetch(db="pubmed", id=pubmed_ids, rettype="medline", retmode="xml") 55 | records = handle.read() 56 | handle.close() 57 | recs = records.decode("utf-8") 58 | f = open("biomed_results.xml", "w") 59 | f.write(recs) 60 | f.close() 61 | 62 | def fetch_xml(): 63 | tree = ET.parse("biomed_results.xml") 64 | root = tree.getroot() 65 | parsed_articles = [] 66 | for article in root.findall('PubmedArticle'): 67 | # Extract title 68 | title = article.find('.//ArticleTitle') 69 | title_text = title.text if title is not None else "No title" 70 | # Extract abstract 71 | abstract = article.find('.//Abstract/AbstractText') 72 | abstract_text = abstract.text if abstract is not None else "No abstract" 73 | # Format output 74 | formatted_entry = f"## {title_text}\n\n**Abstract**:\n\n{abstract_text}" 75 | parsed_articles.append(formatted_entry) 76 | return "\n\n".join(parsed_articles) 77 | 78 | def pubmed_tool(search_query: str = Field(description="The query with which to search PubMed database")): 79 | """A tool to search PubMed""" 80 | idlist = search_pubmed(search_query) 81 | if len(idlist) == 0: 82 | return "There is no significant match in PubMed" 83 | fetch_pubmed_details(idlist) 84 | content = fetch_xml() 85 | return content 86 | -------------------------------------------------------------------------------- /scripts/utils.py: -------------------------------------------------------------------------------- 1 | from llama_index.embeddings.huggingface import HuggingFaceEmbedding 2 | from llama_index.core import Settings 3 | from llama_index.llms.mistralai import MistralAI 4 | from qdrant_client import QdrantClient 5 | from llama_index.core import VectorStoreIndex, SimpleDirectoryReader 6 | from llama_index.core import StorageContext 7 | from llama_index.vector_stores.qdrant import QdrantVectorStore 8 | from llama_cloud_services import LlamaParse 9 | from dotenv import load_dotenv 10 | from typing import List 11 | import torch 12 | import os 13 | 14 | 15 | load_dotenv() 16 | 17 | 18 | qdrant_client = QdrantClient("http://localhost:6333") 19 | device = "cuda" if torch.cuda.is_available() else "cpu" 20 | embedder = HuggingFaceEmbedding(model_name="nomic-ai/modernbert-embed-base", device=device) 21 | Settings.embed_model = embedder 22 | 23 | def ingest_documents(files: List[str], collection_name: str, llamaparse: True): 24 | vector_store = QdrantVectorStore(client=qdrant_client, collection_name=collection_name, enable_hybrid=True) 25 | storage_context = StorageContext.from_defaults(vector_store=vector_store) 26 | if llamaparse: 27 | parser = LlamaParse( 28 | result_type="markdown", 29 | api_key=os.getenv("llamacloud_api_key") 30 | ) 31 | file_extractor = {".pdf": parser} 32 | documents = SimpleDirectoryReader(input_files=files, file_extractor=file_extractor).load_data() 33 | else: 34 | documents = SimpleDirectoryReader(input_files=files).load_data() 35 | index = VectorStoreIndex.from_documents( 36 | documents, 37 | storage_context=storage_context, 38 | ) 39 | return index 40 | 41 | load_dotenv() 42 | -------------------------------------------------------------------------------- /start_services.ps1: -------------------------------------------------------------------------------- 1 | docker compose up db -d 2 | docker compose up app -d -------------------------------------------------------------------------------- /start_services.sh: -------------------------------------------------------------------------------- 1 | docker compose up db -d 2 | docker compose up app -d -------------------------------------------------------------------------------- /usage.md: -------------------------------------------------------------------------------- 1 |

PapersChat Usage Guide

2 | 3 |

If you find PapersChat useful, please consider to support us through donation:

4 |
5 | GitHub Sponsors Badge 6 |
7 | 8 | > _This guide is only on how to use **the app**, not on how to install and/or launch it or on how it works internally. For that, please refer to the [GitHub repository](https://github.com/AstraBert/PapersChat)_ 9 | 10 | ## Use PapersChat with your documents 11 | 12 | If you have papers that you would like to chat with, this is the right section of the guide! 13 | 14 | In order to chat with your papers, you will need to upload them (**as PDF files**) on the dedicated "Upload Papers" widget that you can see at the bottom of the chat interface: you can upload one or more files there (remember: the more you upload, the slower their processing is going to be). 15 | 16 | Once you uploaded the files, before submitting them, you have to do two more things: 17 | 18 | 1. Specify the collection that you want to upload the documents to (in the "Collection" area) 19 | 2. Write your first question/message to interrogate your papers (in the message input space) 20 | 21 | For what concerns point (1), you can give your collection whatever name you want: once you created a new collection, you can always re-use it in the future, just inputting the same name. If you do not remember all your collections, you can go to the "Your collections" tab in the application and click on "Generate" to see the list of your collections. 22 | 23 | Point (2) is very important: if you do not send any message, PapersChat will tell you that you need to send one. 24 | 25 | Once you uploaded the papers, specified the collection and wrote the message, you can send the message and PapersChat will: 26 | 27 | - Ingest your documents 28 | - Produce an answer to your questions 29 | 30 | Congrats! Now you got the first collection and the first message! 31 | 32 | > _**NOTE**: there is still an option we haven't talked about, i.e. the 'LlamaParse' checkbox. If you select that checkbox, you will enable LlamaParse, a tool that LlamaIndex offers [as part of its LlamaCloud services](https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse/). LlamaParse employs enhanced parsing techniques to produce a clean and well-structured data for (often messy) unstructured documents: the free tier offers the possibility of parsing 1000 pages/day. While this approach generates very good data for your collections, you have to take into account the fact that it might take quite some time to parse your documents (especially if they are dense, have lots of text-in-images or are very long). By default the LLamaParse option is disabled_ 33 | 34 | ## Use PapersChat with a collection as knowledge base 35 | 36 | Once you have uploaded all your documents, you might want to interrogate them without having to upload even more. That's where comes into hand the "collection as knowledge base" option. You can simply send a message selecting one of your existing collections as a knowledge base for PapersChat (without uploading any file) and... BAM! You will see that PapersChat replies to your questions :) 37 | 38 | ## Use PapersChat to interrogate PubMed/ArXiv 39 | 40 | PapersChat has access also to PubMed and ArXiv papers archives: if you do not specify a collection name and you do not upload any files, your question is used by PapersChat to search these two online databases for an answer. 41 | 42 | ## Monitor your collections 43 | 44 | Under the "Your Collections" tab of the application you can, by clicking on "Generate", see your collections: you can see how many data points are in these collections (these data points **do not match** with the number of papers you uploaded) and what is the status of your collections. 45 | 46 | A brief guide to the collections status: 47 | 48 | - "green": collection is optimized and searchable 49 | - "yellow": collection is being optimized and you can search it 50 | - "red": collection is not optimized and it will probably return an error if you try to search it --------------------------------------------------------------------------------