├── .env.example
├── .gitignore
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── compose.yaml
├── docker
├── Dockerfile
├── app.py
├── conda_env.sh
├── environment.yml
├── getSecrets.py
├── run.sh
├── toolsFunctions.py
├── usage.md
└── utils.py
├── environment.yml
├── flowchart.png
├── local_setup.ps1
├── local_setup.sh
├── logo.png
├── scripts
├── app.py
├── toolsFunctions.py
└── utils.py
├── start_services.ps1
├── start_services.sh
└── usage.md
/.env.example:
--------------------------------------------------------------------------------
1 | llamacloud_api_key="llx-xxx"
2 | mistral_api_key="*******************abc"
3 | phoenix_api_key="*******************def"
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | *.xml
2 | .env
3 | papers-parser.code-workspace
4 | data/
5 | qdrant_storage/
6 | scripts/__pycache__/
7 | huggingface_spaces/
--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | # Contributing to PapersChat
2 |
3 | Do you want to contribute to this project? Make sure to read this guidelines first :)
4 |
5 | ## Issue
6 |
7 | **When to do it**:
8 |
9 | - You found bugs but you don't know how to solve them or don't have time/will to do the solve
10 | - You want new features but you don't know how to implement them or don't have time/will to do the implementation
11 |
12 | > ⚠️ _Always check open and closed issues before you submit yours to avoid duplicates_
13 |
14 | **How to do it**:
15 |
16 | - Open an issue
17 | - Give the issue a meaningful title (short but effective problem description)
18 | - Describe the problem following the issue template
19 |
20 | ## Traditional contribution
21 |
22 | **When to do it**:
23 |
24 | - You found bugs and corrected them
25 | - You optimized/improved the code
26 | - You added new features that you think could be useful to others
27 |
28 | **How to do it**:
29 |
30 | 1. Fork this repository
31 | 2. Commit your changes
32 | 3. Submit pull request (make sure to provide a thorough description of the changes)
33 |
34 |
35 | ## Showcase your PrAIvateSearch
36 |
37 | **When to do it**:
38 |
39 | - You modified the base application with new features but you don't want/can't merge them with the original PrAIvateSearch
40 |
41 | **How to do it**:
42 |
43 | - Go to [_GitHub Discussions > Show and tell_](https://github.com/AstraBert/PrAIvateSearch/discussions/categories/show-and-tell) page
44 | - Open a new discussion there, describing your PrAIvateSearch application
45 |
46 | ### Thanks for contributing!
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2025 Clelia (Astra) Bertelli
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
PapersChat
2 |
3 | Chatting With Papers Made Easy
4 |
5 | If you find PapersChat useful, please consider to support us through donation:
6 |
7 |

8 |
9 |
10 |
11 |

12 |
13 |
14 | **PapersChat** is an agentic AI application that allows you to chat with your papers and gather also information from papers on ArXiv and on PubMed. It is powered by [LlamaIndex](https://www.llamaindex.ai/), [Qdrant](https://qdrant.tech) and [Mistral AI](https://mistral.ai/en).
15 |
16 | ### Flowchart
17 |
18 |
19 |

20 |
21 |
22 | ### Install and launch it
23 |
24 | The installation of the application is a unique process, you simply have to clone the GitHub repository:
25 |
26 | ```bash
27 | git clone https://github.com/AstraBert/PapersChat.git
28 | cd PapersChat/
29 | ```
30 |
31 | To launch the app, you can follow two paths:
32 |
33 | **1. Docker (recommended)**
34 |
35 | > _Required: [Docker](https://docs.docker.com/desktop/) and [docker compose](https://docs.docker.com/compose/)_
36 |
37 | - Add the `mistral_api_key`, `phoenix_api_key` and `llamacloud_api_key` variables in the [`.env.example`](./docker/.env.example) file and modify the name of the file to `.env`. Get these keys:
38 | + [On Mistral AI](https://console.mistral.ai/api-keys/)
39 | + [On LlamaCloud](https://cloud.llamaindex.ai/)
40 | + [On Phoenix/Arize](https://llamatrace.com/projects)
41 |
42 | ```bash
43 | # modify your access token, e.g. hf_token="hf_abcdefg1234567"
44 | mv .env.example .env
45 | ```
46 |
47 | - Launch the docker application:
48 |
49 | ```bash
50 | # If you are on Linux/macOS
51 | bash start_services.sh
52 | # If you are on Windows
53 | .\start_services.ps1
54 | ```
55 |
56 | You will see the application running on http://localhost:7860 and you will be able to use it. Depending on your connection and on your hardware, the set up might take some time (up to 30 mins to set up) - but this is only for the first time your run it!
57 |
58 | **2. Source code**
59 |
60 | > _Required: [Docker](https://docs.docker.com/desktop/), [docker compose](https://docs.docker.com/compose/) and [conda](https://anaconda.org/anaconda/conda)_
61 |
62 | - Add the `mistral_api_key`, `phoenix_api_key` and `llamacloud_api_key` variables in the [`.env.example`](./docker/.env.example) file and modify the name of the file to `.env`. Get these keys:
63 | + [On Mistral AI](https://console.mistral.ai/api-keys/)
64 | + [On LlamaCloud](https://cloud.llamaindex.ai/)
65 | + [On Phoenix/Arize](https://llamatrace.com/projects)
66 |
67 | ```bash
68 | mv .env.example .env
69 | # modify the variables, e.g.:
70 | # llamacloud_api_key="llx-000-abc"
71 | # mistral_api_key="01234abc"
72 | # phoenix_api_key="56789def"
73 | ```
74 |
75 | - Alternatively, if you wish to use Azure OpenAI or Ollama, specify:
76 |
77 | ```bash
78 | azure_openai_api_key="***" # if you wish to use Azure OpenAI
79 | ollama_model="gemma3:latest" # if you wish to use Ollama
80 | ```
81 |
82 | >[!IMPORTANT]
83 | > _This is only possible while launching from source code, Docker launching does not support this option_
84 |
85 | - Set up PapersChat using the dedicated script:
86 |
87 | ```bash
88 | # For MacOs/Linux users
89 | bash local_setup.sh
90 | # For Windows users
91 | .\local_setup.ps1
92 | ```
93 |
94 | - Or you can do it manually, if you prefer:
95 |
96 | ```bash
97 | docker compose up db -d
98 |
99 | conda env create -f environment.yml
100 |
101 | conda activate papers-chat
102 | python3 scripts/app.py
103 | conda deactivate
104 | ```
105 |
106 | ## Contributing
107 |
108 | Contributions are always welcome! Follow the contributions guidelines reported [here](CONTRIBUTING.md).
109 |
110 | ## License and rights of usage
111 |
112 | The software is provided under MIT [license](./LICENSE).
113 |
114 | ### Full documentation will come soon!👷♀️
115 |
116 |
--------------------------------------------------------------------------------
/compose.yaml:
--------------------------------------------------------------------------------
1 | name: papers_chat
2 |
3 | services:
4 | app:
5 | build:
6 | context: ./docker/
7 | dockerfile: Dockerfile
8 | ports:
9 | - 7860:7860
10 | secrets:
11 | - mistral
12 | - phoenix
13 | - llamacloud
14 | networks:
15 | - internal_net
16 | db:
17 | image: qdrant/qdrant
18 | ports:
19 | - 6333:6333
20 | - 6334:6334
21 | volumes:
22 | - "./qdrant_storage:/qdrant/storage"
23 | networks:
24 | - internal_net
25 |
26 | networks:
27 | internal_net:
28 | driver: bridge
29 | driver_opts:
30 | com.docker.network.bridge.host_binding_ipv4: "127.0.0.1"
31 |
32 | secrets:
33 | mistral:
34 | environment: mistral_api_key
35 | phoenix:
36 | environment: phoenix_api_key
37 | llamacloud:
38 | environment: llamacloud_api_key
--------------------------------------------------------------------------------
/docker/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM condaforge/miniforge3
2 |
3 | WORKDIR /app/
4 | COPY . /app/
5 | RUN bash /app/conda_env.sh
6 |
7 | EXPOSE 7860
8 | CMD ["bash", "/app/run.sh"]
--------------------------------------------------------------------------------
/docker/app.py:
--------------------------------------------------------------------------------
1 | from utils import ingest_documents, qdrant_client, List, QdrantVectorStore, VectorStoreIndex, embedder
2 | import gradio as gr
3 | from toolsFunctions import pubmed_tool, arxiv_tool
4 | from llama_index.core.tools import QueryEngineTool, FunctionTool
5 | from llama_index.core import Settings
6 | from llama_index.llms.mistralai import MistralAI
7 | from llama_index.core.llms import ChatMessage
8 | from llama_index.core.agent import ReActAgent
9 | from getSecrets import mistral_api_key, phoenix_api_key
10 | from phoenix.otel import register
11 | from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
12 | import time
13 | import os
14 |
15 | ## Observing and tracing
16 | PHOENIX_API_KEY = phoenix_api_key
17 | os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={PHOENIX_API_KEY}"
18 | os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"
19 | tracer_provider = register(
20 | project_name="llamaindex",
21 | )
22 | LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)
23 |
24 | ## Globals
25 | Settings.llm = MistralAI(model="mistral-small-latest", temperature=0, api_key=mistral_api_key)
26 | Settings.embed_model = embedder
27 | arxivtool = FunctionTool.from_defaults(arxiv_tool, name="arxiv_tool", description="A tool to search ArXiv (pre-print papers database) for specific papers")
28 | pubmedtool = FunctionTool.from_defaults(pubmed_tool, name="pubmed_tool", description="A tool to search PubMed (printed medical papers database) for specific papers")
29 | query_engine = None
30 | message_history = [
31 | ChatMessage(role="system", content="You are a useful assistant that has to help the user with questions that they ask about several papers they uploaded. You should base your answers on the context you can retrieve from the PDFs and, if you cannot retrieve any, search ArXiv for a potential answer. If you cannot find any viable answer, please reply that you do not know the answer to the user's question")
32 | ]
33 |
34 | ## Functions
35 | def reply(message, history, files: List[str] | None, collection: str | None, llamaparse: bool = False):
36 | global message_history
37 | if message == "" or message is None:
38 | response = "You should provide a message"
39 | r = ""
40 | for char in response:
41 | r+=char
42 | time.sleep(0.001)
43 | yield r
44 | elif files is None and collection == "":
45 | res = "### WARNING! You did not specify any collection, so I only interrogated ArXiv and/or PubMed to answer your question\n\n"
46 | agent = ReActAgent.from_tools(tools=[pubmedtool, arxivtool], verbose=True)
47 | response = agent.chat(message = message, chat_history = message_history)
48 | response = str(response)
49 | message_history.append(ChatMessage(role="user", content=message))
50 | message_history.append(ChatMessage(role="assistant", content=response))
51 | response = res + response
52 | r = ""
53 | for char in response:
54 | r+=char
55 | time.sleep(0.001)
56 | yield r
57 | elif files is None and collection != "" and collection not in [c.name for c in qdrant_client.get_collections().collections]:
58 | response = "Make sure that the name of the existing collection to use as a knowledge base is correct, because the one you provided does not exist! You can check your existing collections and their features in the dedicated tab of the app :)"
59 | r = ""
60 | for char in response:
61 | r+=char
62 | time.sleep(0.001)
63 | yield r
64 | elif files is not None:
65 | if collection == "":
66 | response = "You should provide a collection name (new or existing) if you want to ingest files!"
67 | r = ""
68 | for char in response:
69 | r+=char
70 | time.sleep(0.001)
71 | yield r
72 | else:
73 | collection_name = collection
74 | index = ingest_documents(files, collection_name, llamaparse)
75 | query_engine = index.as_query_engine()
76 | rag_tool = QueryEngineTool.from_defaults(query_engine, name="papers_rag", description="A RAG engine with information from selected scientific papers")
77 | agent = ReActAgent.from_tools(tools=[rag_tool, pubmedtool, arxivtool], verbose=True)
78 | response = agent.chat(message = message, chat_history = message_history)
79 | response = str(response)
80 | message_history.append(ChatMessage(role="user", content=message))
81 | message_history.append(ChatMessage(role="assistant", content=response))
82 | r = ""
83 | for char in response:
84 | r+=char
85 | time.sleep(0.001)
86 | yield r
87 | else:
88 | vector_store = QdrantVectorStore(client = qdrant_client, collection_name=collection, enable_hybrid=True)
89 | index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
90 | query_engine = index.as_query_engine()
91 | rag_tool = QueryEngineTool.from_defaults(query_engine, name="papers_rag", description="A RAG engine with information from selected scientific papers")
92 | agent = ReActAgent.from_tools(tools=[rag_tool, pubmedtool, arxivtool], verbose=True)
93 | response = agent.chat(message = message, chat_history = message_history)
94 | response = str(response)
95 | message_history.append(ChatMessage(role="user", content=message))
96 | message_history.append(ChatMessage(role="assistant", content=response))
97 | r = ""
98 | for char in response:
99 | r+=char
100 | time.sleep(0.001)
101 | yield r
102 |
103 | def to_markdown_color(grade: str):
104 | colors = {"red": "ff0000", "yellow": "ffcc00", "green": "33cc33"}
105 | mdcode = f"![#{colors[grade]}](https://placehold.co/15x15/{colors[grade]}/{colors[grade]}.png)"
106 | return mdcode
107 |
108 | def get_qdrant_collections_dets():
109 | collections = [c.name for c in qdrant_client.get_collections().collections]
110 | details = []
111 | counter = 0
112 | for collection in collections:
113 | counter += 1
114 | dets = qdrant_client.get_collection(collection)
115 | p = f"### {counter}. {collection}\n\n**Number of Points**: {dets.points_count}\n\n**Status**: {to_markdown_color(dets.status)} {dets.status}\n\n"
116 | details.append(p)
117 | final_text = "Available Collections
\n\n"
118 | final_text += "\n\n".join(details)
119 | return final_text
120 |
121 | ## Frontend
122 | accordion = gr.Accordion(label="⚠️Set up these parameters before you start chatting!⚠️")
123 |
124 | iface1 = gr.ChatInterface(fn=reply, additional_inputs=[gr.File(label="Upload Papers (only PDF allowed!) - Ingestion", file_count="multiple", file_types=[".pdf","pdf",".PDF","PDF"], value=None), gr.Textbox(label="Collection", info="Upload your papers to a collection (new or existing)", value=""), gr.Checkbox(label="Use LlamaParse", info="Needs the LlamaCloud API key", value=False)], additional_inputs_accordion=accordion)
125 | u = open("usage.md")
126 | content = u.read()
127 | u.close()
128 | iface2 = gr.Blocks()
129 | with iface2:
130 | with gr.Row():
131 | gr.Markdown(content)
132 | iface3 = gr.Interface(fn=get_qdrant_collections_dets, inputs=None, outputs=gr.Markdown(label="Collections"), submit_btn="See your collections")
133 | iface = gr.TabbedInterface([iface1, iface2, iface3], ["Chat💬", "Usage Guide⚙️", "Your Collections🔎"], title="PapersChat📝")
134 | iface.launch(server_name="0.0.0.0", server_port=7860)
--------------------------------------------------------------------------------
/docker/conda_env.sh:
--------------------------------------------------------------------------------
1 | eval "$(conda shell.bash hook)"
2 |
3 | conda env create -f /app/environment.yml
--------------------------------------------------------------------------------
/docker/environment.yml:
--------------------------------------------------------------------------------
1 | name: papers-chat
2 | channels:
3 | - conda-forge
4 | dependencies:
5 | - _libgcc_mutex=0.1=conda_forge
6 | - _openmp_mutex=4.5=2_gnu
7 | - aiohappyeyeballs=2.4.6=pyhd8ed1ab_0
8 | - aiohttp=3.11.12=py311h2dc5d0c_0
9 | - aiosignal=1.3.2=pyhd8ed1ab_0
10 | - annotated-types=0.7.0=pyhd8ed1ab_1
11 | - anyio=4.8.0=pyhd8ed1ab_0
12 | - attrs=25.1.0=pyh71513ae_0
13 | - beautifulsoup4=4.13.3=pyha770c72_0
14 | - brotli-python=1.1.0=py311hfdbb021_2
15 | - bzip2=1.0.8=h4bc722e_7
16 | - ca-certificates=2025.1.31=hbcca054_0
17 | - certifi=2025.1.31=pyhd8ed1ab_0
18 | - cffi=1.17.1=py311hf29c0ef_0
19 | - charset-normalizer=3.4.1=pyhd8ed1ab_0
20 | - click=8.1.8=pyh707e725_0
21 | - colorama=0.4.6=pyhd8ed1ab_1
22 | - dataclasses-json=0.6.7=pyhd8ed1ab_1
23 | - deprecated=1.2.18=pyhd8ed1ab_0
24 | - dirtyjson=1.0.8=pyhd8ed1ab_1
25 | - distro=1.9.0=pyhd8ed1ab_1
26 | - eval-type-backport=0.2.2=pyhd8ed1ab_0
27 | - eval_type_backport=0.2.2=pyha770c72_0
28 | - exceptiongroup=1.2.2=pyhd8ed1ab_1
29 | - filetype=1.2.0=pyhd8ed1ab_0
30 | - freetype=2.12.1=h267a509_2
31 | - frozenlist=1.5.0=py311h2dc5d0c_1
32 | - fsspec=2025.2.0=pyhd8ed1ab_0
33 | - greenlet=3.1.1=py311hfdbb021_1
34 | - h11=0.14.0=pyhd8ed1ab_1
35 | - h2=4.2.0=pyhd8ed1ab_0
36 | - hpack=4.1.0=pyhd8ed1ab_0
37 | - httpcore=1.0.7=pyh29332c3_1
38 | - httpx=0.28.1=pyhd8ed1ab_0
39 | - hyperframe=6.1.0=pyhd8ed1ab_0
40 | - idna=3.10=pyhd8ed1ab_1
41 | - jiter=0.8.2=py311h9e33e62_0
42 | - joblib=1.4.2=pyhd8ed1ab_1
43 | - lcms2=2.17=h717163a_0
44 | - ld_impl_linux-64=2.43=h712a8e2_2
45 | - lerc=4.0.0=h27087fc_0
46 | - libblas=3.9.0=28_h59b9bed_openblas
47 | - libcblas=3.9.0=28_he106b2a_openblas
48 | - libdeflate=1.23=h4ddbbb0_0
49 | - libexpat=2.6.4=h5888daf_0
50 | - libffi=3.4.6=h2dba641_0
51 | - libgcc=14.2.0=h77fa898_1
52 | - libgcc-ng=14.2.0=h69a702a_1
53 | - libgfortran=14.2.0=h69a702a_1
54 | - libgfortran5=14.2.0=hd5240d6_1
55 | - libgomp=14.2.0=h77fa898_1
56 | - libjpeg-turbo=3.0.0=hd590300_1
57 | - liblapack=3.9.0=28_h7ac8fdf_openblas
58 | - liblzma=5.6.4=hb9d3cd8_0
59 | - libnsl=2.0.1=hd590300_0
60 | - libopenblas=0.3.28=pthreads_h94d23a6_1
61 | - libpng=1.6.46=h943b412_0
62 | - libsqlite=3.48.0=hee588c1_1
63 | - libstdcxx=14.2.0=hc0a3c3a_1
64 | - libstdcxx-ng=14.2.0=h4852527_1
65 | - libtiff=4.7.0=hd9ff511_3
66 | - libuuid=2.38.1=h0b41bf4_0
67 | - libwebp-base=1.5.0=h851e524_0
68 | - libxcb=1.17.0=h8a09558_0
69 | - libxcrypt=4.4.36=hd590300_1
70 | - libzlib=1.3.1=hb9d3cd8_2
71 | - llama-cloud=0.1.12=pyhd8ed1ab_0
72 | - llama-cloud-services=0.6.1=pyhd8ed1ab_0
73 | - llama-index=0.12.17=pyhd8ed1ab_0
74 | - llama-index-agent-openai=0.4.5=pyhd8ed1ab_0
75 | - llama-index-cli=0.4.0=pyhd8ed1ab_1
76 | - llama-index-core=0.12.17=pyhd8ed1ab_1
77 | - llama-index-embeddings-openai=0.3.1=pyhd8ed1ab_1
78 | - llama-index-indices-managed-llama-cloud=0.6.4=pyhd8ed1ab_0
79 | - llama-index-legacy=0.9.48.post4=pyhd8ed1ab_1
80 | - llama-index-llms-openai=0.3.19=pyhd8ed1ab_0
81 | - llama-index-multi-modal-llms-openai=0.4.3=pyhd8ed1ab_0
82 | - llama-index-program-openai=0.3.1=pyhd8ed1ab_1
83 | - llama-index-question-gen-openai=0.3.0=pyhd8ed1ab_1
84 | - llama-index-readers-file=0.4.5=pyhd8ed1ab_0
85 | - llama-index-readers-llama-parse=0.4.0=pyhd8ed1ab_1
86 | - llama-parse=0.6.1=pyhd8ed1ab_0
87 | - llamaindex-py-client=0.1.19=pyhd8ed1ab_1
88 | - marshmallow=3.26.1=pyhd8ed1ab_0
89 | - multidict=6.1.0=py311h2dc5d0c_2
90 | - mypy_extensions=1.0.0=pyha770c72_1
91 | - ncurses=6.5=h2d0b736_3
92 | - nest-asyncio=1.6.0=pyhd8ed1ab_1
93 | - networkx=3.4.2=pyh267e887_2
94 | - nltk=3.9.1=pyhd8ed1ab_1
95 | - numpy=2.2.3=py311h5d046bc_0
96 | - openai=1.63.0=pyhd8ed1ab_0
97 | - openjpeg=2.5.3=h5fbd93e_0
98 | - openssl=3.4.1=h7b32b05_0
99 | - packaging=24.2=pyhd8ed1ab_2
100 | - pandas=2.2.3=py311h7db5c69_1
101 | - pip=25.0.1=pyh8b19718_0
102 | - propcache=0.2.1=py311h2dc5d0c_1
103 | - pthread-stubs=0.4=hb9d3cd8_1002
104 | - pycparser=2.22=pyh29332c3_1
105 | - pydantic=2.10.6=pyh3cfb1c2_0
106 | - pydantic-core=2.27.2=py311h9e33e62_0
107 | - pypdf=5.3.0=pyh29332c3_0
108 | - pysocks=1.7.1=pyha55dd90_7
109 | - python=3.11.11=h9e4cc4f_1_cpython
110 | - python-dateutil=2.9.0.post0=pyhff2d567_1
111 | - python-dotenv=1.0.1=pyhd8ed1ab_1
112 | - python-tzdata=2025.1=pyhd8ed1ab_0
113 | - python_abi=3.11=5_cp311
114 | - pytz=2024.1=pyhd8ed1ab_0
115 | - pyyaml=6.0.2=py311h2dc5d0c_2
116 | - readline=8.2=h8228510_1
117 | - regex=2024.11.6=py311h9ecbd09_0
118 | - requests=2.32.3=pyhd8ed1ab_1
119 | - setuptools=75.8.0=pyhff2d567_0
120 | - six=1.17.0=pyhd8ed1ab_0
121 | - sniffio=1.3.1=pyhd8ed1ab_1
122 | - soupsieve=2.5=pyhd8ed1ab_1
123 | - sqlalchemy=2.0.38=py311h9ecbd09_0
124 | - striprtf=0.0.26=pyhd8ed1ab_0
125 | - tenacity=8.5.0=pyhd8ed1ab_0
126 | - tiktoken=0.9.0=py311hf1706b8_0
127 | - tk=8.6.13=noxft_h4845f30_101
128 | - tqdm=4.67.1=pyhd8ed1ab_1
129 | - typing-extensions=4.12.2=hd8ed1ab_1
130 | - typing_extensions=4.12.2=pyha770c72_1
131 | - typing_inspect=0.9.0=pyhd8ed1ab_1
132 | - tzdata=2025a=h78e105d_0
133 | - urllib3=2.3.0=pyhd8ed1ab_0
134 | - wheel=0.45.1=pyhd8ed1ab_1
135 | - wrapt=1.17.2=py311h9ecbd09_0
136 | - xorg-libxau=1.0.12=hb9d3cd8_0
137 | - xorg-libxdmcp=1.1.5=hb9d3cd8_0
138 | - yaml=0.2.5=h7f98852_2
139 | - yarl=1.18.3=py311h2dc5d0c_1
140 | - zstandard=0.23.0=py311hbc35293_1
141 | - zstd=1.5.6=ha6fb4c9_0
142 | - pip:
143 | - aiofiles==23.2.1
144 | - aioitertools==0.12.0
145 | - aiosqlite==0.21.0
146 | - alembic==1.14.1
147 | - arize-phoenix==7.12.2
148 | - arize-phoenix-evals==0.20.3
149 | - arize-phoenix-otel==0.7.1
150 | - arxiv==2.1.3
151 | - authlib==1.4.1
152 | - azure-ai-documentintelligence==1.0.0
153 | - azure-core==1.32.0
154 | - azure-identity==1.20.0
155 | - biopython==1.85
156 | - cachetools==5.5.1
157 | - cobble==0.1.4
158 | - coloredlogs==15.0.1
159 | - cryptography==44.0.1
160 | - defusedxml==0.7.1
161 | - et-xmlfile==2.0.0
162 | - fastapi==0.115.8
163 | - fastembed==0.5.1
164 | - feedparser==6.0.11
165 | - ffmpy==0.5.0
166 | - filelock==3.17.0
167 | - flatbuffers==25.2.10
168 | - googleapis-common-protos==1.67.0
169 | - gradio==5.16.0
170 | - gradio-client==1.7.0
171 | - graphql-core==3.2.6
172 | - grpc-interceptor==0.15.4
173 | - grpcio==1.70.0
174 | - grpcio-tools==1.70.0
175 | - huggingface-hub==0.28.1
176 | - humanfriendly==10.0
177 | - importlib-metadata==8.5.0
178 | - isodate==0.7.2
179 | - jinja2==3.1.4
180 | - jsonpath-python==1.0.6
181 | - llama-index-embeddings-fastembed==0.3.0
182 | - llama-index-embeddings-huggingface==0.5.1
183 | - llama-index-llms-mistralai==0.3.2
184 | - llama-index-tools-arxiv==0.3.0
185 | - llama-index-vector-stores-qdrant==0.4.3
186 | - loguru==0.7.3
187 | - lxml==5.3.1
188 | - mako==1.3.9
189 | - mammoth==1.9.0
190 | - markdown-it-py==3.0.0
191 | - markdownify==0.14.1
192 | - markitdown==0.0.1a4
193 | - markupsafe==2.1.5
194 | - mdurl==0.1.2
195 | - mistralai==1.5.0
196 | - mmh3==4.1.0
197 | - mpmath==1.3.0
198 | - msal==1.31.1
199 | - msal-extensions==1.2.0
200 | - nvidia-cublas-cu12==12.4.5.8
201 | - nvidia-cuda-cupti-cu12==12.4.127
202 | - nvidia-cuda-nvrtc-cu12==12.4.127
203 | - nvidia-cuda-runtime-cu12==12.4.127
204 | - nvidia-cudnn-cu12==9.1.0.70
205 | - nvidia-cufft-cu12==11.2.1.3
206 | - nvidia-curand-cu12==10.3.5.147
207 | - nvidia-cusolver-cu12==11.6.1.9
208 | - nvidia-cusparse-cu12==12.3.1.170
209 | - nvidia-cusparselt-cu12==0.6.2
210 | - nvidia-nccl-cu12==2.21.5
211 | - nvidia-nvjitlink-cu12==12.4.127
212 | - nvidia-nvtx-cu12==12.4.127
213 | - olefile==0.47
214 | - onnxruntime==1.20.1
215 | - openinference-instrumentation==0.1.22
216 | - openinference-instrumentation-llama-index==3.2.0
217 | - openinference-semantic-conventions==0.1.14
218 | - openpyxl==3.1.5
219 | - opentelemetry-api==1.30.0
220 | - opentelemetry-exporter-otlp==1.30.0
221 | - opentelemetry-exporter-otlp-proto-common==1.30.0
222 | - opentelemetry-exporter-otlp-proto-grpc==1.30.0
223 | - opentelemetry-exporter-otlp-proto-http==1.30.0
224 | - opentelemetry-instrumentation==0.51b0
225 | - opentelemetry-proto==1.30.0
226 | - opentelemetry-sdk==1.30.0
227 | - opentelemetry-semantic-conventions==0.51b0
228 | - orjson==3.10.15
229 | - pathvalidate==3.2.3
230 | - pdfminer-six==20240706
231 | - pillow==10.4.0
232 | - portalocker==2.10.1
233 | - protobuf==5.29.3
234 | - psutil==7.0.0
235 | - puremagic==1.28
236 | - py-rust-stemmers==0.1.3
237 | - pyarrow==19.0.0
238 | - pydub==0.25.1
239 | - pygments==2.19.1
240 | - pyjwt==2.10.1
241 | - python-multipart==0.0.20
242 | - python-pptx==1.0.2
243 | - qdrant-client==1.13.2
244 | - rich==13.9.4
245 | - ruff==0.9.6
246 | - safehttpx==0.1.6
247 | - safetensors==0.5.2
248 | - scikit-learn==1.6.1
249 | - scipy==1.15.1
250 | - semantic-version==2.10.0
251 | - sentence-transformers==3.4.1
252 | - sgmllib3k==1.0.0
253 | - shellingham==1.5.4
254 | - speechrecognition==3.14.1
255 | - sqlean-py==3.47.0
256 | - starlette==0.45.3
257 | - strawberry-graphql==0.253.1
258 | - sympy==1.13.1
259 | - threadpoolctl==3.5.0
260 | - tokenizers==0.21.0
261 | - tomlkit==0.13.2
262 | - torch==2.6.0
263 | - torchaudio==2.6.0
264 | - torchvision==0.21.0
265 | - transformers==4.48.3
266 | - triton==3.2.0
267 | - typer==0.15.1
268 | - uvicorn==0.34.0
269 | - websockets==14.2
270 | - xlrd==2.0.1
271 | - xlsxwriter==3.2.2
272 | - youtube-transcript-api==0.6.3
273 | - zipp==3.21.0
--------------------------------------------------------------------------------
/docker/getSecrets.py:
--------------------------------------------------------------------------------
1 | m = open("/run/secrets/mistral")
2 | mistral_api_key = m.read()
3 | m.close()
4 | p = open("/run/secrets/phoenix")
5 | phoenix_api_key = p.read()
6 | p.close()
7 | l = open("/run/secrets/llamacloud")
8 | llamacloud_api_key = l.read()
9 | l.close()
--------------------------------------------------------------------------------
/docker/run.sh:
--------------------------------------------------------------------------------
1 | eval "$(conda shell.bash hook)"
2 |
3 | conda activate papers-chat
4 | echo "Activated conda env"
5 | python3 /app/app.py
6 |
--------------------------------------------------------------------------------
/docker/toolsFunctions.py:
--------------------------------------------------------------------------------
1 | import urllib, urllib.request
2 | from pydantic import Field
3 | from datetime import datetime
4 | from markitdown import MarkItDown
5 | from Bio import Entrez
6 | import xml.etree.ElementTree as ET
7 |
8 | md = MarkItDown()
9 |
10 | def format_today():
11 | d = datetime.now()
12 | if d.month < 10:
13 | month = f"0{d.month}"
14 | else:
15 | month = d.month
16 | if d.day < 10:
17 | day = f"0{d.day}"
18 | else:
19 | day = d.day
20 | if d.hour < 10:
21 | hour = f"0{d.hour}"
22 | else:
23 | hour = d.hour
24 | if d.minute < 10:
25 | minute = f"0{d.hour}"
26 | else:
27 | minute = d.minute
28 | today = f"{d.year}{month}{day}{hour}{minute}"
29 | two_years_ago = f"{d.year-2}{month}{day}{hour}{minute}"
30 | return today, two_years_ago
31 |
32 | def arxiv_tool(search_query: str = Field(description="The query with which to search ArXiv database")):
33 | """A tool to search ArXiv"""
34 | today, two_years_ago = format_today()
35 | query = search_query.replace(" ", "+")
36 | url = f'http://export.arxiv.org/api/query?search_query=all:{query}&submittedDate:[{two_years_ago}+TO+{today}]&start=0&max_results=3'
37 | data = urllib.request.urlopen(url)
38 | content = data.read().decode("utf-8")
39 | f = open("arxiv_results.xml", "w")
40 | f.write(content)
41 | f.close()
42 | result = md.convert("arxiv_results.xml")
43 | return result.text_content
44 |
45 | def search_pubmed(query):
46 | Entrez.email = "astraberte9@gmail.com" # Replace with your email
47 | handle = Entrez.esearch(db="pubmed", term=query, retmax=3)
48 | record = Entrez.read(handle)
49 | handle.close()
50 | return record["IdList"]
51 |
52 | def fetch_pubmed_details(pubmed_ids):
53 | Entrez.email = "your.personal@email.com" # Replace with your email
54 | handle = Entrez.efetch(db="pubmed", id=pubmed_ids, rettype="medline", retmode="xml")
55 | records = handle.read()
56 | handle.close()
57 | recs = records.decode("utf-8")
58 | f = open("biomed_results.xml", "w")
59 | f.write(recs)
60 | f.close()
61 |
62 | def fetch_xml():
63 | tree = ET.parse("biomed_results.xml")
64 | root = tree.getroot()
65 | parsed_articles = []
66 | for article in root.findall('PubmedArticle'):
67 | # Extract title
68 | title = article.find('.//ArticleTitle')
69 | title_text = title.text if title is not None else "No title"
70 | # Extract abstract
71 | abstract = article.find('.//Abstract/AbstractText')
72 | abstract_text = abstract.text if abstract is not None else "No abstract"
73 | # Format output
74 | formatted_entry = f"## {title_text}\n\n**Abstract**:\n\n{abstract_text}"
75 | parsed_articles.append(formatted_entry)
76 | return "\n\n".join(parsed_articles)
77 |
78 | def pubmed_tool(search_query: str = Field(description="The query with which to search PubMed database")):
79 | """A tool to search PubMed"""
80 | idlist = search_pubmed(search_query)
81 | if len(idlist) == 0:
82 | return "There is no significant match in PubMed"
83 | fetch_pubmed_details(idlist)
84 | content = fetch_xml()
85 | return content
86 |
--------------------------------------------------------------------------------
/docker/usage.md:
--------------------------------------------------------------------------------
1 | PapersChat Usage Guide
2 |
3 | If you find PapersChat useful, please consider to support us through donation:
4 |
5 |

6 |
7 |
8 | > _This guide is only on how to use **the app**, not on how to install and/or launch it or on how it works internally. For that, please refer to the [GitHub repository](https://github.com/AstraBert/PapersChat)_
9 |
10 | ## Use PapersChat with your documents
11 |
12 | If you have papers that you would like to chat with, this is the right section of the guide!
13 |
14 | In order to chat with your papers, you will need to upload them (**as PDF files**) on the dedicated "Upload Papers" widget that you can see at the bottom of the chat interface: you can upload one or more files there (remember: the more you upload, the slower their processing is going to be).
15 |
16 | Once you uploaded the files, before submitting them, you have to do two more things:
17 |
18 | 1. Specify the collection that you want to upload the documents to (in the "Collection" area)
19 | 2. Write your first question/message to interrogate your papers (in the message input space)
20 |
21 | For what concerns point (1), you can give your collection whatever name you want: once you created a new collection, you can always re-use it in the future, just inputting the same name. If you do not remember all your collections, you can go to the "Your collections" tab in the application and click on "Generate" to see the list of your collections.
22 |
23 | Point (2) is very important: if you do not send any message, PapersChat will tell you that you need to send one.
24 |
25 | Once you uploaded the papers, specified the collection and wrote the message, you can send the message and PapersChat will:
26 |
27 | - Ingest your documents
28 | - Produce an answer to your questions
29 |
30 | Congrats! Now you got the first collection and the first message!
31 |
32 | > _**NOTE**: there is still an option we haven't talked about, i.e. the 'LlamaParse' checkbox. If you select that checkbox, you will enable LlamaParse, a tool that LlamaIndex offers [as part of its LlamaCloud services](https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse/). LlamaParse employs enhanced parsing techniques to produce a clean and well-structured data for (often messy) unstructured documents: the free tier offers the possibility of parsing 1000 pages/day. While this approach generates very good data for your collections, you have to take into account the fact that it might take quite some time to parse your documents (especially if they are dense, have lots of text-in-images or are very long). By default the LLamaParse option is disabled_
33 |
34 | ## Use PapersChat with a collection as knowledge base
35 |
36 | Once you have uploaded all your documents, you might want to interrogate them without having to upload even more. That's where comes into hand the "collection as knowledge base" option. You can simply send a message selecting one of your existing collections as a knowledge base for PapersChat (without uploading any file) and... BAM! You will see that PapersChat replies to your questions :)
37 |
38 | ## Use PapersChat to interrogate PubMed/ArXiv
39 |
40 | PapersChat has access also to PubMed and ArXiv papers archives: if you do not specify a collection name and you do not upload any files, your question is used by PapersChat to search these two online databases for an answer.
41 |
42 | ## Monitor your collections
43 |
44 | Under the "Your Collections" tab of the application you can, by clicking on "Generate", see your collections: you can see how many data points are in these collections (these data points **do not match** with the number of papers you uploaded) and what is the status of your collections.
45 |
46 | A brief guide to the collections status:
47 |
48 | - "green": collection is optimized and searchable
49 | - "yellow": collection is being optimized and you can search it
50 | - "red": collection is not optimized and it will probably return an error if you try to search it
--------------------------------------------------------------------------------
/docker/utils.py:
--------------------------------------------------------------------------------
1 | from llama_index.embeddings.huggingface import HuggingFaceEmbedding
2 | from llama_index.core import Settings
3 | from qdrant_client import QdrantClient
4 | from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
5 | from llama_index.core import StorageContext
6 | from llama_index.vector_stores.qdrant import QdrantVectorStore
7 | from llama_cloud_services import LlamaParse
8 | from getSecrets import llamacloud_api_key
9 | from typing import List
10 | import torch
11 |
12 |
13 | qdrant_client = QdrantClient("http://127.0.0.1:6333")
14 | device = "cuda" if torch.cuda.is_available() else "cpu"
15 | embedder = HuggingFaceEmbedding(model_name="nomic-ai/modernbert-embed-base", device=device)
16 | Settings.embed_model = embedder
17 |
18 | def ingest_documents(files: List[str], collection_name: str, llamaparse: True):
19 | vector_store = QdrantVectorStore(client=qdrant_client, collection_name=collection_name, enable_hybrid=True)
20 | storage_context = StorageContext.from_defaults(vector_store=vector_store)
21 | if llamaparse:
22 | parser = LlamaParse(
23 | result_type="markdown",
24 | api_key=llamacloud_api_key
25 | )
26 | file_extractor = {".pdf": parser}
27 | documents = SimpleDirectoryReader(input_files=files, file_extractor=file_extractor).load_data()
28 | else:
29 | documents = SimpleDirectoryReader(input_files=files).load_data()
30 | index = VectorStoreIndex.from_documents(
31 | documents,
32 | storage_context=storage_context,
33 | )
34 | return index
35 |
36 |
--------------------------------------------------------------------------------
/environment.yml:
--------------------------------------------------------------------------------
1 | name: papers-chat
2 | channels:
3 | - conda-forge
4 | dependencies:
5 | - _libgcc_mutex=0.1=conda_forge
6 | - _openmp_mutex=4.5=2_gnu
7 | - aiohappyeyeballs=2.4.6=pyhd8ed1ab_0
8 | - aiohttp=3.11.12=py311h2dc5d0c_0
9 | - aiosignal=1.3.2=pyhd8ed1ab_0
10 | - annotated-types=0.7.0=pyhd8ed1ab_1
11 | - anyio=4.8.0=pyhd8ed1ab_0
12 | - attrs=25.1.0=pyh71513ae_0
13 | - beautifulsoup4=4.13.3=pyha770c72_0
14 | - brotli-python=1.1.0=py311hfdbb021_2
15 | - bzip2=1.0.8=h4bc722e_7
16 | - ca-certificates=2025.1.31=hbcca054_0
17 | - certifi=2025.1.31=pyhd8ed1ab_0
18 | - cffi=1.17.1=py311hf29c0ef_0
19 | - charset-normalizer=3.4.1=pyhd8ed1ab_0
20 | - click=8.1.8=pyh707e725_0
21 | - colorama=0.4.6=pyhd8ed1ab_1
22 | - dataclasses-json=0.6.7=pyhd8ed1ab_1
23 | - deprecated=1.2.18=pyhd8ed1ab_0
24 | - dirtyjson=1.0.8=pyhd8ed1ab_1
25 | - distro=1.9.0=pyhd8ed1ab_1
26 | - eval-type-backport=0.2.2=pyhd8ed1ab_0
27 | - eval_type_backport=0.2.2=pyha770c72_0
28 | - exceptiongroup=1.2.2=pyhd8ed1ab_1
29 | - filetype=1.2.0=pyhd8ed1ab_0
30 | - freetype=2.12.1=h267a509_2
31 | - frozenlist=1.5.0=py311h2dc5d0c_1
32 | - fsspec=2025.2.0=pyhd8ed1ab_0
33 | - greenlet=3.1.1=py311hfdbb021_1
34 | - h11=0.14.0=pyhd8ed1ab_1
35 | - h2=4.2.0=pyhd8ed1ab_0
36 | - hpack=4.1.0=pyhd8ed1ab_0
37 | - httpcore=1.0.7=pyh29332c3_1
38 | - httpx=0.28.1=pyhd8ed1ab_0
39 | - hyperframe=6.1.0=pyhd8ed1ab_0
40 | - idna=3.10=pyhd8ed1ab_1
41 | - jiter=0.8.2=py311h9e33e62_0
42 | - joblib=1.4.2=pyhd8ed1ab_1
43 | - lcms2=2.17=h717163a_0
44 | - ld_impl_linux-64=2.43=h712a8e2_2
45 | - lerc=4.0.0=h27087fc_0
46 | - libblas=3.9.0=28_h59b9bed_openblas
47 | - libcblas=3.9.0=28_he106b2a_openblas
48 | - libdeflate=1.23=h4ddbbb0_0
49 | - libexpat=2.6.4=h5888daf_0
50 | - libffi=3.4.6=h2dba641_0
51 | - libgcc=14.2.0=h77fa898_1
52 | - libgcc-ng=14.2.0=h69a702a_1
53 | - libgfortran=14.2.0=h69a702a_1
54 | - libgfortran5=14.2.0=hd5240d6_1
55 | - libgomp=14.2.0=h77fa898_1
56 | - libjpeg-turbo=3.0.0=hd590300_1
57 | - liblapack=3.9.0=28_h7ac8fdf_openblas
58 | - liblzma=5.6.4=hb9d3cd8_0
59 | - libnsl=2.0.1=hd590300_0
60 | - libopenblas=0.3.28=pthreads_h94d23a6_1
61 | - libpng=1.6.46=h943b412_0
62 | - libsqlite=3.48.0=hee588c1_1
63 | - libstdcxx=14.2.0=hc0a3c3a_1
64 | - libstdcxx-ng=14.2.0=h4852527_1
65 | - libtiff=4.7.0=hd9ff511_3
66 | - libuuid=2.38.1=h0b41bf4_0
67 | - libwebp-base=1.5.0=h851e524_0
68 | - libxcb=1.17.0=h8a09558_0
69 | - libxcrypt=4.4.36=hd590300_1
70 | - libzlib=1.3.1=hb9d3cd8_2
71 | - llama-cloud=0.1.12=pyhd8ed1ab_0
72 | - llama-cloud-services=0.6.1=pyhd8ed1ab_0
73 | - llama-index=0.12.17=pyhd8ed1ab_0
74 | - llama-index-agent-openai=0.4.5=pyhd8ed1ab_0
75 | - llama-index-cli=0.4.0=pyhd8ed1ab_1
76 | - llama-index-core=0.12.17=pyhd8ed1ab_1
77 | - llama-index-embeddings-openai=0.3.1=pyhd8ed1ab_1
78 | - llama-index-indices-managed-llama-cloud=0.6.4=pyhd8ed1ab_0
79 | - llama-index-legacy=0.9.48.post4=pyhd8ed1ab_1
80 | - llama-index-llms-openai=0.3.19=pyhd8ed1ab_0
81 | - llama-index-multi-modal-llms-openai=0.4.3=pyhd8ed1ab_0
82 | - llama-index-program-openai=0.3.1=pyhd8ed1ab_1
83 | - llama-index-question-gen-openai=0.3.0=pyhd8ed1ab_1
84 | - llama-index-readers-file=0.4.5=pyhd8ed1ab_0
85 | - llama-index-readers-llama-parse=0.4.0=pyhd8ed1ab_1
86 | - llama-parse=0.6.1=pyhd8ed1ab_0
87 | - llamaindex-py-client=0.1.19=pyhd8ed1ab_1
88 | - marshmallow=3.26.1=pyhd8ed1ab_0
89 | - multidict=6.1.0=py311h2dc5d0c_2
90 | - mypy_extensions=1.0.0=pyha770c72_1
91 | - ncurses=6.5=h2d0b736_3
92 | - nest-asyncio=1.6.0=pyhd8ed1ab_1
93 | - networkx=3.4.2=pyh267e887_2
94 | - nltk=3.9.1=pyhd8ed1ab_1
95 | - numpy=2.2.3=py311h5d046bc_0
96 | - openai=1.63.0=pyhd8ed1ab_0
97 | - openjpeg=2.5.3=h5fbd93e_0
98 | - openssl=3.4.1=h7b32b05_0
99 | - packaging=24.2=pyhd8ed1ab_2
100 | - pandas=2.2.3=py311h7db5c69_1
101 | - pip=25.0.1=pyh8b19718_0
102 | - propcache=0.2.1=py311h2dc5d0c_1
103 | - pthread-stubs=0.4=hb9d3cd8_1002
104 | - pycparser=2.22=pyh29332c3_1
105 | - pydantic=2.10.6=pyh3cfb1c2_0
106 | - pydantic-core=2.27.2=py311h9e33e62_0
107 | - pypdf=5.3.0=pyh29332c3_0
108 | - pysocks=1.7.1=pyha55dd90_7
109 | - python=3.11.11=h9e4cc4f_1_cpython
110 | - python-dateutil=2.9.0.post0=pyhff2d567_1
111 | - python-dotenv=1.0.1=pyhd8ed1ab_1
112 | - python-tzdata=2025.1=pyhd8ed1ab_0
113 | - python_abi=3.11=5_cp311
114 | - pytz=2024.1=pyhd8ed1ab_0
115 | - pyyaml=6.0.2=py311h2dc5d0c_2
116 | - readline=8.2=h8228510_1
117 | - regex=2024.11.6=py311h9ecbd09_0
118 | - requests=2.32.3=pyhd8ed1ab_1
119 | - setuptools=75.8.0=pyhff2d567_0
120 | - six=1.17.0=pyhd8ed1ab_0
121 | - sniffio=1.3.1=pyhd8ed1ab_1
122 | - soupsieve=2.5=pyhd8ed1ab_1
123 | - sqlalchemy=2.0.38=py311h9ecbd09_0
124 | - striprtf=0.0.26=pyhd8ed1ab_0
125 | - tenacity=8.5.0=pyhd8ed1ab_0
126 | - tiktoken=0.9.0=py311hf1706b8_0
127 | - tk=8.6.13=noxft_h4845f30_101
128 | - tqdm=4.67.1=pyhd8ed1ab_1
129 | - typing-extensions=4.12.2=hd8ed1ab_1
130 | - typing_extensions=4.12.2=pyha770c72_1
131 | - typing_inspect=0.9.0=pyhd8ed1ab_1
132 | - tzdata=2025a=h78e105d_0
133 | - urllib3=2.3.0=pyhd8ed1ab_0
134 | - wheel=0.45.1=pyhd8ed1ab_1
135 | - wrapt=1.17.2=py311h9ecbd09_0
136 | - xorg-libxau=1.0.12=hb9d3cd8_0
137 | - xorg-libxdmcp=1.1.5=hb9d3cd8_0
138 | - yaml=0.2.5=h7f98852_2
139 | - yarl=1.18.3=py311h2dc5d0c_1
140 | - zstandard=0.23.0=py311hbc35293_1
141 | - zstd=1.5.6=ha6fb4c9_0
142 | - pip:
143 | - aiofiles==23.2.1
144 | - aioitertools==0.12.0
145 | - aiosqlite==0.21.0
146 | - alembic==1.14.1
147 | - arize-phoenix==7.12.2
148 | - arize-phoenix-evals==0.20.3
149 | - arize-phoenix-otel==0.7.1
150 | - arxiv==2.1.3
151 | - authlib==1.4.1
152 | - azure-ai-documentintelligence==1.0.0
153 | - azure-core==1.32.0
154 | - azure-identity==1.20.0
155 | - biopython==1.85
156 | - cachetools==5.5.1
157 | - cobble==0.1.4
158 | - coloredlogs==15.0.1
159 | - cryptography==44.0.1
160 | - defusedxml==0.7.1
161 | - et-xmlfile==2.0.0
162 | - fastapi==0.115.8
163 | - fastembed==0.5.1
164 | - feedparser==6.0.11
165 | - ffmpy==0.5.0
166 | - filelock==3.17.0
167 | - flatbuffers==25.2.10
168 | - googleapis-common-protos==1.67.0
169 | - gradio==5.16.0
170 | - gradio-client==1.7.0
171 | - graphql-core==3.2.6
172 | - grpc-interceptor==0.15.4
173 | - grpcio==1.70.0
174 | - grpcio-tools==1.70.0
175 | - huggingface-hub==0.28.1
176 | - humanfriendly==10.0
177 | - importlib-metadata==8.5.0
178 | - isodate==0.7.2
179 | - jinja2==3.1.4
180 | - jsonpath-python==1.0.6
181 | - llama-index-embeddings-fastembed==0.3.0
182 | - llama-index-embeddings-huggingface==0.5.1
183 | - llama-index-llms-azure-openai==0.3.2
184 | - llama-index-llms-mistralai==0.3.2
185 | - llama-index-llms-ollama==0.5.4
186 | - llama-index-tools-arxiv==0.3.0
187 | - llama-index-vector-stores-qdrant==0.4.3
188 | - loguru==0.7.3
189 | - lxml==5.3.1
190 | - mako==1.3.9
191 | - mammoth==1.9.0
192 | - markdown-it-py==3.0.0
193 | - markdownify==0.14.1
194 | - markitdown==0.0.1a4
195 | - markupsafe==2.1.5
196 | - mdurl==0.1.2
197 | - mistralai==1.5.0
198 | - mmh3==4.1.0
199 | - mpmath==1.3.0
200 | - msal==1.31.1
201 | - msal-extensions==1.2.0
202 | - nvidia-cublas-cu12==12.4.5.8
203 | - nvidia-cuda-cupti-cu12==12.4.127
204 | - nvidia-cuda-nvrtc-cu12==12.4.127
205 | - nvidia-cuda-runtime-cu12==12.4.127
206 | - nvidia-cudnn-cu12==9.1.0.70
207 | - nvidia-cufft-cu12==11.2.1.3
208 | - nvidia-curand-cu12==10.3.5.147
209 | - nvidia-cusolver-cu12==11.6.1.9
210 | - nvidia-cusparse-cu12==12.3.1.170
211 | - nvidia-cusparselt-cu12==0.6.2
212 | - nvidia-nccl-cu12==2.21.5
213 | - nvidia-nvjitlink-cu12==12.4.127
214 | - nvidia-nvtx-cu12==12.4.127
215 | - olefile==0.47
216 | - ollama==0.4.8
217 | - onnxruntime==1.20.1
218 | - openinference-instrumentation==0.1.22
219 | - openinference-instrumentation-llama-index==3.2.0
220 | - openinference-semantic-conventions==0.1.14
221 | - openpyxl==3.1.5
222 | - opentelemetry-api==1.30.0
223 | - opentelemetry-exporter-otlp==1.30.0
224 | - opentelemetry-exporter-otlp-proto-common==1.30.0
225 | - opentelemetry-exporter-otlp-proto-grpc==1.30.0
226 | - opentelemetry-exporter-otlp-proto-http==1.30.0
227 | - opentelemetry-instrumentation==0.51b0
228 | - opentelemetry-proto==1.30.0
229 | - opentelemetry-sdk==1.30.0
230 | - opentelemetry-semantic-conventions==0.51b0
231 | - orjson==3.10.15
232 | - pathvalidate==3.2.3
233 | - pdfminer-six==20240706
234 | - pillow==10.4.0
235 | - portalocker==2.10.1
236 | - protobuf==5.29.3
237 | - psutil==7.0.0
238 | - puremagic==1.28
239 | - py-rust-stemmers==0.1.3
240 | - pyarrow==19.0.0
241 | - pydub==0.25.1
242 | - pygments==2.19.1
243 | - pyjwt==2.10.1
244 | - python-multipart==0.0.20
245 | - python-pptx==1.0.2
246 | - qdrant-client==1.13.2
247 | - rich==13.9.4
248 | - ruff==0.9.6
249 | - safehttpx==0.1.6
250 | - safetensors==0.5.2
251 | - scikit-learn==1.6.1
252 | - scipy==1.15.1
253 | - semantic-version==2.10.0
254 | - sentence-transformers==3.4.1
255 | - sgmllib3k==1.0.0
256 | - shellingham==1.5.4
257 | - speechrecognition==3.14.1
258 | - sqlean-py==3.47.0
259 | - starlette==0.45.3
260 | - strawberry-graphql==0.253.1
261 | - sympy==1.13.1
262 | - threadpoolctl==3.5.0
263 | - tokenizers==0.21.0
264 | - tomlkit==0.13.2
265 | - torch==2.6.0
266 | - torchaudio==2.6.0
267 | - torchvision==0.21.0
268 | - transformers==4.48.3
269 | - triton==3.2.0
270 | - typer==0.15.1
271 | - uvicorn==0.34.0
272 | - websockets==14.2
273 | - xlrd==2.0.1
274 | - xlsxwriter==3.2.2
275 | - youtube-transcript-api==0.6.3
276 | - zipp==3.21.0
--------------------------------------------------------------------------------
/flowchart.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AstraBert/PapersChat/1d91b36f92071b7f1e002204735a7bbba385f406/flowchart.png
--------------------------------------------------------------------------------
/local_setup.ps1:
--------------------------------------------------------------------------------
1 | docker compose up db -d
2 |
3 | conda env create -f environment.yml
4 |
5 | conda activate papers-chat
6 | python3 scripts/app.py
7 | conda deactivate
--------------------------------------------------------------------------------
/local_setup.sh:
--------------------------------------------------------------------------------
1 | docker compose up db -d
2 |
3 | conda env create -f environment.yml
4 |
5 | conda activate papers-chat
6 | python3 scripts/app.py
7 | conda deactivate
--------------------------------------------------------------------------------
/logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AstraBert/PapersChat/1d91b36f92071b7f1e002204735a7bbba385f406/logo.png
--------------------------------------------------------------------------------
/scripts/app.py:
--------------------------------------------------------------------------------
1 | from utils import ingest_documents, qdrant_client, List, QdrantVectorStore, VectorStoreIndex, embedder
2 | import sys
3 | import gradio as gr
4 | from toolsFunctions import pubmed_tool, arxiv_tool
5 | from llama_index.core.tools import QueryEngineTool, FunctionTool
6 | from llama_index.core import Settings
7 | from llama_index.llms.mistralai import MistralAI
8 | from llama_index.llms.azure_openai import AzureOpenAI
9 | from llama_index.llms.ollama import Ollama
10 | from llama_index.core.llms import ChatMessage
11 | from llama_index.core.agent import ReActAgent
12 | from dotenv import load_dotenv
13 | from phoenix.otel import register
14 | from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
15 | import time
16 | import os
17 |
18 | load_dotenv()
19 |
20 | ## Observing and tracing
21 | PHOENIX_API_KEY = os.getenv("phoenix_api_key")
22 | os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={PHOENIX_API_KEY}"
23 | os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"
24 | tracer_provider = register(
25 | project_name="llamaindex",
26 | )
27 | LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)
28 |
29 | ## Globals
30 | if os.getenv("mistral_api_key", None) is not None:
31 | Settings.llm = MistralAI(model="mistral-small-latest", temperature=0, api_key=os.getenv("mistral_api_key"))
32 | elif os.getenv("azure_openai_api_key", None) is not None:
33 | Settings.llm = AzureOpenAI(model="gpt-4.1", temperature=0, api_key=os.getenv("azure_openai_api_key"))
34 | elif os.getenv("ollama_model", None) is not None:
35 | Settings.llm = Ollama(model=os.getenv("ollama_model"))
36 | else:
37 | print("ERROR! No supported LLM can be loaded in PapersChat. Exiting...")
38 | sys.exit(1)
39 |
40 | Settings.embed_model = embedder
41 | arxivtool = FunctionTool.from_defaults(arxiv_tool, name="arxiv_tool", description="A tool to search ArXiv (pre-print papers database) for specific papers")
42 | pubmedtool = FunctionTool.from_defaults(pubmed_tool, name="pubmed_tool", description="A tool to search PubMed (printed medical papers database) for specific papers")
43 | query_engine = None
44 | message_history = [
45 | ChatMessage(role="system", content="You are a useful assistant that has to help the user with questions that they ask about several papers they uploaded. You should base your answers on the context you can retrieve from the PDFs and, if you cannot retrieve any, search ArXiv for a potential answer. If you cannot find any viable answer, please reply that you do not know the answer to the user's question")
46 | ]
47 |
48 | ## Functions
49 | def reply(message, history, files: List[str] | None, collection: str | None, llamaparse: bool = False):
50 | global message_history
51 | if message == "" or message is None:
52 | response = "You should provide a message"
53 | r = ""
54 | for char in response:
55 | r+=char
56 | time.sleep(0.001)
57 | yield r
58 | elif files is None and collection == "":
59 | res = "### WARNING! You did not specify any collection, so I only interrogated ArXiv and/or PubMed to answer your question\n\n"
60 | agent = ReActAgent.from_tools(tools=[pubmedtool, arxivtool], verbose=True)
61 | response = agent.chat(message = message, chat_history = message_history)
62 | response = str(response)
63 | message_history.append(ChatMessage(role="user", content=message))
64 | message_history.append(ChatMessage(role="assistant", content=response))
65 | response = res + response
66 | r = ""
67 | for char in response:
68 | r+=char
69 | time.sleep(0.001)
70 | yield r
71 | elif files is None and collection != "" and collection not in [c.name for c in qdrant_client.get_collections().collections]:
72 | response = "Make sure that the name of the existing collection to use as a knowledge base is correct, because the one you provided does not exist! You can check your existing collections and their features in the dedicated tab of the app :)"
73 | r = ""
74 | for char in response:
75 | r+=char
76 | time.sleep(0.001)
77 | yield r
78 | elif files is not None:
79 | if collection == "":
80 | response = "You should provide a collection name (new or existing) if you want to ingest files!"
81 | r = ""
82 | for char in response:
83 | r+=char
84 | time.sleep(0.001)
85 | yield r
86 | else:
87 | collection_name = collection
88 | index = ingest_documents(files, collection_name, llamaparse)
89 | query_engine = index.as_query_engine()
90 | rag_tool = QueryEngineTool.from_defaults(query_engine, name="papers_rag", description="A RAG engine with information from selected scientific papers")
91 | agent = ReActAgent.from_tools(tools=[rag_tool, pubmedtool, arxivtool], verbose=True)
92 | response = agent.chat(message = message, chat_history = message_history)
93 | response = str(response)
94 | message_history.append(ChatMessage(role="user", content=message))
95 | message_history.append(ChatMessage(role="assistant", content=response))
96 | r = ""
97 | for char in response:
98 | r+=char
99 | time.sleep(0.001)
100 | yield r
101 | else:
102 | vector_store = QdrantVectorStore(client = qdrant_client, collection_name=collection, enable_hybrid=True)
103 | index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
104 | query_engine = index.as_query_engine()
105 | rag_tool = QueryEngineTool.from_defaults(query_engine, name="papers_rag", description="A RAG engine with information from selected scientific papers")
106 | agent = ReActAgent.from_tools(tools=[rag_tool, pubmedtool, arxivtool], verbose=True)
107 | response = agent.chat(message = message, chat_history = message_history)
108 | response = str(response)
109 | message_history.append(ChatMessage(role="user", content=message))
110 | message_history.append(ChatMessage(role="assistant", content=response))
111 | r = ""
112 | for char in response:
113 | r+=char
114 | time.sleep(0.001)
115 | yield r
116 |
117 | def to_markdown_color(grade: str):
118 | colors = {"red": "ff0000", "yellow": "ffcc00", "green": "33cc33"}
119 | mdcode = f"![#{colors[grade]}](https://placehold.co/15x15/{colors[grade]}/{colors[grade]}.png)"
120 | return mdcode
121 |
122 | def get_qdrant_collections_dets():
123 | collections = [c.name for c in qdrant_client.get_collections().collections]
124 | details = []
125 | counter = 0
126 | for collection in collections:
127 | counter += 1
128 | dets = qdrant_client.get_collection(collection)
129 | p = f"### {counter}. {collection}\n\n**Number of Points**: {dets.points_count}\n\n**Status**: {to_markdown_color(dets.status)} {dets.status}\n\n"
130 | details.append(p)
131 | final_text = "Available Collections
\n\n"
132 | final_text += "\n\n".join(details)
133 | return final_text
134 |
135 | ## Frontend
136 | accordion = gr.Accordion(label="⚠️Set up these parameters before you start chatting!⚠️")
137 |
138 | iface1 = gr.ChatInterface(fn=reply, additional_inputs=[gr.File(label="Upload Papers (only PDF allowed!)", file_count="multiple", file_types=[".pdf","pdf",".PDF","PDF"], value=None), gr.Textbox(label="Collection", info="Upload your papers to a collection (new or existing)", value=""), gr.Checkbox(label="Use LlamaParse", info="Needs the LlamaCloud API key", value=False)], additional_inputs_accordion=accordion)
139 | u = open("usage.md")
140 | content = u.read()
141 | u.close()
142 | iface2 = gr.Blocks()
143 | with iface2:
144 | with gr.Row():
145 | gr.Markdown(content)
146 | iface3 = gr.Interface(fn=get_qdrant_collections_dets, inputs=None, outputs=gr.Markdown(label="Collections"), submit_btn="See your collections")
147 | iface = gr.TabbedInterface([iface1, iface2, iface3], ["Chat💬", "Usage Guide⚙️", "Your Collections🔎"], title="PapersChat📝")
148 | iface.launch(server_name="0.0.0.0", server_port=7860)
--------------------------------------------------------------------------------
/scripts/toolsFunctions.py:
--------------------------------------------------------------------------------
1 | import urllib, urllib.request
2 | from pydantic import Field
3 | from datetime import datetime
4 | from markitdown import MarkItDown
5 | from Bio import Entrez
6 | import xml.etree.ElementTree as ET
7 |
8 | md = MarkItDown()
9 |
10 | def format_today():
11 | d = datetime.now()
12 | if d.month < 10:
13 | month = f"0{d.month}"
14 | else:
15 | month = d.month
16 | if d.day < 10:
17 | day = f"0{d.day}"
18 | else:
19 | day = d.day
20 | if d.hour < 10:
21 | hour = f"0{d.hour}"
22 | else:
23 | hour = d.hour
24 | if d.minute < 10:
25 | minute = f"0{d.hour}"
26 | else:
27 | minute = d.minute
28 | today = f"{d.year}{month}{day}{hour}{minute}"
29 | two_years_ago = f"{d.year-2}{month}{day}{hour}{minute}"
30 | return today, two_years_ago
31 |
32 | def arxiv_tool(search_query: str = Field(description="The query with which to search ArXiv database")):
33 | """A tool to search ArXiv"""
34 | today, two_years_ago = format_today()
35 | query = search_query.replace(" ", "+")
36 | url = f'http://export.arxiv.org/api/query?search_query=all:{query}&submittedDate:[{two_years_ago}+TO+{today}]&start=0&max_results=3'
37 | data = urllib.request.urlopen(url)
38 | content = data.read().decode("utf-8")
39 | f = open("arxiv_results.xml", "w")
40 | f.write(content)
41 | f.close()
42 | result = md.convert("arxiv_results.xml")
43 | return result.text_content
44 |
45 | def search_pubmed(query):
46 | Entrez.email = "astraberte9@gmail.com" # Replace with your email
47 | handle = Entrez.esearch(db="pubmed", term=query, retmax=3)
48 | record = Entrez.read(handle)
49 | handle.close()
50 | return record["IdList"]
51 |
52 | def fetch_pubmed_details(pubmed_ids):
53 | Entrez.email = "your.personal@email.com" # Replace with your email
54 | handle = Entrez.efetch(db="pubmed", id=pubmed_ids, rettype="medline", retmode="xml")
55 | records = handle.read()
56 | handle.close()
57 | recs = records.decode("utf-8")
58 | f = open("biomed_results.xml", "w")
59 | f.write(recs)
60 | f.close()
61 |
62 | def fetch_xml():
63 | tree = ET.parse("biomed_results.xml")
64 | root = tree.getroot()
65 | parsed_articles = []
66 | for article in root.findall('PubmedArticle'):
67 | # Extract title
68 | title = article.find('.//ArticleTitle')
69 | title_text = title.text if title is not None else "No title"
70 | # Extract abstract
71 | abstract = article.find('.//Abstract/AbstractText')
72 | abstract_text = abstract.text if abstract is not None else "No abstract"
73 | # Format output
74 | formatted_entry = f"## {title_text}\n\n**Abstract**:\n\n{abstract_text}"
75 | parsed_articles.append(formatted_entry)
76 | return "\n\n".join(parsed_articles)
77 |
78 | def pubmed_tool(search_query: str = Field(description="The query with which to search PubMed database")):
79 | """A tool to search PubMed"""
80 | idlist = search_pubmed(search_query)
81 | if len(idlist) == 0:
82 | return "There is no significant match in PubMed"
83 | fetch_pubmed_details(idlist)
84 | content = fetch_xml()
85 | return content
86 |
--------------------------------------------------------------------------------
/scripts/utils.py:
--------------------------------------------------------------------------------
1 | from llama_index.embeddings.huggingface import HuggingFaceEmbedding
2 | from llama_index.core import Settings
3 | from llama_index.llms.mistralai import MistralAI
4 | from qdrant_client import QdrantClient
5 | from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
6 | from llama_index.core import StorageContext
7 | from llama_index.vector_stores.qdrant import QdrantVectorStore
8 | from llama_cloud_services import LlamaParse
9 | from dotenv import load_dotenv
10 | from typing import List
11 | import torch
12 | import os
13 |
14 |
15 | load_dotenv()
16 |
17 |
18 | qdrant_client = QdrantClient("http://localhost:6333")
19 | device = "cuda" if torch.cuda.is_available() else "cpu"
20 | embedder = HuggingFaceEmbedding(model_name="nomic-ai/modernbert-embed-base", device=device)
21 | Settings.embed_model = embedder
22 |
23 | def ingest_documents(files: List[str], collection_name: str, llamaparse: True):
24 | vector_store = QdrantVectorStore(client=qdrant_client, collection_name=collection_name, enable_hybrid=True)
25 | storage_context = StorageContext.from_defaults(vector_store=vector_store)
26 | if llamaparse:
27 | parser = LlamaParse(
28 | result_type="markdown",
29 | api_key=os.getenv("llamacloud_api_key")
30 | )
31 | file_extractor = {".pdf": parser}
32 | documents = SimpleDirectoryReader(input_files=files, file_extractor=file_extractor).load_data()
33 | else:
34 | documents = SimpleDirectoryReader(input_files=files).load_data()
35 | index = VectorStoreIndex.from_documents(
36 | documents,
37 | storage_context=storage_context,
38 | )
39 | return index
40 |
41 | load_dotenv()
42 |
--------------------------------------------------------------------------------
/start_services.ps1:
--------------------------------------------------------------------------------
1 | docker compose up db -d
2 | docker compose up app -d
--------------------------------------------------------------------------------
/start_services.sh:
--------------------------------------------------------------------------------
1 | docker compose up db -d
2 | docker compose up app -d
--------------------------------------------------------------------------------
/usage.md:
--------------------------------------------------------------------------------
1 | PapersChat Usage Guide
2 |
3 | If you find PapersChat useful, please consider to support us through donation:
4 |
5 |

6 |
7 |
8 | > _This guide is only on how to use **the app**, not on how to install and/or launch it or on how it works internally. For that, please refer to the [GitHub repository](https://github.com/AstraBert/PapersChat)_
9 |
10 | ## Use PapersChat with your documents
11 |
12 | If you have papers that you would like to chat with, this is the right section of the guide!
13 |
14 | In order to chat with your papers, you will need to upload them (**as PDF files**) on the dedicated "Upload Papers" widget that you can see at the bottom of the chat interface: you can upload one or more files there (remember: the more you upload, the slower their processing is going to be).
15 |
16 | Once you uploaded the files, before submitting them, you have to do two more things:
17 |
18 | 1. Specify the collection that you want to upload the documents to (in the "Collection" area)
19 | 2. Write your first question/message to interrogate your papers (in the message input space)
20 |
21 | For what concerns point (1), you can give your collection whatever name you want: once you created a new collection, you can always re-use it in the future, just inputting the same name. If you do not remember all your collections, you can go to the "Your collections" tab in the application and click on "Generate" to see the list of your collections.
22 |
23 | Point (2) is very important: if you do not send any message, PapersChat will tell you that you need to send one.
24 |
25 | Once you uploaded the papers, specified the collection and wrote the message, you can send the message and PapersChat will:
26 |
27 | - Ingest your documents
28 | - Produce an answer to your questions
29 |
30 | Congrats! Now you got the first collection and the first message!
31 |
32 | > _**NOTE**: there is still an option we haven't talked about, i.e. the 'LlamaParse' checkbox. If you select that checkbox, you will enable LlamaParse, a tool that LlamaIndex offers [as part of its LlamaCloud services](https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse/). LlamaParse employs enhanced parsing techniques to produce a clean and well-structured data for (often messy) unstructured documents: the free tier offers the possibility of parsing 1000 pages/day. While this approach generates very good data for your collections, you have to take into account the fact that it might take quite some time to parse your documents (especially if they are dense, have lots of text-in-images or are very long). By default the LLamaParse option is disabled_
33 |
34 | ## Use PapersChat with a collection as knowledge base
35 |
36 | Once you have uploaded all your documents, you might want to interrogate them without having to upload even more. That's where comes into hand the "collection as knowledge base" option. You can simply send a message selecting one of your existing collections as a knowledge base for PapersChat (without uploading any file) and... BAM! You will see that PapersChat replies to your questions :)
37 |
38 | ## Use PapersChat to interrogate PubMed/ArXiv
39 |
40 | PapersChat has access also to PubMed and ArXiv papers archives: if you do not specify a collection name and you do not upload any files, your question is used by PapersChat to search these two online databases for an answer.
41 |
42 | ## Monitor your collections
43 |
44 | Under the "Your Collections" tab of the application you can, by clicking on "Generate", see your collections: you can see how many data points are in these collections (these data points **do not match** with the number of papers you uploaded) and what is the status of your collections.
45 |
46 | A brief guide to the collections status:
47 |
48 | - "green": collection is optimized and searchable
49 | - "yellow": collection is being optimized and you can search it
50 | - "red": collection is not optimized and it will probably return an error if you try to search it
--------------------------------------------------------------------------------