├── .env.example
├── .gitignore
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── compose.yaml
├── docker
    ├── Dockerfile
    ├── app.py
    ├── conda_env.sh
    ├── environment.yml
    ├── getSecrets.py
    ├── run.sh
    ├── toolsFunctions.py
    ├── usage.md
    └── utils.py
├── environment.yml
├── flowchart.png
├── local_setup.ps1
├── local_setup.sh
├── logo.png
├── scripts
    ├── app.py
    ├── toolsFunctions.py
    └── utils.py
├── start_services.ps1
├── start_services.sh
└── usage.md


/.env.example:
--------------------------------------------------------------------------------
1 | llamacloud_api_key="llx-xxx"
2 | mistral_api_key="*******************abc"
3 | phoenix_api_key="*******************def"


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | *.xml
2 | .env
3 | papers-parser.code-workspace
4 | data/
5 | qdrant_storage/
6 | scripts/__pycache__/
7 | huggingface_spaces/


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing to PapersChat
 2 | 
 3 | Do you want to contribute to this project? Make sure to read this guidelines first :)
 4 | 
 5 | ## Issue
 6 | 
 7 | **When to do it**:
 8 | 
 9 | - You found bugs but you don't know how to solve them or don't have time/will to do the solve
10 | - You want new features but you don't know how to implement them or don't have time/will to do the implementation
11 | 
12 | > ⚠️ _Always check open and closed issues before you submit yours to avoid duplicates_
13 | 
14 | **How to do it**:
15 | 
16 | - Open an issue
17 | - Give the issue a meaningful title (short but effective problem description)
18 | - Describe the problem following the issue template
19 | 
20 | ## Traditional contribution
21 | 
22 | **When to do it**:
23 | 
24 | - You found bugs and corrected them
25 | - You optimized/improved the code
26 | - You added new features that you think could be useful to others
27 | 
28 | **How to do it**:
29 | 
30 | 1. Fork this repository
31 | 2. Commit your changes
32 | 3. Submit pull request (make sure to provide a thorough description of the changes)
33 | 
34 | 
35 | ## Showcase your PrAIvateSearch
36 | 
37 | **When to do it**:
38 | 
39 | - You modified the base application with new features but you don't want/can't merge them with the original PrAIvateSearch
40 | 
41 | **How to do it**:
42 | 
43 | - Go to [_GitHub Discussions > Show and tell_](https://github.com/AstraBert/PrAIvateSearch/discussions/categories/show-and-tell) page
44 | - Open a new discussion there, describing your PrAIvateSearch application
45 | 
46 | ### Thanks for contributing!


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2025 Clelia (Astra) Bertelli
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | <h1 align="center">PapersChat</h1>
  2 | 
  3 | <h2 align="center">Chatting With Papers Made Easy</h2>
  4 | 
  5 | <h3 align="center">If you find PapersChat useful, please consider to support us through donation:</h3>
  6 | <div align="center">
  7 |     <a href="https://github.com/sponsors/AstraBert"><img src="https://img.shields.io/badge/sponsor-30363D?style=for-the-badge&logo=GitHub-Sponsors&logoColor=#EA4AAA" alt="GitHub Sponsors Badge"></a>
  8 | </div>
  9 | 
 10 | <div align="center">
 11 |     <img src="logo.png" alt="PapersChat Logo" width=200 height=200>
 12 | </div>
 13 | 
 14 | **PapersChat** is an agentic AI application that allows you to chat with your papers and gather also information from papers on ArXiv and on PubMed. It is powered by [LlamaIndex](https://www.llamaindex.ai/), [Qdrant](https://qdrant.tech) and [Mistral AI](https://mistral.ai/en).
 15 | 
 16 | ### Flowchart
 17 | 
 18 | <div align="center">
 19 |     <img src="flowchart.png" alt="PapersChat Flowchart">
 20 | </div>
 21 | 
 22 | ### Install and launch it
 23 | 
 24 | The installation of the application is a unique process, you simply have to clone the GitHub repository:
 25 | 
 26 | ```bash
 27 | git clone https://github.com/AstraBert/PapersChat.git
 28 | cd PapersChat/
 29 | ```
 30 | 
 31 | To launch the app, you can follow two paths:
 32 | 
 33 | **1. Docker (recommended)**
 34 | 
 35 | > _Required: [Docker](https://docs.docker.com/desktop/) and [docker compose](https://docs.docker.com/compose/)_
 36 | 
 37 | - Add the `mistral_api_key`, `phoenix_api_key` and `llamacloud_api_key`  variables in the [`.env.example`](./docker/.env.example) file and modify the name of the file to `.env`. Get these keys:
 38 |     + [On Mistral AI](https://console.mistral.ai/api-keys/)
 39 |     + [On LlamaCloud](https://cloud.llamaindex.ai/)
 40 |     + [On Phoenix/Arize](https://llamatrace.com/projects)
 41 | 
 42 | ```bash
 43 | # modify your access token, e.g. hf_token="hf_abcdefg1234567"
 44 | mv .env.example .env
 45 | ```
 46 | 
 47 | - Launch the docker application:
 48 | 
 49 | ```bash
 50 | # If you are on Linux/macOS
 51 | bash start_services.sh
 52 | # If you are on Windows
 53 | .\start_services.ps1
 54 | ```
 55 | 
 56 | You will see the application running on http://localhost:7860 and you will be able to use it. Depending on your connection and on your hardware, the set up might take some time (up to 30 mins to set up) - but this is only for the first time your run it!
 57 | 
 58 | **2. Source code**
 59 | 
 60 | > _Required: [Docker](https://docs.docker.com/desktop/), [docker compose](https://docs.docker.com/compose/) and [conda](https://anaconda.org/anaconda/conda)_
 61 | 
 62 | - Add the `mistral_api_key`, `phoenix_api_key` and `llamacloud_api_key`  variables in the [`.env.example`](./docker/.env.example) file and modify the name of the file to `.env`. Get these keys:
 63 |     + [On Mistral AI](https://console.mistral.ai/api-keys/)
 64 |     + [On LlamaCloud](https://cloud.llamaindex.ai/)
 65 |     + [On Phoenix/Arize](https://llamatrace.com/projects)
 66 | 
 67 | ```bash
 68 | mv .env.example .env
 69 | # modify the variables, e.g.:
 70 | # llamacloud_api_key="llx-000-abc"
 71 | # mistral_api_key="01234abc"
 72 | # phoenix_api_key="56789def"
 73 | ```
 74 | 
 75 | - Alternatively, if you wish to use Azure OpenAI or Ollama, specify:
 76 | 
 77 | ```bash
 78 | azure_openai_api_key="***" # if you wish to use Azure OpenAI
 79 | ollama_model="gemma3:latest" # if you wish to use Ollama
 80 | ```
 81 | 
 82 | >[!IMPORTANT]
 83 | > _This is only possible while launching from source code, Docker launching does not support this option_
 84 | 
 85 | - Set up PapersChat using the dedicated script:
 86 | 
 87 | ```bash
 88 | # For MacOs/Linux users
 89 | bash local_setup.sh
 90 | # For Windows users
 91 | .\local_setup.ps1
 92 | ```
 93 | 
 94 | - Or you can do it manually, if you prefer:
 95 | 
 96 | ```bash
 97 | docker compose up db -d
 98 | 
 99 | conda env create -f environment.yml
100 | 
101 | conda activate papers-chat
102 | python3 scripts/app.py
103 | conda deactivate
104 | ```
105 | 
106 | ## Contributing
107 | 
108 | Contributions are always welcome! Follow the contributions guidelines reported [here](CONTRIBUTING.md).
109 | 
110 | ## License and rights of usage
111 | 
112 | The software is provided under MIT [license](./LICENSE).
113 | 
114 | ### Full documentation will come soon!👷‍♀️
115 | 
116 | 


--------------------------------------------------------------------------------
/compose.yaml:
--------------------------------------------------------------------------------
 1 | name: papers_chat
 2 | 
 3 | services:
 4 |   app:
 5 |     build: 
 6 |       context: ./docker/
 7 |       dockerfile: Dockerfile
 8 |     ports:
 9 |       - 7860:7860
10 |     secrets:
11 |       - mistral
12 |       - phoenix
13 |       - llamacloud
14 |     networks:
15 |       - internal_net
16 |   db:
17 |     image: qdrant/qdrant
18 |     ports:
19 |       - 6333:6333
20 |       - 6334:6334
21 |     volumes:
22 |       - "./qdrant_storage:/qdrant/storage"
23 |     networks:
24 |       - internal_net
25 | 
26 | networks:
27 |   internal_net:
28 |     driver: bridge
29 |     driver_opts:
30 |       com.docker.network.bridge.host_binding_ipv4: "127.0.0.1"
31 | 
32 | secrets:
33 |   mistral:
34 |     environment: mistral_api_key
35 |   phoenix:
36 |     environment: phoenix_api_key
37 |   llamacloud:
38 |     environment: llamacloud_api_key


--------------------------------------------------------------------------------
/docker/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM condaforge/miniforge3
2 | 
3 | WORKDIR /app/
4 | COPY . /app/
5 | RUN bash /app/conda_env.sh
6 | 
7 | EXPOSE 7860
8 | CMD ["bash", "/app/run.sh"]


--------------------------------------------------------------------------------
/docker/app.py:
--------------------------------------------------------------------------------
  1 | from utils import ingest_documents, qdrant_client, List, QdrantVectorStore, VectorStoreIndex, embedder
  2 | import gradio as gr
  3 | from toolsFunctions import pubmed_tool, arxiv_tool
  4 | from llama_index.core.tools import QueryEngineTool, FunctionTool
  5 | from llama_index.core import Settings
  6 | from llama_index.llms.mistralai import MistralAI
  7 | from llama_index.core.llms import ChatMessage
  8 | from llama_index.core.agent import ReActAgent
  9 | from getSecrets import mistral_api_key, phoenix_api_key
 10 | from phoenix.otel import register
 11 | from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
 12 | import time
 13 | import os
 14 | 
 15 | ## Observing and tracing
 16 | PHOENIX_API_KEY = phoenix_api_key
 17 | os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={PHOENIX_API_KEY}"
 18 | os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"
 19 | tracer_provider = register(
 20 |     project_name="llamaindex", 
 21 | ) 
 22 | LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)
 23 | 
 24 | ## Globals
 25 | Settings.llm = MistralAI(model="mistral-small-latest", temperature=0, api_key=mistral_api_key)
 26 | Settings.embed_model = embedder
 27 | arxivtool = FunctionTool.from_defaults(arxiv_tool, name="arxiv_tool", description="A tool to search ArXiv (pre-print papers database) for specific papers")
 28 | pubmedtool = FunctionTool.from_defaults(pubmed_tool, name="pubmed_tool", description="A tool to search PubMed (printed medical papers database) for specific papers")
 29 | query_engine = None
 30 | message_history = [
 31 |     ChatMessage(role="system", content="You are a useful assistant that has to help the user with questions that they ask about several papers they uploaded. You should base your answers on the context you can retrieve from the PDFs and, if you cannot retrieve any, search ArXiv for a potential answer. If you cannot find any viable answer, please reply that you do not know the answer to the user's question")
 32 | ]
 33 | 
 34 | ## Functions
 35 | def reply(message, history, files: List[str] | None, collection: str | None, llamaparse: bool = False):
 36 |     global message_history
 37 |     if message == "" or message is None:
 38 |         response = "You should provide a message"
 39 |         r = ""
 40 |         for char in response:
 41 |             r+=char
 42 |             time.sleep(0.001)
 43 |             yield r
 44 |     elif files is None and collection == "":
 45 |         res = "### WARNING! You did not specify any collection, so I only interrogated ArXiv and/or PubMed to answer your question\n\n"
 46 |         agent = ReActAgent.from_tools(tools=[pubmedtool, arxivtool], verbose=True)
 47 |         response = agent.chat(message = message, chat_history = message_history)
 48 |         response = str(response)
 49 |         message_history.append(ChatMessage(role="user", content=message))
 50 |         message_history.append(ChatMessage(role="assistant", content=response))
 51 |         response = res + response
 52 |         r = ""
 53 |         for char in response:
 54 |             r+=char
 55 |             time.sleep(0.001)
 56 |             yield r
 57 |     elif files is None and collection != "" and collection not in [c.name for c in qdrant_client.get_collections().collections]:
 58 |             response = "Make sure that the name of the existing collection to use as a knowledge base is correct, because the one you provided does not exist! You can check your existing collections and their features in the dedicated tab of the app :)"
 59 |             r = ""
 60 |             for char in response:
 61 |                 r+=char
 62 |                 time.sleep(0.001)
 63 |                 yield r
 64 |     elif files is not None:
 65 |         if collection == "":
 66 |             response = "You should provide a collection name (new or existing) if you want to ingest files!"
 67 |             r = ""
 68 |             for char in response:
 69 |                 r+=char
 70 |                 time.sleep(0.001)
 71 |                 yield r
 72 |         else:
 73 |             collection_name = collection
 74 |             index = ingest_documents(files, collection_name, llamaparse)
 75 |             query_engine = index.as_query_engine()
 76 |             rag_tool = QueryEngineTool.from_defaults(query_engine, name="papers_rag", description="A RAG engine with information from selected scientific papers")
 77 |             agent = ReActAgent.from_tools(tools=[rag_tool, pubmedtool, arxivtool], verbose=True)
 78 |             response = agent.chat(message = message, chat_history = message_history)
 79 |             response = str(response)
 80 |             message_history.append(ChatMessage(role="user", content=message))
 81 |             message_history.append(ChatMessage(role="assistant", content=response))
 82 |             r = ""
 83 |             for char in response:
 84 |                 r+=char
 85 |                 time.sleep(0.001)
 86 |                 yield r
 87 |     else:
 88 |         vector_store = QdrantVectorStore(client = qdrant_client, collection_name=collection, enable_hybrid=True)
 89 |         index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
 90 |         query_engine = index.as_query_engine()
 91 |         rag_tool = QueryEngineTool.from_defaults(query_engine, name="papers_rag", description="A RAG engine with information from selected scientific papers")
 92 |         agent = ReActAgent.from_tools(tools=[rag_tool, pubmedtool, arxivtool], verbose=True)
 93 |         response = agent.chat(message = message, chat_history = message_history)
 94 |         response = str(response)
 95 |         message_history.append(ChatMessage(role="user", content=message))
 96 |         message_history.append(ChatMessage(role="assistant", content=response))
 97 |         r = ""
 98 |         for char in response:
 99 |             r+=char
100 |             time.sleep(0.001)
101 |             yield r
102 | 
103 | def to_markdown_color(grade: str):
104 |     colors = {"red": "ff0000", "yellow": "ffcc00", "green": "33cc33"}
105 |     mdcode = f"![#{colors[grade]}](https://placehold.co/15x15/{colors[grade]}/{colors[grade]}.png)"
106 |     return mdcode
107 | 
108 | def get_qdrant_collections_dets():
109 |     collections = [c.name for c in qdrant_client.get_collections().collections]
110 |     details = []
111 |     counter = 0
112 |     for collection in collections:
113 |         counter += 1
114 |         dets = qdrant_client.get_collection(collection)
115 |         p = f"### {counter}. {collection}\n\n**Number of Points**: {dets.points_count}\n\n**Status**: {to_markdown_color(dets.status)} {dets.status}\n\n"
116 |         details.append(p)
117 |     final_text = "<h2 align='center'>Available Collections</h2>\n\n"
118 |     final_text += "\n\n".join(details)
119 |     return final_text
120 | 
121 | ## Frontend
122 | accordion = gr.Accordion(label="⚠️Set up these parameters before you start chatting!⚠️")
123 | 
124 | iface1 = gr.ChatInterface(fn=reply, additional_inputs=[gr.File(label="Upload Papers (only PDF allowed!) - Ingestion", file_count="multiple", file_types=[".pdf","pdf",".PDF","PDF"], value=None), gr.Textbox(label="Collection", info="Upload your papers to a collection (new or existing)", value=""), gr.Checkbox(label="Use LlamaParse", info="Needs the LlamaCloud API key", value=False)], additional_inputs_accordion=accordion)
125 | u = open("usage.md")
126 | content = u.read()
127 | u.close()
128 | iface2 = gr.Blocks()
129 | with iface2:
130 |     with gr.Row():
131 |         gr.Markdown(content)
132 | iface3 = gr.Interface(fn=get_qdrant_collections_dets, inputs=None, outputs=gr.Markdown(label="Collections"), submit_btn="See your collections")
133 | iface = gr.TabbedInterface([iface1, iface2, iface3], ["Chat💬", "Usage Guide⚙️", "Your Collections🔎"], title="PapersChat📝")
134 | iface.launch(server_name="0.0.0.0", server_port=7860)


--------------------------------------------------------------------------------
/docker/conda_env.sh:
--------------------------------------------------------------------------------
1 | eval "$(conda shell.bash hook)"
2 | 
3 | conda env create -f /app/environment.yml


--------------------------------------------------------------------------------
/docker/environment.yml:
--------------------------------------------------------------------------------
  1 | name: papers-chat
  2 | channels:
  3 |   - conda-forge
  4 | dependencies:
  5 |   - _libgcc_mutex=0.1=conda_forge
  6 |   - _openmp_mutex=4.5=2_gnu
  7 |   - aiohappyeyeballs=2.4.6=pyhd8ed1ab_0
  8 |   - aiohttp=3.11.12=py311h2dc5d0c_0
  9 |   - aiosignal=1.3.2=pyhd8ed1ab_0
 10 |   - annotated-types=0.7.0=pyhd8ed1ab_1
 11 |   - anyio=4.8.0=pyhd8ed1ab_0
 12 |   - attrs=25.1.0=pyh71513ae_0
 13 |   - beautifulsoup4=4.13.3=pyha770c72_0
 14 |   - brotli-python=1.1.0=py311hfdbb021_2
 15 |   - bzip2=1.0.8=h4bc722e_7
 16 |   - ca-certificates=2025.1.31=hbcca054_0
 17 |   - certifi=2025.1.31=pyhd8ed1ab_0
 18 |   - cffi=1.17.1=py311hf29c0ef_0
 19 |   - charset-normalizer=3.4.1=pyhd8ed1ab_0
 20 |   - click=8.1.8=pyh707e725_0
 21 |   - colorama=0.4.6=pyhd8ed1ab_1
 22 |   - dataclasses-json=0.6.7=pyhd8ed1ab_1
 23 |   - deprecated=1.2.18=pyhd8ed1ab_0
 24 |   - dirtyjson=1.0.8=pyhd8ed1ab_1
 25 |   - distro=1.9.0=pyhd8ed1ab_1
 26 |   - eval-type-backport=0.2.2=pyhd8ed1ab_0
 27 |   - eval_type_backport=0.2.2=pyha770c72_0
 28 |   - exceptiongroup=1.2.2=pyhd8ed1ab_1
 29 |   - filetype=1.2.0=pyhd8ed1ab_0
 30 |   - freetype=2.12.1=h267a509_2
 31 |   - frozenlist=1.5.0=py311h2dc5d0c_1
 32 |   - fsspec=2025.2.0=pyhd8ed1ab_0
 33 |   - greenlet=3.1.1=py311hfdbb021_1
 34 |   - h11=0.14.0=pyhd8ed1ab_1
 35 |   - h2=4.2.0=pyhd8ed1ab_0
 36 |   - hpack=4.1.0=pyhd8ed1ab_0
 37 |   - httpcore=1.0.7=pyh29332c3_1
 38 |   - httpx=0.28.1=pyhd8ed1ab_0
 39 |   - hyperframe=6.1.0=pyhd8ed1ab_0
 40 |   - idna=3.10=pyhd8ed1ab_1
 41 |   - jiter=0.8.2=py311h9e33e62_0
 42 |   - joblib=1.4.2=pyhd8ed1ab_1
 43 |   - lcms2=2.17=h717163a_0
 44 |   - ld_impl_linux-64=2.43=h712a8e2_2
 45 |   - lerc=4.0.0=h27087fc_0
 46 |   - libblas=3.9.0=28_h59b9bed_openblas
 47 |   - libcblas=3.9.0=28_he106b2a_openblas
 48 |   - libdeflate=1.23=h4ddbbb0_0
 49 |   - libexpat=2.6.4=h5888daf_0
 50 |   - libffi=3.4.6=h2dba641_0
 51 |   - libgcc=14.2.0=h77fa898_1
 52 |   - libgcc-ng=14.2.0=h69a702a_1
 53 |   - libgfortran=14.2.0=h69a702a_1
 54 |   - libgfortran5=14.2.0=hd5240d6_1
 55 |   - libgomp=14.2.0=h77fa898_1
 56 |   - libjpeg-turbo=3.0.0=hd590300_1
 57 |   - liblapack=3.9.0=28_h7ac8fdf_openblas
 58 |   - liblzma=5.6.4=hb9d3cd8_0
 59 |   - libnsl=2.0.1=hd590300_0
 60 |   - libopenblas=0.3.28=pthreads_h94d23a6_1
 61 |   - libpng=1.6.46=h943b412_0
 62 |   - libsqlite=3.48.0=hee588c1_1
 63 |   - libstdcxx=14.2.0=hc0a3c3a_1
 64 |   - libstdcxx-ng=14.2.0=h4852527_1
 65 |   - libtiff=4.7.0=hd9ff511_3
 66 |   - libuuid=2.38.1=h0b41bf4_0
 67 |   - libwebp-base=1.5.0=h851e524_0
 68 |   - libxcb=1.17.0=h8a09558_0
 69 |   - libxcrypt=4.4.36=hd590300_1
 70 |   - libzlib=1.3.1=hb9d3cd8_2
 71 |   - llama-cloud=0.1.12=pyhd8ed1ab_0
 72 |   - llama-cloud-services=0.6.1=pyhd8ed1ab_0
 73 |   - llama-index=0.12.17=pyhd8ed1ab_0
 74 |   - llama-index-agent-openai=0.4.5=pyhd8ed1ab_0
 75 |   - llama-index-cli=0.4.0=pyhd8ed1ab_1
 76 |   - llama-index-core=0.12.17=pyhd8ed1ab_1
 77 |   - llama-index-embeddings-openai=0.3.1=pyhd8ed1ab_1
 78 |   - llama-index-indices-managed-llama-cloud=0.6.4=pyhd8ed1ab_0
 79 |   - llama-index-legacy=0.9.48.post4=pyhd8ed1ab_1
 80 |   - llama-index-llms-openai=0.3.19=pyhd8ed1ab_0
 81 |   - llama-index-multi-modal-llms-openai=0.4.3=pyhd8ed1ab_0
 82 |   - llama-index-program-openai=0.3.1=pyhd8ed1ab_1
 83 |   - llama-index-question-gen-openai=0.3.0=pyhd8ed1ab_1
 84 |   - llama-index-readers-file=0.4.5=pyhd8ed1ab_0
 85 |   - llama-index-readers-llama-parse=0.4.0=pyhd8ed1ab_1
 86 |   - llama-parse=0.6.1=pyhd8ed1ab_0
 87 |   - llamaindex-py-client=0.1.19=pyhd8ed1ab_1
 88 |   - marshmallow=3.26.1=pyhd8ed1ab_0
 89 |   - multidict=6.1.0=py311h2dc5d0c_2
 90 |   - mypy_extensions=1.0.0=pyha770c72_1
 91 |   - ncurses=6.5=h2d0b736_3
 92 |   - nest-asyncio=1.6.0=pyhd8ed1ab_1
 93 |   - networkx=3.4.2=pyh267e887_2
 94 |   - nltk=3.9.1=pyhd8ed1ab_1
 95 |   - numpy=2.2.3=py311h5d046bc_0
 96 |   - openai=1.63.0=pyhd8ed1ab_0
 97 |   - openjpeg=2.5.3=h5fbd93e_0
 98 |   - openssl=3.4.1=h7b32b05_0
 99 |   - packaging=24.2=pyhd8ed1ab_2
100 |   - pandas=2.2.3=py311h7db5c69_1
101 |   - pip=25.0.1=pyh8b19718_0
102 |   - propcache=0.2.1=py311h2dc5d0c_1
103 |   - pthread-stubs=0.4=hb9d3cd8_1002
104 |   - pycparser=2.22=pyh29332c3_1
105 |   - pydantic=2.10.6=pyh3cfb1c2_0
106 |   - pydantic-core=2.27.2=py311h9e33e62_0
107 |   - pypdf=5.3.0=pyh29332c3_0
108 |   - pysocks=1.7.1=pyha55dd90_7
109 |   - python=3.11.11=h9e4cc4f_1_cpython
110 |   - python-dateutil=2.9.0.post0=pyhff2d567_1
111 |   - python-dotenv=1.0.1=pyhd8ed1ab_1
112 |   - python-tzdata=2025.1=pyhd8ed1ab_0
113 |   - python_abi=3.11=5_cp311
114 |   - pytz=2024.1=pyhd8ed1ab_0
115 |   - pyyaml=6.0.2=py311h2dc5d0c_2
116 |   - readline=8.2=h8228510_1
117 |   - regex=2024.11.6=py311h9ecbd09_0
118 |   - requests=2.32.3=pyhd8ed1ab_1
119 |   - setuptools=75.8.0=pyhff2d567_0
120 |   - six=1.17.0=pyhd8ed1ab_0
121 |   - sniffio=1.3.1=pyhd8ed1ab_1
122 |   - soupsieve=2.5=pyhd8ed1ab_1
123 |   - sqlalchemy=2.0.38=py311h9ecbd09_0
124 |   - striprtf=0.0.26=pyhd8ed1ab_0
125 |   - tenacity=8.5.0=pyhd8ed1ab_0
126 |   - tiktoken=0.9.0=py311hf1706b8_0
127 |   - tk=8.6.13=noxft_h4845f30_101
128 |   - tqdm=4.67.1=pyhd8ed1ab_1
129 |   - typing-extensions=4.12.2=hd8ed1ab_1
130 |   - typing_extensions=4.12.2=pyha770c72_1
131 |   - typing_inspect=0.9.0=pyhd8ed1ab_1
132 |   - tzdata=2025a=h78e105d_0
133 |   - urllib3=2.3.0=pyhd8ed1ab_0
134 |   - wheel=0.45.1=pyhd8ed1ab_1
135 |   - wrapt=1.17.2=py311h9ecbd09_0
136 |   - xorg-libxau=1.0.12=hb9d3cd8_0
137 |   - xorg-libxdmcp=1.1.5=hb9d3cd8_0
138 |   - yaml=0.2.5=h7f98852_2
139 |   - yarl=1.18.3=py311h2dc5d0c_1
140 |   - zstandard=0.23.0=py311hbc35293_1
141 |   - zstd=1.5.6=ha6fb4c9_0
142 |   - pip:
143 |       - aiofiles==23.2.1
144 |       - aioitertools==0.12.0
145 |       - aiosqlite==0.21.0
146 |       - alembic==1.14.1
147 |       - arize-phoenix==7.12.2
148 |       - arize-phoenix-evals==0.20.3
149 |       - arize-phoenix-otel==0.7.1
150 |       - arxiv==2.1.3
151 |       - authlib==1.4.1
152 |       - azure-ai-documentintelligence==1.0.0
153 |       - azure-core==1.32.0
154 |       - azure-identity==1.20.0
155 |       - biopython==1.85
156 |       - cachetools==5.5.1
157 |       - cobble==0.1.4
158 |       - coloredlogs==15.0.1
159 |       - cryptography==44.0.1
160 |       - defusedxml==0.7.1
161 |       - et-xmlfile==2.0.0
162 |       - fastapi==0.115.8
163 |       - fastembed==0.5.1
164 |       - feedparser==6.0.11
165 |       - ffmpy==0.5.0
166 |       - filelock==3.17.0
167 |       - flatbuffers==25.2.10
168 |       - googleapis-common-protos==1.67.0
169 |       - gradio==5.16.0
170 |       - gradio-client==1.7.0
171 |       - graphql-core==3.2.6
172 |       - grpc-interceptor==0.15.4
173 |       - grpcio==1.70.0
174 |       - grpcio-tools==1.70.0
175 |       - huggingface-hub==0.28.1
176 |       - humanfriendly==10.0
177 |       - importlib-metadata==8.5.0
178 |       - isodate==0.7.2
179 |       - jinja2==3.1.4
180 |       - jsonpath-python==1.0.6
181 |       - llama-index-embeddings-fastembed==0.3.0
182 |       - llama-index-embeddings-huggingface==0.5.1
183 |       - llama-index-llms-mistralai==0.3.2
184 |       - llama-index-tools-arxiv==0.3.0
185 |       - llama-index-vector-stores-qdrant==0.4.3
186 |       - loguru==0.7.3
187 |       - lxml==5.3.1
188 |       - mako==1.3.9
189 |       - mammoth==1.9.0
190 |       - markdown-it-py==3.0.0
191 |       - markdownify==0.14.1
192 |       - markitdown==0.0.1a4
193 |       - markupsafe==2.1.5
194 |       - mdurl==0.1.2
195 |       - mistralai==1.5.0
196 |       - mmh3==4.1.0
197 |       - mpmath==1.3.0
198 |       - msal==1.31.1
199 |       - msal-extensions==1.2.0
200 |       - nvidia-cublas-cu12==12.4.5.8
201 |       - nvidia-cuda-cupti-cu12==12.4.127
202 |       - nvidia-cuda-nvrtc-cu12==12.4.127
203 |       - nvidia-cuda-runtime-cu12==12.4.127
204 |       - nvidia-cudnn-cu12==9.1.0.70
205 |       - nvidia-cufft-cu12==11.2.1.3
206 |       - nvidia-curand-cu12==10.3.5.147
207 |       - nvidia-cusolver-cu12==11.6.1.9
208 |       - nvidia-cusparse-cu12==12.3.1.170
209 |       - nvidia-cusparselt-cu12==0.6.2
210 |       - nvidia-nccl-cu12==2.21.5
211 |       - nvidia-nvjitlink-cu12==12.4.127
212 |       - nvidia-nvtx-cu12==12.4.127
213 |       - olefile==0.47
214 |       - onnxruntime==1.20.1
215 |       - openinference-instrumentation==0.1.22
216 |       - openinference-instrumentation-llama-index==3.2.0
217 |       - openinference-semantic-conventions==0.1.14
218 |       - openpyxl==3.1.5
219 |       - opentelemetry-api==1.30.0
220 |       - opentelemetry-exporter-otlp==1.30.0
221 |       - opentelemetry-exporter-otlp-proto-common==1.30.0
222 |       - opentelemetry-exporter-otlp-proto-grpc==1.30.0
223 |       - opentelemetry-exporter-otlp-proto-http==1.30.0
224 |       - opentelemetry-instrumentation==0.51b0
225 |       - opentelemetry-proto==1.30.0
226 |       - opentelemetry-sdk==1.30.0
227 |       - opentelemetry-semantic-conventions==0.51b0
228 |       - orjson==3.10.15
229 |       - pathvalidate==3.2.3
230 |       - pdfminer-six==20240706
231 |       - pillow==10.4.0
232 |       - portalocker==2.10.1
233 |       - protobuf==5.29.3
234 |       - psutil==7.0.0
235 |       - puremagic==1.28
236 |       - py-rust-stemmers==0.1.3
237 |       - pyarrow==19.0.0
238 |       - pydub==0.25.1
239 |       - pygments==2.19.1
240 |       - pyjwt==2.10.1
241 |       - python-multipart==0.0.20
242 |       - python-pptx==1.0.2
243 |       - qdrant-client==1.13.2
244 |       - rich==13.9.4
245 |       - ruff==0.9.6
246 |       - safehttpx==0.1.6
247 |       - safetensors==0.5.2
248 |       - scikit-learn==1.6.1
249 |       - scipy==1.15.1
250 |       - semantic-version==2.10.0
251 |       - sentence-transformers==3.4.1
252 |       - sgmllib3k==1.0.0
253 |       - shellingham==1.5.4
254 |       - speechrecognition==3.14.1
255 |       - sqlean-py==3.47.0
256 |       - starlette==0.45.3
257 |       - strawberry-graphql==0.253.1
258 |       - sympy==1.13.1
259 |       - threadpoolctl==3.5.0
260 |       - tokenizers==0.21.0
261 |       - tomlkit==0.13.2
262 |       - torch==2.6.0
263 |       - torchaudio==2.6.0
264 |       - torchvision==0.21.0
265 |       - transformers==4.48.3
266 |       - triton==3.2.0
267 |       - typer==0.15.1
268 |       - uvicorn==0.34.0
269 |       - websockets==14.2
270 |       - xlrd==2.0.1
271 |       - xlsxwriter==3.2.2
272 |       - youtube-transcript-api==0.6.3
273 |       - zipp==3.21.0


--------------------------------------------------------------------------------
/docker/getSecrets.py:
--------------------------------------------------------------------------------
1 | m = open("/run/secrets/mistral")
2 | mistral_api_key = m.read()
3 | m.close()
4 | p = open("/run/secrets/phoenix")
5 | phoenix_api_key = p.read()
6 | p.close()
7 | l = open("/run/secrets/llamacloud")
8 | llamacloud_api_key = l.read()
9 | l.close()


--------------------------------------------------------------------------------
/docker/run.sh:
--------------------------------------------------------------------------------
1 | eval "$(conda shell.bash hook)"
2 | 
3 | conda activate papers-chat
4 | echo "Activated conda env"
5 | python3 /app/app.py
6 | 


--------------------------------------------------------------------------------
/docker/toolsFunctions.py:
--------------------------------------------------------------------------------
 1 | import urllib, urllib.request
 2 | from pydantic import Field
 3 | from datetime import datetime
 4 | from markitdown import MarkItDown
 5 | from Bio import Entrez
 6 | import xml.etree.ElementTree as ET
 7 | 
 8 | md = MarkItDown()
 9 | 
10 | def format_today():
11 |     d = datetime.now()
12 |     if d.month < 10:
13 |           month = f"0{d.month}"
14 |     else:
15 |         month = d.month
16 |     if d.day < 10:
17 |         day = f"0{d.day}"
18 |     else:
19 |         day = d.day
20 |     if d.hour < 10:
21 |         hour = f"0{d.hour}"
22 |     else:
23 |         hour = d.hour
24 |     if d.minute < 10:
25 |         minute = f"0{d.hour}"
26 |     else:
27 |         minute = d.minute
28 |     today = f"{d.year}{month}{day}{hour}{minute}"
29 |     two_years_ago = f"{d.year-2}{month}{day}{hour}{minute}"
30 |     return today, two_years_ago
31 | 
32 | def arxiv_tool(search_query: str = Field(description="The query with which to search ArXiv database")):
33 |     """A tool to search ArXiv"""
34 |     today, two_years_ago = format_today()
35 |     query = search_query.replace(" ", "+")
36 |     url = f'http://export.arxiv.org/api/query?search_query=all:{query}&submittedDate:[{two_years_ago}+TO+{today}]&start=0&max_results=3'
37 |     data = urllib.request.urlopen(url)
38 |     content = data.read().decode("utf-8")
39 |     f = open("arxiv_results.xml", "w")
40 |     f.write(content)
41 |     f.close()
42 |     result = md.convert("arxiv_results.xml")
43 |     return result.text_content
44 | 
45 | def search_pubmed(query):
46 |     Entrez.email = "astraberte9@gmail.com"  # Replace with your email
47 |     handle = Entrez.esearch(db="pubmed", term=query, retmax=3)
48 |     record = Entrez.read(handle)
49 |     handle.close()
50 |     return record["IdList"]
51 | 
52 | def fetch_pubmed_details(pubmed_ids):
53 |     Entrez.email = "your.personal@email.com"  # Replace with your email
54 |     handle = Entrez.efetch(db="pubmed", id=pubmed_ids, rettype="medline", retmode="xml")
55 |     records = handle.read()
56 |     handle.close()
57 |     recs = records.decode("utf-8")
58 |     f = open("biomed_results.xml", "w")
59 |     f.write(recs)
60 |     f.close()
61 | 
62 | def fetch_xml():
63 |     tree = ET.parse("biomed_results.xml")
64 |     root = tree.getroot()
65 |     parsed_articles = []
66 |     for article in root.findall('PubmedArticle'):
67 |         # Extract title
68 |         title = article.find('.//ArticleTitle')
69 |         title_text = title.text if title is not None else "No title"
70 |         # Extract abstract
71 |         abstract = article.find('.//Abstract/AbstractText')
72 |         abstract_text = abstract.text if abstract is not None else "No abstract"
73 |         # Format output
74 |         formatted_entry = f"## {title_text}\n\n**Abstract**:\n\n{abstract_text}"
75 |         parsed_articles.append(formatted_entry)
76 |     return "\n\n".join(parsed_articles)
77 | 
78 | def pubmed_tool(search_query: str = Field(description="The query with which to search PubMed database")):
79 |     """A tool to search PubMed"""
80 |     idlist = search_pubmed(search_query)
81 |     if len(idlist) == 0:
82 |         return "There is no significant match in PubMed"
83 |     fetch_pubmed_details(idlist)
84 |     content = fetch_xml()
85 |     return content    
86 | 


--------------------------------------------------------------------------------
/docker/usage.md:
--------------------------------------------------------------------------------
 1 | <h1 align="center">PapersChat Usage Guide</h1>
 2 | 
 3 | <h3 align="center">If you find PapersChat useful, please consider to support us through donation:</h3>
 4 | <div align="center">
 5 |     <a href="https://github.com/sponsors/AstraBert"><img src="https://img.shields.io/badge/sponsor-30363D?style=for-the-badge&logo=GitHub-Sponsors&logoColor=#EA4AAA" alt="GitHub Sponsors Badge"></a>
 6 | </div>
 7 | 
 8 | > _This guide is only on how to use **the app**, not on how to install and/or launch it or on how it works internally. For that, please refer to the [GitHub repository](https://github.com/AstraBert/PapersChat)_
 9 | 
10 | ## Use PapersChat with your documents
11 | 
12 | If you have papers that you would like to chat with, this is the right section of the guide!
13 | 
14 | In order to chat with your papers, you will need to upload them (**as PDF files**) on the dedicated "Upload Papers" widget that you can see at the bottom of the chat interface: you can upload one or more files there (remember: the more you upload, the slower their processing is going to be).
15 | 
16 | Once you uploaded the files, before submitting them, you have to do two more things:
17 | 
18 | 1. Specify the collection that you want to upload the documents to (in the "Collection" area)
19 | 2. Write your first question/message to interrogate your papers (in the message input space)
20 | 
21 | For what concerns point (1), you can give your collection whatever name you want: once you created a new collection, you can always re-use it in the future, just inputting the same name. If you do not remember all your collections, you can go to the "Your collections" tab in the application and click on "Generate" to see the list of your collections.
22 | 
23 | Point (2) is very important: if you do not send any message, PapersChat will tell you that you need to send one. 
24 | 
25 | Once you uploaded the papers, specified the collection and wrote the message, you can send the message and PapersChat will:
26 | 
27 | - Ingest your documents
28 | - Produce an answer to your questions
29 | 
30 | Congrats! Now you got the first collection and the first message!
31 | 
32 | > _**NOTE**: there is still an option we haven't talked about, i.e. the 'LlamaParse' checkbox. If you select that checkbox, you will enable LlamaParse, a tool that LlamaIndex offers [as part of its LlamaCloud services](https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse/). LlamaParse employs enhanced parsing techniques to produce a clean and well-structured data for (often messy) unstructured documents: the free tier offers the possibility of parsing 1000 pages/day. While this approach generates very good data for your collections, you have to take into account the fact that it might take quite some time to parse your documents (especially if they are dense, have lots of text-in-images or are very long). By default the LLamaParse option is disabled_
33 | 
34 | ## Use PapersChat with a collection as knowledge base
35 | 
36 | Once you have uploaded all your documents, you might want to interrogate them without having to upload even more. That's where comes into hand the "collection as knowledge base" option. You can simply send a message selecting one of your existing collections as a knowledge base for PapersChat (without uploading any file) and... BAM! You will see that PapersChat replies to your questions :)
37 | 
38 | ## Use PapersChat to interrogate PubMed/ArXiv
39 | 
40 | PapersChat has access also to PubMed and ArXiv papers archives: if you do not specify a collection name and you do not upload any files, your question is used by PapersChat to search these two online databases for an answer.
41 | 
42 | ## Monitor your collections
43 | 
44 | Under the "Your Collections" tab of the application you can, by clicking on "Generate", see your collections: you can see how many data points are in these collections (these data points **do not match** with the number of papers you uploaded) and what is the status of your collections. 
45 | 
46 | A brief guide to the collections status:
47 | 
48 | - "green": collection is optimized and searchable
49 | - "yellow": collection is being optimized and you can search it
50 | - "red": collection is not optimized and it will probably return an error if you try to search it


--------------------------------------------------------------------------------
/docker/utils.py:
--------------------------------------------------------------------------------
 1 | from llama_index.embeddings.huggingface import HuggingFaceEmbedding
 2 | from llama_index.core import Settings
 3 | from qdrant_client import QdrantClient
 4 | from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
 5 | from llama_index.core import StorageContext
 6 | from llama_index.vector_stores.qdrant import QdrantVectorStore
 7 | from llama_cloud_services import LlamaParse
 8 | from getSecrets import llamacloud_api_key
 9 | from typing import List
10 | import torch
11 | 
12 | 
13 | qdrant_client = QdrantClient("http://127.0.0.1:6333")
14 | device = "cuda" if torch.cuda.is_available() else "cpu" 
15 | embedder = HuggingFaceEmbedding(model_name="nomic-ai/modernbert-embed-base", device=device)
16 | Settings.embed_model = embedder
17 | 
18 | def ingest_documents(files: List[str], collection_name: str, llamaparse: True):
19 |     vector_store = QdrantVectorStore(client=qdrant_client, collection_name=collection_name, enable_hybrid=True)
20 |     storage_context = StorageContext.from_defaults(vector_store=vector_store)
21 |     if llamaparse: 
22 |         parser = LlamaParse(
23 |             result_type="markdown",
24 |             api_key=llamacloud_api_key
25 |         )
26 |         file_extractor = {".pdf": parser}
27 |         documents = SimpleDirectoryReader(input_files=files, file_extractor=file_extractor).load_data()
28 |     else:
29 |         documents = SimpleDirectoryReader(input_files=files).load_data()
30 |     index = VectorStoreIndex.from_documents(
31 |         documents,
32 |         storage_context=storage_context,
33 |     )
34 |     return index
35 | 
36 | 


--------------------------------------------------------------------------------
/environment.yml:
--------------------------------------------------------------------------------
  1 | name: papers-chat
  2 | channels:
  3 |   - conda-forge
  4 | dependencies:
  5 |   - _libgcc_mutex=0.1=conda_forge
  6 |   - _openmp_mutex=4.5=2_gnu
  7 |   - aiohappyeyeballs=2.4.6=pyhd8ed1ab_0
  8 |   - aiohttp=3.11.12=py311h2dc5d0c_0
  9 |   - aiosignal=1.3.2=pyhd8ed1ab_0
 10 |   - annotated-types=0.7.0=pyhd8ed1ab_1
 11 |   - anyio=4.8.0=pyhd8ed1ab_0
 12 |   - attrs=25.1.0=pyh71513ae_0
 13 |   - beautifulsoup4=4.13.3=pyha770c72_0
 14 |   - brotli-python=1.1.0=py311hfdbb021_2
 15 |   - bzip2=1.0.8=h4bc722e_7
 16 |   - ca-certificates=2025.1.31=hbcca054_0
 17 |   - certifi=2025.1.31=pyhd8ed1ab_0
 18 |   - cffi=1.17.1=py311hf29c0ef_0
 19 |   - charset-normalizer=3.4.1=pyhd8ed1ab_0
 20 |   - click=8.1.8=pyh707e725_0
 21 |   - colorama=0.4.6=pyhd8ed1ab_1
 22 |   - dataclasses-json=0.6.7=pyhd8ed1ab_1
 23 |   - deprecated=1.2.18=pyhd8ed1ab_0
 24 |   - dirtyjson=1.0.8=pyhd8ed1ab_1
 25 |   - distro=1.9.0=pyhd8ed1ab_1
 26 |   - eval-type-backport=0.2.2=pyhd8ed1ab_0
 27 |   - eval_type_backport=0.2.2=pyha770c72_0
 28 |   - exceptiongroup=1.2.2=pyhd8ed1ab_1
 29 |   - filetype=1.2.0=pyhd8ed1ab_0
 30 |   - freetype=2.12.1=h267a509_2
 31 |   - frozenlist=1.5.0=py311h2dc5d0c_1
 32 |   - fsspec=2025.2.0=pyhd8ed1ab_0
 33 |   - greenlet=3.1.1=py311hfdbb021_1
 34 |   - h11=0.14.0=pyhd8ed1ab_1
 35 |   - h2=4.2.0=pyhd8ed1ab_0
 36 |   - hpack=4.1.0=pyhd8ed1ab_0
 37 |   - httpcore=1.0.7=pyh29332c3_1
 38 |   - httpx=0.28.1=pyhd8ed1ab_0
 39 |   - hyperframe=6.1.0=pyhd8ed1ab_0
 40 |   - idna=3.10=pyhd8ed1ab_1
 41 |   - jiter=0.8.2=py311h9e33e62_0
 42 |   - joblib=1.4.2=pyhd8ed1ab_1
 43 |   - lcms2=2.17=h717163a_0
 44 |   - ld_impl_linux-64=2.43=h712a8e2_2
 45 |   - lerc=4.0.0=h27087fc_0
 46 |   - libblas=3.9.0=28_h59b9bed_openblas
 47 |   - libcblas=3.9.0=28_he106b2a_openblas
 48 |   - libdeflate=1.23=h4ddbbb0_0
 49 |   - libexpat=2.6.4=h5888daf_0
 50 |   - libffi=3.4.6=h2dba641_0
 51 |   - libgcc=14.2.0=h77fa898_1
 52 |   - libgcc-ng=14.2.0=h69a702a_1
 53 |   - libgfortran=14.2.0=h69a702a_1
 54 |   - libgfortran5=14.2.0=hd5240d6_1
 55 |   - libgomp=14.2.0=h77fa898_1
 56 |   - libjpeg-turbo=3.0.0=hd590300_1
 57 |   - liblapack=3.9.0=28_h7ac8fdf_openblas
 58 |   - liblzma=5.6.4=hb9d3cd8_0
 59 |   - libnsl=2.0.1=hd590300_0
 60 |   - libopenblas=0.3.28=pthreads_h94d23a6_1
 61 |   - libpng=1.6.46=h943b412_0
 62 |   - libsqlite=3.48.0=hee588c1_1
 63 |   - libstdcxx=14.2.0=hc0a3c3a_1
 64 |   - libstdcxx-ng=14.2.0=h4852527_1
 65 |   - libtiff=4.7.0=hd9ff511_3
 66 |   - libuuid=2.38.1=h0b41bf4_0
 67 |   - libwebp-base=1.5.0=h851e524_0
 68 |   - libxcb=1.17.0=h8a09558_0
 69 |   - libxcrypt=4.4.36=hd590300_1
 70 |   - libzlib=1.3.1=hb9d3cd8_2
 71 |   - llama-cloud=0.1.12=pyhd8ed1ab_0
 72 |   - llama-cloud-services=0.6.1=pyhd8ed1ab_0
 73 |   - llama-index=0.12.17=pyhd8ed1ab_0
 74 |   - llama-index-agent-openai=0.4.5=pyhd8ed1ab_0
 75 |   - llama-index-cli=0.4.0=pyhd8ed1ab_1
 76 |   - llama-index-core=0.12.17=pyhd8ed1ab_1
 77 |   - llama-index-embeddings-openai=0.3.1=pyhd8ed1ab_1
 78 |   - llama-index-indices-managed-llama-cloud=0.6.4=pyhd8ed1ab_0
 79 |   - llama-index-legacy=0.9.48.post4=pyhd8ed1ab_1
 80 |   - llama-index-llms-openai=0.3.19=pyhd8ed1ab_0
 81 |   - llama-index-multi-modal-llms-openai=0.4.3=pyhd8ed1ab_0
 82 |   - llama-index-program-openai=0.3.1=pyhd8ed1ab_1
 83 |   - llama-index-question-gen-openai=0.3.0=pyhd8ed1ab_1
 84 |   - llama-index-readers-file=0.4.5=pyhd8ed1ab_0
 85 |   - llama-index-readers-llama-parse=0.4.0=pyhd8ed1ab_1
 86 |   - llama-parse=0.6.1=pyhd8ed1ab_0
 87 |   - llamaindex-py-client=0.1.19=pyhd8ed1ab_1
 88 |   - marshmallow=3.26.1=pyhd8ed1ab_0
 89 |   - multidict=6.1.0=py311h2dc5d0c_2
 90 |   - mypy_extensions=1.0.0=pyha770c72_1
 91 |   - ncurses=6.5=h2d0b736_3
 92 |   - nest-asyncio=1.6.0=pyhd8ed1ab_1
 93 |   - networkx=3.4.2=pyh267e887_2
 94 |   - nltk=3.9.1=pyhd8ed1ab_1
 95 |   - numpy=2.2.3=py311h5d046bc_0
 96 |   - openai=1.63.0=pyhd8ed1ab_0
 97 |   - openjpeg=2.5.3=h5fbd93e_0
 98 |   - openssl=3.4.1=h7b32b05_0
 99 |   - packaging=24.2=pyhd8ed1ab_2
100 |   - pandas=2.2.3=py311h7db5c69_1
101 |   - pip=25.0.1=pyh8b19718_0
102 |   - propcache=0.2.1=py311h2dc5d0c_1
103 |   - pthread-stubs=0.4=hb9d3cd8_1002
104 |   - pycparser=2.22=pyh29332c3_1
105 |   - pydantic=2.10.6=pyh3cfb1c2_0
106 |   - pydantic-core=2.27.2=py311h9e33e62_0
107 |   - pypdf=5.3.0=pyh29332c3_0
108 |   - pysocks=1.7.1=pyha55dd90_7
109 |   - python=3.11.11=h9e4cc4f_1_cpython
110 |   - python-dateutil=2.9.0.post0=pyhff2d567_1
111 |   - python-dotenv=1.0.1=pyhd8ed1ab_1
112 |   - python-tzdata=2025.1=pyhd8ed1ab_0
113 |   - python_abi=3.11=5_cp311
114 |   - pytz=2024.1=pyhd8ed1ab_0
115 |   - pyyaml=6.0.2=py311h2dc5d0c_2
116 |   - readline=8.2=h8228510_1
117 |   - regex=2024.11.6=py311h9ecbd09_0
118 |   - requests=2.32.3=pyhd8ed1ab_1
119 |   - setuptools=75.8.0=pyhff2d567_0
120 |   - six=1.17.0=pyhd8ed1ab_0
121 |   - sniffio=1.3.1=pyhd8ed1ab_1
122 |   - soupsieve=2.5=pyhd8ed1ab_1
123 |   - sqlalchemy=2.0.38=py311h9ecbd09_0
124 |   - striprtf=0.0.26=pyhd8ed1ab_0
125 |   - tenacity=8.5.0=pyhd8ed1ab_0
126 |   - tiktoken=0.9.0=py311hf1706b8_0
127 |   - tk=8.6.13=noxft_h4845f30_101
128 |   - tqdm=4.67.1=pyhd8ed1ab_1
129 |   - typing-extensions=4.12.2=hd8ed1ab_1
130 |   - typing_extensions=4.12.2=pyha770c72_1
131 |   - typing_inspect=0.9.0=pyhd8ed1ab_1
132 |   - tzdata=2025a=h78e105d_0
133 |   - urllib3=2.3.0=pyhd8ed1ab_0
134 |   - wheel=0.45.1=pyhd8ed1ab_1
135 |   - wrapt=1.17.2=py311h9ecbd09_0
136 |   - xorg-libxau=1.0.12=hb9d3cd8_0
137 |   - xorg-libxdmcp=1.1.5=hb9d3cd8_0
138 |   - yaml=0.2.5=h7f98852_2
139 |   - yarl=1.18.3=py311h2dc5d0c_1
140 |   - zstandard=0.23.0=py311hbc35293_1
141 |   - zstd=1.5.6=ha6fb4c9_0
142 |   - pip:
143 |       - aiofiles==23.2.1
144 |       - aioitertools==0.12.0
145 |       - aiosqlite==0.21.0
146 |       - alembic==1.14.1
147 |       - arize-phoenix==7.12.2
148 |       - arize-phoenix-evals==0.20.3
149 |       - arize-phoenix-otel==0.7.1
150 |       - arxiv==2.1.3
151 |       - authlib==1.4.1
152 |       - azure-ai-documentintelligence==1.0.0
153 |       - azure-core==1.32.0
154 |       - azure-identity==1.20.0
155 |       - biopython==1.85
156 |       - cachetools==5.5.1
157 |       - cobble==0.1.4
158 |       - coloredlogs==15.0.1
159 |       - cryptography==44.0.1
160 |       - defusedxml==0.7.1
161 |       - et-xmlfile==2.0.0
162 |       - fastapi==0.115.8
163 |       - fastembed==0.5.1
164 |       - feedparser==6.0.11
165 |       - ffmpy==0.5.0
166 |       - filelock==3.17.0
167 |       - flatbuffers==25.2.10
168 |       - googleapis-common-protos==1.67.0
169 |       - gradio==5.16.0
170 |       - gradio-client==1.7.0
171 |       - graphql-core==3.2.6
172 |       - grpc-interceptor==0.15.4
173 |       - grpcio==1.70.0
174 |       - grpcio-tools==1.70.0
175 |       - huggingface-hub==0.28.1
176 |       - humanfriendly==10.0
177 |       - importlib-metadata==8.5.0
178 |       - isodate==0.7.2
179 |       - jinja2==3.1.4
180 |       - jsonpath-python==1.0.6
181 |       - llama-index-embeddings-fastembed==0.3.0
182 |       - llama-index-embeddings-huggingface==0.5.1
183 |       - llama-index-llms-azure-openai==0.3.2
184 |       - llama-index-llms-mistralai==0.3.2
185 |       - llama-index-llms-ollama==0.5.4
186 |       - llama-index-tools-arxiv==0.3.0
187 |       - llama-index-vector-stores-qdrant==0.4.3
188 |       - loguru==0.7.3
189 |       - lxml==5.3.1
190 |       - mako==1.3.9
191 |       - mammoth==1.9.0
192 |       - markdown-it-py==3.0.0
193 |       - markdownify==0.14.1
194 |       - markitdown==0.0.1a4
195 |       - markupsafe==2.1.5
196 |       - mdurl==0.1.2
197 |       - mistralai==1.5.0
198 |       - mmh3==4.1.0
199 |       - mpmath==1.3.0
200 |       - msal==1.31.1
201 |       - msal-extensions==1.2.0
202 |       - nvidia-cublas-cu12==12.4.5.8
203 |       - nvidia-cuda-cupti-cu12==12.4.127
204 |       - nvidia-cuda-nvrtc-cu12==12.4.127
205 |       - nvidia-cuda-runtime-cu12==12.4.127
206 |       - nvidia-cudnn-cu12==9.1.0.70
207 |       - nvidia-cufft-cu12==11.2.1.3
208 |       - nvidia-curand-cu12==10.3.5.147
209 |       - nvidia-cusolver-cu12==11.6.1.9
210 |       - nvidia-cusparse-cu12==12.3.1.170
211 |       - nvidia-cusparselt-cu12==0.6.2
212 |       - nvidia-nccl-cu12==2.21.5
213 |       - nvidia-nvjitlink-cu12==12.4.127
214 |       - nvidia-nvtx-cu12==12.4.127
215 |       - olefile==0.47
216 |       - ollama==0.4.8
217 |       - onnxruntime==1.20.1
218 |       - openinference-instrumentation==0.1.22
219 |       - openinference-instrumentation-llama-index==3.2.0
220 |       - openinference-semantic-conventions==0.1.14
221 |       - openpyxl==3.1.5
222 |       - opentelemetry-api==1.30.0
223 |       - opentelemetry-exporter-otlp==1.30.0
224 |       - opentelemetry-exporter-otlp-proto-common==1.30.0
225 |       - opentelemetry-exporter-otlp-proto-grpc==1.30.0
226 |       - opentelemetry-exporter-otlp-proto-http==1.30.0
227 |       - opentelemetry-instrumentation==0.51b0
228 |       - opentelemetry-proto==1.30.0
229 |       - opentelemetry-sdk==1.30.0
230 |       - opentelemetry-semantic-conventions==0.51b0
231 |       - orjson==3.10.15
232 |       - pathvalidate==3.2.3
233 |       - pdfminer-six==20240706
234 |       - pillow==10.4.0
235 |       - portalocker==2.10.1
236 |       - protobuf==5.29.3
237 |       - psutil==7.0.0
238 |       - puremagic==1.28
239 |       - py-rust-stemmers==0.1.3
240 |       - pyarrow==19.0.0
241 |       - pydub==0.25.1
242 |       - pygments==2.19.1
243 |       - pyjwt==2.10.1
244 |       - python-multipart==0.0.20
245 |       - python-pptx==1.0.2
246 |       - qdrant-client==1.13.2
247 |       - rich==13.9.4
248 |       - ruff==0.9.6
249 |       - safehttpx==0.1.6
250 |       - safetensors==0.5.2
251 |       - scikit-learn==1.6.1
252 |       - scipy==1.15.1
253 |       - semantic-version==2.10.0
254 |       - sentence-transformers==3.4.1
255 |       - sgmllib3k==1.0.0
256 |       - shellingham==1.5.4
257 |       - speechrecognition==3.14.1
258 |       - sqlean-py==3.47.0
259 |       - starlette==0.45.3
260 |       - strawberry-graphql==0.253.1
261 |       - sympy==1.13.1
262 |       - threadpoolctl==3.5.0
263 |       - tokenizers==0.21.0
264 |       - tomlkit==0.13.2
265 |       - torch==2.6.0
266 |       - torchaudio==2.6.0
267 |       - torchvision==0.21.0
268 |       - transformers==4.48.3
269 |       - triton==3.2.0
270 |       - typer==0.15.1
271 |       - uvicorn==0.34.0
272 |       - websockets==14.2
273 |       - xlrd==2.0.1
274 |       - xlsxwriter==3.2.2
275 |       - youtube-transcript-api==0.6.3
276 |       - zipp==3.21.0


--------------------------------------------------------------------------------
/flowchart.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AstraBert/PapersChat/1d91b36f92071b7f1e002204735a7bbba385f406/flowchart.png


--------------------------------------------------------------------------------
/local_setup.ps1:
--------------------------------------------------------------------------------
1 | docker compose up db -d
2 | 
3 | conda env create -f environment.yml
4 | 
5 | conda activate papers-chat
6 | python3 scripts/app.py
7 | conda deactivate


--------------------------------------------------------------------------------
/local_setup.sh:
--------------------------------------------------------------------------------
1 | docker compose up db -d
2 | 
3 | conda env create -f environment.yml
4 | 
5 | conda activate papers-chat
6 | python3 scripts/app.py
7 | conda deactivate


--------------------------------------------------------------------------------
/logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AstraBert/PapersChat/1d91b36f92071b7f1e002204735a7bbba385f406/logo.png


--------------------------------------------------------------------------------
/scripts/app.py:
--------------------------------------------------------------------------------
  1 | from utils import ingest_documents, qdrant_client, List, QdrantVectorStore, VectorStoreIndex, embedder
  2 | import sys
  3 | import gradio as gr
  4 | from toolsFunctions import pubmed_tool, arxiv_tool
  5 | from llama_index.core.tools import QueryEngineTool, FunctionTool
  6 | from llama_index.core import Settings
  7 | from llama_index.llms.mistralai import MistralAI
  8 | from llama_index.llms.azure_openai import AzureOpenAI
  9 | from llama_index.llms.ollama import Ollama
 10 | from llama_index.core.llms import ChatMessage
 11 | from llama_index.core.agent import ReActAgent
 12 | from dotenv import load_dotenv
 13 | from phoenix.otel import register
 14 | from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
 15 | import time
 16 | import os
 17 | 
 18 | load_dotenv()
 19 | 
 20 | ## Observing and tracing
 21 | PHOENIX_API_KEY = os.getenv("phoenix_api_key")
 22 | os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={PHOENIX_API_KEY}"
 23 | os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"
 24 | tracer_provider = register(
 25 |     project_name="llamaindex", 
 26 | ) 
 27 | LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)
 28 | 
 29 | ## Globals
 30 | if os.getenv("mistral_api_key", None) is not None:
 31 |     Settings.llm = MistralAI(model="mistral-small-latest", temperature=0, api_key=os.getenv("mistral_api_key"))
 32 | elif os.getenv("azure_openai_api_key", None) is not None:
 33 |     Settings.llm = AzureOpenAI(model="gpt-4.1", temperature=0, api_key=os.getenv("azure_openai_api_key"))
 34 | elif os.getenv("ollama_model", None) is not None:
 35 |     Settings.llm = Ollama(model=os.getenv("ollama_model"))
 36 | else:
 37 |     print("ERROR! No supported LLM can be loaded in PapersChat. Exiting...")
 38 |     sys.exit(1)
 39 | 
 40 | Settings.embed_model = embedder
 41 | arxivtool = FunctionTool.from_defaults(arxiv_tool, name="arxiv_tool", description="A tool to search ArXiv (pre-print papers database) for specific papers")
 42 | pubmedtool = FunctionTool.from_defaults(pubmed_tool, name="pubmed_tool", description="A tool to search PubMed (printed medical papers database) for specific papers")
 43 | query_engine = None
 44 | message_history = [
 45 |     ChatMessage(role="system", content="You are a useful assistant that has to help the user with questions that they ask about several papers they uploaded. You should base your answers on the context you can retrieve from the PDFs and, if you cannot retrieve any, search ArXiv for a potential answer. If you cannot find any viable answer, please reply that you do not know the answer to the user's question")
 46 | ]
 47 | 
 48 | ## Functions
 49 | def reply(message, history, files: List[str] | None, collection: str | None, llamaparse: bool = False):
 50 |     global message_history
 51 |     if message == "" or message is None:
 52 |         response = "You should provide a message"
 53 |         r = ""
 54 |         for char in response:
 55 |             r+=char
 56 |             time.sleep(0.001)
 57 |             yield r
 58 |     elif files is None and collection == "":
 59 |         res = "### WARNING! You did not specify any collection, so I only interrogated ArXiv and/or PubMed to answer your question\n\n"
 60 |         agent = ReActAgent.from_tools(tools=[pubmedtool, arxivtool], verbose=True)
 61 |         response = agent.chat(message = message, chat_history = message_history)
 62 |         response = str(response)
 63 |         message_history.append(ChatMessage(role="user", content=message))
 64 |         message_history.append(ChatMessage(role="assistant", content=response))
 65 |         response = res + response
 66 |         r = ""
 67 |         for char in response:
 68 |             r+=char
 69 |             time.sleep(0.001)
 70 |             yield r
 71 |     elif files is None and collection != "" and collection not in [c.name for c in qdrant_client.get_collections().collections]:
 72 |             response = "Make sure that the name of the existing collection to use as a knowledge base is correct, because the one you provided does not exist! You can check your existing collections and their features in the dedicated tab of the app :)"
 73 |             r = ""
 74 |             for char in response:
 75 |                 r+=char
 76 |                 time.sleep(0.001)
 77 |                 yield r
 78 |     elif files is not None:
 79 |         if collection == "":
 80 |             response = "You should provide a collection name (new or existing) if you want to ingest files!"
 81 |             r = ""
 82 |             for char in response:
 83 |                 r+=char
 84 |                 time.sleep(0.001)
 85 |                 yield r
 86 |         else:
 87 |             collection_name = collection
 88 |             index = ingest_documents(files, collection_name, llamaparse)
 89 |             query_engine = index.as_query_engine()
 90 |             rag_tool = QueryEngineTool.from_defaults(query_engine, name="papers_rag", description="A RAG engine with information from selected scientific papers")
 91 |             agent = ReActAgent.from_tools(tools=[rag_tool, pubmedtool, arxivtool], verbose=True)
 92 |             response = agent.chat(message = message, chat_history = message_history)
 93 |             response = str(response)
 94 |             message_history.append(ChatMessage(role="user", content=message))
 95 |             message_history.append(ChatMessage(role="assistant", content=response))
 96 |             r = ""
 97 |             for char in response:
 98 |                 r+=char
 99 |                 time.sleep(0.001)
100 |                 yield r
101 |     else:
102 |         vector_store = QdrantVectorStore(client = qdrant_client, collection_name=collection, enable_hybrid=True)
103 |         index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
104 |         query_engine = index.as_query_engine()
105 |         rag_tool = QueryEngineTool.from_defaults(query_engine, name="papers_rag", description="A RAG engine with information from selected scientific papers")
106 |         agent = ReActAgent.from_tools(tools=[rag_tool, pubmedtool, arxivtool], verbose=True)
107 |         response = agent.chat(message = message, chat_history = message_history)
108 |         response = str(response)
109 |         message_history.append(ChatMessage(role="user", content=message))
110 |         message_history.append(ChatMessage(role="assistant", content=response))
111 |         r = ""
112 |         for char in response:
113 |             r+=char
114 |             time.sleep(0.001)
115 |             yield r
116 | 
117 | def to_markdown_color(grade: str):
118 |     colors = {"red": "ff0000", "yellow": "ffcc00", "green": "33cc33"}
119 |     mdcode = f"![#{colors[grade]}](https://placehold.co/15x15/{colors[grade]}/{colors[grade]}.png)"
120 |     return mdcode
121 | 
122 | def get_qdrant_collections_dets():
123 |     collections = [c.name for c in qdrant_client.get_collections().collections]
124 |     details = []
125 |     counter = 0
126 |     for collection in collections:
127 |         counter += 1
128 |         dets = qdrant_client.get_collection(collection)
129 |         p = f"### {counter}. {collection}\n\n**Number of Points**: {dets.points_count}\n\n**Status**: {to_markdown_color(dets.status)} {dets.status}\n\n"
130 |         details.append(p)
131 |     final_text = "<h2 align='center'>Available Collections</h2>\n\n"
132 |     final_text += "\n\n".join(details)
133 |     return final_text
134 | 
135 | ## Frontend
136 | accordion = gr.Accordion(label="⚠️Set up these parameters before you start chatting!⚠️")
137 | 
138 | iface1 = gr.ChatInterface(fn=reply, additional_inputs=[gr.File(label="Upload Papers (only PDF allowed!)", file_count="multiple", file_types=[".pdf","pdf",".PDF","PDF"], value=None), gr.Textbox(label="Collection", info="Upload your papers to a collection (new or existing)", value=""), gr.Checkbox(label="Use LlamaParse", info="Needs the LlamaCloud API key", value=False)], additional_inputs_accordion=accordion)
139 | u = open("usage.md")
140 | content = u.read()
141 | u.close()
142 | iface2 = gr.Blocks()
143 | with iface2:
144 |     with gr.Row():
145 |         gr.Markdown(content)
146 | iface3 = gr.Interface(fn=get_qdrant_collections_dets, inputs=None, outputs=gr.Markdown(label="Collections"), submit_btn="See your collections")
147 | iface = gr.TabbedInterface([iface1, iface2, iface3], ["Chat💬", "Usage Guide⚙️", "Your Collections🔎"], title="PapersChat📝")
148 | iface.launch(server_name="0.0.0.0", server_port=7860)


--------------------------------------------------------------------------------
/scripts/toolsFunctions.py:
--------------------------------------------------------------------------------
 1 | import urllib, urllib.request
 2 | from pydantic import Field
 3 | from datetime import datetime
 4 | from markitdown import MarkItDown
 5 | from Bio import Entrez
 6 | import xml.etree.ElementTree as ET
 7 | 
 8 | md = MarkItDown()
 9 | 
10 | def format_today():
11 |     d = datetime.now()
12 |     if d.month < 10:
13 |           month = f"0{d.month}"
14 |     else:
15 |         month = d.month
16 |     if d.day < 10:
17 |         day = f"0{d.day}"
18 |     else:
19 |         day = d.day
20 |     if d.hour < 10:
21 |         hour = f"0{d.hour}"
22 |     else:
23 |         hour = d.hour
24 |     if d.minute < 10:
25 |         minute = f"0{d.hour}"
26 |     else:
27 |         minute = d.minute
28 |     today = f"{d.year}{month}{day}{hour}{minute}"
29 |     two_years_ago = f"{d.year-2}{month}{day}{hour}{minute}"
30 |     return today, two_years_ago
31 | 
32 | def arxiv_tool(search_query: str = Field(description="The query with which to search ArXiv database")):
33 |     """A tool to search ArXiv"""
34 |     today, two_years_ago = format_today()
35 |     query = search_query.replace(" ", "+")
36 |     url = f'http://export.arxiv.org/api/query?search_query=all:{query}&submittedDate:[{two_years_ago}+TO+{today}]&start=0&max_results=3'
37 |     data = urllib.request.urlopen(url)
38 |     content = data.read().decode("utf-8")
39 |     f = open("arxiv_results.xml", "w")
40 |     f.write(content)
41 |     f.close()
42 |     result = md.convert("arxiv_results.xml")
43 |     return result.text_content
44 | 
45 | def search_pubmed(query):
46 |     Entrez.email = "astraberte9@gmail.com"  # Replace with your email
47 |     handle = Entrez.esearch(db="pubmed", term=query, retmax=3)
48 |     record = Entrez.read(handle)
49 |     handle.close()
50 |     return record["IdList"]
51 | 
52 | def fetch_pubmed_details(pubmed_ids):
53 |     Entrez.email = "your.personal@email.com"  # Replace with your email
54 |     handle = Entrez.efetch(db="pubmed", id=pubmed_ids, rettype="medline", retmode="xml")
55 |     records = handle.read()
56 |     handle.close()
57 |     recs = records.decode("utf-8")
58 |     f = open("biomed_results.xml", "w")
59 |     f.write(recs)
60 |     f.close()
61 | 
62 | def fetch_xml():
63 |     tree = ET.parse("biomed_results.xml")
64 |     root = tree.getroot()
65 |     parsed_articles = []
66 |     for article in root.findall('PubmedArticle'):
67 |         # Extract title
68 |         title = article.find('.//ArticleTitle')
69 |         title_text = title.text if title is not None else "No title"
70 |         # Extract abstract
71 |         abstract = article.find('.//Abstract/AbstractText')
72 |         abstract_text = abstract.text if abstract is not None else "No abstract"
73 |         # Format output
74 |         formatted_entry = f"## {title_text}\n\n**Abstract**:\n\n{abstract_text}"
75 |         parsed_articles.append(formatted_entry)
76 |     return "\n\n".join(parsed_articles)
77 | 
78 | def pubmed_tool(search_query: str = Field(description="The query with which to search PubMed database")):
79 |     """A tool to search PubMed"""
80 |     idlist = search_pubmed(search_query)
81 |     if len(idlist) == 0:
82 |         return "There is no significant match in PubMed"
83 |     fetch_pubmed_details(idlist)
84 |     content = fetch_xml()
85 |     return content    
86 | 


--------------------------------------------------------------------------------
/scripts/utils.py:
--------------------------------------------------------------------------------
 1 | from llama_index.embeddings.huggingface import HuggingFaceEmbedding
 2 | from llama_index.core import Settings
 3 | from llama_index.llms.mistralai import MistralAI
 4 | from qdrant_client import QdrantClient
 5 | from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
 6 | from llama_index.core import StorageContext
 7 | from llama_index.vector_stores.qdrant import QdrantVectorStore
 8 | from llama_cloud_services import LlamaParse
 9 | from dotenv import load_dotenv
10 | from typing import List
11 | import torch
12 | import os
13 | 
14 | 
15 | load_dotenv()
16 | 
17 | 
18 | qdrant_client = QdrantClient("http://localhost:6333")
19 | device = "cuda" if torch.cuda.is_available() else "cpu" 
20 | embedder = HuggingFaceEmbedding(model_name="nomic-ai/modernbert-embed-base", device=device)
21 | Settings.embed_model = embedder
22 | 
23 | def ingest_documents(files: List[str], collection_name: str, llamaparse: True):
24 |     vector_store = QdrantVectorStore(client=qdrant_client, collection_name=collection_name, enable_hybrid=True)
25 |     storage_context = StorageContext.from_defaults(vector_store=vector_store)
26 |     if llamaparse: 
27 |         parser = LlamaParse(
28 |             result_type="markdown",
29 |             api_key=os.getenv("llamacloud_api_key")
30 |         )
31 |         file_extractor = {".pdf": parser}
32 |         documents = SimpleDirectoryReader(input_files=files, file_extractor=file_extractor).load_data()
33 |     else:
34 |         documents = SimpleDirectoryReader(input_files=files).load_data()
35 |     index = VectorStoreIndex.from_documents(
36 |         documents,
37 |         storage_context=storage_context,
38 |     )
39 |     return index
40 | 
41 | load_dotenv()
42 | 


--------------------------------------------------------------------------------
/start_services.ps1:
--------------------------------------------------------------------------------
1 | docker compose up db -d 
2 | docker compose up app -d


--------------------------------------------------------------------------------
/start_services.sh:
--------------------------------------------------------------------------------
1 | docker compose up db -d 
2 | docker compose up app -d


--------------------------------------------------------------------------------
/usage.md:
--------------------------------------------------------------------------------
 1 | <h1 align="center">PapersChat Usage Guide</h1>
 2 | 
 3 | <h3 align="center">If you find PapersChat useful, please consider to support us through donation:</h3>
 4 | <div align="center">
 5 |     <a href="https://github.com/sponsors/AstraBert"><img src="https://img.shields.io/badge/sponsor-30363D?style=for-the-badge&logo=GitHub-Sponsors&logoColor=#EA4AAA" alt="GitHub Sponsors Badge"></a>
 6 | </div>
 7 | 
 8 | > _This guide is only on how to use **the app**, not on how to install and/or launch it or on how it works internally. For that, please refer to the [GitHub repository](https://github.com/AstraBert/PapersChat)_
 9 | 
10 | ## Use PapersChat with your documents
11 | 
12 | If you have papers that you would like to chat with, this is the right section of the guide!
13 | 
14 | In order to chat with your papers, you will need to upload them (**as PDF files**) on the dedicated "Upload Papers" widget that you can see at the bottom of the chat interface: you can upload one or more files there (remember: the more you upload, the slower their processing is going to be).
15 | 
16 | Once you uploaded the files, before submitting them, you have to do two more things:
17 | 
18 | 1. Specify the collection that you want to upload the documents to (in the "Collection" area)
19 | 2. Write your first question/message to interrogate your papers (in the message input space)
20 | 
21 | For what concerns point (1), you can give your collection whatever name you want: once you created a new collection, you can always re-use it in the future, just inputting the same name. If you do not remember all your collections, you can go to the "Your collections" tab in the application and click on "Generate" to see the list of your collections.
22 | 
23 | Point (2) is very important: if you do not send any message, PapersChat will tell you that you need to send one. 
24 | 
25 | Once you uploaded the papers, specified the collection and wrote the message, you can send the message and PapersChat will:
26 | 
27 | - Ingest your documents
28 | - Produce an answer to your questions
29 | 
30 | Congrats! Now you got the first collection and the first message!
31 | 
32 | > _**NOTE**: there is still an option we haven't talked about, i.e. the 'LlamaParse' checkbox. If you select that checkbox, you will enable LlamaParse, a tool that LlamaIndex offers [as part of its LlamaCloud services](https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse/). LlamaParse employs enhanced parsing techniques to produce a clean and well-structured data for (often messy) unstructured documents: the free tier offers the possibility of parsing 1000 pages/day. While this approach generates very good data for your collections, you have to take into account the fact that it might take quite some time to parse your documents (especially if they are dense, have lots of text-in-images or are very long). By default the LLamaParse option is disabled_
33 | 
34 | ## Use PapersChat with a collection as knowledge base
35 | 
36 | Once you have uploaded all your documents, you might want to interrogate them without having to upload even more. That's where comes into hand the "collection as knowledge base" option. You can simply send a message selecting one of your existing collections as a knowledge base for PapersChat (without uploading any file) and... BAM! You will see that PapersChat replies to your questions :)
37 | 
38 | ## Use PapersChat to interrogate PubMed/ArXiv
39 | 
40 | PapersChat has access also to PubMed and ArXiv papers archives: if you do not specify a collection name and you do not upload any files, your question is used by PapersChat to search these two online databases for an answer.
41 | 
42 | ## Monitor your collections
43 | 
44 | Under the "Your Collections" tab of the application you can, by clicking on "Generate", see your collections: you can see how many data points are in these collections (these data points **do not match** with the number of papers you uploaded) and what is the status of your collections. 
45 | 
46 | A brief guide to the collections status:
47 | 
48 | - "green": collection is optimized and searchable
49 | - "yellow": collection is being optimized and you can search it
50 | - "red": collection is not optimized and it will probably return an error if you try to search it


--------------------------------------------------------------------------------